In these lessons so far you've learned a lot about the statistical programming language R: you've learned how to set variables and work with data structures, to create informative visualizations using ggplot2, and how to perform rigorous statistical tests. But datasets from the real world aren't so simple: they might contain dozens of variables across tens of thousands of observations. You might need to filter them based on quality considerations, and they might be spread across multiple datasets, leaving you to combine them together yourself. I'm David Robinson, and in this lesson we're going to tie all your R skills together, by learning how to perform exploratory data analysis using the powerful data.table package. As an example, we'll be analyzing a dataset of historical baseball statistics, learning how to preprocess and filter it, how to combine multiple datasets together, and how to answer interesting questions interactively with your data.
This lesson will assume basic familiarity with R, especially vectors and data frames, with RStudio, and with the ggplot2 package for visualization.
One of the essential functions of computers is to take repetitive, dull tasks and automate them, so they can be easily be performed as many times as you need. Part of being a good programmer is sticking to this philosophy.
So far we've been working in the interactive R terminal. This is useful for writing a line of code quickly and seeing the result.
3 + 5
## [1] 8
But most of your analyses will take multiple lines of code and they'll have to be run all in the same order, and it's impractical to have to type them all in sequence in an interactive window. So let's instead write an R script, which will contain a series of commands that you want to run in order. In RStudio, Go to File->New File->New R Script, or you can do CMD+SHIFT+N, to create a new R script. You'll see it pops up above your command prompt. Write a couple lines of R. For example, let's define two variables:
x = 4
y = x + 6
Now save your new script. First you have to choose a working directory, which you can do with "More->Set as Working Directory" in the file manager in RStudio. Then save your script, which you can do with CMD+S or with the floppy disk icon here. Let's save it as script.R
in our working directory. Now we can run this script- that is, all our commands in the script all at once- by clicking "Source" at the top of the script. Notice that a command pops up in your interactive terminal, something like
source('~/Desktop/RCourse/script.R')
That means it ran all the commands in your script in a row. You can see this by checking the values of x
and y
:
x
## [1] 4
y
## [1] 10
So even though you didn't see the commands in the interactive terminal, they did run when you hit Source.
Incidentally, there's a keyboard shortcut for sourcing a current file as well: you can do CMD+SHIFT+S.
Now what if we want some output to our script besides just setting a few variables? Normally we'd be able to view the contents of a variable just by typing that variable by itself. For instance:
y
## [1] 10
However, try putting the line y
by itself in your script, save, and source it. Notice there was no output. That's because when you run outside of the interactive terminal, values don't print unless you explicitly tell them to do so. You do that with the print function. Change your line in the file to print(y)
, then save and source.
print(y)
## [1] 10
Now we can see the output in your interactive window. Remember that: when you're running inside a script, if you want to see an output, you have to actually print it.
Importantly, the same is true of a ggplot2 plot. Let's create a basic ggplot:
library(ggplot2)
## Loading required package: methods
data(mtcars)
ggplot(mtcars, aes(wt, mpg)) + geom_point()
As soon as we hit return on this line, we created a scatterplot. However, let's put the same lines of code into our script. Then we clear the current plot by hitting Clear All, and hit Source.
Notice that no plot showed up. The reason is that when it's in a file, a ggplot, just like a regular variable, needs to be printed to show up. You do that with:
print(ggplot(mtcars, aes(wt, mpg)) + geom_point())
Finally, there's a useful shortcut for running one line at a time in a script. Put your cursor on a particular line of code, then hit CMD+RETURN in Macs, or CONTROL+ENTER on Windows, you'll run just that line of code in your interactive terminal. This means you don't have to select a line and copy and paste it into your terminal. This is useful for if you want to run a single line of code in your script, but don't want to go through the time of rerunning the whole file.
R scripts can be as long as you like: yours may end up being hundreds or even thousands of lines of code- and they're the only practical way to organize a complicated analysis productively.
So far we've been working with built-in datasets in R. For example, remember that we can load the mtcars
dataset by doing
data(mtcars)
R comes with dozens of useful datasets like this, and they're great for learning and practicing. But of course, whatever data you're actually interested in, chances are it's probably not already built into R. So how can you read your own data into R?
Let's start by downloading some data and reading it in. We're going to be working with a dataset about baseball: specifically the 2013 version of Sean Lahman's Baseball Archive. The statistical analysis of baseball is called sabermetrics, and it has a rich and fascinating history: this dataset can be used to explore lots of extraordinary trends. Don't worry if you're not a fan of baseball, or even if you know nothing about it: we're just using it as an example of data manipulation. I'll explain everything as we go along, and you'll have just as much fun as anyone.
So the URL for this file is http://dgrtwo.github.io/pages/lahman/Salaries.csv. Take this URL and put it into your favorite web browser. At this point you can see the comma separated values in the dataset. Let's save it to our computer, particularly putting it into your current working directory. Make sure it's saved as a CSV file. We should now have the Salaries.csv file in your file manager in RStudio.
Let's look at the contents of this CSV file a little more closely. This is called CSV, or comma-separated value, format. It contains one header row, and then a series of lines made up of multiple fields separated by commas. This is one of the most common formats data is shared in. For instance, if you have an Excel spreadsheet, note that you can save it into CSV format.
You can read a CSV file into R easily with the read.csv
function:
salaries = read.csv("Salaries.csv")
This read in the CSV file as a data frame. You can see this by doing
View(salaries)
This data is organized into rows: one row per player per year. You can see a column for the year, for the ID of the player, and for the salary in US dollars. You can also see the team that the player was playing on, and the league, either the American League (AL
) or National League (NL
) that the team plays in.
I had you download the file just so we could look at it, but note that you don't actually have to download it at all. R can read a CSV file directly from the internet, if you give it the URL. Take that URL we had and put it into read.csv
directly:
salaries = read.csv("http://dgrtwo.github.io/pages/lahman/Salaries.csv")
This downloads the file, reads it in, and saves it into the same data.frame.
Now, this was the best way to read this standard CSV file. But what if you didn't have a header row? Or what if your file were separated by spaces, or by tabs, instead of by commas? Take a look at the help page for read.csv
:
help(read.csv)
You can see here that there are multiple ways to use read.csv
with different defaults and options. For example, if you didn't want to have a header, you could change the header option to header=FALSE
. If you wanted the fields to be separated by spaces, rather than commas, you could add sep=" "
. Don't do that in this case.
You can also see that R provides other functions such as read.table
whose defaults are different (space separation with no header). So by choosing the right function, and the right set of specialized option, you can read almost any kind of row/column organized file.
R can't read everything. But for the majority of cases, if your data was prepared responsibly and you set these simple options correctly, you can use the read.csv
and read.table
functions to import whatever data you need.
Now we have downloaded this data on baseball players' salaries, we're interested in filtering and manipulating it. The salary data is currently held in a data.frame, a data structure built into R. data.frame is powerful, but there's an even more powerful alternative in R that makes data manipulation fast, easy and intuitive. Like ggplot2, it's a third party package, which means it doesn't come built in, and we have to install it. So let's install the data.table
package You can do that with:
install.packages("data.table")
Now, before we learn to use data.table, let's take one last look at salaries the data.frame. If you type
salaries
it will print thousands and thousands of lines until it decides to stops. Now, let's convert the data.frame into a data table. First you load the data.table
package:
library(data.table)
Then use the as.data.table
function to replace salaries
with a data.table version.
salaries = as.data.table(salaries)
Now, let's print it again:
salaries
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
## 2: 1985 BAL AL lynnfr01 1090000
## 3: 1985 BAL AL ripkeca01 800000
## 4: 1985 BAL AL lacyle01 725000
## 5: 1985 BAL AL flanami01 641667
## ---
## 23952: 2013 WAS NL matthry01 504500
## 23953: 2013 WAS NL lombast02 501250
## 23954: 2013 WAS NL ramoswi01 501250
## 23955: 2013 WAS NL rodrihe03 501000
## 23956: 2013 WAS NL moorety01 493000
Notice that it contains the same information, but only shows the first five rows, then ---, then the last five rows, which is generally a more convenient representation. This more compact way of printing a data.table is the first benefit of using the package.
Now, a lot of things work just the same way as they do in a data.frame. You can still access a column with a dollar sign. For instance, let's say you want the salary
column.
salaries$salary
Or to get a single row of a data.table, you can do:
salaries[1, ]
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
Or you can get a range of rows:
salaries[1:5, ]
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
## 2: 1985 BAL AL lynnfr01 1090000
## 3: 1985 BAL AL ripkeca01 800000
## 4: 1985 BAL AL lacyle01 725000
## 5: 1985 BAL AL flanami01 641667
One thing that did work on data frames but doesn't work on data tables is extracting a column based on an index. In a data.frame, you could extract the first column by putting the index after the comma:
salaries[, 1]
## [1] 1
But that doesn't work in data.table. Instead, you can put the name of the column, without quotes, after the comma:
salaries[, yearID]
This retrieves the entire vector of that year.
You can also grab multiple columns (for example, just the year and the salary) using list
:
salaries[, list(yearID, salary)]
## yearID salary
## 1: 1985 1472819
## 2: 1985 1090000
## 3: 1985 800000
## 4: 1985 725000
## 5: 1985 641667
## ---
## 23952: 2013 504500
## 23953: 2013 501250
## 23954: 2013 501250
## 23955: 2013 501000
## 23956: 2013 493000
Now we've created a new data table with just the two columns yearID
and salary
.
Now, let's say we want to filter the rows based on one column, which is a common step in data preprocessing. For example, let's say you want to get only the years after 2000. There is an easy way to do this in data.table. We put a condition before the comma:
salaries[yearID > 2000, ]
## yearID teamID lgID playerID salary
## 1: 2001 ANA AL vaughmo01 13166667
## 2: 2001 ANA AL salmoti01 6500000
## 3: 2001 ANA AL anderga01 4500000
## 4: 2001 ANA AL erstada01 3450000
## 5: 2001 ANA AL percitr01 3400000
## ---
## 10853: 2013 WAS NL matthry01 504500
## 10854: 2013 WAS NL lombast02 501250
## 10855: 2013 WAS NL ramoswi01 501250
## 10856: 2013 WAS NL rodrihe03 501000
## 10857: 2013 WAS NL moorety01 493000
Now we've created a subset of the data that contains only years 2001 and after. Similarly, if you wanted to get salaries from a specific year, 2010, you could do
salaries[yearID == 2010, ]
## yearID teamID lgID playerID salary
## 1: 2010 BAL AL millwke01 12000000
## 2: 2010 BAL AL roberbr01 10000000
## 3: 2010 BAL AL lugoju01 9250000
## 4: 2010 BAL AL markani01 7100000
## 5: 2010 BAL AL gonzami02 6000000
## ---
## 826: 2010 WAS NL zimmejo01 401000
## 827: 2010 WAS NL desmoia01 400000
## 828: 2010 WAS NL detwiro01 400000
## 829: 2010 WAS NL englije01 400000
## 830: 2010 WAS NL taverwi01 400000
There are two leagues in American Major League Baseball: the American League (AL
), and the National League (NL
). If we want to filter for just the American League, we can do:
salaries[lgID == "AL", ]
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
## 2: 1985 BAL AL lynnfr01 1090000
## 3: 1985 BAL AL ripkeca01 800000
## 4: 1985 BAL AL lacyle01 725000
## 5: 1985 BAL AL flanami01 641667
## ---
## 11740: 2013 TOR AL perezlu01 500000
## 11741: 2013 TOR AL drabeky01 499500
## 11742: 2013 TOR AL delabst01 498900
## 11743: 2013 TOR AL jeffrje01 495900
## 11744: 2013 TOR AL loupaa01 494200
Now I've selected just the American League teams.
Finally, we can combine multiple filtering conditions using the and (&
) or or (|
) operators. For instance, we can filter for all the rows in the American League that were after 1990.
salaries[lgID == "AL" & yearID >= 1990, ]
## yearID teamID lgID playerID salary
## 1: 1990 BAL AL ripkeca01 1316667
## 2: 1990 BAL AL bradlph01 1150000
## 3: 1990 BAL AL tettlmi01 750000
## 4: 1990 BAL AL orsuljo01 610000
## 5: 1990 BAL AL melvibo01 350000
## ---
## 10023: 2013 TOR AL perezlu01 500000
## 10024: 2013 TOR AL drabeky01 499500
## 10025: 2013 TOR AL delabst01 498900
## 10026: 2013 TOR AL jeffrje01 495900
## 10027: 2013 TOR AL loupaa01 494200
Similarly, we can combine conditions with or (|
). For instance, say we wanted only years before 1990 or after 2010.
salaries[yearID < 1990 | yearID > 2010, ]
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
## 2: 1985 BAL AL lynnfr01 1090000
## 3: 1985 BAL AL ripkeca01 800000
## 4: 1985 BAL AL lacyle01 725000
## 5: 1985 BAL AL flanami01 641667
## ---
## 5787: 2013 WAS NL matthry01 504500
## 5788: 2013 WAS NL lombast02 501250
## 5789: 2013 WAS NL ramoswi01 501250
## 5790: 2013 WAS NL rodrihe03 501000
## 5791: 2013 WAS NL moorety01 493000
We can also sort the data easily, using the order function in the area before the comma:
salaries[order(salary), ]
## yearID teamID lgID playerID salary
## 1: 1993 NYA AL jamesdi01 0
## 2: 1999 PIT NL martija02 0
## 3: 1993 NYA AL silveda01 10900
## 4: 1994 CHA AL carych01 50000
## 5: 1997 FLO NL penaal01 50000
## ---
## 23952: 2013 NYA AL rodrial01 29000000
## 23953: 2012 NYA AL rodrial01 30000000
## 23954: 2011 NYA AL rodrial01 32000000
## 23955: 2009 NYA AL rodrial01 33000000
## 23956: 2010 NYA AL rodrial01 33000000
Now you can see that the pairs of players and years are sorted by salary, with the lower salaries at the top. This gives us an easy way to see the highest and lowest salaries. In case you're wondering, NYA here is the team ID of the New York Yankees: called NYA instead of NYY for historical reasons. We could also sort by year:
salaries[order(yearID), ]
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
## 2: 1985 BAL AL lynnfr01 1090000
## 3: 1985 BAL AL ripkeca01 800000
## 4: 1985 BAL AL lacyle01 725000
## 5: 1985 BAL AL flanami01 641667
## ---
## 23952: 2013 WAS NL matthry01 504500
## 23953: 2013 WAS NL lombast02 501250
## 23954: 2013 WAS NL ramoswi01 501250
## 23955: 2013 WAS NL rodrihe03 501000
## 23956: 2013 WAS NL moorety01 493000
What if we want to sort first by year, and then breaking ties with salary? We can do that by providing two arguments to the order function:
salaries[order(yearID, salary), ]
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL sheetla01 60000
## 2: 1985 CAL AL clibust02 60000
## 3: 1985 CAL AL mccaski01 60000
## 4: 1985 CHA AL guilloz01 60000
## 5: 1985 MIN AL salasma01 60000
## ---
## 23952: 2013 NYA AL teixema01 23125000
## 23953: 2013 NYA AL sabatcc01 24285714
## 23954: 2013 NYA AL wellsve01 24642857
## 23955: 2013 PHI NL leecl02 25000000
## 23956: 2013 NYA AL rodrial01 29000000
Now it's organized with 1985 first, and 2013 last, but within each of those years it is organized by salary.
Note that we can perform multiple operations all in a sequence, by saving the intermediate results. For instance, we can first perform a filtering operation and save it as salaries.filtered
:
salaries.filtered = salaries[lgID == "AL" & yearID >= 1990, ]
Then we can sort it by salary and save it into a new data table, which is now both filtered and sorted.
salaries.filtered.sorted = salaries.filtered[order(salary), ]
salaries.filtered.sorted
## yearID teamID lgID playerID salary
## 1: 1993 NYA AL jamesdi01 0
## 2: 1993 NYA AL silveda01 10900
## 3: 1994 CHA AL carych01 50000
## 4: 1990 BAL AL bellju01 100000
## 5: 1990 BAL AL brownma03 100000
## ---
## 10023: 2013 NYA AL rodrial01 29000000
## 10024: 2012 NYA AL rodrial01 30000000
## 10025: 2011 NYA AL rodrial01 32000000
## 10026: 2009 NYA AL rodrial01 33000000
## 10027: 2010 NYA AL rodrial01 33000000
These operations let us easily explore the data and answer basic questions.
In our last segment we learned how to download a dataset on baseball player salaries and turn it into a data table, and then to perform some basic organizations on it like filtering and sorting. Now we're going to learn about a more sophisticated and powerful way of processing the data, namely performing summary operations within groups. This is an important and omnipresent task in data analysis.
Let's look again at our salaries dataset.
salaries
## yearID teamID lgID playerID salary
## 1: 1985 BAL AL murraed02 1472819
## 2: 1985 BAL AL lynnfr01 1090000
## 3: 1985 BAL AL ripkeca01 800000
## 4: 1985 BAL AL lacyle01 725000
## 5: 1985 BAL AL flanami01 641667
## ---
## 23952: 2013 WAS NL matthry01 504500
## 23953: 2013 WAS NL lombast02 501250
## 23954: 2013 WAS NL ramoswi01 501250
## 23955: 2013 WAS NL rodrihe03 501000
## 23956: 2013 WAS NL moorety01 493000
Right now we have 24 thousand rows, each with a combination of a year and a player. Now, we can perform some general summaries of this data. For example, we can extract the salary column using a dollar sign ($):
salaries$salary
Then we can find the average salary of all players across all years with the mean function:
mean(salaries$salary)
## [1] 1864357
Similarly, we could find the highest salary across all years with max
:
max(salaries$salary)
## [1] 33000000
or the median:
median(salaries$salary)
## [1] 507950
We can also find the average salary in a given year by filtering the data before we extract the column. This code extracts the salaries only from the year 2000:
salaries[yearID == 2000, ]$salary
## [1] 11166667 6000000 5600000 4600000 4000000 3250000 3225000
## [8] 2500000 2350000 1512500 925000 900000 850000 600000
## [15] 550000 375000 275000 275000 225000 222500 215000
## [22] 210000 210000 210000 207500 207500 202500 200000
## [29] 200000 200000 12868670 7127199 6786032 6620921 6300000
## [36] 6000000 4600000 4250000 4209324 4146789 3000000 2500000
## [43] 2250000 2000000 1750000 1450000 1200000 750000 700000
## [50] 620000 500000 400000 205000 205000 204000 203500
## [57] 201000 200000 200000 11500000 6350000 6320000 5750000
## [64] 5000333 4500000 4500000 4000000 3750000 3700000 3500000
## [71] 3500000 2500000 2400000 2000000 2000000 1350000 695000
## [78] 625000 625000 612500 610000 375000 350000 295000
## [85] 270000 250000 210000 202500 200000 7100000 5400000
## [92] 4900000 3300000 1400000 1350000 1320000 675000 475000
## [99] 425000 375000 325000 305000 285000 285000 275000
## [106] 255000 255000 250000 250000 250000 225000 225000
## [113] 212500 210000 206000 200000 200000 200000 8175000
## [120] 7911948 7500000 7196656 7000000 6000000 5550000 5000000
## [127] 4250000 3000000 2700000 1850000 1617667 1600000 1200000
## [134] 1125000 1062500 1000000 550000 247000 237500 232500
## [141] 227500 225000 217500 205000 7500000 7000000 4500000
## [148] 4500000 4425000 4062500 4000000 3950000 3650000 3125000
## [155] 1975000 1816667 1300000 1100000 1000000 950000 700000
## [162] 550000 325000 300000 255000 240000 236000 205000
## [169] 200000 200000 200000 4000000 2300000 2300000 2250000
## [176] 2250000 1500000 1450000 1000000 962500 575000 550000
## [183] 525000 350000 325000 250000 250000 232500 232500
## [190] 225000 225000 220000 220000 215000 210000 205000
## [197] 205000 203000 202500 3500000 2000000 1500000 1200000
## [204] 1115000 875000 825000 700000 500000 500000 322500
## [211] 300000 285000 285000 245000 240000 230000 225000
## [218] 225000 225000 220000 202000 200000 200000 200000
## [225] 200000 12357143 12000000 10000000 7250000 7000000 6500000
## [232] 6350000 6000000 5250000 4800000 2400000 1950000 1916667
## [239] 1400000 1300000 1250000 1000000 800000 750000 350000
## [246] 250000 240000 213000 206650 203800 201000 200000
## [253] 200000 5600000 4000000 3103333 3050000 3050000 2750000
## [260] 2100000 1225000 900000 825000 750000 600000 500000
## [267] 500000 290000 260000 250000 240000 240000 240000
## [274] 228000 221000 219500 217000 211000 201000 200500
## [281] 7500000 6350000 6000000 5400000 4362500 4000000 4000000
## [288] 3950000 2750000 2225000 2000000 1850000 1500000 1450000
## [295] 1425000 950000 580000 500000 325000 290000 285000
## [302] 275000 275000 252500 215000 205000 9000000 7097962
## [309] 6250000 6000000 6000000 5945818 3300000 3000000 2947410
## [316] 2373439 2000000 1850000 1000000 800000 610000 525000
## [323] 425000 375000 350000 300000 295000 270000 265000
## [330] 265000 253000 237500 230000 200000 200000 200000
## [337] 200000 8620921 8600000 7500000 6500000 5000000 5000000
## [344] 4500000 4325000 3750000 3600000 3025000 2650000 2000000
## [351] 1100000 975000 750000 750000 305000 280000 250000
## [358] 245000 230000 220000 220000 200000 200000 10000000
## [365] 6600000 5500000 5500000 3025000 2500000 1800000 1450000
## [372] 900000 800000 800000 733333 700000 683333 500000
## [379] 500000 425000 395000 383333 375000 333333 310000
## [386] 220000 205000 200000 13350000 8500000 8000000 7000000
## [393] 5625000 5375000 5250000 3375000 3333333 2831000 2500000
## [400] 2250000 1916667 1833333 1725000 1500000 1100000 1005000
## [407] 1000000 762500 750000 500000 312500 300000 260000
## [414] 243500 215000 215000 11100000 9463237 8500000 8500000
## [421] 7600000 6250000 4750000 3700000 3700000 3365099 3000000
## [428] 3000000 2975000 1650000 1400000 1000000 750000 650000
## [435] 420000 400000 350000 290000 255000 242000 220000
## [442] 207500 200000 200000 200000 200000 11000000 6000000
## [449] 5737500 5300000 4600000 4500000 3833333 3500000 2500000
## [456] 2000000 1600000 1500000 1100000 1000000 1000000 776000
## [463] 690000 625000 600000 395000 300000 265000 245000
## [470] 235000 217500 210000 205000 205000 200000 200000
## [477] 9329700 7000000 5300000 4750000 3250000 1950000 1950000
## [484] 1700000 1600000 1400000 1400000 1100000 862500 650000
## [491] 630000 600000 500000 400000 400000 375000 300000
## [498] 300000 300000 220000 200000 200000 200000 12142857
## [505] 6350000 6250000 3700000 3600000 3416667 3183333 3000000
## [512] 3000000 2350000 2212500 1750000 1500000 1300000 1200000
## [519] 933333 750000 650000 625000 560000 525000 450000
## [526] 325000 300000 300000 270000 245000 222500 7000000
## [533] 4500000 1500000 500000 390000 380000 324000 318000
## [540] 315000 295000 295000 287000 269000 268000 265000
## [547] 265000 264000 255000 255000 245000 240000 225000
## [554] 208000 204000 203000 201000 201000 200000 6750000
## [561] 6666667 6500000 5250000 5032444 4500000 3200000 2400000
## [568] 1700000 1250000 1100000 1025000 800000 700000 700000
## [575] 500000 500000 350000 330000 300000 275000 260000
## [582] 255000 252500 240000 237500 215000 15714286 9916667
## [589] 9416667 6125000 5383333 5375000 5333333 4000000 3850000
## [596] 3700000 3500000 3000000 2250000 2000000 1450000 1400000
## [603] 1250000 1000000 750000 550000 500000 475000 300000
## [610] 250000 230000 205000 5000000 5000000 4462500 3333333
## [617] 2500000 2400000 2350000 1450000 1025000 900000 835000
## [624] 800000 775000 650000 612500 515000 350000 300000
## [631] 282000 275000 260000 258000 255000 250000 235000
## [638] 215000 210000 205000 201500 200500 200000 200000
## [645] 4125000 4000000 3633333 3500000 3500000 3200000 3000000
## [652] 1350000 825000 700000 600000 355000 350000 340000
## [659] 325000 280000 267500 265000 265000 255000 225000
## [666] 218000 210000 202500 201500 201500 200000 200000
## [673] 200000 12071429 8000000 7750000 5750000 5366667 5000000
## [680] 4375000 4333333 4225000 3633333 3437500 3350000 2250000
## [687] 2200000 2050014 2000000 750000 600000 500000 500000
## [694] 462500 260000 220000 215000 210000 5900000 5650000
## [701] 5066667 4833333 4200000 3000000 2933333 2750000 2016667
## [708] 1885000 1400000 1000000 750000 600000 550000 500000
## [715] 500000 500000 475000 390000 375000 345000 300000
## [722] 280000 250000 240000 210000 205000 203000 5625000
## [729] 3290000 2750000 2250000 2166667 2050000 1816667 1000000
## [736] 1000000 875000 750000 700000 550000 450000 400000
## [743] 400000 380000 335000 320000 300000 300000 285000
## [750] 285000 230000 215000 205000 6916667 6600000 6300000
## [757] 6100000 5750000 4583333 3750000 2600000 1750000 1750000
## [764] 875000 830000 800000 775000 700000 700000 600000
## [771] 450000 370000 360000 250000 250000 250000 230000
## [778] 230000 225000 215000 208000 203000 200000 10658826
## [785] 6000000 5500000 5500000 4750000 4000000 3000000 2450000
## [792] 1550000 1450000 1325000 1325000 1175000 1125000 1000000
## [799] 537500 462500 232000 225000 225000 220000 215000
## [806] 210000 202000 200000 200000 9333333 7600000 7417981
## [813] 6000000 4500000 4420840 2867542 2500000 2300000 2250000
## [820] 1750000 1600000 1366667 1100000 750000 750000 690000
## [827] 650000 600000 595000 550000 550000 400000 250000
## [834] 240000 220000 202500
We can then find the average salary in the year 2000:
mean(salaries[yearID == 2000, ]$salary)
## [1] 1992985
This gives us a good way to ask questions interactively about our data. But what if we want to look for a trend- for example, how baseball player salaries change over time? It would be a huge hassle to repeat this line of code for 1985, 1986, 1987 and so on, and them combine all of those. What we want is a way to perform this summary operation of averaging within each year, for every year in the dataset. It turns out the data.table package makes that easy:
Let's create a new data.table called summarized.year
, then we subset it using square brackets in a very particular way. Put nothing before the first comma. Now after the second comma, we say we want to take the mean of the salary column- but then we put another comma and by="yearID"
, which means we don't want to do it once for the whole dataset- we want to do it within each year. So we're computing the mean salary (mean(salary)
), within each year (by="yearID"
).
summarized.year = salaries[, mean(salary), by="yearID"]
## Warning in gmean(salary): Group 21 summed to more than type 'integer'
## can hold so the result has been coerced to 'numeric' automatically, for
## convenience.
This message in red is a warning: it's basically complaining that baseball players are paid too much. But you can ignore it, it doesn't have any effect on our results. Let's look at what ended up in summarized.year
:
summarized.year
## yearID V1
## 1: 1985 476299.4
## 2: 1986 417147.0
## 3: 1987 434729.5
## 4: 1988 453171.1
## 5: 1989 506323.1
## 6: 1990 511973.7
## 7: 1991 894961.2
## 8: 1992 1047520.6
## 9: 1993 976966.6
## 10: 1994 1049588.6
## 11: 1995 964979.1
## 12: 1996 1027909.3
## 13: 1997 1218687.4
## 14: 1998 1280844.6
## 15: 1999 1485316.8
## 16: 2000 1992984.6
## 17: 2001 2279841.1
## 18: 2002 2392526.6
## 19: 2003 2573472.9
## 20: 2004 2491776.1
## 21: 2005 2633830.8
## 22: 2006 2834520.9
## 23: 2007 2941435.9
## 24: 2008 3136517.1
## 25: 2009 3277647.0
## 26: 2010 3278746.8
## 27: 2011 3318838.2
## 28: 2012 3458421.2
## 29: 2013 3723344.4
## yearID V1
We now have two columns: one for year, which we were summarizing by, and one called V1. Every year has its own row, and this V1 was the result of this expression we put between the two commas.
What if we wanted to give it a more useful name, like "Average"? We can do that by changing what we put in our summarizing expression, by placing the list()
function between the commas.
summarized.year = salaries[, list(Average=mean(salary)), by="yearID"]
## Warning in gmean(salary): Group 21 summed to more than type 'integer'
## can hold so the result has been coerced to 'numeric' automatically, for
## convenience.
summarized.year
## yearID Average
## 1: 1985 476299.4
## 2: 1986 417147.0
## 3: 1987 434729.5
## 4: 1988 453171.1
## 5: 1989 506323.1
## 6: 1990 511973.7
## 7: 1991 894961.2
## 8: 1992 1047520.6
## 9: 1993 976966.6
## 10: 1994 1049588.6
## 11: 1995 964979.1
## 12: 1996 1027909.3
## 13: 1997 1218687.4
## 14: 1998 1280844.6
## 15: 1999 1485316.8
## 16: 2000 1992984.6
## 17: 2001 2279841.1
## 18: 2002 2392526.6
## 19: 2003 2573472.9
## 20: 2004 2491776.1
## 21: 2005 2633830.8
## 22: 2006 2834520.9
## 23: 2007 2941435.9
## 24: 2008 3136517.1
## 25: 2009 3277647.0
## 26: 2010 3278746.8
## 27: 2011 3318838.2
## 28: 2012 3458421.2
## 29: 2013 3723344.4
## yearID Average
This means we want to create a column called average that contains the mean salary within each year. (Ignore the warning again). Now you can see that the column is called Average
, which is more helpful. But our summary operation doesn't have to stop there. We can create two columns at the same time: one for the average, and one for the maximum:
summarized.year = salaries[, list(Average=mean(salary), Maximum=max(salary)), by="yearID"]
## Warning in gmean(salary): Group 21 summed to more than type 'integer'
## can hold so the result has been coerced to 'numeric' automatically, for
## convenience.
summarized.year
## yearID Average Maximum
## 1: 1985 476299.4 2130300
## 2: 1986 417147.0 2800000
## 3: 1987 434729.5 2127333
## 4: 1988 453171.1 2340000
## 5: 1989 506323.1 2766667
## 6: 1990 511973.7 3200000
## 7: 1991 894961.2 3800000
## 8: 1992 1047520.6 6100000
## 9: 1993 976966.6 6200000
## 10: 1994 1049588.6 6300000
## 11: 1995 964979.1 9237500
## 12: 1996 1027909.3 9237500
## 13: 1997 1218687.4 10000000
## 14: 1998 1280844.6 14936667
## 15: 1999 1485316.8 11949794
## 16: 2000 1992984.6 15714286
## 17: 2001 2279841.1 22000000
## 18: 2002 2392526.6 22000000
## 19: 2003 2573472.9 22000000
## 20: 2004 2491776.1 22500000
## 21: 2005 2633830.8 26000000
## 22: 2006 2834520.9 21680727
## 23: 2007 2941435.9 23428571
## 24: 2008 3136517.1 28000000
## 25: 2009 3277647.0 33000000
## 26: 2010 3278746.8 33000000
## 27: 2011 3318838.2 32000000
## 28: 2012 3458421.2 30000000
## 29: 2013 3723344.4 29000000
## yearID Average Maximum
Now you can see that we've created two columns along with the summarizing year: one with average, one with maximum.
We could add other columns as well, showing, for instance, the minimum or standard deviation or the salary within each year. Anything you put in this list will end up being a column.
You can group your summaries by any column in the data, not just the year. For example, you could summarize within each baseball league- those two leagues being the American League and the National League. Again we put a list()
of the columns we want to create within the commas, but this time put by="lgID"
.
summarized.lg = salaries[, list(Average=mean(salary), Maximum=max(salary)), by="lgID"]
## Warning in gmean(salary): Group 1 summed to more than type 'integer'
## can hold so the result has been coerced to 'numeric' automatically, for
## convenience.
summarized.lg
## lgID Average Maximum
## 1: AL 1891850 33000000
## 2: NL 1837917 25000000
Now you can see there are only two rows: one for the American League, one for the National League, and we can see the average and maximum salaries within each.
We've already summarized by year, and by league- we could also summarize by team. We would just change by="lgID"
to by="teamID"
.
summarized.team = salaries[, list(Average=mean(salary), Maximum=max(salary)), by="teamID"]
## Warning in gmean(salary): Group 2 summed to more than type 'integer'
## can hold so the result has been coerced to 'numeric' automatically, for
## convenience.
summarized.team
## teamID Average Maximum
## 1: BAL 1785712.3 17000000
## 2: BOS 2692113.9 22500000
## 3: CAL 739073.2 5375000
## 4: CHA 1992653.5 17000000
## 5: CLE 1525795.0 15000000
## 6: DET 1980835.0 23000000
## 7: KCA 1299025.8 13000000
## 8: MIN 1525031.7 23000000
## 9: ML4 613243.6 5875000
## 10: NYA 3608860.1 33000000
## 11: OAK 1303094.8 13500000
## 12: SEA 1932288.9 20557143
## 13: TEX 1874651.6 22000000
## 14: TOR 1768711.0 19700000
## 15: ATL 2130474.7 16061802
## 16: CHN 2185518.7 19000000
## 17: CIN 1568035.3 18910655
## 18: HOU 1705561.3 19369019
## 19: LAN 2346982.7 23854494
## 20: MON 707458.9 11500000
## 21: NYN 2317350.0 23145011
## 22: PHI 2092230.9 25000000
## 23: PIT 1077989.7 16500000
## 24: SDN 1317959.6 15505142
## 25: SFN 2044198.7 22250000
## 26: SLN 1928832.6 16333327
## 27: COL 1945628.5 20275000
## 28: FLO 1147986.4 14936667
## 29: ANA 1895109.2 13166667
## 30: TBA 1528399.5 10125000
## 31: ARI 2428195.9 16000000
## 32: MIL 2095009.0 15500000
## 33: LAA 4151107.2 26187500
## 34: WAS 2243755.2 16571429
## 35: MIA 2974115.7 19000000
## teamID Average Maximum
Now we can see one row for each team.
Note that this output is itself a data.table, just like the input was. You can process it or sort it just like you could any other. For example, you could filter it to look only at a particular range of years. Here we look for years after 2000:
summarized.year[yearID > 2000, ]
## yearID Average Maximum
## 1: 2001 2279841 22000000
## 2: 2002 2392527 22000000
## 3: 2003 2573473 22000000
## 4: 2004 2491776 22500000
## 5: 2005 2633831 26000000
## 6: 2006 2834521 21680727
## 7: 2007 2941436 23428571
## 8: 2008 3136517 28000000
## 9: 2009 3277647 33000000
## 10: 2010 3278747 33000000
## 11: 2011 3318838 32000000
## 12: 2012 3458421 30000000
## 13: 2013 3723344 29000000
Here we get a smaller subset of that summary. Similarly you can sort the output, for instance to find the highest paid and lowest paid teams.
summarized.team[order(Average), ]
## teamID Average Maximum
## 1: ML4 613243.6 5875000
## 2: MON 707458.9 11500000
## 3: CAL 739073.2 5375000
## 4: PIT 1077989.7 16500000
## 5: FLO 1147986.4 14936667
## 6: KCA 1299025.8 13000000
## 7: OAK 1303094.8 13500000
## 8: SDN 1317959.6 15505142
## 9: MIN 1525031.7 23000000
## 10: CLE 1525795.0 15000000
## 11: TBA 1528399.5 10125000
## 12: CIN 1568035.3 18910655
## 13: HOU 1705561.3 19369019
## 14: TOR 1768711.0 19700000
## 15: BAL 1785712.3 17000000
## 16: TEX 1874651.6 22000000
## 17: ANA 1895109.2 13166667
## 18: SLN 1928832.6 16333327
## 19: SEA 1932288.9 20557143
## 20: COL 1945628.5 20275000
## 21: DET 1980835.0 23000000
## 22: CHA 1992653.5 17000000
## 23: SFN 2044198.7 22250000
## 24: PHI 2092230.9 25000000
## 25: MIL 2095009.0 15500000
## 26: ATL 2130474.7 16061802
## 27: CHN 2185518.7 19000000
## 28: WAS 2243755.2 16571429
## 29: NYN 2317350.0 23145011
## 30: LAN 2346982.7 23854494
## 31: ARI 2428195.9 16000000
## 32: BOS 2692113.9 22500000
## 33: MIA 2974115.7 19000000
## 34: NYA 3608860.1 33000000
## 35: LAA 4151107.2 26187500
## teamID Average Maximum
Finally, we can group by more than one column in our analysis. Let's say we want to see the average salary within each year, separately for the two leagues: so we're summarizing by the league and year combination. To do this, we change the by argument to be a vector c("yearID", "lgID")
.
summarized.year.lg = salaries[, list(Average=mean(salary), Maximum=max(salary)), by=c("yearID", "lgID")]
summarized.year.lg
## yearID lgID Average Maximum
## 1: 1985 AL 455597.0 1795704
## 2: 1985 NL 500249.3 2130300
## 3: 1986 AL 402337.9 1984423
## 4: 1986 NL 433925.1 2800000
## 5: 1987 AL 441846.6 2110000
## 6: 1987 NL 427857.8 2127333
## 7: 1988 AL 453901.2 2305000
## 8: 1988 NL 452374.2 2340000
## 9: 1989 AL 502052.4 2766666
## 10: 1989 NL 511116.5 2766667
## 11: 1990 AL 500415.8 3200000
## 12: 1990 NL 525913.7 2513703
## 13: 1991 AL 908126.7 3791667
## 14: 1991 NL 879587.5 3800000
## 15: 1992 AL 1017651.1 5300000
## 16: 1992 NL 1085608.6 6100000
## 17: 1993 AL 1028575.6 5550000
## 18: 1993 NL 923883.0 6200000
## 19: 1994 AL 1130703.1 5550000
## 20: 1994 NL 971003.2 6300000
## 21: 1995 AL 1039864.5 9237500
## 22: 1995 NL 890698.7 8166666
## 23: 1996 AL 1055235.2 9237500
## 24: 1996 NL 1000406.7 8416667
## 25: 1997 AL 1267829.6 10000000
## 26: 1997 NL 1169651.4 8666667
## 27: 1998 AL 1364396.6 10000000
## 28: 1998 NL 1207658.0 14936667
## 29: 1999 AL 1503986.4 11949794
## 30: 1999 NL 1468880.6 10714286
## 31: 2000 AL 2004401.7 12868670
## 32: 2000 NL 1983096.5 15714286
## 33: 2001 AL 2333183.9 22000000
## 34: 2001 NL 2232801.3 15714286
## 35: 2002 AL 2449016.0 22000000
## 36: 2002 NL 2342579.4 15714286
## 37: 2003 AL 2524939.6 22000000
## 38: 2003 NL 2614933.1 17166667
## 39: 2004 AL 2517280.2 22500000
## 40: 2004 NL 2469002.5 18000000
## 41: 2005 AL 2681580.8 26000000
## 42: 2005 NL 2590986.6 22000000
## 43: 2006 AL 3088942.0 21680727
## 44: 2006 NL 2616445.7 19369019
## 45: 2007 AL 3304390.9 23428571
## 46: 2007 NL 2623749.2 16600000
## 47: 2008 AL 3449574.3 28000000
## 48: 2008 NL 2870790.4 18622809
## 49: 2009 AL 3380833.1 33000000
## 50: 2009 NL 3184368.7 23854494
## 51: 2010 AL 3431360.4 33000000
## 52: 2010 NL 3142161.2 20144707
## 53: 2011 AL 3505557.0 32000000
## 54: 2011 NL 3156654.9 21644707
## 55: 2012 AL 3662264.1 30000000
## 56: 2012 NL 3277278.0 23145011
## 57: 2013 AL 3757664.2 29000000
## 58: 2013 NL 3688940.2 25000000
## yearID lgID Average Maximum
Notice it has one row for each combination of a year and a league- for example, one for the year 1985 within AL (American League), and one for 1985 NL (National League), and has the average and maximum within each of these. You could go even farther and view the analysis within each team, within each year. Just change the lgID
to teamID
:
summarized.year.team = salaries[, list(Average=mean(salary), Maximum=max(salary)), by=c("yearID", "teamID")]
summarized.year.team
## yearID teamID Average Maximum
## 1: 1985 BAL 525486.9 1472819
## 2: 1985 BOS 435902.4 1075000
## 3: 1985 CAL 515281.9 1100000
## 4: 1985 CHA 468865.6 1242333
## 5: 1985 CLE 327583.3 1100000
## ---
## 824: 2013 PIT 2752214.3 16500000
## 825: 2013 SDN 2342339.3 9500000
## 826: 2013 SFN 5006440.5 22250000
## 827: 2013 SLN 3295003.9 16272110
## 828: 2013 WAS 4548130.8 16571429
Now, any of these summaries could be used as the result of an analysis, for example as a table in a presentation or paper. But they also make visualizing trends much easier. For example, let's say we want to examine the trend of how salary changes over time. We could produce a plot of all the points in the original salaries dataset, all 24 thousand combinations of players and years. For that we'll use ggplot2, which we covered in a previous segment:
library(ggplot2)
Perform this on our original dataset salaries
, putting on the x-axis yearID
and on the y-axis salary
.
ggplot(salaries, aes(yearID, salary)) + geom_point()
So in this graph we can see a positive trend in salary over time. However, all we can really see is the range of salaries: within this mass of points it's not possible to tell what the average is for each of these years. If we actually want to see how the average changed over time, we can instead plot the summarized data. We get this from summarized.year
:
summarized.year
## yearID Average Maximum
## 1: 1985 476299.4 2130300
## 2: 1986 417147.0 2800000
## 3: 1987 434729.5 2127333
## 4: 1988 453171.1 2340000
## 5: 1989 506323.1 2766667
## 6: 1990 511973.7 3200000
## 7: 1991 894961.2 3800000
## 8: 1992 1047520.6 6100000
## 9: 1993 976966.6 6200000
## 10: 1994 1049588.6 6300000
## 11: 1995 964979.1 9237500
## 12: 1996 1027909.3 9237500
## 13: 1997 1218687.4 10000000
## 14: 1998 1280844.6 14936667
## 15: 1999 1485316.8 11949794
## 16: 2000 1992984.6 15714286
## 17: 2001 2279841.1 22000000
## 18: 2002 2392526.6 22000000
## 19: 2003 2573472.9 22000000
## 20: 2004 2491776.1 22500000
## 21: 2005 2633830.8 26000000
## 22: 2006 2834520.9 21680727
## 23: 2007 2941435.9 23428571
## 24: 2008 3136517.1 28000000
## 25: 2009 3277647.0 33000000
## 26: 2010 3278746.8 33000000
## 27: 2011 3318838.2 32000000
## 28: 2012 3458421.2 30000000
## 29: 2013 3723344.4 29000000
## yearID Average Maximum
We have the year in one column and the average salary in another. So use that yearID
as our x-axis and Average
as our y-axis. Usually when we have one trend we put just a line:
ggplot(summarized.year, aes(yearID, Average)) + geom_line()
Now we can see the trend of the average over time. This comes directly from the summarized data, where we have the average per year.
We can even go farther- since we have it summarized by year and league, we can plot summarized.year.lg
instead. We still put yearID
on the x-axis and Average
on the y, but now we color based on the league ID (lgID
):
ggplot(summarized.year.lg, aes(yearID, Average, col=lgID)) + geom_line()
Now we can see two separate lines- one red for the American League, and one green for the National League.
Notice that data.table and ggplot2 have a natural synergy: data table lets you summarize your data to the extent you need to visualize what you want with ggplot2
. Together they work as two powerful tools for exploratory data analysis.
Let's take a closer look at the baseball player salary data that we've downloaded and imported into R. You can do that with
View(salaries)
Notice that the players are not represented by their actual first and last names- they're represented by some kind of ID. This ID looks pretty unhelpful: why not just put their names in that column?
The first reason is that there are multiple players in history that have the same name, and at that point if you used their names to identify them, it wouldn't be possible to tell them apart in the data. Meanwhile, these IDs are guaranteed to be unique per player. There are other advantages: for example, the player ID is shorter and therefore takes up less storage space- but the uniqueness is the most important. That ID can be used to connect this column to other datasets.
So what do I mean by other datasets? Well, the salary data is just one table within the Lahman baseball dataset. Let's load in a different one, from a slightly different URL. Go back to the line where we read in the salary data, but change Salary.csv
to Master.csv
, and save it to a variable called master
.
master = read.csv("http://dgrtwo.github.io/pages/lahman/Master.csv")
Before we do anything else, let's turn this data frame into a data table just like we did with salaries
:
master = as.data.table(master)
Let's take a look at the table:
View(master)
This is a master list of the baseball players based on their ID. Here in the first column you can see the playerIDs that appeared in the salaries data. But you can also see a lot of biographical information, like their birthday and birthplace, their weight and height, the date of their death, and most importantly, their full name. So now in one table (salaries
) we just have the ID, and in another table (master
) we have a way of getting from that name to their full biographical information. So, let's take the first name on this list: someone named David Aardsma. We could take this players ID, copy it, and extract just this player's salary:
salaries[playerID == "aardsda01", ]
## yearID teamID lgID playerID salary
## 1: 2004 SFN NL aardsda01 300000
## 2: 2007 CHA AL aardsda01 387500
## 3: 2008 BOS AL aardsda01 403250
## 4: 2009 SEA AL aardsda01 419000
## 5: 2010 SEA AL aardsda01 2750000
## 6: 2011 SEA AL aardsda01 4500000
## 7: 2012 NYA AL aardsda01 500000
Based on their ID, we were able to extract the years of their salary. You can see that David Aardsma played in seven years, you can see it rose from 2004 to 2010, before dropping back down again. So this is a way we can get one player's name and team along with his salary each year. But it's very clumsy to have to do it individually for each player. Luckily there's a much easier way to connect these two datasets: we can merge them, using the merge
function. Let's create a new merged dataset called merged.salaries
.
merged.salaries = merge(salaries, master, by="playerID")
The "by" argument defines what column we should use to merge them. In this case, that's what column is shared between them, which is playerID
.
merged.salaries
## playerID yearID teamID lgID salary birthYear birthMonth birthDay
## 1: aardsda01 2004 SFN NL 300000 1981 12 27
## 2: aardsda01 2007 CHA AL 387500 1981 12 27
## 3: aardsda01 2008 BOS AL 403250 1981 12 27
## 4: aardsda01 2009 SEA AL 419000 1981 12 27
## 5: aardsda01 2010 SEA AL 2750000 1981 12 27
## ---
## 23952: zumayjo01 2011 DET AL 1400000 1984 11 9
## 23953: zupcibo01 1991 BOS AL 100000 1966 8 18
## 23954: zupcibo01 1992 BOS AL 109000 1966 8 18
## 23955: zupcibo01 1993 BOS AL 222000 1966 8 18
## 23956: zuvelpa01 1989 ATL NL 145000 1958 10 31
## birthCountry birthState birthCity deathYear deathMonth deathDay
## 1: USA CO Denver NA NA NA
## 2: USA CO Denver NA NA NA
## 3: USA CO Denver NA NA NA
## 4: USA CO Denver NA NA NA
## 5: USA CO Denver NA NA NA
## ---
## 23952: USA CA Chula Vista NA NA NA
## 23953: USA PA Pittsburgh NA NA NA
## 23954: USA PA Pittsburgh NA NA NA
## 23955: USA PA Pittsburgh NA NA NA
## 23956: USA CA San Mateo NA NA NA
## deathCountry deathState deathCity nameFirst nameLast nameGiven
## 1: David Aardsma David Allan
## 2: David Aardsma David Allan
## 3: David Aardsma David Allan
## 4: David Aardsma David Allan
## 5: David Aardsma David Allan
## ---
## 23952: Joel Zumaya Joel Martin
## 23953: Bob Zupcic Robert
## 23954: Bob Zupcic Robert
## 23955: Bob Zupcic Robert
## 23956: Paul Zuvella Paul
## weight height bats throws debut finalGame retroID bbrefID
## 1: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 2: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 3: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 4: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 5: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## ---
## 23952: 215 75 R R 2006-04-03 2010-06-28 zumaj001 zumayjo01
## 23953: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23954: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23955: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23956: 173 72 R R 1982-09-04 1991-05-02 zuvep001 zuvelpa01
Notice that we can still see all the data from the salaries
dataset- the player ID, year, team, league, and salary. But each player's salary in each year- but we've also combined it with their biographical information- like their birthday and place, and most notably their name. So we've combined these two tables based on this common column: we have them all in one place. If you wanted to look for trends in salary- for instance, a connection of salary to a player's height, weight, or birth country- you now have all the information in one data table.
One note, having their first and last names as different columns is useful, but we'd like to combine them together into a new column, of first name-space-last name. One way we can create a new column in a data.table is with the :=
operator:
merged.salaries[, name:=paste(nameFirst, nameLast)]
## playerID yearID teamID lgID salary birthYear birthMonth birthDay
## 1: aardsda01 2004 SFN NL 300000 1981 12 27
## 2: aardsda01 2007 CHA AL 387500 1981 12 27
## 3: aardsda01 2008 BOS AL 403250 1981 12 27
## 4: aardsda01 2009 SEA AL 419000 1981 12 27
## 5: aardsda01 2010 SEA AL 2750000 1981 12 27
## ---
## 23952: zumayjo01 2011 DET AL 1400000 1984 11 9
## 23953: zupcibo01 1991 BOS AL 100000 1966 8 18
## 23954: zupcibo01 1992 BOS AL 109000 1966 8 18
## 23955: zupcibo01 1993 BOS AL 222000 1966 8 18
## 23956: zuvelpa01 1989 ATL NL 145000 1958 10 31
## birthCountry birthState birthCity deathYear deathMonth deathDay
## 1: USA CO Denver NA NA NA
## 2: USA CO Denver NA NA NA
## 3: USA CO Denver NA NA NA
## 4: USA CO Denver NA NA NA
## 5: USA CO Denver NA NA NA
## ---
## 23952: USA CA Chula Vista NA NA NA
## 23953: USA PA Pittsburgh NA NA NA
## 23954: USA PA Pittsburgh NA NA NA
## 23955: USA PA Pittsburgh NA NA NA
## 23956: USA CA San Mateo NA NA NA
## deathCountry deathState deathCity nameFirst nameLast nameGiven
## 1: David Aardsma David Allan
## 2: David Aardsma David Allan
## 3: David Aardsma David Allan
## 4: David Aardsma David Allan
## 5: David Aardsma David Allan
## ---
## 23952: Joel Zumaya Joel Martin
## 23953: Bob Zupcic Robert
## 23954: Bob Zupcic Robert
## 23955: Bob Zupcic Robert
## 23956: Paul Zuvella Paul
## weight height bats throws debut finalGame retroID bbrefID
## 1: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 2: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 3: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 4: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 5: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## ---
## 23952: 215 75 R R 2006-04-03 2010-06-28 zumaj001 zumayjo01
## 23953: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23954: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23955: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23956: 173 72 R R 1982-09-04 1991-05-02 zuvep001 zuvelpa01
## name
## 1: David Aardsma
## 2: David Aardsma
## 3: David Aardsma
## 4: David Aardsma
## 5: David Aardsma
## ---
## 23952: Joel Zumaya
## 23953: Bob Zupcic
## 23954: Bob Zupcic
## 23955: Bob Zupcic
## 23956: Paul Zuvella
This means assign a new column, name
, and now we can give it a value based on other columns in the dataset. The paste
function is a useful function in R for combining two vectors of strings to be separated by spaces. If we put nameFirst
and nameLast
, because we're within the data.table, that we want to combine those two names into a new name.
merged.salaries
## playerID yearID teamID lgID salary birthYear birthMonth birthDay
## 1: aardsda01 2004 SFN NL 300000 1981 12 27
## 2: aardsda01 2007 CHA AL 387500 1981 12 27
## 3: aardsda01 2008 BOS AL 403250 1981 12 27
## 4: aardsda01 2009 SEA AL 419000 1981 12 27
## 5: aardsda01 2010 SEA AL 2750000 1981 12 27
## ---
## 23952: zumayjo01 2011 DET AL 1400000 1984 11 9
## 23953: zupcibo01 1991 BOS AL 100000 1966 8 18
## 23954: zupcibo01 1992 BOS AL 109000 1966 8 18
## 23955: zupcibo01 1993 BOS AL 222000 1966 8 18
## 23956: zuvelpa01 1989 ATL NL 145000 1958 10 31
## birthCountry birthState birthCity deathYear deathMonth deathDay
## 1: USA CO Denver NA NA NA
## 2: USA CO Denver NA NA NA
## 3: USA CO Denver NA NA NA
## 4: USA CO Denver NA NA NA
## 5: USA CO Denver NA NA NA
## ---
## 23952: USA CA Chula Vista NA NA NA
## 23953: USA PA Pittsburgh NA NA NA
## 23954: USA PA Pittsburgh NA NA NA
## 23955: USA PA Pittsburgh NA NA NA
## 23956: USA CA San Mateo NA NA NA
## deathCountry deathState deathCity nameFirst nameLast nameGiven
## 1: David Aardsma David Allan
## 2: David Aardsma David Allan
## 3: David Aardsma David Allan
## 4: David Aardsma David Allan
## 5: David Aardsma David Allan
## ---
## 23952: Joel Zumaya Joel Martin
## 23953: Bob Zupcic Robert
## 23954: Bob Zupcic Robert
## 23955: Bob Zupcic Robert
## 23956: Paul Zuvella Paul
## weight height bats throws debut finalGame retroID bbrefID
## 1: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 2: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 3: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 4: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## 5: 205 75 R R 2004-04-06 2013-09-28 aardd001 aardsda01
## ---
## 23952: 215 75 R R 2006-04-03 2010-06-28 zumaj001 zumayjo01
## 23953: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23954: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23955: 220 76 R R 1991-09-07 1994-08-04 zupcb001 zupcibo01
## 23956: 173 72 R R 1982-09-04 1991-05-02 zuvep001 zuvelpa01
## name
## 1: David Aardsma
## 2: David Aardsma
## 3: David Aardsma
## 4: David Aardsma
## 5: David Aardsma
## ---
## 23952: Joel Zumaya
## 23953: Bob Zupcic
## 23954: Bob Zupcic
## 23955: Bob Zupcic
## 23956: Paul Zuvella
You can see that we've added a new column, name
.
Merging can sometimes be a bit more complicated. For example, let's bring in one more dataset, this one a history of each player's batting statistics for each year. To do that, take the earlier read.csv
line, change Master.csv
to Batting.csv
, and save it into a variable called batting
:
batting = read.csv("http://dgrtwo.github.io/pages/lahman/Batting.csv")
Then let's turn it into a data.table, just like we did for master
:
batting = as.data.table(batting)
batting
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B
## 1: aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0
## 2: aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0
## 3: aardsda01 2007 1 CHA AL 25 2 0 0 0 0 0
## 4: aardsda01 2008 1 BOS AL 47 5 1 0 0 0 0
## 5: aardsda01 2009 1 SEA AL 73 3 0 0 0 0 0
## ---
## 97885: zimmejo02 2013 1 WAS NL 32 32 65 4 8 1 0
## 97886: zimmery01 2013 1 WAS NL 147 147 568 84 156 26 2
## 97887: zitoba01 2013 1 SFN NL 30 30 34 3 5 0 0
## 97888: zobribe01 2013 1 TBA AL 157 157 612 77 168 36 3
## 97889: zuninmi01 2013 1 SEA AL 52 52 173 22 37 5 0
## HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old
## 1: 0 0 0 0 0 0 0 0 0 0 0 11
## 2: 0 0 0 0 0 0 0 0 1 0 0 45
## 3: 0 0 0 0 0 0 0 0 0 0 0 2
## 4: 0 0 0 0 0 1 0 0 0 0 0 5
## 5: 0 0 0 0 0 0 0 0 0 0 0 NA
## ---
## 97885: 0 2 0 0 1 20 0 0 6 1 0 NA
## 97886: 26 79 6 0 60 133 2 2 0 3 16 NA
## 97887: 0 2 0 0 0 8 0 0 9 0 1 NA
## 97888: 12 71 11 3 72 91 4 7 1 6 18 NA
## 97889: 5 14 1 0 16 49 0 3 0 1 5 NA
This is the most complex dataset yet. Here, like the salary data, we have one row per player per year, and their team ID and league ID. But we also have many statistics summarizing how well he did at batting that year. For instance, G represents how many games the player played in, AB represents the number of times a player went up to bat (how many chances they had to get a hit), H represents the number of hits, and HR represents the number of home runs he scored (hitting the ball out of the park, which gets a run in just one hit).
Now, let's say we want to combine this data with the salary data- for example so we can see how salary is correlated with performance. First, notice that the salary table and the batting table don't share only one column of player ID: they share four: playerID, teamID, leagueID and yearID. That's because we have multiple batting statistics and salary for each single player. This means we won't just be merging by player: we'll be merging them based on the combination of all four columns.
The way we do that is with the by
argument to merged
. Instead of giving just the playerID
to by
, we give a vector of the four shared columns.
merged.batting = merge(batting, salaries, by=c("playerID", "yearID", "teamID", "lgID"))
merged.batting
## yearID teamID lgID playerID stint G G_batting AB R H X2B X3B
## 1: 2004 SFN NL aardsda01 1 11 11 0 0 0 0 0
## 2: 2007 CHA AL aardsda01 1 25 2 0 0 0 0 0
## 3: 2008 BOS AL aardsda01 1 47 5 1 0 0 0 0
## 4: 2009 SEA AL aardsda01 1 73 3 0 0 0 0 0
## 5: 2010 SEA AL aardsda01 1 53 4 0 0 0 0 0
## ---
## 22867: 2009 DET AL zumayjo01 1 29 3 0 0 0 0 0
## 22868: 2010 DET AL zumayjo01 1 31 4 0 0 0 0 0
## 22869: 1991 BOS AL zupcibo01 1 18 18 25 3 4 0 0
## 22870: 1992 BOS AL zupcibo01 1 124 124 392 46 108 19 1
## 22871: 1993 BOS AL zupcibo01 1 141 141 286 40 69 24 2
## HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary
## 1: 0 0 0 0 0 0 0 0 0 0 0 11 300000
## 2: 0 0 0 0 0 0 0 0 0 0 0 2 387500
## 3: 0 0 0 0 0 1 0 0 0 0 0 5 403250
## 4: 0 0 0 0 0 0 0 0 0 0 0 NA 419000
## 5: 0 0 0 0 0 0 0 0 0 0 0 NA 2750000
## ---
## 22867: 0 0 0 0 0 0 0 0 0 0 0 NA 735000
## 22868: 0 0 0 0 0 0 0 0 0 0 0 NA 915000
## 22869: 1 3 0 0 1 6 0 0 1 0 0 18 100000
## 22870: 3 43 2 2 25 60 1 4 7 4 6 124 109000
## 22871: 2 26 5 2 27 54 2 2 8 3 7 141 222000
Now it has all the information that was in the batting dataset, but it also added a column for salary. Another thing to note is that we don't have salary information on every player in every year: in particular, we've lost all information on players before 1985. There is a way we can fix this, by adding the all.x
option to the merge
function:
merged.batting = merge(batting, salaries, by=c("playerID", "yearID", "teamID", "lgID"), all.x=TRUE)
merged.batting
## yearID teamID lgID playerID stint G G_batting AB R H X2B X3B
## 1: 2004 SFN NL aardsda01 1 11 11 0 0 0 0 0
## 2: 2006 CHN NL aardsda01 1 45 43 2 0 0 0 0
## 3: 2007 CHA AL aardsda01 1 25 2 0 0 0 0 0
## 4: 2008 BOS AL aardsda01 1 47 5 1 0 0 0 0
## 5: 2009 SEA AL aardsda01 1 73 3 0 0 0 0 0
## ---
## 97885: 1959 BAL AL zuverge01 1 6 6 0 0 0 0 0
## 97886: 1910 CHA AL zwilldu01 1 27 27 87 7 16 5 0
## 97887: 1914 CHF FL zwilldu01 1 154 154 592 91 185 38 8
## 97888: 1915 CHF FL zwilldu01 1 150 150 548 65 157 32 7
## 97889: 1916 CHN NL zwilldu01 1 35 35 53 4 6 1 0
## HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary
## 1: 0 0 0 0 0 0 0 0 0 0 0 11 300000
## 2: 0 0 0 0 0 0 0 0 1 0 0 45 NA
## 3: 0 0 0 0 0 0 0 0 0 0 0 2 387500
## 4: 0 0 0 0 0 1 0 0 0 0 0 5 403250
## 5: 0 0 0 0 0 0 0 0 0 0 0 NA 419000
## ---
## 97885: 0 0 0 0 2 0 0 0 0 0 0 6 NA
## 97886: 0 5 1 NA 11 NA NA 1 1 NA NA 27 NA
## 97887: 16 95 21 NA 46 68 NA 1 10 NA NA 154 NA
## 97888: 13 94 24 NA 67 65 NA 2 18 NA NA 150 NA
## 97889: 1 8 0 NA 4 6 NA 0 2 NA NA 35 NA
This means "keep everything in the first dataset we're merging," which is batting
(all.y
would mean "keep everything in the second dataset"). Notice now that now all rows have information in the salary
column: some have NA, which means "missing value," or "not applicable." So notice that all the rows where we have salary data get to keep their value, while all the ones that don't get filled in by the missing value NA.
Now we can take this merged dataset and merge it with our biographical data in the master list. Here that would be
merged.all = merge(merged.batting, master, by="playerID")
merged.all
## playerID yearID teamID lgID stint G G_batting AB R H X2B X3B
## 1: aardsda01 2004 SFN NL 1 11 11 0 0 0 0 0
## 2: aardsda01 2006 CHN NL 1 45 43 2 0 0 0 0
## 3: aardsda01 2007 CHA AL 1 25 2 0 0 0 0 0
## 4: aardsda01 2008 BOS AL 1 47 5 1 0 0 0 0
## 5: aardsda01 2009 SEA AL 1 73 3 0 0 0 0 0
## ---
## 97885: zuverge01 1959 BAL AL 1 6 6 0 0 0 0 0
## 97886: zwilldu01 1910 CHA AL 1 27 27 87 7 16 5 0
## 97887: zwilldu01 1914 CHF FL 1 154 154 592 91 185 38 8
## 97888: zwilldu01 1915 CHF FL 1 150 150 548 65 157 32 7
## 97889: zwilldu01 1916 CHN NL 1 35 35 53 4 6 1 0
## HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary birthYear
## 1: 0 0 0 0 0 0 0 0 0 0 0 11 300000 1981
## 2: 0 0 0 0 0 0 0 0 1 0 0 45 NA 1981
## 3: 0 0 0 0 0 0 0 0 0 0 0 2 387500 1981
## 4: 0 0 0 0 0 1 0 0 0 0 0 5 403250 1981
## 5: 0 0 0 0 0 0 0 0 0 0 0 NA 419000 1981
## ---
## 97885: 0 0 0 0 2 0 0 0 0 0 0 6 NA 1924
## 97886: 0 5 1 NA 11 NA NA 1 1 NA NA 27 NA 1888
## 97887: 16 95 21 NA 46 68 NA 1 10 NA NA 154 NA 1888
## 97888: 13 94 24 NA 67 65 NA 2 18 NA NA 150 NA 1888
## 97889: 1 8 0 NA 4 6 NA 0 2 NA NA 35 NA 1888
## birthMonth birthDay birthCountry birthState birthCity deathYear
## 1: 12 27 USA CO Denver NA
## 2: 12 27 USA CO Denver NA
## 3: 12 27 USA CO Denver NA
## 4: 12 27 USA CO Denver NA
## 5: 12 27 USA CO Denver NA
## ---
## 97885: 8 20 USA MI Holland NA
## 97886: 11 2 USA MO St. Louis 1978
## 97887: 11 2 USA MO St. Louis 1978
## 97888: 11 2 USA MO St. Louis 1978
## 97889: 11 2 USA MO St. Louis 1978
## deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1: NA NA David
## 2: NA NA David
## 3: NA NA David
## 4: NA NA David
## 5: NA NA David
## ---
## 97885: NA NA George
## 97886: 3 27 USA CA La Crescenta Dutch
## 97887: 3 27 USA CA La Crescenta Dutch
## 97888: 3 27 USA CA La Crescenta Dutch
## 97889: 3 27 USA CA La Crescenta Dutch
## nameLast nameGiven weight height bats throws debut
## 1: Aardsma David Allan 205 75 R R 2004-04-06
## 2: Aardsma David Allan 205 75 R R 2004-04-06
## 3: Aardsma David Allan 205 75 R R 2004-04-06
## 4: Aardsma David Allan 205 75 R R 2004-04-06
## 5: Aardsma David Allan 205 75 R R 2004-04-06
## ---
## 97885: Zuverink George 195 76 R R 1951-04-21
## 97886: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 97887: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 97888: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 97889: Zwilling Edward Harrison 160 66 L L 1910-08-14
## finalGame retroID bbrefID
## 1: 2013-09-28 aardd001 aardsda01
## 2: 2013-09-28 aardd001 aardsda01
## 3: 2013-09-28 aardd001 aardsda01
## 4: 2013-09-28 aardd001 aardsda01
## 5: 2013-09-28 aardd001 aardsda01
## ---
## 97885: 1959-06-15 zuveg101 zuverge01
## 97886: 1916-07-12 zwild101 zwilldu01
## 97887: 1916-07-12 zwild101 zwilldu01
## 97888: 1916-07-12 zwild101 zwilldu01
## 97889: 1916-07-12 zwild101 zwilldu01
Now we see we still have the same batting information, but we also have the biographical information from the master list- for example, each player's real name. We've created one mega-dataset covering all three kinds of information. The Lahman baseball dataset contains a lot more information, including player's fielding statistics, presence in the Hall of Fame, pitchers, managers, and so on, all sharing these same IDs. By merging these datasets in the right way, you can answer very complex and interesting questions.
So let's wrap up by taking all these tools together on our mega-merged dataset. Just like any other dataset, we can filter and process this. For example, this dataset includes pitchers, who might never go up to bat in a whole season. That could end up skewing our analysis.
head(merged.all)
## playerID yearID teamID lgID stint G G_batting AB R H X2B X3B HR RBI
## 1: aardsda01 2004 SFN NL 1 11 11 0 0 0 0 0 0 0
## 2: aardsda01 2006 CHN NL 1 45 43 2 0 0 0 0 0 0
## 3: aardsda01 2007 CHA AL 1 25 2 0 0 0 0 0 0 0
## 4: aardsda01 2008 BOS AL 1 47 5 1 0 0 0 0 0 0
## 5: aardsda01 2009 SEA AL 1 73 3 0 0 0 0 0 0 0
## 6: aardsda01 2010 SEA AL 1 53 4 0 0 0 0 0 0 0
## SB CS BB SO IBB HBP SH SF GIDP G_old salary birthYear birthMonth
## 1: 0 0 0 0 0 0 0 0 0 11 300000 1981 12
## 2: 0 0 0 0 0 0 1 0 0 45 NA 1981 12
## 3: 0 0 0 0 0 0 0 0 0 2 387500 1981 12
## 4: 0 0 0 1 0 0 0 0 0 5 403250 1981 12
## 5: 0 0 0 0 0 0 0 0 0 NA 419000 1981 12
## 6: 0 0 0 0 0 0 0 0 0 NA 2750000 1981 12
## birthDay birthCountry birthState birthCity deathYear deathMonth
## 1: 27 USA CO Denver NA NA
## 2: 27 USA CO Denver NA NA
## 3: 27 USA CO Denver NA NA
## 4: 27 USA CO Denver NA NA
## 5: 27 USA CO Denver NA NA
## 6: 27 USA CO Denver NA NA
## deathDay deathCountry deathState deathCity nameFirst nameLast
## 1: NA David Aardsma
## 2: NA David Aardsma
## 3: NA David Aardsma
## 4: NA David Aardsma
## 5: NA David Aardsma
## 6: NA David Aardsma
## nameGiven weight height bats throws debut finalGame retroID
## 1: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 2: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 3: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 4: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 5: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 6: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## bbrefID
## 1: aardsda01
## 2: aardsda01
## 3: aardsda01
## 4: aardsda01
## 5: aardsda01
## 6: aardsda01
An example would be David Aardsma, who in many years never even had a single At Bat (AB is 0). We can start by filtering out all the years in which someone has no At Bats.
merged.all = merged.all[AB > 0, ]
merged.all
## playerID yearID teamID lgID stint G G_batting AB R H X2B
## 1: aardsda01 2006 CHN NL 1 45 43 2 0 0 0
## 2: aardsda01 2008 BOS AL 1 47 5 1 0 0 0
## 3: aaronha01 1954 ML1 NL 1 122 122 468 58 131 27
## 4: aaronha01 1955 ML1 NL 1 153 153 602 105 189 37
## 5: aaronha01 1956 ML1 NL 1 153 153 609 106 200 34
## ---
## 84365: zuverge01 1958 BAL AL 1 45 45 9 0 2 0
## 84366: zwilldu01 1910 CHA AL 1 27 27 87 7 16 5
## 84367: zwilldu01 1914 CHF FL 1 154 154 592 91 185 38
## 84368: zwilldu01 1915 CHF FL 1 150 150 548 65 157 32
## 84369: zwilldu01 1916 CHN NL 1 35 35 53 4 6 1
## X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary birthYear
## 1: 0 0 0 0 0 0 0 0 0 1 0 0 45 NA 1981
## 2: 0 0 0 0 0 0 1 0 0 0 0 0 5 403250 1981
## 3: 6 13 69 2 2 28 39 NA 3 6 4 13 122 NA 1934
## 4: 9 27 106 3 1 49 61 5 3 7 4 20 153 NA 1934
## 5: 14 26 92 2 4 37 54 6 2 5 7 21 153 NA 1934
## ---
## 84365: 1 0 2 0 0 1 2 0 0 0 0 0 45 NA 1924
## 84366: 0 0 5 1 NA 11 NA NA 1 1 NA NA 27 NA 1888
## 84367: 8 16 95 21 NA 46 68 NA 1 10 NA NA 154 NA 1888
## 84368: 7 13 94 24 NA 67 65 NA 2 18 NA NA 150 NA 1888
## 84369: 0 1 8 0 NA 4 6 NA 0 2 NA NA 35 NA 1888
## birthMonth birthDay birthCountry birthState birthCity deathYear
## 1: 12 27 USA CO Denver NA
## 2: 12 27 USA CO Denver NA
## 3: 2 5 USA AL Mobile NA
## 4: 2 5 USA AL Mobile NA
## 5: 2 5 USA AL Mobile NA
## ---
## 84365: 8 20 USA MI Holland NA
## 84366: 11 2 USA MO St. Louis 1978
## 84367: 11 2 USA MO St. Louis 1978
## 84368: 11 2 USA MO St. Louis 1978
## 84369: 11 2 USA MO St. Louis 1978
## deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1: NA NA David
## 2: NA NA David
## 3: NA NA Hank
## 4: NA NA Hank
## 5: NA NA Hank
## ---
## 84365: NA NA George
## 84366: 3 27 USA CA La Crescenta Dutch
## 84367: 3 27 USA CA La Crescenta Dutch
## 84368: 3 27 USA CA La Crescenta Dutch
## 84369: 3 27 USA CA La Crescenta Dutch
## nameLast nameGiven weight height bats throws debut
## 1: Aardsma David Allan 205 75 R R 2004-04-06
## 2: Aardsma David Allan 205 75 R R 2004-04-06
## 3: Aaron Henry Louis 180 72 R R 1954-04-13
## 4: Aaron Henry Louis 180 72 R R 1954-04-13
## 5: Aaron Henry Louis 180 72 R R 1954-04-13
## ---
## 84365: Zuverink George 195 76 R R 1951-04-21
## 84366: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84367: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84368: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84369: Zwilling Edward Harrison 160 66 L L 1910-08-14
## finalGame retroID bbrefID
## 1: 2013-09-28 aardd001 aardsda01
## 2: 2013-09-28 aardd001 aardsda01
## 3: 1976-10-03 aaroh101 aaronha01
## 4: 1976-10-03 aaroh101 aaronha01
## 5: 1976-10-03 aaroh101 aaronha01
## ---
## 84365: 1959-06-15 zuveg101 zuverge01
## 84366: 1916-07-12 zwild101 zwilldu01
## 84367: 1916-07-12 zwild101 zwilldu01
## 84368: 1916-07-12 zwild101 zwilldu01
## 84369: 1916-07-12 zwild101 zwilldu01
Now we can see that all At Bat's are at least 1.
Now, one thing baseball fans like looking for is career records. That means we want to summarize across all the years that a batter played, and find, for example, the total number of home runs each player hit. Recall that we learned to do that with "by". For example:
summarized.batters = merged.all[, list(Total.HR=sum(HR)), by="playerID"]
Here we create one column, Total.HR
, which we define as the sum of home runs for each player, and we tell it to perform these summaries on each player individually.
summarized.batters
## playerID Total.HR
## 1: aardsda01 0
## 2: aaronha01 755
## 3: aaronto01 13
## 4: aasedo01 0
## 5: abadan01 0
## ---
## 16336: zupcibo01 7
## 16337: zupofr01 0
## 16338: zuvelpa01 2
## 16339: zuverge01 0
## 16340: zwilldu01 30
Now we can see that we've created a new data.table that contains each player's ID and their total career home runs. But in the process, since the only thing we're summarizing by is the player ID, we lost track of their actual first and last names. There's a simple way around that. First, recall that we can create a new column that combines the players' first and last names using paste
and :=
, and let's try the same trick again, this time on merged.all
:
merged.all[, name:=paste(nameFirst, nameLast)]
## playerID yearID teamID lgID stint G G_batting AB R H X2B
## 1: aardsda01 2006 CHN NL 1 45 43 2 0 0 0
## 2: aardsda01 2008 BOS AL 1 47 5 1 0 0 0
## 3: aaronha01 1954 ML1 NL 1 122 122 468 58 131 27
## 4: aaronha01 1955 ML1 NL 1 153 153 602 105 189 37
## 5: aaronha01 1956 ML1 NL 1 153 153 609 106 200 34
## ---
## 84365: zuverge01 1958 BAL AL 1 45 45 9 0 2 0
## 84366: zwilldu01 1910 CHA AL 1 27 27 87 7 16 5
## 84367: zwilldu01 1914 CHF FL 1 154 154 592 91 185 38
## 84368: zwilldu01 1915 CHF FL 1 150 150 548 65 157 32
## 84369: zwilldu01 1916 CHN NL 1 35 35 53 4 6 1
## X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary birthYear
## 1: 0 0 0 0 0 0 0 0 0 1 0 0 45 NA 1981
## 2: 0 0 0 0 0 0 1 0 0 0 0 0 5 403250 1981
## 3: 6 13 69 2 2 28 39 NA 3 6 4 13 122 NA 1934
## 4: 9 27 106 3 1 49 61 5 3 7 4 20 153 NA 1934
## 5: 14 26 92 2 4 37 54 6 2 5 7 21 153 NA 1934
## ---
## 84365: 1 0 2 0 0 1 2 0 0 0 0 0 45 NA 1924
## 84366: 0 0 5 1 NA 11 NA NA 1 1 NA NA 27 NA 1888
## 84367: 8 16 95 21 NA 46 68 NA 1 10 NA NA 154 NA 1888
## 84368: 7 13 94 24 NA 67 65 NA 2 18 NA NA 150 NA 1888
## 84369: 0 1 8 0 NA 4 6 NA 0 2 NA NA 35 NA 1888
## birthMonth birthDay birthCountry birthState birthCity deathYear
## 1: 12 27 USA CO Denver NA
## 2: 12 27 USA CO Denver NA
## 3: 2 5 USA AL Mobile NA
## 4: 2 5 USA AL Mobile NA
## 5: 2 5 USA AL Mobile NA
## ---
## 84365: 8 20 USA MI Holland NA
## 84366: 11 2 USA MO St. Louis 1978
## 84367: 11 2 USA MO St. Louis 1978
## 84368: 11 2 USA MO St. Louis 1978
## 84369: 11 2 USA MO St. Louis 1978
## deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1: NA NA David
## 2: NA NA David
## 3: NA NA Hank
## 4: NA NA Hank
## 5: NA NA Hank
## ---
## 84365: NA NA George
## 84366: 3 27 USA CA La Crescenta Dutch
## 84367: 3 27 USA CA La Crescenta Dutch
## 84368: 3 27 USA CA La Crescenta Dutch
## 84369: 3 27 USA CA La Crescenta Dutch
## nameLast nameGiven weight height bats throws debut
## 1: Aardsma David Allan 205 75 R R 2004-04-06
## 2: Aardsma David Allan 205 75 R R 2004-04-06
## 3: Aaron Henry Louis 180 72 R R 1954-04-13
## 4: Aaron Henry Louis 180 72 R R 1954-04-13
## 5: Aaron Henry Louis 180 72 R R 1954-04-13
## ---
## 84365: Zuverink George 195 76 R R 1951-04-21
## 84366: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84367: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84368: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84369: Zwilling Edward Harrison 160 66 L L 1910-08-14
## finalGame retroID bbrefID name
## 1: 2013-09-28 aardd001 aardsda01 David Aardsma
## 2: 2013-09-28 aardd001 aardsda01 David Aardsma
## 3: 1976-10-03 aaroh101 aaronha01 Hank Aaron
## 4: 1976-10-03 aaroh101 aaronha01 Hank Aaron
## 5: 1976-10-03 aaroh101 aaronha01 Hank Aaron
## ---
## 84365: 1959-06-15 zuveg101 zuverge01 George Zuverink
## 84366: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
## 84367: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
## 84368: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
## 84369: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
Now we've added to merged.all
a name
column:
merged.all
## playerID yearID teamID lgID stint G G_batting AB R H X2B
## 1: aardsda01 2006 CHN NL 1 45 43 2 0 0 0
## 2: aardsda01 2008 BOS AL 1 47 5 1 0 0 0
## 3: aaronha01 1954 ML1 NL 1 122 122 468 58 131 27
## 4: aaronha01 1955 ML1 NL 1 153 153 602 105 189 37
## 5: aaronha01 1956 ML1 NL 1 153 153 609 106 200 34
## ---
## 84365: zuverge01 1958 BAL AL 1 45 45 9 0 2 0
## 84366: zwilldu01 1910 CHA AL 1 27 27 87 7 16 5
## 84367: zwilldu01 1914 CHF FL 1 154 154 592 91 185 38
## 84368: zwilldu01 1915 CHF FL 1 150 150 548 65 157 32
## 84369: zwilldu01 1916 CHN NL 1 35 35 53 4 6 1
## X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary birthYear
## 1: 0 0 0 0 0 0 0 0 0 1 0 0 45 NA 1981
## 2: 0 0 0 0 0 0 1 0 0 0 0 0 5 403250 1981
## 3: 6 13 69 2 2 28 39 NA 3 6 4 13 122 NA 1934
## 4: 9 27 106 3 1 49 61 5 3 7 4 20 153 NA 1934
## 5: 14 26 92 2 4 37 54 6 2 5 7 21 153 NA 1934
## ---
## 84365: 1 0 2 0 0 1 2 0 0 0 0 0 45 NA 1924
## 84366: 0 0 5 1 NA 11 NA NA 1 1 NA NA 27 NA 1888
## 84367: 8 16 95 21 NA 46 68 NA 1 10 NA NA 154 NA 1888
## 84368: 7 13 94 24 NA 67 65 NA 2 18 NA NA 150 NA 1888
## 84369: 0 1 8 0 NA 4 6 NA 0 2 NA NA 35 NA 1888
## birthMonth birthDay birthCountry birthState birthCity deathYear
## 1: 12 27 USA CO Denver NA
## 2: 12 27 USA CO Denver NA
## 3: 2 5 USA AL Mobile NA
## 4: 2 5 USA AL Mobile NA
## 5: 2 5 USA AL Mobile NA
## ---
## 84365: 8 20 USA MI Holland NA
## 84366: 11 2 USA MO St. Louis 1978
## 84367: 11 2 USA MO St. Louis 1978
## 84368: 11 2 USA MO St. Louis 1978
## 84369: 11 2 USA MO St. Louis 1978
## deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1: NA NA David
## 2: NA NA David
## 3: NA NA Hank
## 4: NA NA Hank
## 5: NA NA Hank
## ---
## 84365: NA NA George
## 84366: 3 27 USA CA La Crescenta Dutch
## 84367: 3 27 USA CA La Crescenta Dutch
## 84368: 3 27 USA CA La Crescenta Dutch
## 84369: 3 27 USA CA La Crescenta Dutch
## nameLast nameGiven weight height bats throws debut
## 1: Aardsma David Allan 205 75 R R 2004-04-06
## 2: Aardsma David Allan 205 75 R R 2004-04-06
## 3: Aaron Henry Louis 180 72 R R 1954-04-13
## 4: Aaron Henry Louis 180 72 R R 1954-04-13
## 5: Aaron Henry Louis 180 72 R R 1954-04-13
## ---
## 84365: Zuverink George 195 76 R R 1951-04-21
## 84366: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84367: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84368: Zwilling Edward Harrison 160 66 L L 1910-08-14
## 84369: Zwilling Edward Harrison 160 66 L L 1910-08-14
## finalGame retroID bbrefID name
## 1: 2013-09-28 aardd001 aardsda01 David Aardsma
## 2: 2013-09-28 aardd001 aardsda01 David Aardsma
## 3: 1976-10-03 aaroh101 aaronha01 Hank Aaron
## 4: 1976-10-03 aaroh101 aaronha01 Hank Aaron
## 5: 1976-10-03 aaroh101 aaronha01 Hank Aaron
## ---
## 84365: 1959-06-15 zuveg101 zuverge01 George Zuverink
## 84366: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
## 84367: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
## 84368: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
## 84369: 1916-07-12 zwild101 zwilldu01 Dutch Zwilling
Now when we perform this summary, let's do it not just on the player ID, but also on their name:
summarized.batters = merged.all[, list(Total.HR=sum(HR)), by=c("playerID", "name")]
summarized.batters
## playerID name Total.HR
## 1: aardsda01 David Aardsma 0
## 2: aaronha01 Hank Aaron 755
## 3: aaronto01 Tommie Aaron 13
## 4: aasedo01 Don Aase 0
## 5: abadan01 Andy Abad 0
## ---
## 16336: zupcibo01 Bob Zupcic 7
## 16337: zupofr01 Frank Zupo 0
## 16338: zuvelpa01 Paul Zuvella 2
## 16339: zuverge01 George Zuverink 0
## 16340: zwilldu01 Dutch Zwilling 30
By summarizing based on these two columns, we can keep both their ID and their real name.
Now, just like any data.table, we can sort it to find out who the top home-run hitters are. For this we use the order
function:
summarized.batters[order(Total.HR), ]
## playerID name Total.HR
## 1: aardsda01 David Aardsma 0
## 2: aasedo01 Don Aase 0
## 3: abadan01 Andy Abad 0
## 4: abadfe01 Fernando Abad 0
## 5: abadijo01 John Abadie 0
## ---
## 16336: rodrial01 Alex Rodriguez 654
## 16337: mayswi01 Willie Mays 660
## 16338: ruthba01 Babe Ruth 714
## 16339: aaronha01 Hank Aaron 755
## 16340: bondsba01 Barry Bonds 762
Baseball fans won't be surprised that at the top we can see Barry Bonds, Hank Aaron, Babe Ruth, and some other legendary baseball hitters. In the same way we can summarize by other statistics, like total number of hits or runs. For instance, here let's add Total.R
for total number of runs, and Total.H
for total number of hits.
summarized.batters = merged.all[, list(Total.HR=sum(HR), Total.R=sum(R), Total.H=sum(H)), by=c("playerID", "name")]
summarized.batters
## playerID name Total.HR Total.R Total.H
## 1: aardsda01 David Aardsma 0 0 0
## 2: aaronha01 Hank Aaron 755 2174 3771
## 3: aaronto01 Tommie Aaron 13 102 216
## 4: aasedo01 Don Aase 0 0 0
## 5: abadan01 Andy Abad 0 1 2
## ---
## 16336: zupcibo01 Bob Zupcic 7 99 199
## 16337: zupofr01 Frank Zupo 0 3 3
## 16338: zuvelpa01 Paul Zuvella 2 41 109
## 16339: zuverge01 George Zuverink 0 4 21
## 16340: zwilldu01 Dutch Zwilling 30 167 364
Now we've saved all that career information into summarized.batters
.
The more a player gets hits in baseball, the more chance they have to actually score runs. That means it's not surprising that there's a correlation between them. We can take a look at that correlation through ggplot. We'll put total hits (Total.H
) on the x-axis and total runs (Total.R
) on the y-axis.
ggplot(summarized.batters, aes(Total.H, Total.R)) + geom_point()
Here we can see a clear correlation between the number of hits a player gets and the number of runs.
So far each of these summaries has been of one statistic: the total number of home runs, or the total number of hits. But some baseball statistics are calculated based on several of a player's statistics. For example, consider the batting average, which is the number of hits a player gets, divided by the number of times he goes up to bat.
head(merged.all)
## playerID yearID teamID lgID stint G G_batting AB R H X2B X3B HR
## 1: aardsda01 2006 CHN NL 1 45 43 2 0 0 0 0 0
## 2: aardsda01 2008 BOS AL 1 47 5 1 0 0 0 0 0
## 3: aaronha01 1954 ML1 NL 1 122 122 468 58 131 27 6 13
## 4: aaronha01 1955 ML1 NL 1 153 153 602 105 189 37 9 27
## 5: aaronha01 1956 ML1 NL 1 153 153 609 106 200 34 14 26
## 6: aaronha01 1957 ML1 NL 1 151 151 615 118 198 27 6 44
## RBI SB CS BB SO IBB HBP SH SF GIDP G_old salary birthYear birthMonth
## 1: 0 0 0 0 0 0 0 1 0 0 45 NA 1981 12
## 2: 0 0 0 0 1 0 0 0 0 0 5 403250 1981 12
## 3: 69 2 2 28 39 NA 3 6 4 13 122 NA 1934 2
## 4: 106 3 1 49 61 5 3 7 4 20 153 NA 1934 2
## 5: 92 2 4 37 54 6 2 5 7 21 153 NA 1934 2
## 6: 132 1 1 57 58 15 0 0 3 13 151 NA 1934 2
## birthDay birthCountry birthState birthCity deathYear deathMonth
## 1: 27 USA CO Denver NA NA
## 2: 27 USA CO Denver NA NA
## 3: 5 USA AL Mobile NA NA
## 4: 5 USA AL Mobile NA NA
## 5: 5 USA AL Mobile NA NA
## 6: 5 USA AL Mobile NA NA
## deathDay deathCountry deathState deathCity nameFirst nameLast
## 1: NA David Aardsma
## 2: NA David Aardsma
## 3: NA Hank Aaron
## 4: NA Hank Aaron
## 5: NA Hank Aaron
## 6: NA Hank Aaron
## nameGiven weight height bats throws debut finalGame retroID
## 1: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 2: David Allan 205 75 R R 2004-04-06 2013-09-28 aardd001
## 3: Henry Louis 180 72 R R 1954-04-13 1976-10-03 aaroh101
## 4: Henry Louis 180 72 R R 1954-04-13 1976-10-03 aaroh101
## 5: Henry Louis 180 72 R R 1954-04-13 1976-10-03 aaroh101
## 6: Henry Louis 180 72 R R 1954-04-13 1976-10-03 aaroh101
## bbrefID name
## 1: aardsda01 David Aardsma
## 2: aardsda01 David Aardsma
## 3: aaronha01 Hank Aaron
## 4: aaronha01 Hank Aaron
## 5: aaronha01 Hank Aaron
## 6: aaronha01 Hank Aaron
So in our batting dataset, for Hank Aaron in 1955, we can see that he had 189 hits out of 602 at-bats. We'd calculate his batting average as
189 / 602
## [1] 0.3139535
for that year. What if we wanted to compute each player's career batting average? It turns out that's easy with the summary operation. We add a column BattingAverage
to the summary data.table, which we put as the sum of all hits divided by the sum of all at-bats.
summarized.batters = merged.all[, list(Total.HR=sum(HR), Total.R=sum(R), Total.H=sum(H), BattingAverage=sum(H) / sum(AB)), by=c("playerID", "name")]
summarized.batters
## playerID name Total.HR Total.R Total.H BattingAverage
## 1: aardsda01 David Aardsma 0 0 0 0.0000000
## 2: aaronha01 Hank Aaron 755 2174 3771 0.3049984
## 3: aaronto01 Tommie Aaron 13 102 216 0.2288136
## 4: aasedo01 Don Aase 0 0 0 0.0000000
## 5: abadan01 Andy Abad 0 1 2 0.0952381
## ---
## 16336: zupcibo01 Bob Zupcic 7 99 199 0.2503145
## 16337: zupofr01 Frank Zupo 0 3 3 0.1666667
## 16338: zuvelpa01 Paul Zuvella 2 41 109 0.2219959
## 16339: zuverge01 George Zuverink 0 4 21 0.1478873
## 16340: zwilldu01 Dutch Zwilling 30 167 364 0.2843750
This kind of summary operation thus lets us generate any statistic we're interested in. We could then, for instance, put it into a histogram to find out its distribution:
ggplot(summarized.batters, aes(BattingAverage)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
We can see that they center around about 25%, with a large number of people with close to 0 batting average, which would mostly be pitchers.
In this way you're able to test hypotheses almost as fast as you can think of them. This loop of asking questions about your data and getting answers back is the core of exploratory data analysis.