4.6 Exploratory Data Analysis

View the "merged" data.table (created in Quiz 4.5) like a spreadsheet. It contains the pitching dataset with salary information added.

View(merged)

Let's say want to predict a pitcher's salary based on his performance. First, you might want to get a sense of the distribution of salaries.

Use ggplot2 on the "merged" dataset to create a histogram of pitchers' salaries.

ggplot(merged, aes(salary)) + geom_histogram()

The salary distribution is heavily skewed, which will be a problem for some of our graphs and tests. Try plotting the x-axis on a log scale instead.

ggplot(merged, aes(salary)) + geom_histogram() + scale_x_log10()

Since the log data is closer to symmetrical (which works better for statistical tests and graphs), we know we'll want to work on the log scale from now on.

Examine whether salaries change over time by creating a scatterplot with year on the x-axis and salary on the y-axis. Add a smoothing curve, and make the y-axis on a log scale

ggplot(merged, aes(yearID, salary)) + geom_point() + geom_smooth() + scale_y_log10()

Perform a statistical test for a correlation between year and log(salary).

cor.test(merged$yearID, log(merged$salary))

Examine whether the league (American, AL, versus National, NL) has an effect on salary, by creating a boxplot comparing salary between leagues. Put the y-axis on a log scale.

ggplot(merged, aes(lgID, salary)) + geom_boxplot() + scale_y_log10()

Perform a t-test to test whether there is a statistically significant difference in log(salary) between the two leagues.

t.test(log(salary) ~ lgID, data=merged)

Perform a linear regression to predict log(salary) based on year. Save it to a variable called "fit"

fit = lm(log(salary) ~ yearID, data=merged)

Display a summary of the linear fit

summary(fit)

Based on the reported p-value, does including a player's year significantly improve your prediction of a player's salary?

true

ERA, for "Earned Run Average," is one of the most popular statistics used to measure pitcher's performance. It is a measure of how many runs a pitcher lets the other team score per game (adjusted for how long the pitcher played). Thus, a *lower* ERA is better.

Perform a multiple linear regression to predict log(salary) based on both year and ERA. Save it to a variable called "mfit"

mfit = lm(log(salary) ~ yearID + ERA, data=merged)

Display a summary of the multiple regression

summary(mfit)

Based on the reported p-value, does including a player's ERA significantly improve your prediction of a player's salary?

true

R Data

4.6 Exploratory Data Analysis