The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation, analysis, and visualization. Popular packages like dplyr, tidyr and ggplot2 take great advantage of this framework, as explored in several recent posts by others.
But there’s an important step in a tidy data workflow that so far has been missing: the output of R statistical modeling functions isn’t tidy, meaning it’s difficult to manipulate and recombine in downstream analyses and visualizations. Hadley’s paper makes a convincing statement of this problem (emphasis mine):
While model inputs usually require tidy inputs, such attention to detail doesn’t carry over to model outputs. Outputs such as predictions and estimated coefficients aren’t always tidy. This makes it more difficult to combine results from multiple models. For example, in R, the default representation of model coefficients is not tidy because it does not have an explicit variable that records the variable name for each estimate, they are instead recorded as row names. In R, row names must be unique, so combining coefficients from many models (e.g., from bootstrap resamples, or subgroups) requires workarounds to avoid losing important information. This knocks you out of the flow of analysis and makes it harder to combine the results from multiple models. I’m not currently aware of any packages that resolve this problem.
In this new paper I introduce the broom package (available on CRAN), which bridges the gap from untidy outputs of predictions and estimations to the tidy data we want to work with. It takes the messy output of built-in statistical functions in R, such as
t.test, as well as popular third-party packages, like gam, glmnet, survival or lme4, and turns them into tidy data frames. This allows the results to be handed to other tidy packages for downstream analysis: they can be recombined using dplyr or visualized using ggplot2.
Example: linear regression
As a simple example, consider a linear regression on the built-in
This summary shows many kinds of statistics describing the regression: coefficient estimates and p-values, information about the residuals, and model statistics like $R^2$ and the F statistic. But this format isn’t convenient if you want to combine and compare multiple models, or plot it using ggplot2: you need to turn it into a data frame.
The broom package provides three tidying methods for turning the contents of this object into a data frame, depending on the level of statistics you’re interested in. If you want statistics about each of the coefficients fit by the model, use the
Note that the rownames are now added as a column,
term, meaning that the data can be combined with other models. Note also that the columns have been given names like
p.value that are more easily accessed than
Std. Error and
Pr(>|t|). This is true of all data frames broom returns: they’re designed so they can be processed in additional steps.
If you’re interested in extracting per-observation information, such as fitted values and residuals, use the
augment() method, which adds these to the original data:
glance() computes per-model statistics, such as $R^2$, AIC, and BIC:
As pointed out by Mike Love, the
tidy method makes it easy to construct coefficient plots using ggplot2:
When combined with dplyr’s
do, broom also lets you perform regressions within groups, such as within automatic and manual cars separately:
This is useful for performing regressions or other analyses within each gene, country, or any other kind of division in your tidy dataset.
Using tidiers for visualization with ggplot2
The broom package provides tidying methods for many other packages as well. These tidiers serve to connect various statistical models seamlessly with packages like dplyr and ggplot2. For instance, we could create a LASSO regression with the glmnet package:
Then we tidy it with broom and plot it using ggplot2:
By plotting with ggplot2 rather than relying on glmnet’s built-in plotting methods, we gain access to all the tools and framework of the package. This allows us to customize or add attributes, or even to compare multiple LASSO cross-validations in the same plot. The same is true of the survival package:
The vignettes for the broom package offer other useful examples, including one on combining broom and dplyr, a demonstration of bootstrapping with broom, and a simulation of k-means clustering. The broom manuscript offers still more examples.
Tidying model outputs is not an exact science, and it is based on a judgment of the kinds of values a data scientist typically wants out of a tidy analysis (for instance, estimates, test statistics, and p-values). It is my hope that data scientists will propose and contribute their own features feature requests are welcome!) to help expand the universe of tidy analysis tools.