All Posts
2021
July 21, 2021
Over this summer I've competed in the SLICED ML competition, where contestants have two hours to create a Kaggle submission. I share what I learned about competitive machine learning and R.
2020
November 23, 2020
Solving a puzzle from 538's The Riddler column: if you pass cranberry sauce randomly around a table of 20, who is most likely to be the last person to get it?
May 04, 2020
Solving a puzzle from 538's the Riddler column: if N prisoners have a choice to flip a coin, and go free as long as one coin is flipped and all coins are heads, what strategy should they take to maximize their chances? Another demonstration of probabilistic reasoning and tidy simulation.
April 13, 2020
Solving a puzzle from 538's The Riddler column: if new spam comments appear at an average rate of one per day, including on other spam comments, how many can we expect after three days?
January 17, 2020
If you toss n coins, what's the probability there are no streaks of k heads?
January 06, 2020
Solving a puzzle from 538's The Riddler column: what is the honeycomb (from the New York Times's Spelling Bee puzzle) that scores the highest?
January 03, 2020
A demonstration of a tidy approach to simulating the classic birthday paradox problem.
2019
December 24, 2019
A demonstration of a fast tidyverse approach to an interview problem.
2018
December 10, 2018
A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.
October 16, 2018
A live screencast of an exploratory data analysis from the Tidy Tuesday series. This one explores college major and income data from 538.
September 06, 2018
An analysis of an anonymous op-ed in the New York Times, using document similarity metrics to match it to Twitter accounts.
May 10, 2018
Introducing an analogy to 'technical debt' for data scientists.
April 10, 2018
Two months after starting as Chief Data Scientist at DataCamp, I share some thoughts on the role and the team's future.
February 04, 2018
An statistical examination of the least significant digit of NFL scores.
January 22, 2018
Demonstrating an approach of exploratory data analysis on a classic image classification problem.
January 09, 2018
An oversimplified definition of the difference between three important fields.
2017
November 14, 2017
Reasons that a data science blog is particularly valuable to people at the start of their career.
November 09, 2017
Introducing my new DataCamp course that teaches ggplot2 and dplyr, and how they relate.
September 21, 2017
Discussion of a particular teaching approach that I believe is a mistake.
August 09, 2017
An followup to last summer's analysis of Donald Trump's Twitter account.
July 05, 2017
An argument for teaching R packages like dplyr and tidyr as the first part of a data science course.
June 22, 2017
Looking back at my second year at the first job I've had outside academia.
June 08, 2017
An analysis of one million Hacker News titles, and what topics and technologies are changing in frequency over time.
May 22, 2017
Some notes from the 2017 New York R Conference, and slides and video from my talk.
May 15, 2017
A discussion of the algorithm training and development behind our TechCrunch Disrupt hackathon project
April 27, 2017
An analysis of "he " vs "she " in plot descriptions from Wikipedia.
April 26, 2017
An analysis of over 100,000 plot descriptions downloaded from Wikipedia, particularly examining which words tend to occur at which point in a story.
February 07, 2017
Releasing my e-book 'Introduction to Empirical Bayes: Examples from Baseball Statistics', adapted from my series of posts on applying empirical Bayesian methods to baseball.
January 11, 2017
Examining the assumptions and accuracy of empirical Bayes through simulations.
January 05, 2017
Turning many of the statistical methods described in the baseball/empirical Bayes series into a convenient R package.
January 03, 2017
Modeling the prior as a mixture of two beta distributions, and classifying each player with an iterative expectation-maximization algorithm.
2016
December 01, 2016
An analysis of how programmers use different programming languages among different cities, based on Stack Overflow traffic.
October 19, 2016
A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.
October 12, 2016
Allow the priors to depend on observed variables such as year and position.
September 30, 2016
A directory of talks, slides, and professional information from graduates presenting at the Metis bootcamp.
September 06, 2016
An example of tidying Bioconductor model objects, such as from the limma package, with the biobroom package.
August 25, 2016
An analysis of people's favorite R packages, as shared in the #7FavPackages hashtag.
August 23, 2016
My reports on the useR and JSM conferences that I attended in summer 2016.
August 09, 2016
Text and sentiment analysis on Trump's tweets confirms that the tweets posted from the iPhone appear to come from his campaign, while his tweets from Android are the 'off-the-cuff' observations he's known for.
July 21, 2016
How well does sentiment analysis work at predicting customer satisfaction? We examine a Yelp dataset using the tidytext package
July 19, 2016
Sharing the answers of 56,000 developers in a R package easily suited for analysis
July 18, 2016
Sharing a new resource for analyzing Stack Overflow questions
June 20, 2016
Looking back at a year at my new job, and the transition from academia to industry.
May 31, 2016
Allowing our priors to depend on the number of times each player went up to bat.
May 23, 2016
How to detect a difference between two proportions using Bayesian hypothesis testing
April 01, 2016
Introducing a new package that blocks the greedy, distracting ads from the heinous monetizr package.
April 01, 2016
Introducing a new package that lets you monetize R open source development.
March 14, 2016
An example of replacing a pie chart with a bar chart that communicates more information.
February 12, 2016
A response to Jeff Leek about base plotting and the Grammar of Graphics.
2015
December 25, 2015
An analysis and visualization of a holiday classic.
December 11, 2015
A simulation of the 'lost boarding pass' puzzle, showing one way to perform efficient simulations in R even when they need to keep track of state.
November 25, 2015
An example of fitting models to each gene in an expression dataset using tidy tools (dplyr and broom).
November 19, 2015
An example of cleaning and graphing a gene expression dataset using tidy tools (dplyr, tidyr, and ggplot2).
November 04, 2015
An analysis what technologies are liked and disliked on Stack Overflow Careers.
November 03, 2015
Taking an empirical Bayesian approach to false discovery rates, in order to assemble a 'Hall of Fame' of great batters.
October 21, 2015
Computing posterior credible intervals after empirical Bayesian estimation, in terms of baseball batting averages.
October 01, 2015
An intuitive explanation of empirical Bayes estimation in terms of estimating baseball batting averages.
August 21, 2015
Bayesian A/B testing doesn't 'solve' the problems of frequentist testing- it just makes different promises.
June 08, 2015
There are stupid questions, but they're not what you're thinking of.
April 17, 2015
A confession, and a plea, about 'code-shaming'.
April 13, 2015
Slides and some highlights from my talk on the broom package at UP-STAT 2015.
March 19, 2015
Introducing a package that turns statistical objects from R into tidy data frames that can be used with packages like dplyr and ggplot2.
March 06, 2015
A Shiny app to visualize downloads from RStudio's CRAN mirror.
February 04, 2015
An example analysis of a Stack Overflow user (me) in R
February 02, 2015
A Shiny visualization for personalized data from Stack Exchange's machine learning system Providence
January 16, 2015
A response to a Cross Validated question, exploring the assumptions underlying the k-means algorithm.
2014
December 23, 2014
Creating an interactive visualization of the podcast Serial's infamous 'call log,' using ggvis.
December 20, 2014
An intuitive explanation of the beta distribution in terms of predicting baseball batting averages.
December 16, 2014
If you're teaching built-in plotting in your statistics class, you're doing it wrong.
December 15, 2014
What anyone doing multiple hypothesis testing should know
December 15, 2014
An analysis based on answering trends over time.