# All Posts

### 2021

July 21, 2021

Over this summer I've competed in the SLICED ML competition, where contestants have two hours to create a Kaggle submission. I share what I learned about competitive machine learning and R.

### 2020

November 23, 2020

Solving a puzzle from 538's The Riddler column: if you pass cranberry sauce randomly around a table of 20, who is most likely to be the last person to get it?

May 04, 2020

Solving a puzzle from 538's the Riddler column: if N prisoners have a choice to flip a coin, and go free as long as one coin is flipped and all coins are heads, what strategy should they take to maximize their chances? Another demonstration of probabilistic reasoning and tidy simulation.

April 13, 2020

Solving a puzzle from 538's The Riddler column: if new spam comments appear at an average rate of one per day, including on other spam comments, how many can we expect after three days?

January 17, 2020

If you toss n coins, what's the probability there are no streaks of k heads?

January 06, 2020

Solving a puzzle from 538's The Riddler column: what is the honeycomb (from the New York Times's Spelling Bee puzzle) that scores the highest?

January 03, 2020

A demonstration of a tidy approach to simulating the classic birthday paradox problem.

### 2019

December 24, 2019

A demonstration of a fast tidyverse approach to an interview problem.

### 2018

December 10, 2018

A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.

October 16, 2018

A live screencast of an exploratory data analysis from the Tidy Tuesday series. This one explores college major and income data from 538.

September 06, 2018

An analysis of an anonymous op-ed in the New York Times, using document similarity metrics to match it to Twitter accounts.

May 10, 2018

Introducing an analogy to 'technical debt' for data scientists.

April 10, 2018

Two months after starting as Chief Data Scientist at DataCamp, I share some thoughts on the role and the team's future.

February 04, 2018

An statistical examination of the least significant digit of NFL scores.

January 22, 2018

Demonstrating an approach of exploratory data analysis on a classic image classification problem.

January 09, 2018

An oversimplified definition of the difference between three important fields.

### 2017

November 14, 2017

Reasons that a data science blog is particularly valuable to people at the start of their career.

November 09, 2017

Introducing my new DataCamp course that teaches ggplot2 and dplyr, and how they relate.

September 21, 2017

Discussion of a particular teaching approach that I believe is a mistake.

August 09, 2017

An followup to last summer's analysis of Donald Trump's Twitter account.

July 05, 2017

An argument for teaching R packages like dplyr and tidyr as the first part of a data science course.

June 22, 2017

Looking back at my second year at the first job I've had outside academia.

June 08, 2017

An analysis of one million Hacker News titles, and what topics and technologies are changing in frequency over time.

May 22, 2017

Some notes from the 2017 New York R Conference, and slides and video from my talk.

May 15, 2017

A discussion of the algorithm training and development behind our TechCrunch Disrupt hackathon project

April 27, 2017

An analysis of "he " vs "she " in plot descriptions from Wikipedia.

April 26, 2017

An analysis of over 100,000 plot descriptions downloaded from Wikipedia, particularly examining which words tend to occur at which point in a story.

February 07, 2017

Releasing my e-book 'Introduction to Empirical Bayes: Examples from Baseball Statistics', adapted from my series of posts on applying empirical Bayesian methods to baseball.

January 11, 2017

Examining the assumptions and accuracy of empirical Bayes through simulations.

January 05, 2017

Turning many of the statistical methods described in the baseball/empirical Bayes series into a convenient R package.

January 03, 2017

Modeling the prior as a mixture of two beta distributions, and classifying each player with an iterative expectation-maximization algorithm.

### 2016

December 01, 2016

An analysis of how programmers use different programming languages among different cities, based on Stack Overflow traffic.

October 19, 2016

A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.

October 12, 2016

Allow the priors to depend on observed variables such as year and position.

September 30, 2016

A directory of talks, slides, and professional information from graduates presenting at the Metis bootcamp.

September 06, 2016

An example of tidying Bioconductor model objects, such as from the limma package, with the biobroom package.

August 25, 2016

An analysis of people's favorite R packages, as shared in the #7FavPackages hashtag.

August 23, 2016

My reports on the useR and JSM conferences that I attended in summer 2016.

August 09, 2016

Text and sentiment analysis on Trump's tweets confirms that the tweets posted from the iPhone appear to come from his campaign, while his tweets from Android are the 'off-the-cuff' observations he's known for.

July 21, 2016

How well does sentiment analysis work at predicting customer satisfaction? We examine a Yelp dataset using the tidytext package

July 19, 2016

Sharing the answers of 56,000 developers in a R package easily suited for analysis

July 18, 2016

Sharing a new resource for analyzing Stack Overflow questions

June 20, 2016

Looking back at a year at my new job, and the transition from academia to industry.

May 31, 2016

Allowing our priors to depend on the number of times each player went up to bat.

May 23, 2016

How to detect a difference between two proportions using Bayesian hypothesis testing

April 01, 2016

Introducing a new package that blocks the greedy, distracting ads from the heinous monetizr package.

April 01, 2016

Introducing a new package that lets you monetize R open source development.

March 14, 2016

An example of replacing a pie chart with a bar chart that communicates more information.

February 12, 2016

A response to Jeff Leek about base plotting and the Grammar of Graphics.

### 2015

December 25, 2015

An analysis and visualization of a holiday classic.

December 11, 2015

A simulation of the 'lost boarding pass' puzzle, showing one way to perform efficient simulations in R even when they need to keep track of state.

November 25, 2015

An example of fitting models to each gene in an expression dataset using tidy tools (dplyr and broom).

November 19, 2015

An example of cleaning and graphing a gene expression dataset using tidy tools (dplyr, tidyr, and ggplot2).

November 04, 2015

An analysis what technologies are liked and disliked on Stack Overflow Careers.

November 03, 2015

Taking an empirical Bayesian approach to false discovery rates, in order to assemble a 'Hall of Fame' of great batters.

October 21, 2015

Computing posterior credible intervals after empirical Bayesian estimation, in terms of baseball batting averages.

October 01, 2015

An intuitive explanation of empirical Bayes estimation in terms of estimating baseball batting averages.

August 21, 2015

Bayesian A/B testing doesn't 'solve' the problems of frequentist testing- it just makes different promises.

June 08, 2015

There are stupid questions, but they're not what you're thinking of.

April 17, 2015

A confession, and a plea, about 'code-shaming'.

April 13, 2015

Slides and some highlights from my talk on the broom package at UP-STAT 2015.

March 19, 2015

Introducing a package that turns statistical objects from R into tidy data frames that can be used with packages like dplyr and ggplot2.

March 06, 2015

A Shiny app to visualize downloads from RStudio's CRAN mirror.

February 04, 2015

An example analysis of a Stack Overflow user (me) in R

February 02, 2015

A Shiny visualization for personalized data from Stack Exchange's machine learning system Providence

January 16, 2015

A response to a Cross Validated question, exploring the assumptions underlying the k-means algorithm.

### 2014

December 23, 2014

Creating an interactive visualization of the podcast Serial's infamous 'call log,' using ggvis.

December 20, 2014

An intuitive explanation of the beta distribution in terms of predicting baseball batting averages.

December 16, 2014

If you're teaching built-in plotting in your statistics class, you're doing it wrong.

December 15, 2014

What anyone doing multiple hypothesis testing should know

December 15, 2014

An analysis based on answering trends over time.