### David Robinson

Director of Data Scientist at Heap, works in R.

# All Posts

## Machine learning in a hurry: what I've learned from the SLICED ML competitionJuly 21, 2021

Over this summer I've competed in the SLICED ML competition, where contestants have two hours to create a Kaggle submission. I share what I learned about competitive machine learning and R.

## The 'circular random walk' puzzle: tidy simulation of stochastic processes in RNovember 23, 2020

Solving a puzzle from 538's The Riddler column: if you pass cranberry sauce randomly around a table of 20, who is most likely to be the last person to get it?

## The 'prisoner coin flipping' puzzle: tidy simulation in RMay 04, 2020

Solving a puzzle from 538's the Riddler column: if N prisoners have a choice to flip a coin, and go free as long as one coin is flipped and all coins are heads, what strategy should they take to maximize their chances? Another demonstration of probabilistic reasoning and tidy simulation.

## The 'spam comments' puzzle: tidy simulation of stochastic processes in RApril 13, 2020

Solving a puzzle from 538's The Riddler column: if new spam comments appear at an average rate of one per day, including on other spam comments, how many can we expect after three days?

## Feller's coin-tossing puzzle: tidy simulation in RJanuary 17, 2020

If you toss n coins, what's the probability there are no streaks of k heads?

## The 'Spelling Bee Honeycomb' puzzle: efficient computation in RJanuary 06, 2020

Solving a puzzle from 538's The Riddler column: what is the honeycomb (from the New York Times's Spelling Bee puzzle) that scores the highest?

## The birthday paradox puzzle: tidy simulation in RJanuary 03, 2020

A demonstration of a tidy approach to simulating the classic birthday paradox problem.

## The 'largest stock profit or loss' puzzle: efficient computation in RDecember 24, 2019

A demonstration of a fast tidyverse approach to an interview problem.

## The 'knight on an infinite chessboard' puzzle: efficient simulation in R December 10, 2018

A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.

## Exploring college major and income: a live data analysis in ROctober 16, 2018

A live screencast of an exploratory data analysis from the Tidy Tuesday series. This one explores college major and income data from 538.

## Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similaritySeptember 06, 2018

An analysis of an anonymous op-ed in the New York Times, using document similarity metrics to match it to Twitter accounts.

## Scientific debtMay 10, 2018

Introducing an analogy to 'technical debt' for data scientists.

## Data science at DataCampApril 10, 2018

Two months after starting as Chief Data Scientist at DataCamp, I share some thoughts on the role and the team's future.

## What digits should you bet on in Super Bowl squares?February 04, 2018

An statistical examination of the least significant digit of NFL scores.

## Exploring handwritten digit classification: a tidy analysis of the MNIST datasetJanuary 22, 2018

Demonstrating an approach of exploratory data analysis on a classic image classification problem.

## What's the difference between data science, machine learning, and artificial intelligence?January 09, 2018

An oversimplified definition of the difference between three important fields.

## Advice to aspiring data scientists: start a blogNovember 14, 2017

Reasons that a data science blog is particularly valuable to people at the start of their career.

## Announcing "Introduction to the Tidyverse", my new DataCamp courseNovember 09, 2017

Introducing my new DataCamp course that teaches ggplot2 and dplyr, and how they relate.

## Don't teach students the hard way firstSeptember 21, 2017

Discussion of a particular teaching approach that I believe is a mistake.

## Trump's Android and iPhone tweets, one year laterAugust 09, 2017

An followup to last summer's analysis of Donald Trump's Twitter account.

## Teach the tidyverse to beginnersJuly 05, 2017

An argument for teaching R packages like dplyr and tidyr as the first part of a data science course.

## Two years as a Data Scientist at Stack OverflowJune 22, 2017

Looking back at my second year at the first job I've had outside academia.

## Words growing or shrinking in Hacker News titles: a tidy analysisJune 08, 2017

An analysis of one million Hacker News titles, and what topics and technologies are changing in frequency over time.

## Slides, videos, and tweets from the 2017 New York R ConferenceMay 22, 2017

Some notes from the 2017 New York R Conference, and slides and video from my talk.

## How we built Tagger News: machine learning on a tight scheduleMay 15, 2017

A discussion of the algorithm training and development behind our TechCrunch Disrupt hackathon project

## Gender and verbs across 100,000 stories: a tidy analysisApril 27, 2017

An analysis of "he " vs "she " in plot descriptions from Wikipedia.

## Examining the arc of 100,000 stories: a tidy analysisApril 26, 2017

An analysis of over 100,000 plot descriptions downloaded from Wikipedia, particularly examining which words tend to occur at which point in a story.

## Announcing the release of my e-book: Introduction to Empirical BayesFebruary 07, 2017

Releasing my e-book 'Introduction to Empirical Bayes: Examples from Baseball Statistics', adapted from my series of posts on applying empirical Bayesian methods to baseball.

## Simulation of empirical Bayesian methods (using baseball statistics)January 11, 2017

Examining the assumptions and accuracy of empirical Bayes through simulations.

## Introducing the ebbr package for empirical Bayes estimation (using baseball statistics)January 05, 2017

Turning many of the statistical methods described in the baseball/empirical Bayes series into a convenient R package.

## Understanding mixture models and expectation-maximization (using baseball statistics)January 03, 2017

Modeling the prior as a mixture of two beta distributions, and classifying each player with an iterative expectation-maximization algorithm.

## Analysis of software developers in New York, San Francisco, London and BangaloreDecember 01, 2016

An analysis of how programmers use different programming languages among different cities, based on Stack Overflow traffic.

## The 'deadly board game' puzzle: efficient simulation in ROctober 19, 2016

A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.

## Understanding empirical Bayesian hierarchical modeling (using baseball statistics)October 12, 2016

Allow the priors to depend on observed variables such as year and position.

## Slides from graduates of the Metis Data Science Career DaySeptember 30, 2016

A directory of talks, slides, and professional information from graduates presenting at the Metis bootcamp.

## Tidying computational biology models with biobroom: a case study in tidy analysisSeptember 06, 2016

An example of tidying Bioconductor model objects, such as from the limma package, with the biobroom package.

## Analysis of the #7FavPackages hashtagAugust 25, 2016

An analysis of people's favorite R packages, as shared in the #7FavPackages hashtag.

## useR and JSM 2016 conferences: a story in tweetsAugust 23, 2016

My reports on the useR and JSM conferences that I attended in summer 2016.

## Text analysis of Trump's tweets confirms he writes only the (angrier) Android halfAugust 09, 2016

Text and sentiment analysis on Trump's tweets confirms that the tweets posted from the iPhone appear to come from his campaign, while his tweets from Android are the 'off-the-cuff' observations he's known for.

## Does sentiment analysis work? A tidy analysis of Yelp reviewsJuly 21, 2016

How well does sentiment analysis work at predicting customer satisfaction? We examine a Yelp dataset using the tidytext package

## stacksurveyr: An R package with the 2016 Developer Survey ResultsJuly 19, 2016

Sharing the answers of 56,000 developers in a R package easily suited for analysis

## Releasing the StackLite dataset of Stack Overflow questions and tagsJuly 18, 2016

Sharing a new resource for analyzing Stack Overflow questions

## One year as a Data Scientist at Stack OverflowJune 20, 2016

Looking back at a year at my new job, and the transition from academia to industry.

## Understanding beta binomial regression (using baseball statistics)May 31, 2016

Allowing our priors to depend on the number of times each player went up to bat.

## Understanding Bayesian A/B testing (using baseball statistics)May 23, 2016

How to detect a difference between two proportions using Bayesian hypothesis testing

## The adblockr package: block ads from the monetizr packageApril 01, 2016

Introducing a new package that blocks the greedy, distracting ads from the heinous monetizr package.

## The monetizr package: make money on your open source R packagesApril 01, 2016

Introducing a new package that lets you monetize R open source development.

## How to replace a pie chartMarch 14, 2016

An example of replacing a pie chart with a bar chart that communicates more information.

## Why I use ggplot2February 12, 2016

A response to Jeff Leek about base plotting and the Grammar of Graphics.

## Analyzing networks of characters in 'Love Actually'December 25, 2015

An analysis and visualization of a holiday classic.

## The 'lost boarding pass' puzzle: efficient simulation in RDecember 11, 2015

A simulation of the 'lost boarding pass' puzzle, showing one way to perform efficient simulations in R even when they need to keep track of state.

## Modeling gene expression with broom: a case study in tidy analysisNovember 25, 2015

An example of fitting models to each gene in an expression dataset using tidy tools (dplyr and broom).

## Cleaning and visualizing genomic data: a case study in tidy analysisNovember 19, 2015

An example of cleaning and graphing a gene expression dataset using tidy tools (dplyr, tidyr, and ggplot2).

## What are the most polarizing programming languages?November 04, 2015

An analysis what technologies are liked and disliked on Stack Overflow Careers.

## Understanding the Bayesian approach to false discovery rates (using baseball statistics)November 03, 2015

Taking an empirical Bayesian approach to false discovery rates, in order to assemble a 'Hall of Fame' of great batters.

## Understanding credible intervals (using baseball statistics)October 21, 2015

Computing posterior credible intervals after empirical Bayesian estimation, in terms of baseball batting averages.

## Understanding empirical Bayes estimation (using baseball statistics)October 01, 2015

An intuitive explanation of empirical Bayes estimation in terms of estimating baseball batting averages.

## Is Bayesian A/B Testing Immune to Peeking? Not ExactlyAugust 21, 2015

Bayesian A/B testing doesn't 'solve' the problems of frequentist testing- it just makes different promises.

## Yes, There is Such a Thing as a Stupid QuestionJune 08, 2015

There are stupid questions, but they're not what you're thinking of.

## A Million Lines of Bad CodeApril 17, 2015

A confession, and a plea, about 'code-shaming'.

## Slides from my talk on the broom packageApril 13, 2015

Slides and some highlights from my talk on the broom package at UP-STAT 2015.

## broom: a package for tidying statistical models into data framesMarch 19, 2015

Introducing a package that turns statistical objects from R into tidy data frames that can be used with packages like dplyr and ggplot2.

## Introducing stackr: An R package for querying the Stack Exchange APIFebruary 04, 2015

An example analysis of a Stack Overflow user (me) in R

## What kind of programmer are you? Stack Exchange can predict it, Shiny can graph itFebruary 02, 2015

A Shiny visualization for personalized data from Stack Exchange's machine learning system Providence

## K-means clustering is not a free lunchJanuary 16, 2015

A response to a Cross Validated question, exploring the assumptions underlying the k-means algorithm.

## Can R and ggvis help solve Serial's murder?December 23, 2014

Creating an interactive visualization of the podcast Serial's infamous 'call log,' using ggvis.

## Understanding the beta distribution (using baseball statistics)December 20, 2014

An intuitive explanation of the beta distribution in terms of predicting baseball batting averages.

## Don't teach built-in plotting to beginners (teach ggplot2)December 16, 2014

If you're teaching built-in plotting in your statistics class, you're doing it wrong.

## How to interpret a p-value histogramDecember 15, 2014

What anyone doing multiple hypothesis testing should know

## Are high-reputation users quitting Stack Overflow?December 15, 2014

An analysis based on answering trends over time.