David Robinson

Director of Engineering at Contentsquare

All Posts

2021

Machine learning in a hurry: what I've learned from the SLICED ML competition
July 21, 2021

Over this summer I've competed in the SLICED ML competition, where contestants have two hours to create a Kaggle submission. I share what I learned about competitive machine learning and R.

2020

The 'circular random walk' puzzle: tidy simulation of stochastic processes in R
November 23, 2020

Solving a puzzle from 538's The Riddler column: if you pass cranberry sauce randomly around a table of 20, who is most likely to be the last person to get it?

The 'prisoner coin flipping' puzzle: tidy simulation in R
May 04, 2020

Solving a puzzle from 538's the Riddler column: if N prisoners have a choice to flip a coin, and go free as long as one coin is flipped and all coins are heads, what strategy should they take to maximize their chances? Another demonstration of probabilistic reasoning and tidy simulation.

The 'spam comments' puzzle: tidy simulation of stochastic processes in R
April 13, 2020

Solving a puzzle from 538's The Riddler column: if new spam comments appear at an average rate of one per day, including on other spam comments, how many can we expect after three days?

Feller's coin-tossing puzzle: tidy simulation in R
January 17, 2020

If you toss n coins, what's the probability there are no streaks of k heads?

The 'Spelling Bee Honeycomb' puzzle: efficient computation in R
January 06, 2020

Solving a puzzle from 538's The Riddler column: what is the honeycomb (from the New York Times's Spelling Bee puzzle) that scores the highest?

The birthday paradox puzzle: tidy simulation in R
January 03, 2020

A demonstration of a tidy approach to simulating the classic birthday paradox problem.

2019

The 'largest stock profit or loss' puzzle: efficient computation in R
December 24, 2019

A demonstration of a fast tidyverse approach to an interview problem.

2018

The 'knight on an infinite chessboard' puzzle: efficient simulation in R
December 10, 2018

A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.

Exploring college major and income: a live data analysis in R
October 16, 2018

A live screencast of an exploratory data analysis from the Tidy Tuesday series. This one explores college major and income data from 538.

Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity
September 06, 2018

An analysis of an anonymous op-ed in the New York Times, using document similarity metrics to match it to Twitter accounts.

Scientific debt
May 10, 2018

Introducing an analogy to 'technical debt' for data scientists.

Data science at DataCamp
April 10, 2018

Two months after starting as Chief Data Scientist at DataCamp, I share some thoughts on the role and the team's future.

What digits should you bet on in Super Bowl squares?
February 04, 2018

An statistical examination of the least significant digit of NFL scores.

Exploring handwritten digit classification: a tidy analysis of the MNIST dataset
January 22, 2018

Demonstrating an approach of exploratory data analysis on a classic image classification problem.

What's the difference between data science, machine learning, and artificial intelligence?
January 09, 2018

An oversimplified definition of the difference between three important fields.

2017

Advice to aspiring data scientists: start a blog
November 14, 2017

Reasons that a data science blog is particularly valuable to people at the start of their career.

Announcing "Introduction to the Tidyverse", my new DataCamp course
November 09, 2017

Introducing my new DataCamp course that teaches ggplot2 and dplyr, and how they relate.

Don't teach students the hard way first
September 21, 2017

Discussion of a particular teaching approach that I believe is a mistake.

Trump's Android and iPhone tweets, one year later
August 09, 2017

An followup to last summer's analysis of Donald Trump's Twitter account.

Teach the tidyverse to beginners
July 05, 2017

An argument for teaching R packages like dplyr and tidyr as the first part of a data science course.

Two years as a Data Scientist at Stack Overflow
June 22, 2017

Looking back at my second year at the first job I've had outside academia.

Words growing or shrinking in Hacker News titles: a tidy analysis
June 08, 2017

An analysis of one million Hacker News titles, and what topics and technologies are changing in frequency over time.

Slides, videos, and tweets from the 2017 New York R Conference
May 22, 2017

Some notes from the 2017 New York R Conference, and slides and video from my talk.

How we built Tagger News: machine learning on a tight schedule
May 15, 2017

A discussion of the algorithm training and development behind our TechCrunch Disrupt hackathon project

Gender and verbs across 100,000 stories: a tidy analysis
April 27, 2017

An analysis of "he " vs "she " in plot descriptions from Wikipedia.

Examining the arc of 100,000 stories: a tidy analysis
April 26, 2017

An analysis of over 100,000 plot descriptions downloaded from Wikipedia, particularly examining which words tend to occur at which point in a story.

Announcing the release of my e-book: Introduction to Empirical Bayes
February 07, 2017

Releasing my e-book 'Introduction to Empirical Bayes: Examples from Baseball Statistics', adapted from my series of posts on applying empirical Bayesian methods to baseball.

Simulation of empirical Bayesian methods (using baseball statistics)
January 11, 2017

Examining the assumptions and accuracy of empirical Bayes through simulations.

Introducing the ebbr package for empirical Bayes estimation (using baseball statistics)
January 05, 2017

Turning many of the statistical methods described in the baseball/empirical Bayes series into a convenient R package.

Understanding mixture models and expectation-maximization (using baseball statistics)
January 03, 2017

Modeling the prior as a mixture of two beta distributions, and classifying each player with an iterative expectation-maximization algorithm.

2016

Analysis of software developers in New York, San Francisco, London and Bangalore
December 01, 2016

An analysis of how programmers use different programming languages among different cities, based on Stack Overflow traffic.

The 'deadly board game' puzzle: efficient simulation in R
October 19, 2016

A simulation of a probabilistic puzzle from the Riddler column on FiveThirtyEight.

Understanding empirical Bayesian hierarchical modeling (using baseball statistics)
October 12, 2016

Allow the priors to depend on observed variables such as year and position.

Slides from graduates of the Metis Data Science Career Day
September 30, 2016

A directory of talks, slides, and professional information from graduates presenting at the Metis bootcamp.

Tidying computational biology models with biobroom: a case study in tidy analysis
September 06, 2016

An example of tidying Bioconductor model objects, such as from the limma package, with the biobroom package.

Analysis of the #7FavPackages hashtag
August 25, 2016

An analysis of people's favorite R packages, as shared in the #7FavPackages hashtag.

useR and JSM 2016 conferences: a story in tweets
August 23, 2016

My reports on the useR and JSM conferences that I attended in summer 2016.

Text analysis of Trump's tweets confirms he writes only the (angrier) Android half
August 09, 2016

Text and sentiment analysis on Trump's tweets confirms that the tweets posted from the iPhone appear to come from his campaign, while his tweets from Android are the 'off-the-cuff' observations he's known for.

Does sentiment analysis work? A tidy analysis of Yelp reviews
July 21, 2016

How well does sentiment analysis work at predicting customer satisfaction? We examine a Yelp dataset using the tidytext package

stacksurveyr: An R package with the 2016 Developer Survey Results
July 19, 2016

Sharing the answers of 56,000 developers in a R package easily suited for analysis

Releasing the StackLite dataset of Stack Overflow questions and tags
July 18, 2016

Sharing a new resource for analyzing Stack Overflow questions

One year as a Data Scientist at Stack Overflow
June 20, 2016

Looking back at a year at my new job, and the transition from academia to industry.

Understanding beta binomial regression (using baseball statistics)
May 31, 2016

Allowing our priors to depend on the number of times each player went up to bat.

Understanding Bayesian A/B testing (using baseball statistics)
May 23, 2016

How to detect a difference between two proportions using Bayesian hypothesis testing

The adblockr package: block ads from the monetizr package
April 01, 2016

Introducing a new package that blocks the greedy, distracting ads from the heinous monetizr package.

The monetizr package: make money on your open source R packages
April 01, 2016

Introducing a new package that lets you monetize R open source development.

How to replace a pie chart
March 14, 2016

An example of replacing a pie chart with a bar chart that communicates more information.

Why I use ggplot2
February 12, 2016

A response to Jeff Leek about base plotting and the Grammar of Graphics.

2015

Analyzing networks of characters in 'Love Actually'
December 25, 2015

An analysis and visualization of a holiday classic.

The 'lost boarding pass' puzzle: efficient simulation in R
December 11, 2015

A simulation of the 'lost boarding pass' puzzle, showing one way to perform efficient simulations in R even when they need to keep track of state.

Modeling gene expression with broom: a case study in tidy analysis
November 25, 2015

An example of fitting models to each gene in an expression dataset using tidy tools (dplyr and broom).

Cleaning and visualizing genomic data: a case study in tidy analysis
November 19, 2015

An example of cleaning and graphing a gene expression dataset using tidy tools (dplyr, tidyr, and ggplot2).

What are the most polarizing programming languages?
November 04, 2015

An analysis what technologies are liked and disliked on Stack Overflow Careers.

Understanding the Bayesian approach to false discovery rates (using baseball statistics)
November 03, 2015

Taking an empirical Bayesian approach to false discovery rates, in order to assemble a 'Hall of Fame' of great batters.

Understanding credible intervals (using baseball statistics)
October 21, 2015

Computing posterior credible intervals after empirical Bayesian estimation, in terms of baseball batting averages.

Understanding empirical Bayes estimation (using baseball statistics)
October 01, 2015

An intuitive explanation of empirical Bayes estimation in terms of estimating baseball batting averages.

Is Bayesian A/B Testing Immune to Peeking? Not Exactly
August 21, 2015

Bayesian A/B testing doesn't 'solve' the problems of frequentist testing- it just makes different promises.

Yes, There is Such a Thing as a Stupid Question
June 08, 2015

There are stupid questions, but they're not what you're thinking of.

A Million Lines of Bad Code
April 17, 2015

A confession, and a plea, about 'code-shaming'.

Slides from my talk on the broom package
April 13, 2015

Slides and some highlights from my talk on the broom package at UP-STAT 2015.

broom: a package for tidying statistical models into data frames
March 19, 2015

Introducing a package that turns statistical objects from R into tidy data frames that can be used with packages like dplyr and ggplot2.

View package downloads over time with Shiny
March 06, 2015

A Shiny app to visualize downloads from RStudio's CRAN mirror.

Introducing stackr: An R package for querying the Stack Exchange API
February 04, 2015

An example analysis of a Stack Overflow user (me) in R

What kind of programmer are you? Stack Exchange can predict it, Shiny can graph it
February 02, 2015

A Shiny visualization for personalized data from Stack Exchange's machine learning system Providence

K-means clustering is not a free lunch
January 16, 2015

A response to a Cross Validated question, exploring the assumptions underlying the k-means algorithm.

2014

Can R and ggvis help solve Serial's murder?
December 23, 2014

Creating an interactive visualization of the podcast Serial's infamous 'call log,' using ggvis.

Understanding the beta distribution (using baseball statistics)
December 20, 2014

An intuitive explanation of the beta distribution in terms of predicting baseball batting averages.

Don't teach built-in plotting to beginners (teach ggplot2)
December 16, 2014

If you're teaching built-in plotting in your statistics class, you're doing it wrong.

How to interpret a p-value histogram
December 15, 2014

What anyone doing multiple hypothesis testing should know

Are high-reputation users quitting Stack Overflow?
December 15, 2014

An analysis based on answering trends over time.

David Robinson

Subscribe

Recommended

All Posts

2021

Machine learning in a hurry: what I've learned from the SLICED ML competition July 21, 2021

2020

The 'circular random walk' puzzle: tidy simulation of stochastic processes in R November 23, 2020

The 'prisoner coin flipping' puzzle: tidy simulation in R May 04, 2020

The 'spam comments' puzzle: tidy simulation of stochastic processes in R April 13, 2020

Feller's coin-tossing puzzle: tidy simulation in R January 17, 2020

The 'Spelling Bee Honeycomb' puzzle: efficient computation in R January 06, 2020

The birthday paradox puzzle: tidy simulation in R January 03, 2020

2019

The 'largest stock profit or loss' puzzle: efficient computation in R December 24, 2019

2018

The 'knight on an infinite chessboard' puzzle: efficient simulation in R December 10, 2018

Exploring college major and income: a live data analysis in R October 16, 2018

Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity September 06, 2018

Scientific debt May 10, 2018

Data science at DataCamp April 10, 2018

What digits should you bet on in Super Bowl squares? February 04, 2018

Exploring handwritten digit classification: a tidy analysis of the MNIST dataset January 22, 2018

What's the difference between data science, machine learning, and artificial intelligence? January 09, 2018

2017

Advice to aspiring data scientists: start a blog November 14, 2017

Announcing "Introduction to the Tidyverse", my new DataCamp course November 09, 2017

Don't teach students the hard way first September 21, 2017

Trump's Android and iPhone tweets, one year later August 09, 2017

Teach the tidyverse to beginners July 05, 2017

Two years as a Data Scientist at Stack Overflow June 22, 2017

Words growing or shrinking in Hacker News titles: a tidy analysis June 08, 2017

Slides, videos, and tweets from the 2017 New York R Conference May 22, 2017

How we built Tagger News: machine learning on a tight schedule May 15, 2017

Gender and verbs across 100,000 stories: a tidy analysis April 27, 2017

Examining the arc of 100,000 stories: a tidy analysis April 26, 2017

Announcing the release of my e-book: Introduction to Empirical Bayes February 07, 2017

Simulation of empirical Bayesian methods (using baseball statistics) January 11, 2017

Introducing the ebbr package for empirical Bayes estimation (using baseball statistics) January 05, 2017

Understanding mixture models and expectation-maximization (using baseball statistics) January 03, 2017

2016

Analysis of software developers in New York, San Francisco, London and Bangalore December 01, 2016

The 'deadly board game' puzzle: efficient simulation in R October 19, 2016

Understanding empirical Bayesian hierarchical modeling (using baseball statistics) October 12, 2016

Slides from graduates of the Metis Data Science Career Day September 30, 2016

Tidying computational biology models with biobroom: a case study in tidy analysis September 06, 2016

Analysis of the #7FavPackages hashtag August 25, 2016

useR and JSM 2016 conferences: a story in tweets August 23, 2016

Text analysis of Trump's tweets confirms he writes only the (angrier) Android half August 09, 2016

Does sentiment analysis work? A tidy analysis of Yelp reviews July 21, 2016

stacksurveyr: An R package with the 2016 Developer Survey Results July 19, 2016

Releasing the StackLite dataset of Stack Overflow questions and tags July 18, 2016

One year as a Data Scientist at Stack Overflow June 20, 2016

Understanding beta binomial regression (using baseball statistics) May 31, 2016

Understanding Bayesian A/B testing (using baseball statistics) May 23, 2016

The adblockr package: block ads from the monetizr package April 01, 2016

The monetizr package: make money on your open source R packages April 01, 2016

How to replace a pie chart March 14, 2016

Why I use ggplot2 February 12, 2016