In May, some friends and I built Tagger News, a real-time automatic classifier of Hacker News articles based on their text (see here for more about how we built it). This process started me down some interesting paths, particularly analyzing trends in titles. By finding words that became more or less common in Hacker News titles over time, we can see what topics were gaining or losing interest in the tech world.
I thought I’d share some of these analyses here. As with most of my text analysis posts, I’ll be analyzing it with the tidytext package written by me and Julia Silge. (Check out our soon-to-be-released book on the subject, Text Mining with R!)
Processing one million Hacker News articles
The titles (and dates, and links) of Hacker News articles are helpfully stored on Google Bigquery. We start with a dataset of a million Hacker News article titles, which covers about 3 and a half years of posts. I downloaded it on 2017-06-04 with the following query:
SELECT id, score, time, title, url FROM [bigquery-public-data:hacker_news.full]
WHERE deleted IS null AND dead IS null AND type = 'story'
ORDER BY time DESC
LIMIT 1000000
I’m hosting the dataset so that you can read it yourself (note that it’s a moderately large and compressed file so it may take a minute to read) as follows:
Whenever we read in a text dataset, the first step is usually to tokenize it (split it up into words) and remove stopwords (like “the” or “and”). As described here, we can do this with tidytext’s unnest_tokens
function.
This creates a pretty large data frame (4443083 rows) with one row per word per post. For example, we could use this to find and visualize the most common words in Hacker News posts during this 3.5 year period.
The top result, “hn”, comes from the common constructions of “Show HN” and “Ask HN”, which are often prepended to a title. Among the other words, there’s nothing we wouldn’t have expected from following Hacker News. There some notable technology companies like Google, Apple, and Facebook, along with common topics in tech discussions like “app”, “data” and “startup”.
Change over time
What words and topics have become more frequent, or less frequent, over time? These could give us a sense of the changing technology ecosystem, and let us predict what topics will continue to grow in relevance.
To achieve this, we’ll first count the occurrences of words in titles by month.
We can then use my broom package to fit a model (logistic regression) to examine whether the frequency of each word is increasing or decreasing over time. Every term will then have a growth rate (as an exponential term) associated with it.
We can now ask: what terms have been increasing in frequency in Hacker News titles?
It makes sense that “Trump” is the word that’s growing most quickly. While it shows a moderate growth after Trump announced his candidacy in mid-2015, it shows a sharp increase around the time of the 2016 election (though it hasn’t been quite as dominant in the months since his inauguration). The next fastest growing terms include “AI” and “blockchain”, both topics that have shown a recent surge of interest. (We can also see “machine learning”, “AWS”, and “bot” among the growing terms).
What words have been decreasing in frequency in Hacker News titles?
This shows a few topics in which interest has died out since 2013, including Google Glass and Gmail. Discussion of Edward Snowden and the NSA was fresh news around 2013-2014, so it makes sense that it’s declined since. There are also some notable technologies that people write about less, such as AngularJS, HTML5, and Ruby on Rails. It’s interesting to compare these to Stack Overflow Trends during that time period, in which AngularJS has been growing in terms of Stack Overflow questions asked (HTML5 and Rails have leveled off). It’s possible that discussion on HN is a “leading indicator”, showing a surge articles when a technology first gets popular.
(I don’t currently have a guess for why “million” and “billion” had sudden dropoffs in 2014. Is it some artifact of the Hacker News policy, with the word becoming edited or deleted in newer posts? Or is it a real change in what the site discusses?)
Interestingly, the conversation around “bitcoin” peaked around 2013-2014 (with a recent uptick due to a surge in Bitcoin’s price), while discussion of the blockchain has been growing in the years since. Comparing them on the same axes makes in clear that the blockchain has roughly “caught up to” bitcoin in terms of general interest:
Words that passed their peak
We can learn a lot by examining words , but we’re misses an important piece of the puzzle: there are some words that grew in interest for part of this time, then “passed their peak.” This is more complicated to explore quantitatively, but one approach is to find words where the ratio of the peak in the trend to the average is especially high. To reduce the effect of random noise, we use a cubic spline to smooth each trend before determining the peak.
Here are 16 terms that had strong peaks at various point in the last 3.5 years.
We can see a peak of discussion around net neutrality in 2015 (though it’s shown a recent resurgence of interest). You can spot the introduction of Swift and the Apple Watch, then a moderate decline in discussion around them, and there are sudden jolts of discussion around Oculus in 2014 (with Facebook’s purchase of the company) and the FBI in 2016 (news around Clinton’s email server). The graph suggests the discussion around Slack, bots, Tesla, and virtual reality may have leveled off (though in some cases it may be too early to tell).
What’s next
I have some more analyses of Hacker News posts I’ll be sharing in the future, including analyses of word correlations, examination of what words and topics tend to get upvoted, and some improvements on the Tagger News classification algorithm. The series will also give me the chance to highlight more techniques from Text Mining with R, including word networks and topic modeling.