This year, more than fifty thousand programmers answered the Stack Overflow 2016 Developer Survey, in the largest survey of professional developers in history.
Last week Stack Overflow released the full (anonymized) results of the survey at stackoverflow.com/research. To make analysis in R even easier, today I’m also releasing the stacksurveyr package, which contains:
The full survey results as a processed data frame (stack_survey)
A data frame with the survey’s schema, including the original text of each question (stack_schema)
A function that works easily with multiple-response questions (stack_multi)
This makes it easier than ever to explore this rich dataset and answer questions about the world’s developers.
Examples: Basic exploration
I’ll give a few examples of survey analyses using the dplyr package. For instance, you could discover the most common occupations of survey respondents:
We can also use group_by and summarize to find the highest paid (on average) occupations:
This can be visualized in a bar plot:
Examples: Multi-response answers
10 of the questions allow multiple responses, as can be noted in the stack_schema variable:
In these cases, the responses are given delimited by ; . Often, these columns are easier to work with and analyze when they are “unnested” into one user-answer pair per row. The package provides the stack_multi function as a shortcut for that unnesting. For example, consider the tech_do column (““Which of the following languages or technologies have you done extensive development with in the last year?”):
Using this data, we could find the most common answers:
We can join this with the stack_survey dataset using the respondent_id column. For example, we could look at the most common development technologies used by data scientists:
Or we could find out the average age and salary of people using each technology, and compare them:
If we want to be a bit more adventurous, we can use the (in-development) widyr package to find correlations among technologies, and the ggraph package to display them as a network of related technologies: