David Robinson bio photo

David Robinson

Data Scientist at Stack Overflow, works in R and Python.

Email Twitter Github Stack Overflow

Subscribe


Recommended Blogs

This year, more than fifty thousand programmers answered the Stack Overflow 2016 Developer Survey, in the largest survey of professional developers in history.

Last week Stack Overflow released the full (anonymized) results of the survey at stackoverflow.com/research. To make analysis in R even easier, today I’m also releasing the stacksurveyr package, which contains:

  • The full survey results as a processed data frame (stack_survey)
  • A data frame with the survey’s schema, including the original text of each question (stack_schema)
  • A function that works easily with multiple-response questions (stack_multi)

This makes it easier than ever to explore this rich dataset and answer questions about the world’s developers.

Examples: Basic exploration

I’ll give a few examples of survey analyses using the dplyr package. For instance, you could discover the most common occupations of survey respondents:

library(stacksurveyr)
library(dplyr)

stack_survey %>%
  count(occupation, sort = TRUE)
## # A tibble: 28 x 2
##                             occupation     n
##                                  <chr> <int>
## 1             Full-stack web developer 13886
## 2                                 <NA>  6511
## 3               Back-end web developer  6061
## 4                              Student  5619
## 5                    Desktop developer  3390
## 6              Front-end web developer  2873
## 7                                other  2585
## 8  Enterprise level services developer  1471
## 9           Mobile developer - Android  1462
## 10                    Mobile developer  1373
## # ... with 18 more rows

We can also use group_by and summarize to find the highest paid (on average) occupations:

salary_by_occupation <- stack_survey %>%
  filter(occupation != "other") %>%
  group_by(occupation) %>%
  summarize(average_salary = mean(salary_midpoint, na.rm = TRUE)) %>%
  arrange(desc(average_salary))

salary_by_occupation
## # A tibble: 26 x 2
##                                               occupation average_salary
##                                                    <chr>          <dbl>
## 1                 Executive (VP of Eng., CTO, CIO, etc.)      103073.93
## 2                                    Engineering manager      101047.08
## 3                    Enterprise level services developer       79855.62
## 4                                                 DevOps       68731.96
## 5                                        Product manager       68598.62
## 6                                          Growth hacker       67878.79
## 7                             Machine learning developer       67041.80
## 8                                         Data scientist       66508.75
## 9       Business intelligence or data warehousing expert       65660.92
## 10 Developer with a statistics or mathematics background       65625.76
## # ... with 16 more rows

This can be visualized in a bar plot:

library(ggplot2)
library(scales)

salary_by_occupation %>%
  mutate(occupation = reorder(occupation, average_salary)) %>%
  ggplot(aes(occupation, average_salary)) +
  geom_bar(stat = "identity") +
  ylab("Average salary (USD)") +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip()

center

Examples: Multi-response answers

10 of the questions allow multiple responses, as can be noted in the stack_schema variable:

stack_schema %>%
  filter(type == "multi")
## # A tibble: 10 x 4
##                              column  type
##                               <chr> <chr>
## 1               self_identification multi
## 2                           tech_do multi
## 3                         tech_want multi
## 4                   dev_environment multi
## 5                         education multi
## 6                     new_job_value multi
## 7  how_to_improve_interview_process multi
## 8            star_wars_vs_star_trek multi
## 9              developer_challenges multi
## 10               why_stack_overflow multi
## # ... with 2 more variables: question <chr>, description <chr>

In these cases, the responses are given delimited by ; . Often, these columns are easier to work with and analyze when they are “unnested” into one user-answer pair per row. The package provides the stack_multi function as a shortcut for that unnesting. For example, consider the tech_do column (““Which of the following languages or technologies have you done extensive development with in the last year?”):

stack_multi("tech_do")
## # A tibble: 225,075 x 3
##    respondent_id  column                 answer
##            <int>   <chr>                  <chr>
## 1           4637 tech_do                    iOS
## 2           4637 tech_do            Objective-C
## 3          31743 tech_do                Android
## 4          31743 tech_do Arduino / Raspberry Pi
## 5          31743 tech_do              AngularJS
## 6          31743 tech_do                      C
## 7          31743 tech_do                    C++
## 8          31743 tech_do                     C#
## 9          31743 tech_do              Cassandra
## 10         31743 tech_do           CoffeeScript
## # ... with 225,065 more rows

Using this data, we could find the most common answers:

stack_multi("tech_do") %>%
  count(tech = answer, sort = TRUE)
## # A tibble: 42 x 2
##          tech     n
##         <chr> <int>
## 1  JavaScript 27385
## 2         SQL 21976
## 3        Java 17942
## 4          C# 15283
## 5         PHP 12780
## 6      Python 12282
## 7         C++  9589
## 8  SQL Server  9306
## 9   AngularJS  8823
## 10    Android  8601
## # ... with 32 more rows

We can join this with the stack_survey dataset using the respondent_id column. For example, we could look at the most common development technologies used by data scientists:

stack_survey %>%
  filter(occupation == "Data scientist") %>%
  inner_join(stack_multi("tech_do"), by = "respondent_id") %>%
  count(answer, sort = TRUE)
## # A tibble: 42 x 2
##        answer     n
##         <chr> <int>
## 1      Python   507
## 2         SQL   356
## 3           R   352
## 4        Java   240
## 5  JavaScript   207
## 6         C++   155
## 7           C   125
## 8      Hadoop   108
## 9  SQL Server    98
## 10      Spark    97
## # ... with 32 more rows

Or we could find out the average age and salary of people using each technology, and compare them:

stack_survey %>%
  inner_join(stack_multi("tech_do")) %>%
  group_by(answer) %>%
  summarize_each(funs(mean(., na.rm = TRUE)), age_midpoint, salary_midpoint) %>%
  ggplot(aes(age_midpoint, salary_midpoint)) +
  geom_point() +
  geom_text(aes(label = answer), vjust = 1, hjust = 1) +
  xlab("Average age of people using this technology") +
  ylab("Average salary (USD)") +
  scale_y_continuous(labels = dollar_format())

center

If we want to be a bit more adventurous, we can use the (in-development) widyr package to find correlations among technologies, and the ggraph package to display them as a network of related technologies:

library(widyr)
library(ggraph)
library(igraph)

set.seed(2016)

stack_multi("tech_do") %>%
  pairwise_cor(answer, respondent_id) %>%
  filter(correlation > .15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation)) +
  geom_node_point(color = "lightblue", size = 7) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

center

Try the data out for yourself!