David Robinson bio photo

David Robinson

Data Scientist at Stack Overflow, works in R and Python.

Email Twitter Github Stack Overflow

Subscribe


Recommended Blogs

Like much of America, I followed season one of Sarah Koenig’s true-crime podcast Serial with an interest that bordered on obsession. Serial tells the story of the Baltimore 1999 murder of high-schooler Hae Min Lee, and of state prisoner Adnan Syed, who was convicted of the crime but to this day maintains his innocence. One especially gripping episode of the podcast was Ep 5: Route Talk, where Sarah and her producer Dana physically retrace the path Adnan allegedly took that day, all while comparing the prosecution’s timeline (built from the testimony of Jay, their eyewitness) to the call log from Adnan’s phone.

It struck me that while most of the podcast makes great treadmill listening, the discussions of the call log are best understood visually. Each of the calls from Adnan’s phone comes with a time, a duration, and a cell tower that it “pinged”, which gives an approximate idea of where in Baltimore the call originated. All that is hard to keep straight when you’re hearing it described on a podcast. Even with a copy of the map and call log on hand, though, the timeline takes great effort to understand: it weaves information across longitude, latitude, time, towers, and people. This makes it the kind of problem best tackled not with a static visualization, but with an interactive timeline. I decided to give it a shot with the young but powerful ggvis package, creating this interactive visualization.

Thanks to some useful packages and resources, the visualization was straightforward to make. Someone had already transformed the Serial map and call log information from the official site into CSV format, and posted it in a GitHub repository. After reading it in, I used dplyr to clean and merge it into a format that could be visualized. I combined it with map shape files of Baltimore City and County downloaded from the Maryland State Data Center, processing them with rgdal and my own broom package (you can find the processing code here).

The data naturally lend themselves to visualization as a timeline (two dimensions: time and person called) and a map (two dimensions: longitude and latitude). The real puzzle of the visualization is how to connect the two visualizations so that they can be understood and interpreted in parallel. This is where ggvis’s interactivity- specifically the linked brush- is invaluable. By dragging and selecting calls on the timeline, the user can see which cell towers were “pinged” by those calls.

ggvis visualization

While the first season of Serial ended last week, interest in the case continues unabated. Statisticians are already tackling Serial’s mysteries from the perspective of Bayesian reasoning, but I like to think that the data science and visualization community has something important to contribute as well- and the development of great R tools like ggvis and Shiny make that easy. There’s a lot that can be added to visualizations like these (for instance, annotation of key events on the timeline, both those in Jay’s testimony and those corroborated by other witnesses), and I’ve shared the project on GitHub in the hope that others use it as inspiration.


You can find the data pre-processing code here and the code for the Shiny app here (with most of the interesting ggvis code in server.R).