This post was originally published by RV Data Scientist Ryan Angi on Medium.
Red Ventures is headquartered in Charlotte, NC, which is an amazing place to live and raise a family. However, like every city, it has its challenges: skyrocketing home costs, low economic mobility, rising crime rates, and increasing traffic incidents to name a few. In response, a few tech leaders in our community launched the Queen City Hackathon in order to bring the impressive technical talent in this city together and use machine learning and data science to help solve some of these problems.
Overwhelmingly successful, the Queen City Hackathon became the largest data science hackathon in the Southeast, with over 250 participants in its first year.
One of the central tenets of our culture and who we are here at Red Ventures is “leaving the woodpile higher than we found it.” So, when given the chance on a Friday afternoon to stay “a little late” and participate in a 24-hour hackathon for a great cause, several of my coworkers and I jumped at the opportunity to use our skills to make a difference for the city of Charlotte.
Data Gathering and Cleaning
The specific task of the hackathon was to “build an application using a data science model to improve the lives of Charlotteans.” As a resource, we were provided 7 datasets:
- Charlotte Agenda News Articles
- Charlotte Arrest Records
- Charlotte Reddit Posts
- Charlotte Traffic Incidents
- Charlotte 311 Non-Emergency Requests
- Health Claims Data
- Speak Up Magazine Articles
With no clear solution in mind, our first step was to understand the available data with exploratory data analysis, which included plotting distributions of the variables to determine which data points were reliable and which data points were sparsely or unreliably coded. Once we had a good understanding of the different datasets we could work with and the useful features we could incorporate in our solution, we got excited. The traffic incidents dataset contained really quality data, and less processing of unstructured text was required (allowing us more time to focus on the modeling side of things). We started building a plan.
The traffic incident dataset included 1 row for each traffic incident with a “severity of accident” score, which we converted into a binary variable (0 for less severe, and 1 for highly severe). This dataset also included latitude and longitude coordinates which allowed us to aggregate our data and group by a 1km grid precision from the latitude and longitude of each traffic accident. From this, we determined averages of the types of accidents that occurred within each grid. (i.e. 80% of the accidents in a grid occurred on course asphalt and 90% occurred with no stop sign or stop light present.)
Next, we used the coordinates from the traffic dataset to join to the coordinates from the 311 non-emergency reports. This extended our feature set from solely information about accidents in a grid of Charlotte to a more holistic picture about the roads in the area. The 311 dataset was mostly text data, so we had to use NLP techniques to extract which reports included information about road information (i.e. frequent reports of potholes or traffic lights out). Geospatial mapping via coordinate systems are extremely useful in data gathering/cleaning, because one can use the coordinates as a key to join together separate datasets and put together a fuller picture of a physical area.
In the exploratory process, one variable we became very interested in predicting was “severity of traffic accidents”. The important part to us was not only to predict which areas in Charlotte have a high likelihood of having a severe accident, but also to build an interpretable model to explain the factors that went into the prediction. City officials could then use these explanations to make the decision to pave a road, add a light, or increase police patrols in the area, which could reduce the likelihood of a severe accident.
Once we completed our data gathering and cleaning, we split our data into a training and validation set and began our process of model iteration and hyper parameter tuning. After several different model types, we found that an XGBoost model fit the data best.
To quickly give a high-level overview of how XGBoost models work, let’s start with gradient boosted machines (GBMs). GBMs are similar to random forests, but instead of building many independent trees and averaging the results, GBMs build many subsequent trees on the residuals of the previous tree. XGBoost is very similar to a GBM algorithm in that it still builds trees of the residuals from the previous tree, but it uses a regularization technique to prevent overfitting and improve performance on data it has not seen in training.
The XGBoost algorithm is fantastic at prediction and finding interactions between features. However, historically it has been quite a black box in understanding the impact of variables that went into the predictions. To make these models interpretable, we leveraged a package in R called xgboostExplainer. This package is similar to Local Interpretable Model-Agnostic Explanations (LIME), but built specifically for graphical understandings of XGBoost predictions. It is much more insightful than a feature importance plot because it provides the relative positive and negative effects of features on each individual prediction.
The reason this was useful for this hackathon project is because we already know repaving all the roads in Charlotte and adding lights all throughout the city are generally things that prevent severe accidents. What we don’t know is which streets and neighborhoods in Charlotte could benefit from more well-lit streets or more police patrols to prevent speeding. Understanding this helps policy makers to target specific areas with actionable items that might be different from the actionable items needed to reduce the severity of accidents in a different neighborhood.
We opted to use the leaflet package in R to produce the background image and we mapped our longitude and latitude data points on top. We then scored each one of the points on our grid using the XGBoost model which resulted in a prediction from 0 to 1 on the likelihood of a severe accident to occur in that 1km area. We used a gradient color palette to indicate a less severe prediction with green dot and a more severe prediction with a red dot. Reactive values and observe functions inside shiny allow a user to select a data point and the XGBoost Explainer plot on the right-hand side of the application will update with an explanation of what led to the prediction at that one point.