Sam Pastoriza

Insights on Factors of Fatal Car Accidents

Understanding the data

Weather Factors

Home > Crashes > Methods

Methods

Section 1: Understanding the Data

Data Sourcing and Collection

The raw data used for this study is located publicly as follows: NHTSA-FARS → (Year-Folder) → National → FARS~year~NationalCSV.zip, with each zip file containing .csv files for different aspects of the fatal car accidents. This study uses the accident.csv, vehicle.csv, and person.csv files to attain data on each accident and the vehicles and persons involved. The data for this analysis is sourced from the National Highway Traffic Safety Administration's Fatality Analysis Reporting System (FARS) and provides detailed information on all car accidents resulting in at least one fatality in the United States from 1975 to 2019. The data contains information on a vast amount of detail regarding each accident including information on the vehicles and people involved, time and location of each accident, weather and road conditions at the site of the accident, and more. Analyzing the geographic, temporal, and conditional distributions of these fatal accidents through visualizations of this data provides actionable insights that transportation engineers, automotive engineers, and all drivers could use to make driving an overall safer experience. By illuminating the trends involved with fatal car accidents, we aim to spur action on how to prevent these accidents in the future from an engineering and individual driver perspective.

Data Cleaning and Preprocessing

The processing pipeline merged the three files for each year into a single dataset (for all years of interest) for exploration via a common “Case ID” variable. The data was then cleaned and organized into a tidy format with the unit of analysis being each individual person involved in the fatal accident such that each row contains information on a single person, the vehicle they were in, and the accident they were involved in. The final data contains information on approximately 1.7 million fatal accidents and 26 variables associated with the location of (State, Longitude, Latitude, Rural vs Urban, Route Signing), time of (Month, Day, Hour, Minute), road conditions of (Weather, Light Condition, Road Condition, Speed Limit, Number of Lanes), people involved in (Age, Sex, Severity of Injury, Seat Position, Selt Belt Usage, Ejection Status, Drinking, Drugs), and vehicles involved in (Make, Model, Year, Traveling Speed) each accident. In addition to these spelled out variables, the total number of fatalities per accident is recorded. Finally, from these raw data values, the date of the accident was derived for easier visualization of fatalities over time.

Section 2: Visualization

General Description of Visualization Interface

The visualization interface consists of five pages full of narrative and visualizations to investigate the topic of fatal car accidents in the United States. First, the "About the Data" page introduces the reader to the topic of investigation and provides a few visualizations to set the scene and provide a background and understanding of relevance of the data. Next, four separate pages investigate specific areas of the topic. First, the "Weather Across the US" page looks at weather (overall and per state) as a potential contributing factor to fatal car accidents. Next, the "Road Features" page focuses on the types of roads on which accidents occur, examining the impact of speed limit, curvature, and number of lanes. After that, the "Passenger Details & Safety Features" page investigates the age of passengers involved in accidents as well as how safe they are (level of injury and ejection status) based on their seat location and restraint usage. Finally, the "Vehicle Manufacturers & Body Types" page looks at the car manufacturers and body types for vehicles involved in fatal car accidents. By splitting the visualization interface into five separate pages, the goal is to distinctly analyze differnet aspects of fatal car accidents in an easy to navigate manner.

About the Data

This section aims to provide an understanding of the data's source and composition. This is done through text and three visualizations that examine the distribution of fatal car accidents over time and by location. This is done by first showing a time series plot of the number of fatal car accidents per year for the duration of time of analysis (1975-2019) in Figure 1. For this figure, a line chart was used with single points shown at each year; in this way, the viewer is able to understand the specific data for each year while also better visualizing how the number of fatal car accidents per years has been changing over time. The plot is interactive, so the reader can hover over each data point and observe a tool-tip showing the exact number of fatal car accidents that occurred in the given year. This figure is critical for setting the stage of the current and changing climate of fatal car accidents, and the interactive choices allow for the best and most understandable visualization of this.

The next figure (Figure 2) is another time series plot, but it views the number of fatal car accidents that occurred during each month within each 5-year period from 1975-2015. A multi-line plot was shown, with points marking each month of each 5-year period. This shows the general trend of changing magnitudes of car accidents per month along with the changing magnitudes of car accidents in general over time. Also, the interactivity of this plot allows the reader to select individual lines (5-years periods) of interest, and to highlight a specific month and see every 5-year period's total number of fatal car accidents in that month via a tool-tip. Breaking down the fatal car accidents by months was another step in understanding the data and "setting the scene" for the audience to gain interest in the topic. It also provided information for the engineers to investigate how the time of the year may impact safety. This could be interesting for spurring further research topics. This plot is the first plot to utilize the color theme that was established for all plots in the dashboard. As the rest of the visualizations are shown, one can notice a consistent set of colors, text, and overall theme used. This consistency provides the viewer with a seamless experience and flow.

Finally, the third image on this page (Figure 3) depicts the total number of fatal car accidents per capita in each county in the United States in the year 2019. To best show this, a map of the United States is displayed, with each county identified via borders. Then, the magnitude of number of fatal accidents per capita that occurred in each county in 2019 is visualized through the use of color and shade, spanning from dark burgundy representing large magnitudes to light grey representing small magnitude. In this way, the viewer can gain a very fast understanding of where the most and least fatal car accidents per capita occur in the United States. The plot has a zoom function to allow the reader to zoom in and out on different areas of the country. There is also a tool-tip option which shows the reader the number of fatal car accidents per capita in 2019 along with the 2019 population of the highlighted county. This figure is essential to the narrative as it shows the spread of fatal car accidents across the United States, further setting the scene for the importance of this investigation and gaining interest from the audience into the study.

Weather Across the US

As mentioned above, this page provides information regarding weather conditions that occurred during fatal car accidents. First, two static visualizations show 1) the counts of fatal car accidents from 1975-2019 that occurred during each weather condition, and 2) the percent of fatal car accidents from 1975-2019 that occurred during each non-clear weather condition. The first visualization (Figure 4) is shown as a bar chart with the total of each weather condition added as an annotation to the top of each weather condition for ease of understanding. Here, a bar plot was chosen as it is an excellent way to show comparisons of magnitudes between several different categories. Then, since the vast majority of weather conditions were clear, the second visualization (Figure 5) shows a pie chart in order to best represent a breakdown of all non-clear weather conditions. A pie chart was chosen for this visualization as this type of figure provides an effortless way to show parts-of-a-whole type of data. In Figure 5, each slice of the pie represents the proportion of accidents occurring during non-clear weather conditions that happened during the slice's specified non-clear weather condition. Both of these plots were essential to the narrative to gain an understanding of the distribution of weather conditions in the data. For both plots, the themed colors and formats are used for seemless consistency and flow.

Next, an understanding of weather's impact on fatal car crashes in different states was desired. Therefore, Figure 6 visualizes the predominant non-clear weather conditions in each state of the United States from 1975-2019. It's seen that the predominant non-clear weather conditions across the US is either rain or snow, so these conditions were coded as "blue" and "light blue" for rain and snow, respectively. These colors were chosen to aid in visual understanding for the viewer, as these colors are often associated with these conditions in the regular world. Then, for each state, the viewer can click on the state and observe a bar plot of the breakdown of number of fatal car accidents that occurred during each of the non-clear weather conditions in that state from 1975-2019. This additional breakdown allows a viewer to get details of any state that they are interested in. A bar plot was chosen for these additional views as they show comparitive magnitudes very well. Using bar plots for these views also connect the values easily back to what was seen in Figure 4.

Next, Figure 7 is very similar to Figure 6 in all of its encodings, but instead of looking at all accidents together, it breaks down the accidents by month through an interactive time lapse visualization. Here, the user can press the "Start" button, and the map will show the predominant weather condition existing during fatal car accidents occurring in each state for each month (January through December). Again, the coloring for rain and snow closely aligns with their respective commonly-associated colors to give an easier understanding of the plot. The viewer can also click on a state during the time lapse to view a bar plot of the non-clear weather conditions in that state in the current month. This is shown as a bar plot for consistency with the break-out plots seen in Figure 6.

Road Features

On the "Road Features" page, three static bar plots are used to depict the breakdown of number of fatal car accidents occurring on roads of different conditions: speed limit (Figure 8), roadway alignment (Figure 9), and the number of lanes (Figure 10). Each plot was chosen to be very simple bar charts to allow for fast understanding of the distribution of these variables across all fatal car accidents from 2000 to 2019. The existing color scheme continues into these plots. Also, each plot shows the total number of accidents for each condition breakdown as an annotation at the top of that condition's bar, again for quick analysis of the data. These three plots, though simple and static, are necessary to introduce the data involved in this portion of the narrative. By providing a basis of understanding, these plots allow for greater acceptance and usefulness of the interactive plot in Figure 11.

With the individual distributions understood, the next chart (Figure 11) combines these three categories (speed limit, roadway alignment, and number of lanes) to show the interaction between them. To do this, a sunburst chart breaks all fatal car accidents (2000-2019) first by the road's speed limit, then by its curvature, and finally by its number of lanes. In this way, the sum fatal car accidents occurring during each combination of the three groups is the slice of the outermost circle for a given part. To provide maximum understanding, a reader can click on any desired slice of the chart and see a more detailed (zoomed-in) view that subsets the data based on the selected condition. This allows the reader to garner a better understanding of the common road conditions for fatal car accidents. A reader can also hover over a section of the sunburst chart and view the number of fatal car accidents within that subset through a tool-tip. As always, the consistent coloration theme continues into this chart. This plot is essential to the narrative as it fully digests and displays the interactive impact that the three specified road conditions have on fatal car accidents. This understanding is essential for the use case of a transportation engineer's investigation into improving safety design of roads.

Passenger Details and Safety Features

The "Passenger Details & Safety Features" page focuses on three main characteristics related to individuals involved in fatal car accidents: age, ejection status, and proper seat belt usage. First, three bar plots illustrate the percents of passengers involved in fatal car accidents for each age group (Figure 12), ejection status (Figure 13), and seatbelt usage (Figure 14), serving as a source of exploratory data analysis for the reader before diving deeper into the more advanced visualizations. Bar plots were used to give readers a simple and straightforward understanding of the distributions of each of these important characteristics. The metrics were represented as percentages, giving readers a more informed interpretation of the connections between the variables of interest and fatal car accidents. Percentages were used over value counts because, at times, pure counts can be misleading to viewers, and don't provide a good understanding as to how each component contributes to the whole. Also, for each of the three topics, two bar plots were seen: one for accidents occurring before 2000, and one for accidents occurring during or after 2000. By splitting up the data in this way, the three plots provide further details to the viewer regarding the changing aspects of these three passenger characteristics over time. The colors of these bar charts are consistent with the custom theme, providing continuity to viewers.

Next, grouped side-by-side bar charts in Figure 15 show readers a breakdown of individuals involved in a fatal car accident based on seat location. Within each bar graph, these bars are further subset to look at the counts of the levels of severity of injury for these individuals based on seat location. The plot is interactive such that one can hover over a bar and view the number of individuals within a given category via a tool-tip. The reader can also click on a bar and view the age group distribution, ejection status distribution, and seat belt usage for that group of individuals across the two different time periods. Interactive bar charts were chosen to display this data as there were several variables of interest and several groups to highlight, such as seat location, injury status, and age of the individual. Grouped bar charts afforded the ability to clearly visualize all of these interactions without making something too complex for readers to digest. Interactivity of these bar charts was also important as it provided readers with an added layer of information, allowing one to get a better idea of the interactions between usage of seat belts, location of individuals in the car, and fatal injuries. Without this interactivity, the connection between all of these crucial variables would have been lost. Again, the custom theme was used to keep graphs looking consistent throughout the entirety of the site.

Vehicle Manufacturers & Body Types

The "Vehicle Manufacturers & Body Types" page looks at the makes and models of vehicles involved in fatal car accidents. The initial bar plots serve as exploratory data analysis, giving viewers an understanding of the distributions and characteristics of the data prior to diving into deeper analysis. The bar plots look at the number of fatal car accidents by car make (Figure 16) and car model (Figure 17). Bar plots were used to visualize this data as they provide readers with an understanding of the distribution of each of these variables: car make and car model. The count of the number of fatal accidents for each of these categories is displayed above each respective bar, allowing viewers to quickly get an idea of the quantity associated with each group. Since the numbers associated with each of these categories are rather large, having the value displayed above the bar makes it easier for readers to quickly understand the distribution of the data. Similarly to all other visualizations on the site, the custom theme was applied to these bar plots for continuity.

Next, Figure 18 is an interactive bubble chart that breaks down the top ten car manufacturers and the number of fatal car accidents for the top five makes of that manufacturer. Interactivity allows the readers to hover over the bubbles and see the number of fatal car accidents for a body type of a given manufacturer via a tool-tip. One can also highlight the name of a manufacturer and the plot will highlight the bubble for that manufacturer giving readers an easier method to view the visualization. A bubble plot was used to illustrate these relationships because it allows viewers to get a well-informed idea of the magnitude of fatal car accidents for each of these categories, which is illustrated in the size of the bubbles. Similarly, this magnitude is seen in the nesting of bubbles for each car model within a particular car make bubble. For example, when looking at Harley-Davidson (the purple bubble) one can see that the predominant make associated with fatal car accidents is a motorcycle, which is not surprising; however, this visualization method makes it easy for viewers to draw conclusions regarding car make and model. Like all of the other visualizations in this site, the custom theme was applied, allowing for continuity. Also, the colors of each manufacturer seen in the first bar chart were made to be the same for the bubble of each vehicle manufacturer shown in this bubble plot. This provided ease of connection between the two plots.

Section 3: Reflection

Project Development Over Time: Changes in Visualization and Technical Goals

Our team was very flexible in adapting to changes in visualization and technical goals. Originally we were working with a 50/50 split of static and dynamic visualizations given our limited knowledge of interactive visualizations at the beginning of the project. As we learned more about interactive visualizations, we then shifted our visualization and technical goals to be 100% interactive visualizations in order to provide the information in the best and most compelling way to the viewers. Following our mid-point evaluation, we were happy with the choice to move to completely interactive; however, we did want to give viewers an understanding of the distributions and attributes of the data prior to each interactive visualization. As a result, we implemented static exploratory data visualizations ahead of each interactive display. This gives viewers a comprehensive idea of the distributions of the data and what is being used as input for each visualization. It was due to the team's flexibility that we were able to adapt our visualization and technical goals to put forth the best possible product.

How Realistic Were The Original Technical Goals?

Our technical goals were very realistic when compared to what is possible in D3. D3 in general is a library that allows for complete customization in JavaScript and is a far more powerful tool than we needed for this particular project. For this particular project, striking a balance between what we needed to do in low level JavaScript versus what a framework could do for us was important. For our purposes, D3.js gave too much flexibility without providing any structure that a framework would provide. Using a JavaScript library called HighCharts, we were able to create highly customized visualizations that were dynamic, configurable, and easy to work with. Given our original technical goals and proposal, we were able to implement every single visualization and add more to them without any issues. Plotly and Altair would not be able to build these visualizations unless the developer was able to access the JavaScript that was powering the library, and in Python, that was very difficult if not impossible. Every one of our visualizations were accomplishable in D3 (as most visualizations are), but the learning curve in order to create a linked and animated cholopleth map would require far more effort in D3 than it did in HighCharts.

Aspects We Were Not Able to Implement

One visualization that we would like to see improved is the bubble map of car manufacturers and models. Due to limitations in access to data, we were not able to properly normalize the number of fatal car accidents for each manufacturer by the actual number of cars on the road. For example, the current visualization illustrates Chevrolet cars as having the most fatal car accidents; however, we also know that Chevrolet is one of the main cars on the road in the United States. Without proper normalization, it would be incorrect to claim that Chevrolet is the “most dangerous” car manufacturer, when that number could simply be so high due to the sheer volume of Chevrolet cars in the United States. We spent a good amount of time looking for metrics that could be used to properly normalize the fatal car accidents data, but unfortunately came up empty. Even though we were unable to normalize the data the way that we wanted to, it was good practice for our team to think of creative solutions to these types of problems, because in industry this is a more frequent problem than people realize. We didn't want to completely abandon the unique bubble chart visualization or the idea of looking at fatal car accidents by make and model, so instead, we kept the visualization in our final deliverable and chose to caveat the conclusions that can be drawn.

Final Remarks: What Would We Do Differently?

Overall, our group had a great workflow that allowed us to really push ourselves in what we were able to deliver. One thing that we did for some, but not all visualizations, was start with a simple static baseline visualization. From this baseline, we were then able to build increasing complexities and interactivity in a layered approach as opposed to having to start immediately with a daunting interactive visualization. If we were tasked with a similar project or visualization need in the future, I think fully implementing this approach would be helpful. Starting simple, and increasing capabilities and complexity could prove to be more helpful in the long run as opposed to trying and jumping in at the beginning to a complex interactive visualization.

Another improvement we could make is in our data collection. Fully understanding all aspects of the visualizations we wanted to create and what is needed from a data perspective to make that vision come to life would prove to be helpful. As mentioned earlier, we ran into some issues with properly normalizing some of the data for the visualizations. Thinking about all possible aspects of the visualization, down to the specifics of whether the data needs to be normalized, ahead of implementation could be a helpful approach to take in the future.