Impact of Weather Conditions on Accidents in US
Project Overview
For this project, two datasets, population data and US accidents data, were combined to make the final analysis ready dataset which contained over 2.8 million accident records.
The project investigates whether certain weather-related factors have the potential to act as predictors that can be used to develop a model for accurately predicting the likelihood of future accidents.
The project investigates whether certain weather-related factors have the potential to act as predictors that can be used to develop a model for accurately predicting the likelihood of future accidents.
Limitations
There was not any PII data present within the dataset,
so I did not face any data privacy issues.
so I did not face any data privacy issues.
Objective
I conducted this analysis as a personal project in order to gain familiarity with the fundamental concepts surrounding different machine learning algorithms and their applications in creating predictive models.
Tools and Techniques
Regression analysis was conducted as a supervised machine learning method and cluster analysis was performed as an unsupervised learning method during this analysis for identifying patterns within the data.
Scikit-learn library was used to implement machine learning algorithms.
The data was transformed and scaled using StandardScaler() method in Python.
Scikit-learn library was used to implement machine learning algorithms.
The data was transformed and scaled using StandardScaler() method in Python.
Data Cleaning and Scaling
After addressing missing values within the dataset, I conducted a brief descriptive analysis through aggregations and visualizations to identify extreme values. Then, I created subsets of the dataset excluding the extreme outliers that skewed the data and could potentially deviate the outcome. I repeated this process a few times and ran the algorithms on new subsets to examine if it yielded different results.
I also scaled the data In this phase in order to avoid numerical instability, loss of precision, and distortion in graphical plots.
I also scaled the data In this phase in order to avoid numerical instability, loss of precision, and distortion in graphical plots.
Exploratory Analysis
During the initial exploratory analysis, a few weak correlations were identified through creating a correlation heatmap. I, then, created scatterplots for the strongest correlations identified in this heatmap to visualize and examine the relationship between these variables.
Regression Analysis
In an attempt to create a model for making predictions on how changes in weather conditions, mainly precipitation levels, would impact visibility and the extent of the road affected by an accident, I performed regression analysis to examine whether an equation could be produced to explain the relationship between these variables. The regression lines exhibited the anticipated behavior in accordance with my expectations.
However, regression testing revealed that the models failed to accurately demonstrate the underlying patterns present in the data, most likely due to the non-linear relationships between variables.
However, regression testing revealed that the models failed to accurately demonstrate the underlying patterns present in the data, most likely due to the non-linear relationships between variables.
Cluster Analysis
After regression algorithms did not prove to be a suitable approach for creating a predictive model, I used a type of centroid-based clustering algorithms, k-means algorithm, to determine if data points can be grouped into clusters based on similar characteristics. However, only small differences were observed betweeen clusters. For example, In precipitation and distance chart, data points in cluster 0 tend to have higher density in regions of the chart with either longer distances and precipitation levels under 0.5 or distances under 2 miles with up to 1 inch precipitation levels.
Data points in cluster 1 exhibit the same characteristics as data points in cluster 0 only with a few more data points that affected a very short distance with precipitation levels over 1 inch or a very long distance with precipitation levels above 0.5 inch.
The last group, cluster 2, tend to mostly affect either long distances with very low levels of precipitation (under 0.3 inches) or short distances mostly when precipitation is between 0 to 1 inch.
Data points in cluster 1 exhibit the same characteristics as data points in cluster 0 only with a few more data points that affected a very short distance with precipitation levels over 1 inch or a very long distance with precipitation levels above 0.5 inch.
The last group, cluster 2, tend to mostly affect either long distances with very low levels of precipitation (under 0.3 inches) or short distances mostly when precipitation is between 0 to 1 inch.
Conclusions and Next Steps
Although this analysis failed to provide significant results, I really enjoyed working on this project as I learned so many new techniques, and methodologies in detecting outliers while cleaning the data, as well as running machine learning algorithms.
As a next step, The dataset can be further analyzed at state level to examine whether different insights can be extracted as well as obtain a more detailed understanding of the impact of weather conditions on accident patterns in each individual state.
As a next step, The dataset can be further analyzed at state level to examine whether different insights can be extracted as well as obtain a more detailed understanding of the impact of weather conditions on accident patterns in each individual state.
Deliverables
Tableau Dashboard
GitHub Repository