More than 38,000 people die every year in crashes on U.S. roadways. The U.S. traffic fatality rate is 12.4 deaths per 100,000 inhabitants.
The economic and societal impact of road crashes costs U.S. citizens $871 billion.
Road crashes cost the U.S. more than $380 million in direct medical costs.
In this study we will - Explore the influence of various road features and environmental conditions to accident severity in the state of Pennsylvania. - Identify locations that are in dire need of intervention to prevent more accidents.
The data used in this analysis was sourced from Kaggle dataset that holds a collection of accidents across US from 2015-19.
This data-set records 2.24million accidents across 49 features.
https://www.asirt.org/safe-travel/road-safety-facts/
https://www.kaggle.com/sobhanmoosavi/us-accidents
library(MASS)
library(plyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
library(ggforce)
library(ggridges)
library(cowplot)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(sf)
library(tools)
US_Accidents_May19 <- read_csv("C:/Users/krish/Downloads/mofojo/US_Accidents_May19.csv")
tab_colname<-c('Severity',
'Start_Time',
'End_Time',
'Start_Lat',
'Start_Lng',
'End_Lat',
'End_Lng',
'Distance(mi)',
'Description',
'Number',
'Street',
'Side',
'City',
'County',
'State',
'Zipcode',
'Country',
'Timezone',
'Airport_Code',
'Weather_Timestamp',
'Temperature(F)',
'Wind_Chill(F)',
'Humidity(%)',
'Pressure(in)',
'Visibility(mi)',
'Wind_Direction',
'Wind_Speed(mph)',
'Precipitation(in)',
'Weather_Condition',
'Amenity',
'Bump',
'Crossing',
'Give_Way',
'Junction',
'No_Exit',
'Railway',
'Roundabout',
'Station',
'Stop',
'Traffic_Calming',
'Traffic_Signal',
'Turning_Loop',
'Sunrise_Sunset',
'Civil_Twilight',
'Nautical_Twilight',
'Astronomical_Twilight')
tab_desc<-c('Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). Note that severity reported by different sources may differ in their underlying impact on traffic, so please separate data from different sources when doing severity-based analysis.',
'Shows start time of the accident in local time zone.',
'Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.',
'Shows latitude in GPS coordinate of the start point.',
'Shows longitude in GPS coordinate of the start point.',
'Shows latitude in GPS coordinate of the end point.',
'Shows longitude in GPS coordinate of the end point.',
'The length of the road extent affected by the accident.',
'Shows natural language description of the accident.',
'Shows the street number in address record.',
'Shows the street name in address record.',
'Shows the relative side of the street (Right/Left) in address record.',
'Shows the city in address record.',
'Shows the county in address record.',
'Shows the state in address record.',
'Shows the zipcode in address record.',
'Shows the country in address record.',
'Shows timezone based on the location of the accident (eastern, central, etc.).',
'Denotes an airport-based weather station which is the closest one to location of the accident.',
'Shows the time-stamp of weather observation record (in local time).',
'Shows the temperature (in Fahrenheit).',
'Shows the wind chill (in Fahrenheit).',
'Shows the humidity (in percentage).',
'Shows the air pressure (in inches).',
'Shows visibility (in miles).',
'Shows wind direction.',
'Shows wind speed (in miles per hour).',
'Shows precipitation amount in inches, if there is any.',
'Shows the weather condition (rain, snow, thunderstorm, fog, etc.)',
'A POI annotation which indicates presence of amenity in a nearby location.',
'A POI annotation which indicates presence of speed bump or hump in a nearby location.',
'A POI annotation which indicates presence of crossing in a nearby location.',
'A POI annotation which indicates presence of give_way in a nearby location.',
'A POI annotation which indicates presence of junction in a nearby location.',
'A POI annotation which indicates presence of junction in a nearby location.',
'A POI annotation which indicates presence of railway in a nearby location.',
'A POI annotation which indicates presence of roundabout in a nearby location.',
'A POI annotation which indicates presence of station in a nearby location.',
'A POI annotation which indicates presence of stop in a nearby location.',
'A POI annotation which indicates presence of traffic_calming in a nearby location.',
'A POI annotation which indicates presence of traffic_signal in a nearby location.',
'A POI annotation which indicates presence of turning_loop in a nearby location.',
'Shows the period of day (i.e. day or night) based on sunrise/sunset.',
'Shows the period of day (i.e. day or night) based on civil twilight.',
'Shows the period of day (i.e. day or night) based on nautical twilight.',
'Shows the period of day (i.e. day or night) based on astronomical twilight.')
df<- data.frame(tab_colname,tab_desc)
names(df) <- c("Feature", "Description")
writeLines("td, th { padding : 6px } th { background-color : brown ; color : white; border : 1px solid white; } td { color : brown ; border : 1px solid brown }", con = "mystyle.css")
knitr::kable(df, format = "html")
Feature | Description |
---|---|
Severity | Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). Note that severity reported by different sources may differ in their underlying impact on traffic, so please separate data from different sources when doing severity-based analysis. |
Start_Time | Shows start time of the accident in local time zone. |
End_Time | Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed. |
Start_Lat | Shows latitude in GPS coordinate of the start point. |
Start_Lng | Shows longitude in GPS coordinate of the start point. |
End_Lat | Shows latitude in GPS coordinate of the end point. |
End_Lng | Shows longitude in GPS coordinate of the end point. |
Distance(mi) | The length of the road extent affected by the accident. |
Description | Shows natural language description of the accident. |
Number | Shows the street number in address record. |
Street | Shows the street name in address record. |
Side | Shows the relative side of the street (Right/Left) in address record. |
City | Shows the city in address record. |
County | Shows the county in address record. |
State | Shows the state in address record. |
Zipcode | Shows the zipcode in address record. |
Country | Shows the country in address record. |
Timezone | Shows timezone based on the location of the accident (eastern, central, etc.). |
Airport_Code | Denotes an airport-based weather station which is the closest one to location of the accident. |
Weather_Timestamp | Shows the time-stamp of weather observation record (in local time). |
Temperature(F) | Shows the temperature (in Fahrenheit). |
Wind_Chill(F) | Shows the wind chill (in Fahrenheit). |
Humidity(%) | Shows the humidity (in percentage). |
Pressure(in) | Shows the air pressure (in inches). |
Visibility(mi) | Shows visibility (in miles). |
Wind_Direction | Shows wind direction. |
Wind_Speed(mph) | Shows wind speed (in miles per hour). |
Precipitation(in) | Shows precipitation amount in inches, if there is any. |
Weather_Condition | Shows the weather condition (rain, snow, thunderstorm, fog, etc.) |
Amenity | A POI annotation which indicates presence of amenity in a nearby location. |
Bump | A POI annotation which indicates presence of speed bump or hump in a nearby location. |
Crossing | A POI annotation which indicates presence of crossing in a nearby location. |
Give_Way | A POI annotation which indicates presence of give_way in a nearby location. |
Junction | A POI annotation which indicates presence of junction in a nearby location. |
No_Exit | A POI annotation which indicates presence of junction in a nearby location. |
Railway | A POI annotation which indicates presence of railway in a nearby location. |
Roundabout | A POI annotation which indicates presence of roundabout in a nearby location. |
Station | A POI annotation which indicates presence of station in a nearby location. |
Stop | A POI annotation which indicates presence of stop in a nearby location. |
Traffic_Calming | A POI annotation which indicates presence of traffic_calming in a nearby location. |
Traffic_Signal | A POI annotation which indicates presence of traffic_signal in a nearby location. |
Turning_Loop | A POI annotation which indicates presence of turning_loop in a nearby location. |
Sunrise_Sunset | Shows the period of day (i.e. day or night) based on sunrise/sunset. |
Civil_Twilight | Shows the period of day (i.e. day or night) based on civil twilight. |
Nautical_Twilight | Shows the period of day (i.e. day or night) based on nautical twilight. |
Astronomical_Twilight | Shows the period of day (i.e. day or night) based on astronomical twilight. |
str(US_Accidents_May19)
## spec_tbl_df [2,243,939 x 49] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : chr [1:2243939] "A-1" "A-2" "A-3" "A-4" ...
## $ Source : chr [1:2243939] "MapQuest" "MapQuest" "MapQuest" "MapQuest" ...
## $ TMC : num [1:2243939] 201 201 201 201 201 201 201 201 201 201 ...
## $ Severity : num [1:2243939] 3 2 2 3 2 3 2 3 2 3 ...
## $ Start_Time : POSIXct[1:2243939], format: "2016-02-08 05:46:00" "2016-02-08 06:07:59" ...
## $ End_Time : POSIXct[1:2243939], format: "2016-02-08 11:00:00" "2016-02-08 06:37:59" ...
## $ Start_Lat : num [1:2243939] 39.9 39.9 39.1 39.7 39.6 ...
## $ Start_Lng : num [1:2243939] -84.1 -82.8 -84 -84.2 -84.2 ...
## $ End_Lat : num [1:2243939] NA NA NA NA NA NA NA NA NA NA ...
## $ End_Lng : num [1:2243939] NA NA NA NA NA NA NA NA NA NA ...
## $ Distance(mi) : num [1:2243939] 0.01 0.01 0.01 0.01 0.01 0.01 0 0.01 0 0.01 ...
## $ Description : chr [1:2243939] "Right lane blocked due to accident on I-70 Eastbound at Exit 41 OH-235 State Route 4." "Accident on Brice Rd at Tussing Rd. Expect delays." "Accident on OH-32 State Route 32 Westbound at Dela Palma Rd. Expect delays." "Accident on I-75 Southbound at Exits 52 52B US-35. Expect delays." ...
## $ Number : num [1:2243939] NA 2584 NA NA NA ...
## $ Street : chr [1:2243939] "I-70 E" "Brice Rd" "State Route 32" "I-75 S" ...
## $ Side : chr [1:2243939] "R" "L" "R" "R" ...
## $ City : chr [1:2243939] "Dayton" "Reynoldsburg" "Williamsburg" "Dayton" ...
## $ County : chr [1:2243939] "Montgomery" "Franklin" "Clermont" "Montgomery" ...
## $ State : chr [1:2243939] "OH" "OH" "OH" "OH" ...
## $ Zipcode : chr [1:2243939] "45424" "43068-3402" "45176" "45417" ...
## $ Country : chr [1:2243939] "US" "US" "US" "US" ...
## $ Timezone : chr [1:2243939] "US/Eastern" "US/Eastern" "US/Eastern" "US/Eastern" ...
## $ Airport_Code : chr [1:2243939] "KFFO" "KCMH" "KI69" "KDAY" ...
## $ Weather_Timestamp : POSIXct[1:2243939], format: "2016-02-08 05:58:00" "2016-02-08 05:51:00" ...
## $ Temperature(F) : num [1:2243939] 36.9 37.9 36 35.1 36 37.9 34 34 33.3 37.4 ...
## $ Wind_Chill(F) : num [1:2243939] NA NA 33.3 31 33.3 35.5 31 31 NA 33.8 ...
## $ Humidity(%) : num [1:2243939] 91 100 100 96 89 97 100 100 99 100 ...
## $ Pressure(in) : num [1:2243939] 29.7 29.6 29.7 29.6 29.6 ...
## $ Visibility(mi) : num [1:2243939] 10 10 10 9 6 7 7 7 5 3 ...
## $ Wind_Direction : chr [1:2243939] "Calm" "Calm" "SW" "SW" ...
## $ Wind_Speed(mph) : num [1:2243939] NA NA 3.5 4.6 3.5 3.5 3.5 3.5 1.2 4.6 ...
## $ Precipitation(in) : num [1:2243939] 0.02 0 NA NA NA 0.03 NA NA NA 0.02 ...
## $ Weather_Condition : chr [1:2243939] "Light Rain" "Light Rain" "Overcast" "Mostly Cloudy" ...
## $ Amenity : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Bump : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Crossing : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Give_Way : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Junction : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ No_Exit : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Railway : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Roundabout : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Station : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Stop : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Traffic_Calming : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Traffic_Signal : logi [1:2243939] FALSE FALSE TRUE FALSE TRUE FALSE ...
## $ Turning_Loop : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Sunrise_Sunset : chr [1:2243939] "Night" "Night" "Night" "Night" ...
## $ Civil_Twilight : chr [1:2243939] "Night" "Night" "Night" "Day" ...
## $ Nautical_Twilight : chr [1:2243939] "Night" "Night" "Day" "Day" ...
## $ Astronomical_Twilight: chr [1:2243939] "Night" "Day" "Day" "Day" ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_character(),
## .. Source = col_character(),
## .. TMC = col_double(),
## .. Severity = col_double(),
## .. Start_Time = col_datetime(format = ""),
## .. End_Time = col_datetime(format = ""),
## .. Start_Lat = col_double(),
## .. Start_Lng = col_double(),
## .. End_Lat = col_double(),
## .. End_Lng = col_double(),
## .. `Distance(mi)` = col_double(),
## .. Description = col_character(),
## .. Number = col_double(),
## .. Street = col_character(),
## .. Side = col_character(),
## .. City = col_character(),
## .. County = col_character(),
## .. State = col_character(),
## .. Zipcode = col_character(),
## .. Country = col_character(),
## .. Timezone = col_character(),
## .. Airport_Code = col_character(),
## .. Weather_Timestamp = col_datetime(format = ""),
## .. `Temperature(F)` = col_double(),
## .. `Wind_Chill(F)` = col_double(),
## .. `Humidity(%)` = col_double(),
## .. `Pressure(in)` = col_double(),
## .. `Visibility(mi)` = col_double(),
## .. Wind_Direction = col_character(),
## .. `Wind_Speed(mph)` = col_double(),
## .. `Precipitation(in)` = col_double(),
## .. Weather_Condition = col_character(),
## .. Amenity = col_logical(),
## .. Bump = col_logical(),
## .. Crossing = col_logical(),
## .. Give_Way = col_logical(),
## .. Junction = col_logical(),
## .. No_Exit = col_logical(),
## .. Railway = col_logical(),
## .. Roundabout = col_logical(),
## .. Station = col_logical(),
## .. Stop = col_logical(),
## .. Traffic_Calming = col_logical(),
## .. Traffic_Signal = col_logical(),
## .. Turning_Loop = col_logical(),
## .. Sunrise_Sunset = col_character(),
## .. Civil_Twilight = col_character(),
## .. Nautical_Twilight = col_character(),
## .. Astronomical_Twilight = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(US_Accidents_May19)
## ID Source TMC Severity
## Length:2243939 Length:2243939 Min. :200.0 Min. :0.000
## Class :character Class :character 1st Qu.:201.0 1st Qu.:2.000
## Mode :character Mode :character Median :201.0 Median :2.000
## Mean :207.4 Mean :2.383
## 3rd Qu.:201.0 3rd Qu.:3.000
## Max. :406.0 Max. :4.000
## NA's :516762
## Start_Time End_Time Start_Lat
## Min. :2015-03-09 07:00:00 Min. :2016-02-08 06:37:08 Min. :24.57
## 1st Qu.:2017-03-24 08:24:00 1st Qu.:2017-03-24 10:26:56 1st Qu.:33.48
## Median :2017-12-28 08:45:00 Median :2017-12-28 09:57:54 Median :35.86
## Mean :2017-12-04 17:37:24 Mean :2017-12-04 19:48:21 Mean :36.46
## 3rd Qu.:2018-08-22 14:10:46 3rd Qu.:2018-08-22 15:14:52 3rd Qu.:40.42
## Max. :2019-04-01 03:26:00 Max. :2020-10-31 23:59:00 Max. :49.00
##
## Start_Lng End_Lat End_Lng Distance(mi)
## Min. :-124.62 Min. :24.6 Min. :-124.5 Min. : 0.0000
## 1st Qu.:-117.14 1st Qu.:33.9 1st Qu.:-117.9 1st Qu.: 0.0000
## Median : -88.18 Median :38.0 Median : -90.2 Median : 0.0000
## Mean : -94.86 Mean :37.4 Mean : -96.5 Mean : 0.2879
## 3rd Qu.: -80.85 3rd Qu.:41.4 3rd Qu.: -80.9 3rd Qu.: 0.0100
## Max. : -67.11 Max. :49.1 Max. : -67.1 Max. :333.6300
## NA's :1727177 NA's :1727177
## Description Number Street Side
## Length:2243939 Min. : 1 Length:2243939 Length:2243939
## Class :character 1st Qu.: 803 Class :character Class :character
## Mode :character Median : 2672 Mode :character Mode :character
## Mean : 5625
## 3rd Qu.: 6846
## Max. :961052
## NA's :1458402
## City County State Zipcode
## Length:2243939 Length:2243939 Length:2243939 Length:2243939
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Country Timezone Airport_Code
## Length:2243939 Length:2243939 Length:2243939
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Weather_Timestamp Temperature(F) Wind_Chill(F)
## Min. :2016-02-08 00:53:00 Min. :-77.80 Min. :-65.9
## 1st Qu.:2017-03-23 09:15:00 1st Qu.: 48.90 1st Qu.: 19.2
## Median :2017-12-27 15:53:00 Median : 63.00 Median : 28.7
## Mean :2017-12-04 05:22:14 Mean : 61.23 Mean : 26.0
## 3rd Qu.:2018-08-22 15:53:00 3rd Qu.: 75.90 3rd Qu.: 36.4
## Max. :2019-03-31 23:52:00 Max. :170.60 Max. : 45.2
## NA's :47170 NA's :62265 NA's :1852370
## Humidity(%) Pressure(in) Visibility(mi) Wind_Direction
## Min. : 4.00 Min. : 0.00 Min. : 0.00 Length:2243939
## 1st Qu.: 50.00 1st Qu.:29.92 1st Qu.: 10.00 Class :character
## Median : 68.00 Median :30.03 Median : 10.00 Mode :character
## Mean : 65.93 Mean :30.04 Mean : 9.12
## 3rd Qu.: 85.00 3rd Qu.:30.15 3rd Qu.: 10.00
## Max. :100.00 Max. :33.04 Max. :140.00
## NA's :64467 NA's :57280 NA's :71360
## Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity
## Min. : 1.2 Min. : 0.0 Length:2243939 Mode :logical
## 1st Qu.: 5.8 1st Qu.: 0.0 Class :character FALSE:2217962
## Median : 8.1 Median : 0.0 Mode :character TRUE :25977
## Mean : 8.8 Mean : 0.1
## 3rd Qu.: 11.5 3rd Qu.: 0.0
## Max. :822.8 Max. :10.8
## NA's :442954 NA's :1979466
## Bump Crossing Give_Way Junction
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:2243700 FALSE:2122156 FALSE:2239215 FALSE:2056574
## TRUE :239 TRUE :121783 TRUE :4724 TRUE :187365
##
##
##
##
## No_Exit Railway Roundabout Station
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:2241773 FALSE:2225741 FALSE:2243811 FALSE:2207714
## TRUE :2166 TRUE :18198 TRUE :128 TRUE :36225
##
##
##
##
## Stop Traffic_Calming Traffic_Signal Turning_Loop
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:2222168 FALSE:2243321 FALSE:1885291 FALSE:2243939
## TRUE :21771 TRUE :618 TRUE :358648
##
##
##
##
## Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
## Length:2243939 Length:2243939 Length:2243939 Length:2243939
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
We can see that Pennsylvania ranks 7th on a national scale based on collected accident records over a period of 5 years(2015-19)
PA had roughly 75k accidents in the 2016-19 period.
On a comparison with the its neighboring(NYC,NJ,DE,MA,WV & OH) states, only New York out ranks Pennsylvania in number of accidents
state_slice<-US_Accidents_May19 %>%
group_by(State) %>%
dplyr::summarize(count = n())
ggplot(state_slice) +
geom_bar(aes(x=reorder(State,-count),y=count,fill=count),stat="identity")+
ggtitle("Record count by states")+xlab("States")+
scale_y_continuous(name="Record count", labels = scales::comma)+
scale_fill_continuous(name = "Record Count",low = "green", high = "red",labels = scales::comma)+
theme(axis.text.x=element_text(angle=90, hjust=1))+
ggplot2::annotate("text", label = "PA:-75,814",size=4,x='VA',y=nrow(filter(US_Accidents_May19,State=="PA"))+25000,colour="blue")
When checking for presence of null values we see that the following features have high number of null values:- - Latitude and Longitude of end point of accident - Street number of accident - Wind chill,precipitation and windspeed
The nulls in location attributes wont hinder our analysis as we still have the zip code and start lat,longs.
However the absence of the above mentioned weaher factores are worrisome.
pa_slice<-filter(US_Accidents_May19,State=="PA")
missing_val_count<-as.data.frame(sapply(pa_slice, function(x) sum(is.na(x))*100/nrow(pa_slice)))
missing_val_count$ftName<-row.names(missing_val_count)
names(missing_val_count)[1] <- "fillRate"
row.names(missing_val_count) <- NULL
ggplot(filter(missing_val_count,fillRate >= 1)) +
geom_bar(aes(x=reorder(ftName,fillRate),y=fillRate,fill=fillRate),stat="identity")+
coord_flip()+
scale_fill_continuous(name = "Missing value %")+
ggtitle("Fetaures with 5% or Missing values")+xlab("Features")+
ylab("Missing value%")
We observe that there was a spike in accidents in 2016 (especially in September from multi vehicle collision)
However,the general trend of accidents have remained stable over the years
pa_slice$date<-format(as.Date(pa_slice$Start_Time), "%Y-%m")
pa_slice_dt_sev<-pa_slice %>%
group_by(date,Severity) %>%
dplyr::summarize(count = n())
pa_slice_dt_sev$Severity<-as.character(pa_slice_dt_sev$Severity)
pa_slice_dt_sev %>%
ggplot( aes(x=date, y=count, group=Severity, color=Severity)) +
geom_line()+
ggtitle("Time trend of Accidents by severity in PA")+
xlab("Accident count")+
ylab("Date(Year-month)")+
theme(axis.text.x=element_text(angle=90, hjust=1))+
scale_color_manual(values=c("darkgreen", "blue","orange","red"))
We observe that that most of the accidents occur under normal weather conditions
There are only few sever accidents - similar trend as in previous time series analysis
weather_con_slice<-pa_slice %>%
group_by(Weather_Condition,Severity) %>%
dplyr::summarize(count = n())
weather_con_slice_1<-pa_slice %>%
group_by(Weather_Condition) %>%
dplyr::summarize(count = n())
weather_con_slice_1<-weather_con_slice_1 %>% slice_max(count, n = 5)
weather_con_slice$Severity<-as.character(weather_con_slice$Severity)
weather_con_slice <- filter(weather_con_slice,Weather_Condition %in% c(weather_con_slice_1$Weather_Condition))
ggplot(data = weather_con_slice, aes(x = Weather_Condition,
y = count,
fill = Severity))+
geom_bar(stat="identity")+
ggtitle("Top 5 weather condition prevailing during accidents")+
ylab("Accident count")+
xlab("Weather")+
scale_fill_manual(values=c("darkgreen", "blue","orange","red"))
Even when filtering weather conditions for most sever accidents we end up with same conditions as in previous situation - a clear or cloudy da
Thus the missing values in weather features are less concerning now!
weather_con_slice<-pa_slice %>%
group_by(Weather_Condition,Severity) %>%
dplyr::summarize(count = n())
weather_con_slice_1<-pa_slice %>%
group_by(Weather_Condition) %>%
dplyr::summarize(count = n())
weather_con_slice_1<-weather_con_slice_1 %>% slice_max(count, n = 5)
knitr::kable(weather_con_slice_1)
Weather_Condition | count |
---|---|
Clear | 28687 |
Overcast | 18184 |
Mostly Cloudy | 7729 |
Scattered Clouds | 6179 |
Partly Cloudy | 3889 |
We observe that a vast majority of the accidents occur at traffic signals
sample_data<-pa_slice[,c(names(US_Accidents_May19)[33:45],c("Severity","ID"))]
sample_data[,c(names(pa_slice)[33:45])] <- sapply(sample_data[,c(names(pa_slice)[33:45])],as.numeric)
sample_data$RS= rowSums(sample_data[,c(names(pa_slice)[33:45])])
sample_data<-sample_data %>% filter(RS == 1)
k<-as.data.frame(colSums(sample_data[,c(names(pa_slice)[33:45])]))
k$Location_Type<-row.names(k)
names(k)[1] <- "Accident"
row.names(k) <- NULL
ggplot(k %>% slice_max(Accident, n = 5)) +
geom_bar(aes(x=reorder(Location_Type,-Accident),y=Accident,fill=Accident),stat="identity")+
ggtitle("Accident count by Location Type")+xlab("Location Type")+
scale_y_continuous(name="#Accident", labels = scales::comma)+
theme(axis.text.x=element_text(angle=90, hjust=1))
Vast majority of accidents at traffic signals are of severity 2
Tfs<-filter(pa_slice,Traffic_Signal==TRUE)
Tfs<-Tfs %>%
group_by(Severity) %>%
dplyr::summarize(count = n())
Tfs$Severity<-as.character(Tfs$Severity)
ggplot(Tfs, aes(x="", y=count, fill=Severity)) +
geom_bar(stat="identity", width=1,position = "fill") +
coord_polar("y", start=0)+
ggtitle("Distribution of accidents at traffic signals by severity")+
ylab("")+
xlab("")+
scale_fill_manual(values=c("darkgreen", "blue","orange","red"))
Vast majority of accidents occur during day time
And the distribution is multimodal with a lot of data clustered around severity 2
dayNight_slice<-pa_slice %>%
group_by(Sunrise_Sunset) %>%
dplyr::summarize(count = n())
dayNight_slice<-dayNight_slice %>% tidyr::drop_na()
ggplot(dayNight_slice) +
geom_bar(aes(x=reorder(Sunrise_Sunset,-count),y=count,fill=count),stat="identity")+
ggtitle("Record count by Day/Night")+xlab("Conditon")+
scale_y_continuous(name="Record count", labels = scales::comma)+
scale_fill_continuous(name="Record count", labels = scales::comma)
And the distribution is multimodal with a lot of data clustered around severity 2
qplot(Sunrise_Sunset,
Severity,
data = filter(pa_slice,Sunrise_Sunset %in% c("Day","Night")),
geom=c("violin"),
main = "Severity distribution at different times of day",
trim = FALSE,
fill = Sunrise_Sunset,
xlab = "Time of Day",
ylab = "Severity")+
labs(fill = "Day/Night")
gpdata <- ggplot(data=pa_slice,aes(x=`Visibility(mi)`, y=Severity))
gpdata+ stat_binhex()+ggtitle("Plot of Visibility vs Severity")+
scale_y_continuous(name="#Accidents", labels = scales::comma)+
ggplot2::annotate("text", label = "< High number of \n Accidents",size=4,x=12.5,y=1.9,colour="red")
Inspection of traffic congestions caused by these accidents reveal that -Severity 1,2 and 3 have similar range of values - their means differ but the central quarters overlap well. -These last about 30-50 minutes mostly. -Sevrity 2 and 3 accidents have a lot of outliers towards higher duration direction and their mean duration is lower than that of severity 1’s -Severity 4 accidents cause major congestion with mean duration of 6hrs,this also a very tight distribution -due to fewer number of accidents in this category
pa_td_slice<-pa_slice
pa_td_slice$duration<-pa_td_slice$End_Time - pa_td_slice$Start_Time
boxplot(pa_td_slice$duration ~ pa_td_slice$Severity,
ylim = c(0, 100),
xlim=c(0.5,3.5),
main = 'Delays(in minutes) vs severity[1-3]',
col=c("blue","green","red"),
ylab="severity",
xlab="Duration(min)")
boxplot(pa_td_slice$duration ~ pa_td_slice$Severity,
ylim = c(300, 400),
xlim=c(3.5,4.5),
main = 'Delays(in minutes) vs severity[4]',
col=c("yellow"),
ylab="severity",
xlab="Duration(min)")
# https://www.r-graph-gallery.com/294-basic-ridgeline-plot.html
ggplot(pa_td_slice, aes(x = log10(as.numeric(duration)), y=as.factor(Severity) ,fill=as.factor(Severity))) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")+
ggtitle("Severity vs log(Duration)")+
ylab("Severity")+
xlab("log(Duration)")
Performing a t test reveals that there is not much difference between severity 2 and 3 accidents when compared at duration.
The p value of the test is very low indicating a similar distribution.
sev2<-pa_td_slice%>%filter(Severity==2)
sev3<-pa_td_slice%>%filter(Severity==3)
t.test(sev2$duration,sev3$duration)
##
## Welch Two Sample t-test
##
## data: sev2$duration and sev3$duration
## t = 4.5101 mins, df = 34855, p-value = 6.499e-06 mins
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 13.12042 mins 33.28928 mins
## sample estimates:
## Time differences in mins
## mean of x mean of y
## 73.76587 50.56102
We are interested in the accident data over the latest 3 years as these will lead to more actionable insights
Thus we will focus more on years 2017,2018 and 2019.
We will count
We are rounding down the latitude,longitude coordinates by two positions to collate incidents that are close to each other - we will refer to this as simplified coordinates.
-Eg:- (Lat:40.45029,Long:-79.95139) and (Lat:40.45030,Long:-79.95120) will be grouped under Lat:40.450,Long:-79.951 as these points are close by
We are creating a metric “recurrence_index” where we take a weighted average of annual number of accidents occurring at a location over latest three years.
recurrence_index= (Accidents in 2017)x1/6 + (Accidents in 2017)x2/6 + (Accidents in 2017)x3/6
We are also creating a new index for severity “severity_index” which gives a direction to weather the accident severity is increasing in a location or decreasing in a location. The procedure to generate it is as follows:-
The index values are assigned in decresing order of severity with time relevance
pa_slice_re<-pa_slice
pa_slice_re$start_year<-format(as.Date(pa_slice_re$Start_Time, format="%d/%m/%Y"),"%Y")
test<-pa_slice_re %>%
dplyr::filter(start_year %in% c("2019","2018","2017"))%>%
dplyr::group_by(start_year,Start_Lat=round(Start_Lat,2),Start_Lng=round(Start_Lng,2),Street,Zipcode)%>%
dplyr::summarise(UNIQUE_COUNT = n_distinct(ID),Severity=mean(Severity))%>%
dplyr::mutate(cnt_2017 = case_when(start_year=="2017" ~ as.numeric(UNIQUE_COUNT),TRUE ~ 0))%>%
dplyr::mutate(cnt_2018 = case_when(start_year=="2018" ~ as.numeric(UNIQUE_COUNT),TRUE ~ 0))%>%
dplyr::mutate(cnt_2019 = case_when(start_year=="2019" ~ as.numeric(UNIQUE_COUNT),TRUE ~ 0))%>%
dplyr::mutate(sev_2017 = case_when(start_year=="2017" ~ as.numeric(Severity),TRUE ~ 0))%>%
dplyr::mutate(sev_2018 = case_when(start_year=="2018" ~ as.numeric(Severity),TRUE ~ 0))%>%
dplyr::mutate(sev_2019 = case_when(start_year=="2019" ~ as.numeric(Severity),TRUE ~ 0))
test<-test %>%
dplyr::group_by(Start_Lat,Start_Lng)%>%
dplyr::summarise_at(c("cnt_2017", "cnt_2018","cnt_2019","sev_2017", "sev_2018","sev_2019"), sum, na.rm=FALSE)
test_All_3Yrs<-test%>%
dplyr::filter(cnt_2017>0&cnt_2018>0&cnt_2019>0)%>%
dplyr::mutate(recurrence_index=((cnt_2017*1/6)+(cnt_2018*2/6)+(cnt_2017*3/6)))
test_Sev_3Yrs<-test%>%
dplyr::filter(sev_2017>0&sev_2018>0&sev_2019>0)%>%
dplyr::mutate(severity_index=case_when(sev_2019>=sev_2018&sev_2018>=sev_2017 ~ 4,
sev_2019>=sev_2018&sev_2018<=sev_2017 ~ 3,
sev_2019<=sev_2018&sev_2018>=sev_2017 ~ 2,
sev_2019<=sev_2018&sev_2018<=sev_2017 ~ 1,TRUE ~ 0))
# https://r-spatial.org/r/2018/10/25/ggplot2-sf-2.html
# http://zevross.com/blog/2014/07/16/mapping-in-r-using-the-ggplot2-package/
# https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
# https://slcladal.github.io/maps.html
world <- ne_countries(scale = "medium", returnclass = "sf")
states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
sf::sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
states <- cbind(states, st_coordinates(st_centroid(states)))
states$ID <- toTitleCase(states$ID)
counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("pennsylvania", counties$ID))
test_All_3Yrs<-test_All_3Yrs %>%
arrange(recurrence_index)
sites <- st_as_sf(test_All_3Yrs, coords = c("Start_Lng", "Start_Lat"),
crs = 4326, agr = "constant")
ggplot(data = world) +
geom_sf(fill = "antiquewhite1") +
geom_sf(data = counties, fill = "white")+
geom_sf(data = states, fill = NA) +
geom_sf(data = sites, size = 1.75, shape = 24,aes( fill = sites$recurrence_index),colour=alpha("white",0.2))+
scale_fill_distiller(palette = "Spectral")+
scale_fill_fermenter(name = "# Accidents",palette = "Spectral", breaks = c(0,20,40,60,80,100,120,140,160))+
coord_sf(xlim = c(-80.65, -74.6), ylim = c(39.5, 42.5), expand = FALSE)+
theme(legend.position='left')+
ggtitle("Recurring Accident spots and #Accidents in PA")+xlab("Longitude")+
ylab("Latitude")+
ggplot2::annotate("text", label = "\nSchullykill \n expressway",size=3.2,y=39.8,x=-75.2,colour="red")
Bucks county has a very high density of reported accidents over the latest 3 years
ggplot(data = world) +
geom_sf(fill = "antiquewhite1") +
# geom_sf(data = counties) +
geom_sf(data = counties, fill = "white")+
geom_sf(data = states, fill = NA) +
geom_sf(data = sites, aes( fill = sites$recurrence_index),shape=23,size = 1.75,colour=alpha("white",0.2))+
scale_fill_distiller(palette = "Spectral")+
scale_fill_fermenter(name = "# Accidents",palette = "Spectral", breaks = c(0,20,40,60,80,100,120,140,160))+
coord_sf(xlim = c(-75.75, -74.5), ylim = c(39.5, 40.5), expand = FALSE)+
theme(legend.position='left')+
ggtitle("Recurring Accident spots and #Accidents in SE PA")+xlab("Longitude")+
ylab("Latitude")+
ggplot2::annotate("text", label = "\nSchullykill \n expressway",size=3.2,y=39.8,x=-75.2,colour="red")
test_Sev_3Yrs<-test_Sev_3Yrs %>%
arrange(severity_index)
sites <- st_as_sf(test_Sev_3Yrs, coords = c("Start_Lng", "Start_Lat"),
crs = 4326, agr = "constant")
ggplot(data = world) +
geom_sf(fill = "antiquewhite1") +
geom_sf(data = counties, fill = "white")+
geom_sf(data = states, fill = NA) +
geom_sf(data = sites, size = 1.75, shape = 24,aes( fill = sites$severity_index),colour=alpha("white",0.2))+
scale_fill_distiller(palette = "Set1")+
scale_fill_fermenter(name = "Severity Index",palette = "Set1", breaks = c(1,2,3,4))+
coord_sf(xlim = c(-80.65, -74.6), ylim = c(39.5, 42.5), expand = FALSE)+
theme(legend.position='left')+
ggtitle("Recurring Accident spots and Severity Index in PA")+xlab("Longitude")+
ylab("Latitude")+
ggplot2::annotate("text", label = "Pittsburgh\n|",size=4,y=40.8,x=-80,colour="blue")
ggplot(data = world) +
geom_sf(fill = "antiquewhite1") +
geom_sf(data = counties, fill = "white")+
geom_sf(data = states, fill = NA) +
geom_sf(data = sites, size = 1.75, shape = 24,aes( fill = sites$severity_index),colour=alpha("white",0.2))+
scale_fill_distiller(palette = "Set1")+
scale_fill_fermenter(name = "Severity Index",palette = "Set1", breaks = c(1,2,3,4))+
coord_sf(xlim = c(-75.75, -74.5), ylim = c(39.5, 40.5), expand = FALSE)+
theme(legend.position='left')+
ggtitle("Recurring Accident spots and Severity Index in SE PA")+xlab("Longitude")+
ylab("Latitude")+
ggplot2::annotate("text", label = "<-Bucks County",size=3.2,y=40.3,x=-75.1,colour="blue")
We calculate a risk index by multiplying the recurrance and severity index from the previous steps.
On stack ranking locations by this index we seek to identifty the riskiest roads in Pennsylvania.
test_Sev_3Yrs<-test_Sev_3Yrs[,c("Start_Lat","Start_Lng","severity_index")]
test_All_3Yrs<-test_All_3Yrs[,c("Start_Lat","Start_Lng","recurrence_index")]
inference_batch<-head(test_Sev_3Yrs %>% inner_join(test_All_3Yrs) %>%
mutate(risk_index = severity_index*recurrence_index) %>%arrange(desc(risk_index)),10)
address_set<-pa_slice[,c("Start_Lat","Start_Lng","Street","City","State","Zipcode")]%>%
mutate(Start_Lat=round(Start_Lat,2),Start_Lng=round(Start_Lng,2))
inference_batch<-inference_batch %>% inner_join(address_set%>%group_by(Start_Lat,Start_Lng) %>% filter(row_number() == 1))
knitr::kable(inference_batch[,c( "risk_index","Street","City","Zipcode")])
risk_index | Street | City | Zipcode |
---|---|---|---|
326.0000 | I-76 E | Philadelphia | 19104 |
300.0000 | Parkway East | Pittsburgh | 15218 |
224.6667 | Schuylkill Expy W | Conshohocken | 19428 |
193.0000 | Aramingo Ave | Philadelphia | 19125 |
187.0000 | I-95 S | Philadelphia | 19136 |
173.3333 | I-95 N | Bristol | 19007 |
168.0000 | Launfall Rd | Plymouth Meeting | 19462-2113 |
153.3333 | Pennsylvania Tpke W | Willow Grove | 19090 |
152.0000 | I-83 N | Lemoyne | 17043 |
147.3333 | Veterans Memorial Hwy N | Conshohocken | 19428 |
From our analysis we can observe about road accidents in Pennsylvania: