• The cost of accidents
  • Data Snippet
  • Summary statistcs
  • Exploring by State
  • Exploring feature fillrate
  • Exploring time trend of accidents in PA by severity
  • Top weather conditions during accidents
  • Weather condition during most severe accidents
  • Severity of accident by location type
  • Accident severity at traffic signals
  • Accidents at different times of day
  • Accident by visibility
  • Time delay caused by these accidents
    • Another way to look at the time duration vs Severity:-
  • Comparing duration distribution of severity 2 vs 3
  • Defining severity & frequency metrics
  • Exploring recurring accidents in Pennsylvania
    • A closer look at South eastern pennsylvania
  • Trends of accident severity over time
  • Trends of accident severity over time - Bucks county
  • Top 10 Locations that need intervention
  • Conclusion

The cost of accidents

More than 38,000 people die every year in crashes on U.S. roadways. The U.S. traffic fatality rate is 12.4 deaths per 100,000 inhabitants.

The economic and societal impact of road crashes costs U.S. citizens $871 billion.

Road crashes cost the U.S. more than $380 million in direct medical costs.

In this study we will - Explore the influence of various road features and environmental conditions to accident severity in the state of Pennsylvania. - Identify locations that are in dire need of intervention to prevent more accidents.

The data used in this analysis was sourced from Kaggle dataset that holds a collection of accidents across US from 2015-19.

This data-set records 2.24million accidents across 49 features.

https://www.asirt.org/safe-travel/road-safety-facts/

https://www.kaggle.com/sobhanmoosavi/us-accidents

library(MASS)
library(plyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
library(ggforce)
library(ggridges)
library(cowplot)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(sf)
library(tools)


US_Accidents_May19 <- read_csv("C:/Users/krish/Downloads/mofojo/US_Accidents_May19.csv")




tab_colname<-c('Severity',
'Start_Time',
'End_Time',
'Start_Lat',
'Start_Lng',
'End_Lat',
'End_Lng',
'Distance(mi)',
'Description',
'Number',
'Street',
'Side',
'City',
'County',
'State',
'Zipcode',
'Country',
'Timezone',
'Airport_Code',
'Weather_Timestamp',
'Temperature(F)',
'Wind_Chill(F)',
'Humidity(%)',
'Pressure(in)',
'Visibility(mi)',
'Wind_Direction',
'Wind_Speed(mph)',
'Precipitation(in)',
'Weather_Condition',
'Amenity',
'Bump',
'Crossing',
'Give_Way',
'Junction',
'No_Exit',
'Railway',
'Roundabout',
'Station',
'Stop',
'Traffic_Calming',
'Traffic_Signal',
'Turning_Loop',
'Sunrise_Sunset',
'Civil_Twilight',
'Nautical_Twilight',
'Astronomical_Twilight')
tab_desc<-c('Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). Note that severity reported by different sources may differ in their underlying impact on traffic, so please separate data from different sources when doing severity-based analysis.',
'Shows start time of the accident in local time zone.',
'Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.',
'Shows latitude in GPS coordinate of the start point.',
'Shows longitude in GPS coordinate of the start point.',
'Shows latitude in GPS coordinate of the end point.',
'Shows longitude in GPS coordinate of the end point.',
'The length of the road extent affected by the accident.',
'Shows natural language description of the accident.',
'Shows the street number in address record.',
'Shows the street name in address record.',
'Shows the relative side of the street (Right/Left) in address record.',
'Shows the city in address record.',
'Shows the county in address record.',
'Shows the state in address record.',
'Shows the zipcode in address record.',
'Shows the country in address record.',
'Shows timezone based on the location of the accident (eastern, central, etc.).',
'Denotes an airport-based weather station which is the closest one to location of the accident.',
'Shows the time-stamp of weather observation record (in local time).',
'Shows the temperature (in Fahrenheit).',
'Shows the wind chill (in Fahrenheit).',
'Shows the humidity (in percentage).',
'Shows the air pressure (in inches).',
'Shows visibility (in miles).',
'Shows wind direction.',
'Shows wind speed (in miles per hour).',
'Shows precipitation amount in inches, if there is any.',
'Shows the weather condition (rain, snow, thunderstorm, fog, etc.)',
'A POI annotation which indicates presence of amenity in a nearby location.',
'A POI annotation which indicates presence of speed bump or hump in a nearby location.',
'A POI annotation which indicates presence of crossing in a nearby location.',
'A POI annotation which indicates presence of give_way in a nearby location.',
'A POI annotation which indicates presence of junction in a nearby location.',
'A POI annotation which indicates presence of junction in a nearby location.',
'A POI annotation which indicates presence of railway in a nearby location.',
'A POI annotation which indicates presence of roundabout in a nearby location.',
'A POI annotation which indicates presence of station in a nearby location.',
'A POI annotation which indicates presence of stop in a nearby location.',
'A POI annotation which indicates presence of traffic_calming in a nearby location.',
'A POI annotation which indicates presence of traffic_signal in a nearby location.',
'A POI annotation which indicates presence of turning_loop in a nearby location.',
'Shows the period of day (i.e. day or night) based on sunrise/sunset.',
'Shows the period of day (i.e. day or night) based on civil twilight.',
'Shows the period of day (i.e. day or night) based on nautical twilight.',
'Shows the period of day (i.e. day or night) based on astronomical twilight.')

df<- data.frame(tab_colname,tab_desc)


names(df) <- c("Feature", "Description")

writeLines("td, th { padding : 6px } th { background-color : brown ; color : white; border : 1px solid white; } td { color : brown ; border : 1px solid brown }", con = "mystyle.css")

knitr::kable(df, format = "html")
Feature Description
Severity Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). Note that severity reported by different sources may differ in their underlying impact on traffic, so please separate data from different sources when doing severity-based analysis.
Start_Time Shows start time of the accident in local time zone.
End_Time Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
Start_Lat Shows latitude in GPS coordinate of the start point.
Start_Lng Shows longitude in GPS coordinate of the start point.
End_Lat Shows latitude in GPS coordinate of the end point.
End_Lng Shows longitude in GPS coordinate of the end point.
Distance(mi) The length of the road extent affected by the accident.
Description Shows natural language description of the accident.
Number Shows the street number in address record.
Street Shows the street name in address record.
Side Shows the relative side of the street (Right/Left) in address record.
City Shows the city in address record.
County Shows the county in address record.
State Shows the state in address record.
Zipcode Shows the zipcode in address record.
Country Shows the country in address record.
Timezone Shows timezone based on the location of the accident (eastern, central, etc.).
Airport_Code Denotes an airport-based weather station which is the closest one to location of the accident.
Weather_Timestamp Shows the time-stamp of weather observation record (in local time).
Temperature(F) Shows the temperature (in Fahrenheit).
Wind_Chill(F) Shows the wind chill (in Fahrenheit).
Humidity(%) Shows the humidity (in percentage).
Pressure(in) Shows the air pressure (in inches).
Visibility(mi) Shows visibility (in miles).
Wind_Direction Shows wind direction.
Wind_Speed(mph) Shows wind speed (in miles per hour).
Precipitation(in) Shows precipitation amount in inches, if there is any.
Weather_Condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
Amenity A POI annotation which indicates presence of amenity in a nearby location.
Bump A POI annotation which indicates presence of speed bump or hump in a nearby location.
Crossing A POI annotation which indicates presence of crossing in a nearby location.
Give_Way A POI annotation which indicates presence of give_way in a nearby location.
Junction A POI annotation which indicates presence of junction in a nearby location.
No_Exit A POI annotation which indicates presence of junction in a nearby location.
Railway A POI annotation which indicates presence of railway in a nearby location.
Roundabout A POI annotation which indicates presence of roundabout in a nearby location.
Station A POI annotation which indicates presence of station in a nearby location.
Stop A POI annotation which indicates presence of stop in a nearby location.
Traffic_Calming A POI annotation which indicates presence of traffic_calming in a nearby location.
Traffic_Signal A POI annotation which indicates presence of traffic_signal in a nearby location.
Turning_Loop A POI annotation which indicates presence of turning_loop in a nearby location.
Sunrise_Sunset Shows the period of day (i.e. day or night) based on sunrise/sunset.
Civil_Twilight Shows the period of day (i.e. day or night) based on civil twilight.
Nautical_Twilight Shows the period of day (i.e. day or night) based on nautical twilight.
Astronomical_Twilight Shows the period of day (i.e. day or night) based on astronomical twilight.

Data Snippet

str(US_Accidents_May19)
## spec_tbl_df [2,243,939 x 49] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ID                   : chr [1:2243939] "A-1" "A-2" "A-3" "A-4" ...
##  $ Source               : chr [1:2243939] "MapQuest" "MapQuest" "MapQuest" "MapQuest" ...
##  $ TMC                  : num [1:2243939] 201 201 201 201 201 201 201 201 201 201 ...
##  $ Severity             : num [1:2243939] 3 2 2 3 2 3 2 3 2 3 ...
##  $ Start_Time           : POSIXct[1:2243939], format: "2016-02-08 05:46:00" "2016-02-08 06:07:59" ...
##  $ End_Time             : POSIXct[1:2243939], format: "2016-02-08 11:00:00" "2016-02-08 06:37:59" ...
##  $ Start_Lat            : num [1:2243939] 39.9 39.9 39.1 39.7 39.6 ...
##  $ Start_Lng            : num [1:2243939] -84.1 -82.8 -84 -84.2 -84.2 ...
##  $ End_Lat              : num [1:2243939] NA NA NA NA NA NA NA NA NA NA ...
##  $ End_Lng              : num [1:2243939] NA NA NA NA NA NA NA NA NA NA ...
##  $ Distance(mi)         : num [1:2243939] 0.01 0.01 0.01 0.01 0.01 0.01 0 0.01 0 0.01 ...
##  $ Description          : chr [1:2243939] "Right lane blocked due to accident on I-70 Eastbound at Exit 41 OH-235 State Route 4." "Accident on Brice Rd at Tussing Rd. Expect delays." "Accident on OH-32 State Route 32 Westbound at Dela Palma Rd. Expect delays." "Accident on I-75 Southbound at Exits 52 52B US-35. Expect delays." ...
##  $ Number               : num [1:2243939] NA 2584 NA NA NA ...
##  $ Street               : chr [1:2243939] "I-70 E" "Brice Rd" "State Route 32" "I-75 S" ...
##  $ Side                 : chr [1:2243939] "R" "L" "R" "R" ...
##  $ City                 : chr [1:2243939] "Dayton" "Reynoldsburg" "Williamsburg" "Dayton" ...
##  $ County               : chr [1:2243939] "Montgomery" "Franklin" "Clermont" "Montgomery" ...
##  $ State                : chr [1:2243939] "OH" "OH" "OH" "OH" ...
##  $ Zipcode              : chr [1:2243939] "45424" "43068-3402" "45176" "45417" ...
##  $ Country              : chr [1:2243939] "US" "US" "US" "US" ...
##  $ Timezone             : chr [1:2243939] "US/Eastern" "US/Eastern" "US/Eastern" "US/Eastern" ...
##  $ Airport_Code         : chr [1:2243939] "KFFO" "KCMH" "KI69" "KDAY" ...
##  $ Weather_Timestamp    : POSIXct[1:2243939], format: "2016-02-08 05:58:00" "2016-02-08 05:51:00" ...
##  $ Temperature(F)       : num [1:2243939] 36.9 37.9 36 35.1 36 37.9 34 34 33.3 37.4 ...
##  $ Wind_Chill(F)        : num [1:2243939] NA NA 33.3 31 33.3 35.5 31 31 NA 33.8 ...
##  $ Humidity(%)          : num [1:2243939] 91 100 100 96 89 97 100 100 99 100 ...
##  $ Pressure(in)         : num [1:2243939] 29.7 29.6 29.7 29.6 29.6 ...
##  $ Visibility(mi)       : num [1:2243939] 10 10 10 9 6 7 7 7 5 3 ...
##  $ Wind_Direction       : chr [1:2243939] "Calm" "Calm" "SW" "SW" ...
##  $ Wind_Speed(mph)      : num [1:2243939] NA NA 3.5 4.6 3.5 3.5 3.5 3.5 1.2 4.6 ...
##  $ Precipitation(in)    : num [1:2243939] 0.02 0 NA NA NA 0.03 NA NA NA 0.02 ...
##  $ Weather_Condition    : chr [1:2243939] "Light Rain" "Light Rain" "Overcast" "Mostly Cloudy" ...
##  $ Amenity              : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Bump                 : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Crossing             : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Give_Way             : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Junction             : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ No_Exit              : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Railway              : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Roundabout           : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Station              : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Stop                 : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Traffic_Calming      : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Traffic_Signal       : logi [1:2243939] FALSE FALSE TRUE FALSE TRUE FALSE ...
##  $ Turning_Loop         : logi [1:2243939] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Sunrise_Sunset       : chr [1:2243939] "Night" "Night" "Night" "Night" ...
##  $ Civil_Twilight       : chr [1:2243939] "Night" "Night" "Night" "Day" ...
##  $ Nautical_Twilight    : chr [1:2243939] "Night" "Night" "Day" "Day" ...
##  $ Astronomical_Twilight: chr [1:2243939] "Night" "Day" "Day" "Day" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_character(),
##   ..   Source = col_character(),
##   ..   TMC = col_double(),
##   ..   Severity = col_double(),
##   ..   Start_Time = col_datetime(format = ""),
##   ..   End_Time = col_datetime(format = ""),
##   ..   Start_Lat = col_double(),
##   ..   Start_Lng = col_double(),
##   ..   End_Lat = col_double(),
##   ..   End_Lng = col_double(),
##   ..   `Distance(mi)` = col_double(),
##   ..   Description = col_character(),
##   ..   Number = col_double(),
##   ..   Street = col_character(),
##   ..   Side = col_character(),
##   ..   City = col_character(),
##   ..   County = col_character(),
##   ..   State = col_character(),
##   ..   Zipcode = col_character(),
##   ..   Country = col_character(),
##   ..   Timezone = col_character(),
##   ..   Airport_Code = col_character(),
##   ..   Weather_Timestamp = col_datetime(format = ""),
##   ..   `Temperature(F)` = col_double(),
##   ..   `Wind_Chill(F)` = col_double(),
##   ..   `Humidity(%)` = col_double(),
##   ..   `Pressure(in)` = col_double(),
##   ..   `Visibility(mi)` = col_double(),
##   ..   Wind_Direction = col_character(),
##   ..   `Wind_Speed(mph)` = col_double(),
##   ..   `Precipitation(in)` = col_double(),
##   ..   Weather_Condition = col_character(),
##   ..   Amenity = col_logical(),
##   ..   Bump = col_logical(),
##   ..   Crossing = col_logical(),
##   ..   Give_Way = col_logical(),
##   ..   Junction = col_logical(),
##   ..   No_Exit = col_logical(),
##   ..   Railway = col_logical(),
##   ..   Roundabout = col_logical(),
##   ..   Station = col_logical(),
##   ..   Stop = col_logical(),
##   ..   Traffic_Calming = col_logical(),
##   ..   Traffic_Signal = col_logical(),
##   ..   Turning_Loop = col_logical(),
##   ..   Sunrise_Sunset = col_character(),
##   ..   Civil_Twilight = col_character(),
##   ..   Nautical_Twilight = col_character(),
##   ..   Astronomical_Twilight = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summary statistcs

summary(US_Accidents_May19)
##       ID               Source               TMC            Severity    
##  Length:2243939     Length:2243939     Min.   :200.0    Min.   :0.000  
##  Class :character   Class :character   1st Qu.:201.0    1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :201.0    Median :2.000  
##                                        Mean   :207.4    Mean   :2.383  
##                                        3rd Qu.:201.0    3rd Qu.:3.000  
##                                        Max.   :406.0    Max.   :4.000  
##                                        NA's   :516762                  
##    Start_Time                     End_Time                     Start_Lat    
##  Min.   :2015-03-09 07:00:00   Min.   :2016-02-08 06:37:08   Min.   :24.57  
##  1st Qu.:2017-03-24 08:24:00   1st Qu.:2017-03-24 10:26:56   1st Qu.:33.48  
##  Median :2017-12-28 08:45:00   Median :2017-12-28 09:57:54   Median :35.86  
##  Mean   :2017-12-04 17:37:24   Mean   :2017-12-04 19:48:21   Mean   :36.46  
##  3rd Qu.:2018-08-22 14:10:46   3rd Qu.:2018-08-22 15:14:52   3rd Qu.:40.42  
##  Max.   :2019-04-01 03:26:00   Max.   :2020-10-31 23:59:00   Max.   :49.00  
##                                                                             
##    Start_Lng          End_Lat           End_Lng         Distance(mi)     
##  Min.   :-124.62   Min.   :24.6      Min.   :-124.5    Min.   :  0.0000  
##  1st Qu.:-117.14   1st Qu.:33.9      1st Qu.:-117.9    1st Qu.:  0.0000  
##  Median : -88.18   Median :38.0      Median : -90.2    Median :  0.0000  
##  Mean   : -94.86   Mean   :37.4      Mean   : -96.5    Mean   :  0.2879  
##  3rd Qu.: -80.85   3rd Qu.:41.4      3rd Qu.: -80.9    3rd Qu.:  0.0100  
##  Max.   : -67.11   Max.   :49.1      Max.   : -67.1    Max.   :333.6300  
##                    NA's   :1727177   NA's   :1727177                     
##  Description            Number           Street              Side          
##  Length:2243939     Min.   :     1    Length:2243939     Length:2243939    
##  Class :character   1st Qu.:   803    Class :character   Class :character  
##  Mode  :character   Median :  2672    Mode  :character   Mode  :character  
##                     Mean   :  5625                                         
##                     3rd Qu.:  6846                                         
##                     Max.   :961052                                         
##                     NA's   :1458402                                        
##      City              County             State             Zipcode         
##  Length:2243939     Length:2243939     Length:2243939     Length:2243939    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    Country            Timezone         Airport_Code      
##  Length:2243939     Length:2243939     Length:2243939    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Weather_Timestamp             Temperature(F)   Wind_Chill(F)    
##  Min.   :2016-02-08 00:53:00   Min.   :-77.80   Min.   :-65.9    
##  1st Qu.:2017-03-23 09:15:00   1st Qu.: 48.90   1st Qu.: 19.2    
##  Median :2017-12-27 15:53:00   Median : 63.00   Median : 28.7    
##  Mean   :2017-12-04 05:22:14   Mean   : 61.23   Mean   : 26.0    
##  3rd Qu.:2018-08-22 15:53:00   3rd Qu.: 75.90   3rd Qu.: 36.4    
##  Max.   :2019-03-31 23:52:00   Max.   :170.60   Max.   : 45.2    
##  NA's   :47170                 NA's   :62265    NA's   :1852370  
##   Humidity(%)      Pressure(in)   Visibility(mi)   Wind_Direction    
##  Min.   :  4.00   Min.   : 0.00   Min.   :  0.00   Length:2243939    
##  1st Qu.: 50.00   1st Qu.:29.92   1st Qu.: 10.00   Class :character  
##  Median : 68.00   Median :30.03   Median : 10.00   Mode  :character  
##  Mean   : 65.93   Mean   :30.04   Mean   :  9.12                     
##  3rd Qu.: 85.00   3rd Qu.:30.15   3rd Qu.: 10.00                     
##  Max.   :100.00   Max.   :33.04   Max.   :140.00                     
##  NA's   :64467    NA's   :57280   NA's   :71360                      
##  Wind_Speed(mph)  Precipitation(in) Weather_Condition   Amenity       
##  Min.   :  1.2    Min.   : 0.0      Length:2243939     Mode :logical  
##  1st Qu.:  5.8    1st Qu.: 0.0      Class :character   FALSE:2217962  
##  Median :  8.1    Median : 0.0      Mode  :character   TRUE :25977    
##  Mean   :  8.8    Mean   : 0.1                                        
##  3rd Qu.: 11.5    3rd Qu.: 0.0                                        
##  Max.   :822.8    Max.   :10.8                                        
##  NA's   :442954   NA's   :1979466                                     
##     Bump          Crossing        Give_Way        Junction      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2243700   FALSE:2122156   FALSE:2239215   FALSE:2056574  
##  TRUE :239       TRUE :121783    TRUE :4724      TRUE :187365   
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##   No_Exit         Railway        Roundabout       Station       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2241773   FALSE:2225741   FALSE:2243811   FALSE:2207714  
##  TRUE :2166      TRUE :18198     TRUE :128       TRUE :36225    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##     Stop         Traffic_Calming Traffic_Signal  Turning_Loop   
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2222168   FALSE:2243321   FALSE:1885291   FALSE:2243939  
##  TRUE :21771     TRUE :618       TRUE :358648                   
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Sunrise_Sunset     Civil_Twilight     Nautical_Twilight  Astronomical_Twilight
##  Length:2243939     Length:2243939     Length:2243939     Length:2243939       
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
## 

Exploring by State

We can see that Pennsylvania ranks 7th on a national scale based on collected accident records over a period of 5 years(2015-19)

PA had roughly 75k accidents in the 2016-19 period.

On a comparison with the its neighboring(NYC,NJ,DE,MA,WV & OH) states, only New York out ranks Pennsylvania in number of accidents

state_slice<-US_Accidents_May19 %>%         
  group_by(State) %>% 
  dplyr::summarize(count = n())

ggplot(state_slice) +
  geom_bar(aes(x=reorder(State,-count),y=count,fill=count),stat="identity")+
  ggtitle("Record count by states")+xlab("States")+
  scale_y_continuous(name="Record count", labels = scales::comma)+
  scale_fill_continuous(name = "Record Count",low = "green", high = "red",labels = scales::comma)+
  theme(axis.text.x=element_text(angle=90, hjust=1))+
  ggplot2::annotate("text", label = "PA:-75,814",size=4,x='VA',y=nrow(filter(US_Accidents_May19,State=="PA"))+25000,colour="blue")

Exploring feature fillrate

When checking for presence of null values we see that the following features have high number of null values:- - Latitude and Longitude of end point of accident - Street number of accident - Wind chill,precipitation and windspeed

The nulls in location attributes wont hinder our analysis as we still have the zip code and start lat,longs.

However the absence of the above mentioned weaher factores are worrisome.

pa_slice<-filter(US_Accidents_May19,State=="PA")

missing_val_count<-as.data.frame(sapply(pa_slice, function(x) sum(is.na(x))*100/nrow(pa_slice)))
missing_val_count$ftName<-row.names(missing_val_count)
names(missing_val_count)[1] <- "fillRate"
row.names(missing_val_count) <- NULL

ggplot(filter(missing_val_count,fillRate >= 1)) +
  geom_bar(aes(x=reorder(ftName,fillRate),y=fillRate,fill=fillRate),stat="identity")+
  coord_flip()+
  scale_fill_continuous(name = "Missing value %")+
  ggtitle("Fetaures with 5% or Missing values")+xlab("Features")+
  ylab("Missing value%")

Exploring time trend of accidents in PA by severity

We observe that there was a spike in accidents in 2016 (especially in September from multi vehicle collision)

However,the general trend of accidents have remained stable over the years

pa_slice$date<-format(as.Date(pa_slice$Start_Time), "%Y-%m")

pa_slice_dt_sev<-pa_slice %>%         
  group_by(date,Severity) %>% 
  dplyr::summarize(count = n())

pa_slice_dt_sev$Severity<-as.character(pa_slice_dt_sev$Severity)

pa_slice_dt_sev %>%
  ggplot( aes(x=date, y=count, group=Severity, color=Severity)) +
    geom_line()+
  ggtitle("Time trend of Accidents by severity in PA")+
  xlab("Accident count")+
  ylab("Date(Year-month)")+
  theme(axis.text.x=element_text(angle=90, hjust=1))+ 
  scale_color_manual(values=c("darkgreen", "blue","orange","red"))

Top weather conditions during accidents

We observe that that most of the accidents occur under normal weather conditions

There are only few sever accidents - similar trend as in previous time series analysis

weather_con_slice<-pa_slice %>%         
  group_by(Weather_Condition,Severity) %>% 
  dplyr::summarize(count = n())

weather_con_slice_1<-pa_slice %>%         
  group_by(Weather_Condition) %>% 
  dplyr::summarize(count = n())

weather_con_slice_1<-weather_con_slice_1 %>% slice_max(count, n = 5)

weather_con_slice$Severity<-as.character(weather_con_slice$Severity)

weather_con_slice <- filter(weather_con_slice,Weather_Condition %in% c(weather_con_slice_1$Weather_Condition))


ggplot(data = weather_con_slice, aes(x = Weather_Condition, 
                                     y = count, 
                                     fill = Severity))+
  geom_bar(stat="identity")+
  ggtitle("Top 5 weather condition prevailing during accidents")+
  ylab("Accident count")+
  xlab("Weather")+
  scale_fill_manual(values=c("darkgreen", "blue","orange","red"))

Weather condition during most severe accidents

Even when filtering weather conditions for most sever accidents we end up with same conditions as in previous situation - a clear or cloudy da

Thus the missing values in weather features are less concerning now!

weather_con_slice<-pa_slice %>%         
  group_by(Weather_Condition,Severity) %>% 
  dplyr::summarize(count = n())

weather_con_slice_1<-pa_slice %>%         
  group_by(Weather_Condition) %>% 
  dplyr::summarize(count = n())

weather_con_slice_1<-weather_con_slice_1 %>% slice_max(count, n = 5)


knitr::kable(weather_con_slice_1)
Weather_Condition count
Clear 28687
Overcast 18184
Mostly Cloudy 7729
Scattered Clouds 6179
Partly Cloudy 3889

Severity of accident by location type

We observe that a vast majority of the accidents occur at traffic signals

sample_data<-pa_slice[,c(names(US_Accidents_May19)[33:45],c("Severity","ID"))]

sample_data[,c(names(pa_slice)[33:45])] <- sapply(sample_data[,c(names(pa_slice)[33:45])],as.numeric)

sample_data$RS= rowSums(sample_data[,c(names(pa_slice)[33:45])])

sample_data<-sample_data %>% filter(RS == 1)

k<-as.data.frame(colSums(sample_data[,c(names(pa_slice)[33:45])]))

k$Location_Type<-row.names(k)
names(k)[1] <- "Accident"
row.names(k) <- NULL


ggplot(k %>% slice_max(Accident, n = 5)) +
  geom_bar(aes(x=reorder(Location_Type,-Accident),y=Accident,fill=Accident),stat="identity")+
  ggtitle("Accident count by Location Type")+xlab("Location Type")+
  scale_y_continuous(name="#Accident", labels = scales::comma)+
  theme(axis.text.x=element_text(angle=90, hjust=1))

Accident severity at traffic signals

Vast majority of accidents at traffic signals are of severity 2

Tfs<-filter(pa_slice,Traffic_Signal==TRUE)


Tfs<-Tfs %>%         
  group_by(Severity) %>% 
  dplyr::summarize(count = n())

Tfs$Severity<-as.character(Tfs$Severity)


ggplot(Tfs, aes(x="", y=count, fill=Severity)) +
  geom_bar(stat="identity", width=1,position = "fill") +
  coord_polar("y", start=0)+
  ggtitle("Distribution of accidents at traffic signals by severity")+
  ylab("")+
  xlab("")+
  scale_fill_manual(values=c("darkgreen", "blue","orange","red"))

Accidents at different times of day

Vast majority of accidents occur during day time

And the distribution is multimodal with a lot of data clustered around severity 2

dayNight_slice<-pa_slice %>%         
  group_by(Sunrise_Sunset) %>% 
  dplyr::summarize(count = n())

dayNight_slice<-dayNight_slice %>% tidyr::drop_na()

ggplot(dayNight_slice) +
  geom_bar(aes(x=reorder(Sunrise_Sunset,-count),y=count,fill=count),stat="identity")+
  ggtitle("Record count by Day/Night")+xlab("Conditon")+
  scale_y_continuous(name="Record count", labels = scales::comma)+
  scale_fill_continuous(name="Record count", labels = scales::comma)

And the distribution is multimodal with a lot of data clustered around severity 2

qplot(Sunrise_Sunset,
      Severity,
      data = filter(pa_slice,Sunrise_Sunset %in% c("Day","Night")), 
      geom=c("violin"),
      main = "Severity distribution at different times of day",
      trim = FALSE,
      fill = Sunrise_Sunset,
      xlab = "Time of Day",
      ylab = "Severity")+
  labs(fill = "Day/Night")

Accident by visibility

  • The hexplot revealed some outliers for each category
  • Density of accidents to lower end of visibility is observed
  • A particular bright spot for Severity 2 accidents are observed at 10miles
gpdata <- ggplot(data=pa_slice,aes(x=`Visibility(mi)`, y=Severity))

gpdata+ stat_binhex()+ggtitle("Plot of Visibility vs  Severity")+
scale_y_continuous(name="#Accidents", labels = scales::comma)+
  ggplot2::annotate("text", label = "< High number of \n Accidents",size=4,x=12.5,y=1.9,colour="red")

Time delay caused by these accidents

Inspection of traffic congestions caused by these accidents reveal that -Severity 1,2 and 3 have similar range of values - their means differ but the central quarters overlap well. -These last about 30-50 minutes mostly. -Sevrity 2 and 3 accidents have a lot of outliers towards higher duration direction and their mean duration is lower than that of severity 1’s -Severity 4 accidents cause major congestion with mean duration of 6hrs,this also a very tight distribution -due to fewer number of accidents in this category

pa_td_slice<-pa_slice

pa_td_slice$duration<-pa_td_slice$End_Time - pa_td_slice$Start_Time



boxplot(pa_td_slice$duration ~ pa_td_slice$Severity,
        ylim = c(0, 100),
        xlim=c(0.5,3.5),
        main = 'Delays(in minutes) vs severity[1-3]',
        col=c("blue","green","red"),
        ylab="severity",
        xlab="Duration(min)")

boxplot(pa_td_slice$duration ~ pa_td_slice$Severity,
        ylim = c(300, 400),
        xlim=c(3.5,4.5),
        main = 'Delays(in minutes) vs severity[4]',
        col=c("yellow"),
        ylab="severity",
        xlab="Duration(min)")

Another way to look at the time duration vs Severity:-

  • The time axis is a log measure to make the comparison easier
  • Here again the stark difference in duration between Severity 4 and other categories are visible
  • There is also a very slight overlap of lowest Severity 4 mode with highest peaks of other classes - indicating some common ground among them
# https://www.r-graph-gallery.com/294-basic-ridgeline-plot.html

ggplot(pa_td_slice, aes(x = log10(as.numeric(duration)), y=as.factor(Severity) ,fill=as.factor(Severity))) +
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none")+
  ggtitle("Severity vs log(Duration)")+
  ylab("Severity")+
  xlab("log(Duration)")

Comparing duration distribution of severity 2 vs 3

Performing a t test reveals that there is not much difference between severity 2 and 3 accidents when compared at duration.

The p value of the test is very low indicating a similar distribution.

sev2<-pa_td_slice%>%filter(Severity==2)
sev3<-pa_td_slice%>%filter(Severity==3)
t.test(sev2$duration,sev3$duration)
## 
##  Welch Two Sample t-test
## 
## data:  sev2$duration and sev3$duration
## t = 4.5101 mins, df = 34855, p-value = 6.499e-06 mins
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  13.12042 mins 33.28928 mins
## sample estimates:
## Time differences in mins
## mean of x mean of y 
##  73.76587  50.56102

Defining severity & frequency metrics

We are interested in the accident data over the latest 3 years as these will lead to more actionable insights

Thus we will focus more on years 2017,2018 and 2019.

We will count

We are rounding down the latitude,longitude coordinates by two positions to collate incidents that are close to each other - we will refer to this as simplified coordinates.

-Eg:- (Lat:40.45029,Long:-79.95139) and (Lat:40.45030,Long:-79.95120) will be grouped under Lat:40.450,Long:-79.951 as these points are close by

We are creating a metric “recurrence_index” where we take a weighted average of annual number of accidents occurring at a location over latest three years.

  • The highest weightage of 1/2 will be given to No. of accidents in 2019
  • The second highest weightage of 1/3 will be given to No. of accidents in 2018
  • The lowest weightage of 1/6 will be given to No. of accidents in 2017

recurrence_index= (Accidents in 2017)x1/6 + (Accidents in 2017)x2/6 + (Accidents in 2017)x3/6

We are also creating a new index for severity “severity_index” which gives a direction to weather the accident severity is increasing in a location or decreasing in a location. The procedure to generate it is as follows:-

  • We compute the average annual severity index of accidents recurring at a location(simplified by lat,lon coordinates) for latest 3 years
  • We will compare if this average annual severity index is increasing over time or decreasing year on year.

The index values are assigned in decresing order of severity with time relevance

  • If the index has increased continuously or remained steady from 2017->2018->2019 then we will give it a value of 4
  • If the index has increased or remained steady from 2018->2019(most recent years) but decrease from 2017->2018 we give it a value of 3
  • If the index has decreased or remained steady from 2018->2019(most recent years) but increased from 2017->2018 we give it a value of 2 If the index has continuously decreased from 2017->2018->2019(most recent years) then we give it a value of 1
pa_slice_re<-pa_slice

pa_slice_re$start_year<-format(as.Date(pa_slice_re$Start_Time, format="%d/%m/%Y"),"%Y")



test<-pa_slice_re %>%
  dplyr::filter(start_year %in% c("2019","2018","2017"))%>%
  dplyr::group_by(start_year,Start_Lat=round(Start_Lat,2),Start_Lng=round(Start_Lng,2),Street,Zipcode)%>%
  dplyr::summarise(UNIQUE_COUNT = n_distinct(ID),Severity=mean(Severity))%>% 
  dplyr::mutate(cnt_2017 = case_when(start_year=="2017" ~ as.numeric(UNIQUE_COUNT),TRUE ~ 0))%>%
  dplyr::mutate(cnt_2018 = case_when(start_year=="2018" ~ as.numeric(UNIQUE_COUNT),TRUE ~ 0))%>%
  dplyr::mutate(cnt_2019 = case_when(start_year=="2019" ~ as.numeric(UNIQUE_COUNT),TRUE ~ 0))%>%
  dplyr::mutate(sev_2017 = case_when(start_year=="2017" ~ as.numeric(Severity),TRUE ~ 0))%>%
  dplyr::mutate(sev_2018 = case_when(start_year=="2018" ~ as.numeric(Severity),TRUE ~ 0))%>%
  dplyr::mutate(sev_2019 = case_when(start_year=="2019" ~ as.numeric(Severity),TRUE ~ 0))

  
test<-test %>%
  dplyr::group_by(Start_Lat,Start_Lng)%>%
  dplyr::summarise_at(c("cnt_2017", "cnt_2018","cnt_2019","sev_2017", "sev_2018","sev_2019"), sum, na.rm=FALSE)

test_All_3Yrs<-test%>%
  dplyr::filter(cnt_2017>0&cnt_2018>0&cnt_2019>0)%>%
  dplyr::mutate(recurrence_index=((cnt_2017*1/6)+(cnt_2018*2/6)+(cnt_2017*3/6)))

test_Sev_3Yrs<-test%>%
  dplyr::filter(sev_2017>0&sev_2018>0&sev_2019>0)%>%
  dplyr::mutate(severity_index=case_when(sev_2019>=sev_2018&sev_2018>=sev_2017 ~ 4,
                                    sev_2019>=sev_2018&sev_2018<=sev_2017 ~ 3,
                                    sev_2019<=sev_2018&sev_2018>=sev_2017 ~ 2,
                                    sev_2019<=sev_2018&sev_2018<=sev_2017 ~ 1,TRUE ~ 0))

Exploring recurring accidents in Pennsylvania

  • We observe that most of Pennsylvania have a low recurrence_index of 20-60.
  • However south eastern(SE) Pennsylvania has a high frequency of these accidents recurring
  • The highest recurrence_index is also observed there at schullykill expressway,West Philadelphia,PA.
# https://r-spatial.org/r/2018/10/25/ggplot2-sf-2.html

# http://zevross.com/blog/2014/07/16/mapping-in-r-using-the-ggplot2-package/
# https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
# https://slcladal.github.io/maps.html

world <- ne_countries(scale = "medium", returnclass = "sf")


states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))



sf::sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
states <- cbind(states, st_coordinates(st_centroid(states)))
states$ID <- toTitleCase(states$ID)

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("pennsylvania", counties$ID))

test_All_3Yrs<-test_All_3Yrs %>%
 arrange(recurrence_index)

sites <- st_as_sf(test_All_3Yrs, coords = c("Start_Lng", "Start_Lat"), 
    crs = 4326, agr = "constant")



ggplot(data = world) +
    geom_sf(fill = "antiquewhite1") +
    geom_sf(data = counties, fill = "white")+
    geom_sf(data = states, fill = NA) + 
    geom_sf(data = sites, size = 1.75, shape = 24,aes( fill = sites$recurrence_index),colour=alpha("white",0.2))+
    scale_fill_distiller(palette = "Spectral")+
    scale_fill_fermenter(name = "# Accidents",palette = "Spectral", breaks = c(0,20,40,60,80,100,120,140,160))+
    coord_sf(xlim = c(-80.65, -74.6), ylim = c(39.5, 42.5), expand = FALSE)+
    theme(legend.position='left')+
    ggtitle("Recurring Accident spots and #Accidents in PA")+xlab("Longitude")+
    ylab("Latitude")+
  ggplot2::annotate("text", label = "\nSchullykill \n expressway",size=3.2,y=39.8,x=-75.2,colour="red")

A closer look at South eastern pennsylvania

Bucks county has a very high density of reported accidents over the latest 3 years

ggplot(data = world) +
    geom_sf(fill = "antiquewhite1") +
    # geom_sf(data = counties) +
    geom_sf(data = counties, fill = "white")+
    geom_sf(data = states, fill = NA) + 
    geom_sf(data = sites, aes( fill = sites$recurrence_index),shape=23,size = 1.75,colour=alpha("white",0.2))+
    scale_fill_distiller(palette = "Spectral")+
    scale_fill_fermenter(name = "# Accidents",palette = "Spectral", breaks = c(0,20,40,60,80,100,120,140,160))+
    coord_sf(xlim = c(-75.75, -74.5), ylim = c(39.5, 40.5), expand = FALSE)+
    theme(legend.position='left')+
    ggtitle("Recurring Accident spots and #Accidents in SE PA")+xlab("Longitude")+
    ylab("Latitude")+
  ggplot2::annotate("text", label = "\nSchullykill \n expressway",size=3.2,y=39.8,x=-75.2,colour="red")

Top 10 Locations that need intervention

We calculate a risk index by multiplying the recurrance and severity index from the previous steps.

On stack ranking locations by this index we seek to identifty the riskiest roads in Pennsylvania.

test_Sev_3Yrs<-test_Sev_3Yrs[,c("Start_Lat","Start_Lng","severity_index")]
test_All_3Yrs<-test_All_3Yrs[,c("Start_Lat","Start_Lng","recurrence_index")]

inference_batch<-head(test_Sev_3Yrs %>% inner_join(test_All_3Yrs) %>% 
  mutate(risk_index = severity_index*recurrence_index) %>%arrange(desc(risk_index)),10)

address_set<-pa_slice[,c("Start_Lat","Start_Lng","Street","City","State","Zipcode")]%>%
  mutate(Start_Lat=round(Start_Lat,2),Start_Lng=round(Start_Lng,2))


inference_batch<-inference_batch %>% inner_join(address_set%>%group_by(Start_Lat,Start_Lng) %>% filter(row_number() == 1))


knitr::kable(inference_batch[,c( "risk_index","Street","City","Zipcode")])
risk_index Street City Zipcode
326.0000 I-76 E Philadelphia 19104
300.0000 Parkway East Pittsburgh 15218
224.6667 Schuylkill Expy W Conshohocken 19428
193.0000 Aramingo Ave Philadelphia 19125
187.0000 I-95 S Philadelphia 19136
173.3333 I-95 N Bristol 19007
168.0000 Launfall Rd Plymouth Meeting 19462-2113
153.3333 Pennsylvania Tpke W Willow Grove 19090
152.0000 I-83 N Lemoyne 17043
147.3333 Veterans Memorial Hwy N Conshohocken 19428

Conclusion

From our analysis we can observe about road accidents in Pennsylvania:

  • Most accidents are of type severity 2
  • Severity 2 accidents are minor accidents that occur in regular weather and blocking traffic by approx 45mins
  • Most accidents occur in day time,almost double the count of those at night
  • Most accidents occur under clear weather and some under cloudy weather
  • Most accidents occur at traffic signal,followed by Junctions and stops
  • Most accidents occur under visibility of 10m, lower the visibility more the severity
  • There is no difference in distribution of duration between severity 2 and 3 accidents - proved by t-test.
  • Bucks county seem to have a very high number of accidents recurring
  • Pittsburgh is showing an increasing trend in accident severity - the county is becoming more dangerous to drive year over year