Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing DfT pedestrian stats sheet #240

Open
BlaiseKelly opened this issue Jun 24, 2024 · 15 comments
Open

Reproducing DfT pedestrian stats sheet #240

BlaiseKelly opened this issue Jun 24, 2024 · 15 comments

Comments

@BlaiseKelly
Copy link

BlaiseKelly commented Jun 24, 2024

This is maybe not an issue a such, but to use this package for an analysis I wanted to make sure I was using the right data so I decided to try to reproduce the 2018-2022 sections of the government summary sheet for pedestrian casualties: https://www.gov.uk/government/statistics/reported-road-casualties-great-britain-pedestrian-factsheet-2022/reported-road-casualties-in-great-britain-pedestrian-factsheet-2022#how-far-do-pedestrians-travel. This could also be future test dataset for the package once discrepancies have been ironed out?

My version is output in the fork: https://github.com/BlaiseKelly/stats19 and the qmd file is in the tests folder tests/Pedestrian_factsheet_2022.qmd

There are quite some differences. No doubt some to do with errors I have made. Maybe the first paragraph is a good one to post here as this will likely feed into many of the other stats.

DfT get: "an average of 8 pedestrians died and 109 were seriously injured (adjusted) per week in reported road collisions"

I get 41 pedestrians died and 505 seriously injured, which does seem quite high. Below how I did it:

library(stats19) 
library(dplyr)
library(lubridate)

yrz <-  c(2018,2019,2020, 2021,2022)
col_list <- list()
cas_list <- list()
for(y in yrz){

## request collision data
crashes = get_stats19(year = y, type = "collision", ask = FALSE, format = TRUE, output_format = "data.frame") %>%
  select(-accident_year)

## request casualty
casualties = get_stats19(year = y, type = "casualty", ask = FALSE, format = TRUE, output_format = "data.frame")%>%
  select(-accident_year)


## add each year to a list
  col_list[[as.character(y)]] <- crashes
  cas_list[[as.character(y)]] <- casualties

}

## extract from list
all_cra <- do.call(rbind,col_list)
all_cas <- do.call(rbind,cas_list)

## casualty type data fundamental to this analysis
crash_cas <- all_cra %>% 
  left_join(all_cas, by = "accident_reference")

When the casualty data is joined there are multiple matches for the accident reference. This is consistent with the number_of_casualties column and there is a row for each casualty, so adding a count column and summing this up should give the correct casualty totals?

dths_per_wk <- crash_cas %>% 
  mutate(wk = isoweek(date)) %>% ## calculate the day of week, Monday is 1
  select(wk, accident_severity, casualty_type, number_of_casualties,accident_reference) %>% ## pick out the columns needed
  mutate(count = 1) %>% ## there are multiple rows for each casualty, add a 1 so can sum up the number for each circumstance
  filter(casualty_type == "Pedestrian") %>%  ## pick out only pedestrian casualties
  group_by(wk, accident_severity) %>%
  summarise(casualties = sum(count))

  ## filter for only fatal collisions
fatal_wk <- dths_per_wk %>% 
  filter(accident_severity == "Fatal")

## filter for only serious injuries
serious_wk <- dths_per_wk %>% 
  filter(accident_severity == "Serious")

This gives two dataframes of the total deaths per week for each injury severity level.

Now pick out the mean of this for each injury severity.

## output single value
fatalities = round(mean(fatal_wk$casualties))
serious = round(mean(serious_wk$casualties))

## stats document said 8
fatalities 
## stats doument said 109
serious

Amongst some other queries, the report refers to different age bands and there doesn't appear to be contributory factors in the csvs this package accesses?

@BlaiseKelly
Copy link
Author

I had left out year:

dths_per_wk <- crash_cas %>% 
  mutate(wk = isoweek(date),
         yr = year(date)) %>%
  select(wk,yr, accident_severity, casualty_type, number_of_casualties,accident_reference) %>% 
  mutate(count = 1) %>% 
  filter(casualty_type == report_casualty) %>%  
  group_by(wk, yr, accident_severity) %>%
  summarise(casualties = sum(count))

fatal_wk <- dths_per_wk %>% 
  filter(accident_severity == "Fatal")

## filter for only serious injuries
serious_wk <- dths_per_wk %>% 
  filter(accident_severity == "Serious")

## stats document said 8
fatalities 
## stats doument said 109
serious

@BlaiseKelly
Copy link
Author

"In 2022, 385 pedestrians were killed in Great Britain, whilst 5,901 were reported to be seriously injured (adjusted) and 13,041 slightly injured (adjusted)."

I get:

## 2022
severity_2022 <- crash_cas %>% 
  select(accident_severity, casualty_type, date, number_of_casualties,accident_reference) %>% 
  mutate(accident_year = year(date)) %>% 
  filter(casualty_type == "Pedestrian" & accident_year == 2022) %>% 
  mutate(count = 1) %>% 
  group_by(accident_severity) %>% 
  summarise(casualties = sum(number_of_casualties)) 

fatal_2022 <- filter(severity_2022, accident_severity == "Fatal")

serious_2022 <- filter(severity_2022, accident_severity == "Serious")

slight_2022 <- filter(severity_2022, accident_severity == "Slight")

fatal = round(fatal_2022$casualties)

serious = round(serious_2022$casualties)

slight = round(serious_2022$casualties)

fatal: 556
serious: 6986
slight: 14485

@Robinlovelace
Copy link
Member

This is awesome! Big thumbs up to reproducing official stats based on the STATS19 datasets and interesting that the numbers differ. It may be worth emailing the relevant address for clarification, if we can confirm that the results do indeed differ. I cannot off the top of my head think of any clear explanation for this so cc others involved in STATS19 data analysis @wengraf, @layik and @rogerbeecham.

@Robinlovelace
Copy link
Member

Also cc @PublicHealthDataGeek FYI

@layik
Copy link
Member

layik commented Jul 17, 2024

Hi @Robinlovelace and @BlaiseKelly. I just had time to see this thread. As Robin said this is very interesting. Here is my reprex:

I get 41 pedestrians died and 505 seriously injured

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#

##### same code as above...

## stats document said 8
fatalities 
#> [1] 41
## stats doument said 109
serious
#> [1] 505

Created on 2024-07-17 with reprex v2.1.1

We are not saying this is an issue for the package right?

@wengraf
Copy link
Contributor

wengraf commented Jul 17, 2024

I'll have a go with MAST and see what that says. Will report back.

@BlaiseKelly
Copy link
Author

Thanks for taking a look!

I don't think an issue with the package as such, although it seems the underlying data they have used might be different (i.e. age bands and where are the 'contribitory factors'?)

@wengraf
Copy link
Contributor

wengraf commented Jul 17, 2024

Using MAST, I get 2018 Pedestrian fatalities over the 5 years of 2018-22. That averages out to 404 per annum, and 7.76 per week. I've looked at unadjusted serious (not adjusted like in the quote), and that is 25509, 5102 and 98.11 respectively...that sounds like it will be inline with the adjusted serious numbers they have. So I would say MAST is in line with what @BlaiseKelly says the DfT said (I'll admit, I've not gone back to the source material you're comparing to).

CFs are not in the public data...you need special permission to access those, and they have to be quite carefully interpreted.

The public data has single year of age and also age bands.

Let me now see if I can get my R code to make the DfT result...

@wengraf
Copy link
Contributor

wengraf commented Jul 17, 2024

I've had a crack at this, and I get the MAST/DfT result...

library(stats19) 
library(dplyr)
library(lubridate)

yrz <-  c(2018,2019,2020, 2021,2022)
col_list <- list()
cas_list <- list()
for(y in yrz){
  ## request collision data
  crashes = get_stats19(year = y, type = "collision", ask = FALSE, format = TRUE, output_format = "data.frame")
  
  ## request casualty
  casualties = get_stats19(year = y, type = "casualty", ask = FALSE, format = TRUE, output_format = "data.frame")
  casualties$lsoa_of_casualty <- as.character(casualties$lsoa_of_casualty)
  
  ## add each year to a list
  col_list[[as.character(y)]] <- crashes
  cas_list[[as.character(y)]] <- casualties
  
  message(y)
}

## extract from list
all_cra <- bind_rows(col_list)
all_cas <- bind_rows(cas_list)

## Check total casualties and total crashes against MAST.
# Total casualties should be 693028
print(nrow(all_cas))
# Total crashes should be 538461
print(nrow(all_cra))

cas_ped <- subset(all_cas, all_cas$casualty_class == "Pedestrian")

# Clearly an issue with formatting numbers (see Age_of_casualty)...see issue #235

# Make a summary table of pedestrian casualties by severity, then make annual and weekly averages.
summary.df <- cas_ped %>% group_by(casualty_severity) %>% dplyr::summarise("count" = n())

summary.df$annual_average <- summary.df$count/5
summary.df$weekly_average <- summary.df$annual_average/52

There is definitely an issue with formatting numbers, which I've raised previously (#235 ), which is stopping access to single year of age of casualty. For the purpose of reproducing this DfT stat, I cannot see the logic of needing a table join or much of what follows from that...I suspect something weird is going on in all of that...

@layik
Copy link
Member

layik commented Jul 17, 2024

Thank you @wengraf! I see, I actually cannot see where the flaw (if any) is in @BlaiseKelly's code. The left join is required to get the "date" of the accident (I think). So I got the date out and worked with "casualty_severity" but still getting close to some 5x the reported/your results. It looks like assigning the casualties to weeks rather than years makes a difference which cannot be right.

@BlaiseKelly wk values do contain number 53. I will leave it there.

@wengraf
Copy link
Contributor

wengraf commented Jul 17, 2024

Thank you @wengraf! I see, I actually cannot see where the flaw (if any) is in @BlaiseKelly's code. The left join is required to get the "date" of the accident (I think).

For some reason, @BlaiseKelly drops accident year from the downloading function. Don't drop that, and then you have accident year.

I haven't followed through the code from the left_join onwards, but I cannot see the reasoning for it in the first place. You join tables when your base unit (casualty, crash or vehicle) doesn't have some field from another you'd like (e.g., engine capacity of vehicle of casualty, or similar). I can't see how that's needed to reproduce the stat here.

I've not assigned anything to weeks...the initial quote is a five year average turned into a weekly average. In fact, you don't need date or year at all (other than for the initial downloading loop).

@layik
Copy link
Member

layik commented Jul 17, 2024

That is it then. The casualties are assigned in @BlaiseKelly code 5x hence.

@layik
Copy link
Member

layik commented Jul 17, 2024

Please close this when you are happy @BlaiseKelly.

@Robinlovelace
Copy link
Member

Phew! Sounds like a resolution... This could be hugely useful not only in showing the validity of the package but also in the official stats. I suggest making it a vignette from the updated correct result, that would really be closure for me..

@BlaiseKelly
Copy link
Author

Thanks for all your efforts looking at this.

Yes I agree left_join was maybe unnecessary for this stat, but this was an example from the qmd trying to reproduce the full report https://github.com/BlaiseKelly/stats19/blob/master/tests%2FPedestrian_factsheet_2022.qmd. Further down this it is necessary to join the three datasets together (I think) so I thought might as well do it at the start to have one big data frame.

Will have a proper look at this in the next few days and hopefully close and definitely try to leave it as an example of some form as @Robinlovelace suggests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants