technical_note.Rmd

---
title: "Small area unit multimodal travel time matrices in Great Britain"
subtitle: "Data descriptor"
author:
    - name: "J Rafael Verduzco-Torres"
      affiliation: "Urban Big Data Centre, University of Glasgow"
      url: "JoseRafael.Verduzco-Torres@glasgow.ac.uk"
      orcid_id: 0000-0002-1324-1714
    - name: "David P McArthur"
      affiliation: "Urban Big Data Centre, University of Glasgow"
      orcid_id: 0000-0002-9142-3126
date: "February 2024"
bibliography: bib/ttm_references.bib
output: distill::distill_article
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(kableExtra)
```

# Background

The estimated travel time from multiple origins to several potential destinations is a key input for the development of various regional and urban analyses. However, computing these values at a granular geographic level can be technically complex, time-consuming, and computationally intensive [@Conway2017; @Higgins2022]. The present dataset provides ready-to-use travel time estimates for each of the lower super output area (LSOA) and the data zone (DZ) units in Great Britain (42,000 LSOA/DZ in total) to all other for three sustainable modes of transport, namely: public transport, bicycle, and walking. The public transport estimates considers two times of departure, specifically during the morning peak and at night. Altogether, this dataset presents a range of opportunities for researchers and practitioners, such as the development of tailored accessibility measures, spatial connectivity analyses, and the evaluation of public transport service changes throughout a day.

Additionally, a public transport travel time matrix (TTM) at the output area (OA) level is available upon request. This is estimated from each of the 230,000 2011 Census OA to all other OA in GB at the morning peak only.

This dataset is related to the 'Public Transport Accessibility Indicators 2022' (PTAI22) dataset [@VerduzcoTorres2024; @VerduzcoTorres2022a]. The PTAI22 is openly available at <https://zenodo.org/record/8037156>. The LSOA/DZ TTM by public transport provided in this iteration is directly comparable to the TTM offered in the PTAI 2022. It should be noted that despite being published in 2022, the estimates in PTAI22 utilized public transport schedules corresponding to November 2021.

# Method{#method}

## Geographies

The designated points representing origins and destinations are population-weighted centroids (PWC). For this dataset, we employ the definitions corresponding to the 2011 Census.

For England and Wales, the LSOA PWC data were accessed from the Government Digital Service (GDS) in December 2021 (<https://data.gov.uk/dataset/a40f54f7-b123-4185-952f-da90c56b0564/lower-layer-super-output-areas-december-2011-population-weighted-centroids>). Meanwhile, the location of the PWC was sourced from the Office for National Statistics (ONS) in May 2023 (<https://geoportal.statistics.gov.uk/datasets/ons::output-areas-dec-2011-pwc/explore>). 

The PWC data for Scottish DZs were obtained from the GDS platform in December 2023 (<https://data.gov.uk/dataset/8aabd120-6e15-41bf-be7c-2536cbc4b2e5/data-zone-centroids-2011>). The PWC data at OA level were accessed from the National Records of Scotland digital platform (<https://www.nrscotland.gov.uk/statistics-and-data/geography/our-products/census-datasets/2011-census/2011-boundaries>) in May 2023.


## Travel times

The travel times were estimated using `r5r` package software version '1.0.1' [@R-r5r] for `R` programming language. This is an enhanced `R` implementation of the Conveyal R5 Routing Engine (<https://github.com/conveyal/r5>) [@Conway2018; @Conway2019]. The main inputs used are public transport timetables and the road network. This was done using a single graph for GB. This model enables the estimation of multi-modal door-to-door journeys which can include a combination of pedestrian and on-vehicle journey stages for public transport estimates, for example.

The road and pedestrian network used was sourced from OpenStreetMap (OSM) and accessed from the GeoFrabrik (<https://download.geofabrik.de/>) platform. The data was manually downloaded for the whole GB in `.osm.pbf` format on May 2023. The files were filtered to keep only the relevant features for public transport routing (see script `R/00_prepare_osm.sh`).

All of the travel time estimates are limited to a maximum duration of 150 minutes.

### Walking

Travel times on foot only use the OSM network as the its main input. The routing model considers the segments of the network deemed appropriate for pedestrian navigation according to the classification from the source. Further details can be found at <https://wiki.openstreetmap.org/wiki/Guidelines_for_pedestrian_navigation>. The average walking speed was set to 4.8 kph, following the Journey Travel Time Statistics (JTS) [@DfT2019].

### Bicycle

Routes by bicycle considers the characteristics of the OMS network. The routing engine includes a parameter to control the maximum level of traffic stress tolerated by cyclists, where "[a] value of 1 means cyclists will only travel through the quietest streets, while a value of 4 indicates cyclists can travel through any road" [@R-r5r]. The parameter chosen for our estimates is '2', which reflects an average cyclist.

The average cycling used is 16 kph, in consistency with the JTS [@DfT2019. This is applicable for segments where a cyclist is allowed to be mounted. If the fastest route considers a pedestrian-only segment, the speed considered is 4.8 kph. Further details regarding the cycle route classification are available at <https://wiki.openstreetmap.org/wiki/Cycle_routes>.

The estimates include a column with the adjusted cycling travel time, which adds five minutes for unlocking and locking a bicycle at the start and end of a journey, respectively. This time period is consistent with the JTS. 

### Public transport. Sources and parameters

The timetable data used for the public transport estimates are drawn from two sources. The first is the Rail Delivery Group (<https://data.atoc.org/data-download>), which mainly covers heavy rail services, e.g. regional trains in GB. The data used were published for the week corresponding to the 4th of March 2023 and downloaded on the 7th of March 2023. This data, originally accessed in CIF format, was then organized into the general transport feed specification (GTFS) using the `uk2gtfs` package for `R` [@Morgan2022a].

The second timetable source used is the Bus Open Data Service (BODS, <https://www.bus-data.dft.gov.uk/>), which primarily includes local services across various transport modes, such as bus, tram, and ferry. The BODS data, downloaded in GTFS format on the 7th of March 2023, comes from operating companies and is supplemented by additional sources like Traveline (<https://www.travelinedata.org.uk/>). We compared the BODS data and the Traveline National Dataset (TNDS), both of which aim to provide equivalent information. [Appendix 1](#appedix-comparison) shows the results of this comparison, which suggest a high level of consistency. However, transforming TNDS from TransXChange to GTFS format leads to some loss of information. Thus, we preferred to use the data provided by BODS.

#### Routing parameters

The mode is specified is a combination of walking and public transport (defined as `c('WALK', 'TRANSIT')` in the software). The maximum access/egress walking time to/from public transport stations is unlimited as long as it does not exceed the maximum duration. This implies that some routes can be completed only by walking if there are no public transport services available. The walking speed defined was 4 km/h. The maximum number of public transport rides (on-vehicle) was set to 3, which allows a maximum of two transfers.

The travel time via public transport can fluctuate depending on the operational characteristics of the services and the time of departure chosen [@Conway2018]. To account for this variability, we considered a three-hour time window. The output representing the travel time in this dataset corresponds to the 25th, 50th, and 75th percentiles of the options available within this time span (for more details, see the Conveyal R5 user manual <https://docs.conveyal.com/analysis/configuration>).

The date of departure set for all travel times in this dataset is the 7th of March 2023. The departure time for the morning peak TTM is set to 7 a.m., or effectively from 7 to 10 a.m., considering the three-hour time window. The TTM at the LSOA/DZ level considers a second time of departure at the evening period. Specifically, departing at 9 p.m. with a three-hour time window, or equivalently between 9 p.m. to 12 a.m.

# Data records

The TTM structure follows a row matrix format, where each row represents a unique origin-destination pair. The TTMs are offered in a set of sequentially named `.parquet` files (more information about Parquet format at: <https://parquet.apache.org/>). The structure contains one directory for each mode, where 'bike', 'pt', and 'walk', correspond to bicycle, public transport, and walking, respectively.

```{r}
#fs::dir_tree('./output/ttm/lsoa/', recurse = TRUE)
```

## Walking

The walking TTM contains 13.3 million rows and three columns. The table below offers a description of the columns.

```{r walk-desc}
caption_losadesc <- "Walking travel time matrix codebook."
desc_walk <- read.csv('output/ttm/lsoa/descrip_walk.csv')
desc_walk %>% 
  kbl(caption = caption_losadesc) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

## Bicycle

The bicycle TTM includes 40 million rows and four columns which are described in the table below.

```{r bike-desc}
caption_bikedesc <- "Bicycle travel time matrix codebook."
desc_bike <- read.csv('output/ttm/lsoa/descrip_bike.csv')
desc_bike %>% 
  kbl(caption = caption_bikedesc) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

## Public transport

The LSOA/DZ TTM consists of six columns and 265 million rows. The internal structure of the records is displayed in the table below:

```{r pt-desc}
caption_losadesc <- "Public transport LSOA/DZ travel time matrix codebook."
desc_lsoa <- read.csv('output/ttm/lsoa/descrip_pt.csv')
desc_lsoa %>% 
  kbl(caption = caption_losadesc) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
```

The OA TTM consists of five columns and 4.8 billion rows. The data is organised in a series `.parquet` files in the `ttm/oa/` directory (available upon request). The data is organized into 1,519 consecutive files, ranging from `ttm_part-1` to `ttm_part-1519`. Each of these files contains the travel times from 150 origins to all their corresponding destinations. The structure of the records is equivalent to the previous table, except that these are estimated only for the morning peak.


# Usage notes

Given the considerable size of the TTM, we recommend using a database management system (DBMS). The following examples utilize DuckDB (<https://duckdb.org/>). This is an in-process (serverless) DBMS designed to support analytical query workloads (OLAP) in Structured Query Language (SQL). Creating a local database alleviates memory limitations. DuckDB, implemented in `C++`, can be executed from `R` through the `duckdb` package. We provide a few examples in the sections below.

## Extracting and summarising LSOA/DZ travel time matrix data

As an illustrative example, we perform a basic data extraction and summary. This examines the fluctuation of the public transport services across the day. We compute a measure that reflects the relative change between nighttime and morning peak service availability. This is calculated as the percentage change between the number of destinations reachable within 150 minutes at night and the morning peak. The snippet below showcases the workflow to achieve this from `R`.

```{r echo=TRUE, eval=FALSE}
# Load packages -----------------------------------------------------------

library(DBI)
library(duckdb)
library(dplyr)

# Establish connection ----------------------------------------------------

# Define in-memory DuckDB
con <- dbConnect(duckdb::duckdb())

# Extract data from DuckDB ----------------------------------------------

# Query and summarize the data by origin and time of the day using SQL
ttm_summary <- dbGetQuery(con, "
  SELECT from_id, 
         time_of_day, 
         COUNT(*) AS count,
  FROM read_parquet('<path>/lsoa/ttm_pt/*.parquet')
  GROUP BY from_id, time_of_day;
")

# Estimate percent change
dest_difference <- ttm_summary %>% 
  pivot_wider(
    names_from = time_of_day, 
    values_from = c(count, median_travel_time)
  ) %>% 
  mutate(
    # NA if number of destinations is too low
    count_am = if_else(count_am <= 5, NA, count_am),
    # Compute the percentage chance
    difference_dest_pct = (count_pm / count_am - 1) * 100
  )

# Close the DB connection
dbDisconnect(con)
```

In this example, the data is summarized and extracted through an SQL query, and then the result is imported into the session as a data frame. The processing time for the query is approximately one second. Once we have a manageable summary, the rest of the process follows the usual `R` workflow (using `dplyr` verbs in this instance).

The LSOA/DZ boundaries, available at 'InFuse' platform (<https://infuse.ukdataservice.ac.uk/>) [@OfficeForNationalStatistics2017], can be joined to the results and plotted in a map. This is shown in the figure below.

![Changes in public transport services between night the morning peak across Great Britain.](plots/map_difference.png)

It should be noted that the number of destinations reachable corresponds to the the 25th percentile travel time, as the SQL query disregards the NA values present in other travel time percentiles.

## Extracting and summarising OA travel time matrix data

The workflow for processing the OA TTM is similar to the one shown in the previous section. However, the processing times might be considerably longer due to the larger size of the matrix. The computing time will depend on the complexity of the query and the computing resources used.

In the following example, the first query generates a summary grouped by the origin (by OA code), which counts the number of estimated for different time thresholds at the 50th percentile. The second part simply subsets the TTM for a single specific origin or multiple origins.


```{r echo=TRUE, eval=FALSE}

# Pacakges ----------------------------------------------------------------

library(DBI)
library(duckdb)
library(dplyr)


# Establish a connection to the DB ----------------------------------------

# Define in-memory DuckDB
con <- dbConnect(duckdb::duckdb())

# Summary by origin --------------------------------------------------------

# Count number of destinations available at various timecuts
cum_summary <- dbGetQuery(con, "
      SELECT from_id,
             COUNT_IF(travel_time_p50 <= 30) AS count_30,
             COUNT_IF(travel_time_p50 <= 60) AS count_60,
             COUNT_IF(travel_time_p50 <= 120) AS count_120,
             COUNT(*) AS count_all
      FROM read_parquet('<path>/oa/ttm_oa/*.parquet')
      WHERE travel_time_p50 IS NOT NULL
      GROUP BY from_id
    ")

# Extract travel times -----------------------------------------------------

# Extract travel times from a single origin 
singleorigin_query <- dbGetQuery(con, "
    SELECT *
    FROM read_parquet('<path>/oa/ttm_oa/*.parquet')
    WHERE from_id = 'E00004187'
  ")

# Extract travel time from multiple origins
# Define OA codes to extract the data from
oa_code_str <- 
 "('E00004667','E00176055','E00175644','E00166680','S00116383','E00175585','E00176659','E00169534','E00172508')"

# Pass SQL query
filtered_query <- dbGetQuery(con, sprintf("
    SELECT *
    FROM read_parquet('<path>/oa/ttm_oa/*.parquet')
    WHERE from_id IN %s
  ", oa_code_str))


# Close the DB connection
dbDisconnect(con)
```

The execution time for the summary was approximately 17 seconds, while the data extraction for a single origin and multiple origins took less than a second and 6.7 minutes, respectively.

The multiple OA codes used in the example above are only nine origins located in central areas of the larger cities in GB out of the 230,00 ones estimated at this level. We can combine the OA boundaries from the InFuse dataset with the estimated trip duration and map these as isochrones. This is illustrated in the figure below.

![Public transport isochrones at the output area level for larger cities in Great Britain.](plots/isochrone_maps.png)

# Code availability

All the code used to create the dataset is available at the following public GitHub repository: <https://github.com/urbanbigdatacentre/ttm_greatbritain>.

## System and software details

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
  [1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8

time zone: Europe/London
tzcode source: internal

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] arrow_12.0.1    sf_1.0-12       lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0
[6] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1
[11] ggplot2_3.4.2   tidyverse_2.0.0 r5r_1.0.1

loaded via a namespace (and not attached):
  [1] bit_4.0.5          gtable_0.3.3       compiler_4.3.0     tidyselect_1.2.0   Rcpp_1.0.10
[6] assertthat_0.2.1   scales_1.2.1       R6_2.5.1           generics_0.1.3     classInt_0.4-9
[11] units_0.8-2        munsell_0.5.0      DBI_1.1.3          pillar_1.9.0       tzdb_0.3.0
[16] rlang_1.1.1        utf8_1.2.3         stringi_1.7.12     bit64_4.0.5        timechange_0.2.0
[21] cli_3.6.1          withr_2.5.0        magrittr_2.0.3     class_7.3-21       grid_4.3.0
[26] rstudioapi_0.14    hms_1.1.3          lifecycle_1.0.3    vctrs_0.6.2        KernSmooth_2.23-20
[31] proxy_0.4-27       glue_1.6.2         data.table_1.14.8  e1071_1.7-13       fansi_1.0.4
[36] colorspace_2.1-0   tools_4.3.0        pkgconfig_2.0.3


# Acknowledegements

This is a work developed at the [Urban Big Data Centre](https://www.ubdc.ac.uk/) of the University of Glasgow and it is supported by Economic and Social Research Council (ESRC) (Grant No. ES/S007105/1).

# Appendix A. Timetable data comparison{#appedix-comparison}

For the comparison, we constructed two alternative graphs using the information from TNDS and BODS, both of which include regional train services from the Rail Delivery Group.

We selected 1,000 random stops included in the TNDS dataset as origins and another 1,000 representing destinations for each region. We then ran equivalent route estimates using similar parameters described in the [Method](#method) section, except that the maximum duration was increased to 180 minutes.

The figure below compares the total number of routes generated using the alternative feed sources. The total number of estimated routes is very similar for most regions, with few cases where one source produced more routes than the other. The exception is North East (NE) England, where BODS resulted in almost twice the number of routes.

![Comparison of timetable feed sources by the total number of routes generated.](plots/feed_compare_count.png)

The following figure presents a comparison of travel times using the two alternative sources, broken down by region. As observed before, the estimates are very similar to each other, especially for London. The notable exception is North East (NE), where all travel time estimates derived from the TNDS are longer than those generated using the BODS data.

![Comparison of the travel time estimated with alternative timetable feed soruces.](plots/feed_compare_tt.png)

In summary, both sources produce consistent estimates. The significant differences observed for the NE are due to data loss during the transformation process from TransXChange to GTFS format. Hence, we decided to utilize the data provided by the BODS as it proved to be a more stable source in our case.

# References