From 0f5c401b50ac8df056f338bc33b2320afabd9f98 Mon Sep 17 00:00:00 2001 From: James McMahon Date: Tue, 12 Dec 2023 14:18:40 +0000 Subject: [PATCH 01/29] Update maintainer to Megan (#69) --- DESCRIPTION | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index fe118b8..c0a5143 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -4,15 +4,14 @@ Title: Useful functions for working with the Source Linkage Files Version: 0.10.0 Authors@R: c( person("Public Health Scotland", , , "phs.source@phs.scot", role = "cph"), - person("James", "McMahon", , "james.mcmahon@phs.scot", role = c("cre", "aut"), - comment = c(ORCID = "0000-0002-5380-2029")) + person("James", "McMahon", , "james.mcmahon@phs.scot", role = c("aut"), + comment = c(ORCID = "0000-0002-5380-2029")), + person("Megan", "McNicol", , "megan.mcnicol2@phs.scot", role = c("cre", "aut")) ) -Description: This package provides a few helper functions for working with - the Source Linkage Files (SLFs). The functions are mainly focussed on - making the first steps of analysis easier. They can read in and filter - the files in an efficient way using minimal syntax. If you find a bug - or have any ideas for new functions or improvements get in touch or - submit a pull request. +Description: This package provides helper functions for working with + the Source Linkage Files (SLFs). The functions are mainly focused on + making the first steps of analysis easier. They can read and filter + the files efficiently using minimal code. License: MIT + file LICENSE URL: https://public-health-scotland.github.io/slfhelper/, https://github.com/Public-Health-Scotland/slfhelper From 7f98dc7749ae7ab7932e0962139d097ac5aea716 Mon Sep 17 00:00:00 2001 From: James McMahon Date: Tue, 12 Dec 2023 15:13:18 +0000 Subject: [PATCH 02/29] Update README.Rmd (#64) Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- README.Rmd | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/README.Rmd b/README.Rmd index 82fc95c..d6dc4a8 100644 --- a/README.Rmd +++ b/README.Rmd @@ -23,13 +23,19 @@ knitr::opts_chunk$set( # slfhelper -The goal of slfhelper is to provide some easy-to-use functions that make working with the Source Linkage Files as painless and efficient as possible. +The goal of slfhelper is to provide some easy-to-use functions that make working with the Source Linkage Files as painless and efficient as possible. It is only intended for use by PHS employees and will only work on the PHS R infrastructure. ## Installation -The preferred method of installation is to use the [{`pak`} package](https://pak.r-lib.org/), which does an excellent job of handling the errors which sometimes occur. +The simplest way to install to the PHS Posit Workbench environment is to use the [PHS Package Manager](https://ppm.publichealthscotland.org/client/#/repos/3/packages/slfhelper), this will be the default setting and means you can install `slfhelper` as you would any other package. -```{r package_install} +``` {r package_install_ppm} +install.packages("slfhelper") +``` + +If this doesn't work you can install it directly from GitHub, there are a number of ways to do this, we recommend the [{`pak`} package](https://pak.r-lib.org/). + +```{r package_install_github} # Install pak (if needed) install.packages("pak") @@ -41,9 +47,9 @@ pak::pak("Public-Health-Scotland/slfhelper") ### Read a file -**Note:** Reading a full file is quite slow and will use a lot of memory, we would always recommend doing a column selection to only keep the variables that you need for your analysis. Just doing this will dramatically speed up the read-time. +**Note:** Reading a full file is quite slow and will use a lot of memory, we would always recommend doing a column selection to only keep the variables that you need for your analysis. Just doing this will dramatically speed up the read time. -We provide some data snippets to help with the column selection and filtering. +We provide some data snippets to help with column selection and filtering. ```{r helper_data} library(slfhelper) @@ -99,11 +105,11 @@ ep_1718 <- read_slf_episode(c("1718", "1819", "1920"), ) %>% get_chi() -# Change chi numbers from data above back to anon_chi +# Change chi numbers from the data above back to anon_chi ep_1718_anon <- ep_1718 %>% get_anon_chi(chi_var = "chi") -# Add anon_chi to cohort sample +# Add anon_chi to the cohort sample chi_cohort <- chi_cohort %>% get_anon_chi(chi_var = "upi_number") ``` From 448b7218017081a1adf1f576e040094d183a674e Mon Sep 17 00:00:00 2001 From: Jennit07 <67372904+Jennit07@users.noreply.github.com> Date: Tue, 12 Dec 2023 16:33:23 +0000 Subject: [PATCH 03/29] Bug - speed up `get_chi()` (#68) * Update to dev version * Make testthat run in parallel * Update variables to pass tests * Update indiv number of variables * Change exists tests to read * Set an environment var to make testthat use multiple CPUs * Revert changes and deal with NA chi/anon_chi * Update documentation * Style package * Update tests so that they pass * Style package * fix tests * Render `README.md` after changes to the `.Rmd` version * exclude from tests for now --------- Co-authored-by: James McMahon Co-authored-by: Jennit07 --- .Renviron | 1 + DESCRIPTION | 3 +- NEWS.md | 4 + R/get_anon_chi.R | 6 +- R/get_chi.R | 24 +++--- README.md | 27 +++++-- man/read_slf.Rd | 2 +- man/read_slf_episode.Rd | 2 +- man/read_slf_individual.Rd | 2 +- man/slfhelper-package.Rd | 9 ++- tests/testthat/_snaps/get_anon_chi.md | 12 +-- tests/testthat/_snaps/get_chi.md | 62 ++++++++-------- tests/testthat/test-files_exist.R | 26 ------- tests/testthat/test-files_readable.R | 28 +++++++ tests/testthat/test-multiple_selections.R | 20 ++--- tests/testthat/test-multiple_years.R | 26 ++++--- tests/testthat/test-partnership_selection.R | 4 +- tests/testthat/test-read_slf_episode.R | 9 ++- tests/testthat/test-read_slf_individual.R | 5 +- tests/testthat/test-var_lists_match.R | 81 +++++++++++---------- 20 files changed, 194 insertions(+), 159 deletions(-) create mode 100644 .Renviron delete mode 100644 tests/testthat/test-files_exist.R create mode 100644 tests/testthat/test-files_readable.R diff --git a/.Renviron b/.Renviron new file mode 100644 index 0000000..a3a718d --- /dev/null +++ b/.Renviron @@ -0,0 +1 @@ +TESTTHAT_CPUS = 12 diff --git a/DESCRIPTION b/DESCRIPTION index c0a5143..d5be86b 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: slfhelper Title: Useful functions for working with the Source Linkage Files -Version: 0.10.0 +Version: 0.10.0.9000 Authors@R: c( person("Public Health Scotland", , , "phs.source@phs.scot", role = "cph"), person("James", "McMahon", , "james.mcmahon@phs.scot", role = c("aut"), @@ -46,6 +46,7 @@ VignetteBuilder: Remotes: Public-Health-Scotland/phsmethods Config/testthat/edition: 3 +Config/testthat/parallel: true Encoding: UTF-8 Language: en-GB LazyData: true diff --git a/NEWS.md b/NEWS.md index fa7913d..40f48e1 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,8 +1,12 @@ +# slfhelper (development version) + # slfhelper 0.10.0 * [`{glue}`](https://glue.tidyverse.org/) is no longer a dependency as the required functionality can be provided by [`stringr::str_glue()](https://stringr.tidyverse.org/reference/str_glue.html). * Dependency versions have been updated to the latest. * `get_chi()` and `get_anon_chi()` now properly match missing (`NA`) and blank (`""`) values. +* slfhelper now defaults to using the `.parquet` file versions, old versions of slfhelper will no longer work. +* There is now a `dev` parameter available when using the `read_slf_*` functions which allows reading the file from the development environment. # slfhelper 0.9.0 diff --git a/R/get_anon_chi.R b/R/get_anon_chi.R index 7f0c5a8..f1229b0 100644 --- a/R/get_anon_chi.R +++ b/R/get_anon_chi.R @@ -54,7 +54,11 @@ get_anon_chi <- function(chi_cohort, chi_var = "chi", drop = TRUE, check = TRUE) lookup <- tibble::tibble( chi = unique(chi_cohort[[chi_var]]) ) %>% - dplyr::mutate(anon_chi = convert_chi_to_anon_chi(.data$chi)) + dplyr::mutate( + chi = dplyr::if_else(is.na(.data$chi), "", .data$chi), + anon_chi = purrr::map_chr(.data$chi, openssl::base64_encode), + anon_chi = dplyr::if_else(.data$anon_chi == "", NA_character_, .data$anon_chi) + ) chi_cohort <- chi_cohort %>% dplyr::left_join( diff --git a/R/get_chi.R b/R/get_chi.R index 27f34da..577e5cd 100644 --- a/R/get_chi.R +++ b/R/get_chi.R @@ -19,8 +19,11 @@ get_chi <- function(data, anon_chi_var = "anon_chi", drop = TRUE) { lookup <- tibble::tibble( anon_chi = unique(data[[anon_chi_var]]) ) %>% - dplyr::mutate(chi = convert_anon_chi_to_chi(.data$anon_chi)) - + dplyr::mutate( + anon_chi = dplyr::if_else(is.na(.data$anon_chi), "", .data$anon_chi), + chi = unname(convert_anon_chi_to_chi(.data$anon_chi)), + chi = dplyr::if_else(.data$chi == "", NA_character_, .data$chi) + ) data <- data %>% dplyr::left_join( lookup, @@ -36,17 +39,10 @@ get_chi <- function(data, anon_chi_var = "anon_chi", drop = TRUE) { return(data) } -convert_anon_chi_to_chi <- function(anon_chi) { - chi <- purrr::map_chr( - anon_chi, - ~ dplyr::case_match(.x, - NA_character_ ~ NA_character_, - "" ~ "", - .default = openssl::base64_decode(.x) %>% - substr(2, 2) %>% - paste0(collapse = "") - ) - ) +convert_anon_chi_to_chi <- Vectorize(function(anon_chi) { + chi <- openssl::base64_decode(anon_chi) %>% + substr(2, 2) %>% + paste0(collapse = "") return(chi) -} +}) diff --git a/README.md b/README.md index 87ba7e9..fd9ca49 100644 --- a/README.md +++ b/README.md @@ -14,13 +14,24 @@ stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https:// The goal of slfhelper is to provide some easy-to-use functions that make working with the Source Linkage Files as painless and efficient as -possible. +possible. It is only intended for use by PHS employees and will only +work on the PHS R infrastructure. ## Installation -The preferred method of installation is to use the [{`pak`} -package](https://pak.r-lib.org/), which does an excellent job of -handling the errors which sometimes occur. +The simplest way to install to the PHS Posit Workbench environment is to +use the [PHS Package +Manager](https://ppm.publichealthscotland.org/client/#/repos/3/packages/slfhelper), +this will be the default setting and means you can install `slfhelper` +as you would any other package. + +``` r +install.packages("slfhelper") +``` + +If this doesn’t work you can install it directly from GitHub, there are +a number of ways to do this, we recommend the [{`pak`} +package](https://pak.r-lib.org/). ``` r # Install pak (if needed) @@ -37,9 +48,9 @@ pak::pak("Public-Health-Scotland/slfhelper") **Note:** Reading a full file is quite slow and will use a lot of memory, we would always recommend doing a column selection to only keep the variables that you need for your analysis. Just doing this will -dramatically speed up the read-time. +dramatically speed up the read time. -We provide some data snippets to help with the column selection and +We provide some data snippets to help with column selection and filtering. ``` r @@ -97,11 +108,11 @@ ep_1718 <- read_slf_episode(c("1718", "1819", "1920"), ) %>% get_chi() -# Change chi numbers from data above back to anon_chi +# Change chi numbers from the data above back to anon_chi ep_1718_anon <- ep_1718 %>% get_anon_chi(chi_var = "chi") -# Add anon_chi to cohort sample +# Add anon_chi to the cohort sample chi_cohort <- chi_cohort %>% get_anon_chi(chi_var = "upi_number") ``` diff --git a/man/read_slf.Rd b/man/read_slf.Rd index e772920..598356e 100644 --- a/man/read_slf.Rd +++ b/man/read_slf.Rd @@ -34,7 +34,7 @@ of columns, as used in \code{dplyr::select()}.} \item{columns}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} \code{columns} is no longer used, use \code{col_select} instead.} -\item{as_data_frame}{Should the function return a \code{data.frame} (default) or +\item{as_data_frame}{Should the function return a \code{tibble} (default) or an Arrow \link[arrow]{Table}?} \item{partnerships}{Optional specify a partnership (hscp2018) or diff --git a/man/read_slf_episode.Rd b/man/read_slf_episode.Rd index d2b872b..5e316c7 100644 --- a/man/read_slf_episode.Rd +++ b/man/read_slf_episode.Rd @@ -29,7 +29,7 @@ partnerships to select.} \item{recids}{Optional specify a recid or recids to select.} -\item{as_data_frame}{Should the function return a \code{data.frame} (default) or +\item{as_data_frame}{Should the function return a \code{tibble} (default) or an Arrow \link[arrow]{Table}?} \item{dev}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#experimental}{\figure{lifecycle-experimental.svg}{options: alt='[Experimental]'}}}{\strong{[Experimental]}} Whether to get the file from diff --git a/man/read_slf_individual.Rd b/man/read_slf_individual.Rd index 455719b..d88e1e4 100644 --- a/man/read_slf_individual.Rd +++ b/man/read_slf_individual.Rd @@ -26,7 +26,7 @@ of columns, as used in \code{dplyr::select()}.} \item{partnerships}{Optional specify a partnership (hscp2018) or partnerships to select.} -\item{as_data_frame}{Should the function return a \code{data.frame} (default) or +\item{as_data_frame}{Should the function return a \code{tibble} (default) or an Arrow \link[arrow]{Table}?} \item{dev}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#experimental}{\figure{lifecycle-experimental.svg}{options: alt='[Experimental]'}}}{\strong{[Experimental]}} Whether to get the file from diff --git a/man/slfhelper-package.Rd b/man/slfhelper-package.Rd index ed91708..d9bc8c0 100644 --- a/man/slfhelper-package.Rd +++ b/man/slfhelper-package.Rd @@ -6,7 +6,7 @@ \alias{slfhelper-package} \title{slfhelper: Useful functions for working with the Source Linkage Files} \description{ -This package provides a few helper functions for working with the Source Linkage Files (SLFs). The functions are mainly focussed on making the first steps of analysis easier. They can read in and filter the files in an efficient way using minimal syntax. If you find a bug or have any ideas for new functions or improvements get in touch or submit a pull request. +This package provides helper functions for working with the Source Linkage Files (SLFs). The functions are mainly focused on making the first steps of analysis easier. They can read and filter the files efficiently using minimal code. } \seealso{ Useful links: @@ -18,7 +18,12 @@ Useful links: } \author{ -\strong{Maintainer}: James McMahon \email{james.mcmahon@phs.scot} (\href{https://orcid.org/0000-0002-5380-2029}{ORCID}) +\strong{Maintainer}: Megan McNicol \email{megan.mcnicol2@phs.scot} + +Authors: +\itemize{ + \item James McMahon \email{james.mcmahon@phs.scot} (\href{https://orcid.org/0000-0002-5380-2029}{ORCID}) +} Other contributors: \itemize{ diff --git a/tests/testthat/_snaps/get_anon_chi.md b/tests/testthat/_snaps/get_anon_chi.md index 6ea2256..f495c3b 100644 --- a/tests/testthat/_snaps/get_anon_chi.md +++ b/tests/testthat/_snaps/get_anon_chi.md @@ -3,20 +3,22 @@ Code get_anon_chi(data) Output - # A tibble: 2 x 1 + # A tibble: 3 x 1 anon_chi - 1 "" + 1 2 + 3 --- Code get_anon_chi(data, drop = FALSE) Output - # A tibble: 2 x 2 + # A tibble: 3 x 2 chi anon_chi - 1 "" "" - 2 + 1 "" + 2 "" + 3 diff --git a/tests/testthat/_snaps/get_chi.md b/tests/testthat/_snaps/get_chi.md index d51f030..2e3f03f 100644 --- a/tests/testthat/_snaps/get_chi.md +++ b/tests/testthat/_snaps/get_chi.md @@ -3,40 +3,42 @@ Code get_chi(data) Output - # A tibble: 12 x 1 - chi - - 1 "2601211618" - 2 "2210680631" - 3 "1410920754" - 4 "3112358158" - 5 "0112418156" - 6 "0612732243" - 7 "2310474015" - 8 "2411063698" - 9 "3801112374" - 10 "2311161233" - 11 "" - 12 + # A tibble: 13 x 1 + chi + + 1 2601211618 + 2 2210680631 + 3 1410920754 + 4 3112358158 + 5 0112418156 + 6 0612732243 + 7 2310474015 + 8 2411063698 + 9 3801112374 + 10 2311161233 + 11 + 12 + 13 --- Code get_chi(data, drop = FALSE) Output - # A tibble: 12 x 2 - anon_chi chi - - 1 "MjYwMTIxMTYxOA==" "2601211618" - 2 "MjIxMDY4MDYzMQ==" "2210680631" - 3 "MTQxMDkyMDc1NA==" "1410920754" - 4 "MzExMjM1ODE1OA==" "3112358158" - 5 "MDExMjQxODE1Ng==" "0112418156" - 6 "MDYxMjczMjI0Mw==" "0612732243" - 7 "MjMxMDQ3NDAxNQ==" "2310474015" - 8 "MjQxMTA2MzY5OA==" "2411063698" - 9 "MzgwMTExMjM3NA==" "3801112374" - 10 "MjMxMTE2MTIzMw==" "2311161233" - 11 "" "" - 12 + # A tibble: 13 x 2 + anon_chi chi + + 1 "MjYwMTIxMTYxOA==" 2601211618 + 2 "MjIxMDY4MDYzMQ==" 2210680631 + 3 "MTQxMDkyMDc1NA==" 1410920754 + 4 "MzExMjM1ODE1OA==" 3112358158 + 5 "MDExMjQxODE1Ng==" 0112418156 + 6 "MDYxMjczMjI0Mw==" 0612732243 + 7 "MjMxMDQ3NDAxNQ==" 2310474015 + 8 "MjQxMTA2MzY5OA==" 2411063698 + 9 "MzgwMTExMjM3NA==" 3801112374 + 10 "MjMxMTE2MTIzMw==" 2311161233 + 11 "" + 12 "" + 13 diff --git a/tests/testthat/test-files_exist.R b/tests/testthat/test-files_exist.R deleted file mode 100644 index 53515a7..0000000 --- a/tests/testthat/test-files_exist.R +++ /dev/null @@ -1,26 +0,0 @@ -skip_on_ci() - - -test_that("Episode files exist", { - # Episode files - expect_true(fs::file_exists(gen_file_path("1415", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1516", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1617", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1718", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1819", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1920", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("2021", "episode", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("2122", "episode", ext = "parquet"))) -}) - - -test_that("Individual files exist", { - expect_true(fs::file_exists(gen_file_path("1415", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1516", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1617", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1718", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1819", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("1920", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("2021", "individual", ext = "parquet"))) - expect_true(fs::file_exists(gen_file_path("2122", "individual", ext = "parquet"))) -}) diff --git a/tests/testthat/test-files_readable.R b/tests/testthat/test-files_readable.R new file mode 100644 index 0000000..8b4dc13 --- /dev/null +++ b/tests/testthat/test-files_readable.R @@ -0,0 +1,28 @@ +skip_on_ci() + + +test_that("Episode files are readable", { + # Episode files + expect_true(fs::file_access(gen_file_path("1415", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1516", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1617", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1718", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1819", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1920", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("2021", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("2122", "episode"), mode = "read")) + expect_true(fs::file_access(gen_file_path("2223", "episode"), mode = "read")) +}) + + +test_that("Individual files are readable", { + expect_true(fs::file_access(gen_file_path("1415", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1516", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1617", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1718", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1819", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("1920", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("2021", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("2122", "individual"), mode = "read")) + expect_true(fs::file_access(gen_file_path("2223", "individual"), mode = "read")) +}) diff --git a/tests/testthat/test-multiple_selections.R b/tests/testthat/test-multiple_selections.R index d4952d4..4616aec 100644 --- a/tests/testthat/test-multiple_selections.R +++ b/tests/testthat/test-multiple_selections.R @@ -5,29 +5,29 @@ test_that("select years and recid", { set.seed(50) acute_only <- read_slf_episode(c("1718", "1819"), - col_select = c("year", "anon_chi", "recid", "keydate1_dateformat"), + col_select = c("year", "anon_chi", "recid", "record_keydate1"), recids = "01B" ) %>% dplyr::slice_sample(n = 200000) expect_equal( names(acute_only), - c("year", "anon_chi", "recid", "keydate1_dateformat") + c("year", "anon_chi", "recid", "record_keydate1") ) - expect_equal(unique(acute_only$year), c("1718", "1819")) + # expect_equal(unique(acute_only$year), c("1718", "1819")) expect_equal(unique(acute_only$recid), "01B") hosp_only <- read_slf_episode(c("1718", "1819"), - col_select = c("year", "anon_chi", "recid", "keydate1_dateformat"), + col_select = c("year", "anon_chi", "recid", "record_keydate1"), recids = c("01B", "02B", "04B", "GLS") ) %>% dplyr::slice_sample(n = 200000) expect_equal( names(hosp_only), - c("year", "anon_chi", "recid", "keydate1_dateformat") + c("year", "anon_chi", "recid", "record_keydate1") ) - expect_equal(unique(hosp_only$year), c("1718", "1819")) + # expect_equal(unique(hosp_only$year), c("1718", "1819")) expect_equal(sort(unique(hosp_only$recid)), c("01B", "02B", "04B", "GLS")) }) @@ -104,10 +104,10 @@ test_that("all selections", { names(edi_gla_hosp_2_year), c("year", "anon_chi", "recid", "hscp2018") ) - expect_equal( - unique(edi_gla_hosp_2_year$year), - c("1718", "1819") - ) + # expect_equal( + # unique(edi_gla_hosp_2_year$year), + # c("1718", "1819") + # ) expect_equal( sort(unique(edi_gla_hosp_2_year$recid)), c("01B", "02B", "04B", "GLS") diff --git a/tests/testthat/test-multiple_years.R b/tests/testthat/test-multiple_years.R index f2aaece..1466a04 100644 --- a/tests/testthat/test-multiple_years.R +++ b/tests/testthat/test-multiple_years.R @@ -8,7 +8,7 @@ test_that("read multiple years works for individual file", { indiv <- read_slf_individual(c("1718", "1819"), col_select = c("year", "anon_chi") ) %>% - dplyr::slice_sample(n = 50) + dplyr::slice_sample(n = 100) # Test for anything odd expect_s3_class(indiv, "tbl_df") @@ -20,11 +20,12 @@ test_that("read multiple years works for individual file", { # Test for the correct number of rows (50 * 2) expect_equal(nrow(indiv), 100) - # Test that we have 50 rows from each year - expect_equal( - dplyr::count(indiv, year), - tibble::tibble(year = c("1718", "1819"), n = c(50L, 50L)) - ) + # This test keeps failing as the rows are not equal to 50, e.g 29 and 21 + # # Test that we have 50 rows from each year + # expect_equal( + # dplyr::count(indiv, year), + # tibble::tibble(year = c("1718", "1819"), n = c(50L, 50L)) + # ) }) test_that("read multiple years works for episode file", { @@ -34,7 +35,7 @@ test_that("read multiple years works for episode file", { ep <- read_slf_episode(c("1718", "1819"), col_select = c("year", "anon_chi") ) %>% - dplyr::slice_sample(n = 50) + dplyr::slice_sample(n = 100) # Test for anything odd expect_s3_class(ep, "tbl_df") @@ -46,9 +47,10 @@ test_that("read multiple years works for episode file", { # Test for the correct number of rows (50 * 2) expect_equal(nrow(ep), 100) - # Test that we have 50 rows from each year - expect_equal( - dplyr::count(ep, year), - tibble::tibble(year = c("1718", "1819"), n = c(50L, 50L)) - ) + # This test keeps failing as the rows are not equal to 50, e.g 29 and 21 + # # Test that we have 50 rows from each year + # expect_equal( + # dplyr::count(ep, year), + # tibble::tibble(year = c("1718", "1819"), n = c(50L, 50L)) + # ) }) diff --git a/tests/testthat/test-partnership_selection.R b/tests/testthat/test-partnership_selection.R index 27ff274..9a24372 100644 --- a/tests/testthat/test-partnership_selection.R +++ b/tests/testthat/test-partnership_selection.R @@ -45,7 +45,7 @@ test_that("Can still do filtering if variable is not selected", { # Don't choose to read the partnership variable indiv_1718_edinburgh <- read_slf_individual("1718", partnerships = "S37000012", - col_select = c("hri_scot") + col_select = c("anon_chi") ) %>% dplyr::slice_sample(n = 1000) @@ -53,7 +53,7 @@ test_that("Can still do filtering if variable is not selected", { expect_false("hscp2018" %in% names(indiv_1718_edinburgh)) # Should still have the variables we picked - expect_true("hri_scot" %in% names(indiv_1718_edinburgh)) + expect_true("anon_chi" %in% names(indiv_1718_edinburgh)) # Should have at least 100 records (checks we're not getting an empty file) expect_gte(nrow(indiv_1718_edinburgh), 100) diff --git a/tests/testthat/test-read_slf_episode.R b/tests/testthat/test-read_slf_episode.R index a382d8e..da40af1 100644 --- a/tests/testthat/test-read_slf_episode.R +++ b/tests/testthat/test-read_slf_episode.R @@ -28,8 +28,9 @@ for (year in years) { expect_equal(nrow(ep_file), 110) }) - test_that("Episode file has the expected number of variables", { - # Test for correct number of variables (will need updating) - expect_length(ep_file, 241) - }) + # Need to come back to this test - some files have different lengths + # test_that("Episode file has the expected number of variables", { + # # Test for correct number of variables (will need updating) + # expect_length(ep_file, 241) + # }) } diff --git a/tests/testthat/test-read_slf_individual.R b/tests/testthat/test-read_slf_individual.R index eb6305f..2fe1f9a 100644 --- a/tests/testthat/test-read_slf_individual.R +++ b/tests/testthat/test-read_slf_individual.R @@ -15,8 +15,9 @@ test_that("Reads individual file correctly", { # Test for the correct number of rows expect_equal(nrow(indiv_file), 100) - # Test for correct number of variables (will need updating) - expect_length(indiv_file, 184) + # Need to come back to this test - some files have different lengths + # # Test for correct number of variables (will need updating) + # expect_length(indiv_file, 184) } }) diff --git a/tests/testthat/test-var_lists_match.R b/tests/testthat/test-var_lists_match.R index 6ae4091..ff37e00 100644 --- a/tests/testthat/test-var_lists_match.R +++ b/tests/testthat/test-var_lists_match.R @@ -1,42 +1,45 @@ skip_on_ci() -variable_names <- function(year, file_version = c("episode", "individual")) { - if (file_version == "episode") { - set.seed(50) +# Exclude for now as tests are failing due to the ordering not matching. We +# do not order variables anymore in R - variable_names <- names(read_slf_episode(year) %>% - dplyr::slice_sample(n = 1)) - } else if (file_version == "individual") { - set.seed(50) - - variable_names <- names(read_slf_individual(year) %>% - dplyr::slice_sample(n = 1)) - } -} - - -test_that("episode file vars match the vars list", { - # These should be identical (names, order etc.) - expect_equal(variable_names("1415", "episode"), ep_file_vars) - expect_equal(variable_names("1516", "episode"), ep_file_vars) - expect_equal(variable_names("1617", "episode"), ep_file_vars) - expect_equal(variable_names("1718", "episode"), ep_file_vars) - expect_equal(variable_names("1819", "episode"), ep_file_vars) - expect_equal(variable_names("1920", "episode"), ep_file_vars) - expect_equal(variable_names("2021", "episode"), ep_file_vars) - expect_equal(variable_names("2122", "episode"), ep_file_vars) - expect_equal(variable_names("2223", "episode"), ep_file_vars) -}) - -test_that("individual file vars match the vars list", { - # These should be identical (names, order etc.) - expect_equal(variable_names("1415", "individual"), indiv_file_vars) - expect_equal(variable_names("1516", "individual"), indiv_file_vars) - expect_equal(variable_names("1617", "individual"), indiv_file_vars) - expect_equal(variable_names("1718", "individual"), indiv_file_vars) - expect_equal(variable_names("1819", "individual"), indiv_file_vars) - expect_equal(variable_names("1920", "individual"), indiv_file_vars) - expect_equal(variable_names("2021", "individual"), indiv_file_vars) - expect_equal(variable_names("2122", "individual"), indiv_file_vars) - expect_equal(variable_names("2223", "individual"), indiv_file_vars) -}) +# variable_names <- function(year, file_version = c("episode", "individual")) { +# if (file_version == "episode") { +# set.seed(50) +# +# variable_names <- names(read_slf_episode(year) %>% +# dplyr::slice_sample(n = 1)) +# } else if (file_version == "individual") { +# set.seed(50) +# +# variable_names <- names(read_slf_individual(year) %>% +# dplyr::slice_sample(n = 1)) +# } +# } +# +# +# test_that("episode file vars match the vars list", { +# # These should be identical (names, order etc.) +# expect_equal(variable_names("1415", "episode"), ep_file_vars) +# expect_equal(variable_names("1516", "episode"), ep_file_vars) +# expect_equal(variable_names("1617", "episode"), ep_file_vars) +# expect_equal(variable_names("1718", "episode"), ep_file_vars) +# expect_equal(variable_names("1819", "episode"), ep_file_vars) +# expect_equal(variable_names("1920", "episode"), ep_file_vars) +# expect_equal(variable_names("2021", "episode"), ep_file_vars) +# expect_equal(variable_names("2122", "episode"), ep_file_vars) +# expect_equal(variable_names("2223", "episode"), ep_file_vars) +# }) +# +# test_that("individual file vars match the vars list", { +# # These should be identical (names, order etc.) +# expect_equal(variable_names("1415", "individual"), indiv_file_vars) +# expect_equal(variable_names("1516", "individual"), indiv_file_vars) +# expect_equal(variable_names("1617", "individual"), indiv_file_vars) +# expect_equal(variable_names("1718", "individual"), indiv_file_vars) +# expect_equal(variable_names("1819", "individual"), indiv_file_vars) +# expect_equal(variable_names("1920", "individual"), indiv_file_vars) +# expect_equal(variable_names("2021", "individual"), indiv_file_vars) +# expect_equal(variable_names("2122", "individual"), indiv_file_vars) +# expect_equal(variable_names("2223", "individual"), indiv_file_vars) +# }) From beeea0636dd32c8bbda24324640f70135f99409c Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Wed, 13 Dec 2023 09:18:47 +0000 Subject: [PATCH 04/29] Render `README.md` after changes to the `.Rmd` version (#70) Co-authored-by: github-merge-queue[bot] Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> From 8abe664ba3ba8ea72490ce9374a41bb7fcc0051c Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 13 Dec 2023 09:44:50 +0000 Subject: [PATCH 05/29] Bump actions/checkout from 3 to 4 (#66) Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- .github/workflows/R-CMD-check.yaml | 2 +- .github/workflows/document.yaml | 2 +- .github/workflows/lint.yaml | 2 +- .github/workflows/pkgdown.yaml | 2 +- .github/workflows/render-README.yaml | 2 +- .github/workflows/style.yaml | 2 +- .github/workflows/test-coverage.yaml | 2 +- 7 files changed, 7 insertions(+), 7 deletions(-) diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index 613ddbd..35aa114 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -25,7 +25,7 @@ jobs: R_KEEP_PKG_SOURCE: yes steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 - uses: r-lib/actions/setup-pandoc@v2 diff --git a/.github/workflows/document.yaml b/.github/workflows/document.yaml index eb61023..72e4745 100644 --- a/.github/workflows/document.yaml +++ b/.github/workflows/document.yaml @@ -13,7 +13,7 @@ jobs: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} steps: - name: Checkout repo - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: fetch-depth: 0 diff --git a/.github/workflows/lint.yaml b/.github/workflows/lint.yaml index abc5a7c..7debaf3 100644 --- a/.github/workflows/lint.yaml +++ b/.github/workflows/lint.yaml @@ -14,7 +14,7 @@ jobs: env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 - uses: r-lib/actions/setup-r@v2 with: diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml index 17ef100..a0ba579 100644 --- a/.github/workflows/pkgdown.yaml +++ b/.github/workflows/pkgdown.yaml @@ -22,7 +22,7 @@ jobs: permissions: contents: write steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 - uses: r-lib/actions/setup-pandoc@v2 diff --git a/.github/workflows/render-README.yaml b/.github/workflows/render-README.yaml index 10c0059..22eb259 100644 --- a/.github/workflows/render-README.yaml +++ b/.github/workflows/render-README.yaml @@ -13,7 +13,7 @@ jobs: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} steps: - name: Checkout repo - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: fetch-depth: 0 diff --git a/.github/workflows/style.yaml b/.github/workflows/style.yaml index 7487dfb..6503c8d 100644 --- a/.github/workflows/style.yaml +++ b/.github/workflows/style.yaml @@ -13,7 +13,7 @@ jobs: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} steps: - name: Checkout repo - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: fetch-depth: 0 diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml index 8c853e7..dab2089 100644 --- a/.github/workflows/test-coverage.yaml +++ b/.github/workflows/test-coverage.yaml @@ -15,7 +15,7 @@ jobs: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 - uses: r-lib/actions/setup-r@v2 with: From d50d412ed1d220d8be7e3e64551632ee5877411b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 13 Dec 2023 10:19:24 +0000 Subject: [PATCH 06/29] Bump peter-evans/create-pull-request from 4 to 5 (#65) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 4 to 5. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](https://github.com/peter-evans/create-pull-request/compare/v4...v5) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- .github/workflows/document.yaml | 2 +- .github/workflows/style.yaml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/document.yaml b/.github/workflows/document.yaml index 72e4745..f7b9e1d 100644 --- a/.github/workflows/document.yaml +++ b/.github/workflows/document.yaml @@ -34,7 +34,7 @@ jobs: - name: Commit and create a Pull Request on development if: ${{ github.ref == 'refs/heads/development' }} - uses: peter-evans/create-pull-request@v4 + uses: peter-evans/create-pull-request@v5 with: commit-message: "Update documentation" branch: document_development diff --git a/.github/workflows/style.yaml b/.github/workflows/style.yaml index 6503c8d..a43e184 100644 --- a/.github/workflows/style.yaml +++ b/.github/workflows/style.yaml @@ -60,7 +60,7 @@ jobs: - name: Commit and create a Pull Request on development if: ${{ github.ref == 'refs/heads/development' }} - uses: peter-evans/create-pull-request@v4 + uses: peter-evans/create-pull-request@v5 with: commit-message: "Style package" branch: document_development From 60e2a52147528ffded8c55f145909410878089d2 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 13 Dec 2023 10:42:11 +0000 Subject: [PATCH 07/29] Bump stefanzweifel/git-auto-commit-action from 4 to 5 (#67) Bumps [stefanzweifel/git-auto-commit-action](https://github.com/stefanzweifel/git-auto-commit-action) from 4 to 5. - [Release notes](https://github.com/stefanzweifel/git-auto-commit-action/releases) - [Changelog](https://github.com/stefanzweifel/git-auto-commit-action/blob/master/CHANGELOG.md) - [Commits](https://github.com/stefanzweifel/git-auto-commit-action/compare/v4...v5) --- updated-dependencies: - dependency-name: stefanzweifel/git-auto-commit-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- .github/workflows/document.yaml | 2 +- .github/workflows/render-README.yaml | 2 +- .github/workflows/style.yaml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/document.yaml b/.github/workflows/document.yaml index f7b9e1d..8103305 100644 --- a/.github/workflows/document.yaml +++ b/.github/workflows/document.yaml @@ -46,6 +46,6 @@ jobs: - name: Commit and push changes on all other branches if: ${{ github.ref != 'refs/heads/development' }} - uses: stefanzweifel/git-auto-commit-action@v4 + uses: stefanzweifel/git-auto-commit-action@v5 with: commit_message: "Update documentation" diff --git a/.github/workflows/render-README.yaml b/.github/workflows/render-README.yaml index 22eb259..b94cf33 100644 --- a/.github/workflows/render-README.yaml +++ b/.github/workflows/render-README.yaml @@ -47,6 +47,6 @@ jobs: - name: Commit and push changes on all other branches if: ${{ github.ref != 'refs/heads/production' }} - uses: stefanzweifel/git-auto-commit-action@v4 + uses: stefanzweifel/git-auto-commit-action@v5 with: commit_message: "Render `README.md` after changes to the `.Rmd` version" diff --git a/.github/workflows/style.yaml b/.github/workflows/style.yaml index a43e184..436f3b1 100644 --- a/.github/workflows/style.yaml +++ b/.github/workflows/style.yaml @@ -72,6 +72,6 @@ jobs: - name: Commit and push changes on all other branches if: ${{ github.ref != 'refs/heads/development' }} - uses: stefanzweifel/git-auto-commit-action@v4 + uses: stefanzweifel/git-auto-commit-action@v5 with: commit_message: "Style package" From 9730b1a046cd1f392ed3ef66fb8c4c0b3cb77e85 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 18 Dec 2023 15:16:08 +0000 Subject: [PATCH 08/29] Bump JamesIves/github-pages-deploy-action from 4.4.3 to 4.5.0 (#71) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.4.3 to 4.5.0. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](https://github.com/jamesives/github-pages-deploy-action/compare/v4.4.3...v4.5.0) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- .github/workflows/pkgdown.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml index a0ba579..236edd4 100644 --- a/.github/workflows/pkgdown.yaml +++ b/.github/workflows/pkgdown.yaml @@ -41,7 +41,7 @@ jobs: - name: Deploy to GitHub pages 🚀 if: github.event_name != 'pull_request' - uses: JamesIves/github-pages-deploy-action@v4.4.3 + uses: JamesIves/github-pages-deploy-action@v4.5.0 with: clean: false branch: gh-pages From 75e6dd03d15a13ae08ffb24abd2d0e0179168fcb Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 18 Dec 2023 15:23:49 +0000 Subject: [PATCH 09/29] Bump actions/upload-artifact from 3 to 4 (#72) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 3 to 4. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- .github/workflows/test-coverage.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml index dab2089..fe82e74 100644 --- a/.github/workflows/test-coverage.yaml +++ b/.github/workflows/test-coverage.yaml @@ -44,7 +44,7 @@ jobs: - name: Upload test results if: failure() - uses: actions/upload-artifact@v3 + uses: actions/upload-artifact@v4 with: name: coverage-test-failures path: ${{ runner.temp }}/package From d04692f6165f6ce7bfe825444bb22b368ebf395f Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 12 Feb 2024 12:08:44 +0000 Subject: [PATCH 10/29] Bump peter-evans/create-pull-request from 5 to 6 (#74) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 5 to 6. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](https://github.com/peter-evans/create-pull-request/compare/v5...v6) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- .github/workflows/document.yaml | 2 +- .github/workflows/render-README.yaml | 2 +- .github/workflows/style.yaml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/document.yaml b/.github/workflows/document.yaml index 8103305..9280054 100644 --- a/.github/workflows/document.yaml +++ b/.github/workflows/document.yaml @@ -34,7 +34,7 @@ jobs: - name: Commit and create a Pull Request on development if: ${{ github.ref == 'refs/heads/development' }} - uses: peter-evans/create-pull-request@v5 + uses: peter-evans/create-pull-request@v6 with: commit-message: "Update documentation" branch: document_development diff --git a/.github/workflows/render-README.yaml b/.github/workflows/render-README.yaml index b94cf33..3ed5814 100644 --- a/.github/workflows/render-README.yaml +++ b/.github/workflows/render-README.yaml @@ -35,7 +35,7 @@ jobs: - name: Commit and create a Pull Request on production if: ${{ github.ref == 'refs/heads/production' }} - uses: peter-evans/create-pull-request@v5 + uses: peter-evans/create-pull-request@v6 with: commit-message: "Render `README.md` after changes to the `.Rmd` version" branch: render_readme diff --git a/.github/workflows/style.yaml b/.github/workflows/style.yaml index 436f3b1..bd1cbb9 100644 --- a/.github/workflows/style.yaml +++ b/.github/workflows/style.yaml @@ -60,7 +60,7 @@ jobs: - name: Commit and create a Pull Request on development if: ${{ github.ref == 'refs/heads/development' }} - uses: peter-evans/create-pull-request@v5 + uses: peter-evans/create-pull-request@v6 with: commit-message: "Style package" branch: document_development From 8cd62a85ce9b523aa0ffe3318755766324d833e5 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 12 Feb 2024 12:19:33 +0000 Subject: [PATCH 11/29] Bump actions/cache from 3 to 4 (#73) Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](https://github.com/actions/cache/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- .github/workflows/style.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/style.yaml b/.github/workflows/style.yaml index bd1cbb9..9a2f67b 100644 --- a/.github/workflows/style.yaml +++ b/.github/workflows/style.yaml @@ -46,7 +46,7 @@ jobs: shell: Rscript {0} - name: Cache styler - uses: actions/cache@v3 + uses: actions/cache@v4 with: path: ${{ steps.styler-location.outputs.location }} key: ${{ runner.os }}-styler-${{ github.sha }} From fdff827edcac5f3e007e5e1e64631cfbb4dce989 Mon Sep 17 00:00:00 2001 From: Jennit07 <67372904+Jennit07@users.noreply.github.com> Date: Tue, 13 Feb 2024 09:59:14 +0000 Subject: [PATCH 12/29] Update README.md (#75) Updated to include reading in LTC 'catch all' variables --- README.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/README.md b/README.md index fd9ca49..2887f26 100644 --- a/README.md +++ b/README.md @@ -65,6 +65,31 @@ View(partnerships) # See a list with descriptions for the recids View(recids) + +# See a list of Long term conditions +View(ltc_vars) + +# See a list of bedday related variables +View(ep_file_bedday_vars) + +# See a list of cost related variables +View(ep_file_cost_vars) +``` + +``` r +library(slfhelper) + +# Read a group of variables e.g. LTCs (arth, asthma, atrialfib etc) +# A nice 'catch all' for reading in all of the LTC variables +ep_1718 <- read_slf_episode("1718", col_select = c("anon_chi", ltc_vars)) + +# Read in a group of variables e.g. bedday related variables (yearstay, stay, apr_beddays etc) +# A 'catch all' for reading in bedday related variables +ep_1819 <- read_slf_episode("1819", col_select = c("anon_chi", ep_file_bedday_vars)) + +# Read in a group of variables e.g. cost related variables (cost_total_net, apr_cost) +# A 'catch all' for reading in cos related variables +ep_1920 <- read_slf_episode("1920", col_select = c("anon_chi", ep_file_cost_vars)) ``` ``` r From b5eab41774206b718d50dc931f36c8af203dfe48 Mon Sep 17 00:00:00 2001 From: Megan McNicol <43570769+SwiftySalmon@users.noreply.github.com> Date: Tue, 13 Feb 2024 14:55:14 +0000 Subject: [PATCH 13/29] change in episode file cost variable vector (#76) Co-authored-by: marjom02 --- data/ep_file_cost_vars.rda | Bin 181 -> 318 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/data/ep_file_cost_vars.rda b/data/ep_file_cost_vars.rda index ceae74da287d5285b716442812605de1b39156f1..52e6cec99bb9113084703f845f94736a670b82e4 100644 GIT binary patch literal 318 zcmZ9G!485j42BDgL^%*W`4pbK_yEQSFq&{%>KFzxHo^vpFK?`59O|KIzyELk_I9aG zMO6Sm1|mrz%V~uCalanV03szwK){yzUJtTqjh4#!AZHvsi94z?E|iNATtFkO4pBXD zRkl{i+Mu&>8?9@L9Wcl9#yF9Qrv}w8a=JKQ%(3 jStii>gNoMWnxami44pu*oJzpV-ZB?*ML1B9#n=tuP6bd9 From 18b6909d59a94859b299f2e180e1af471bf68659 Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Tue, 19 Mar 2024 08:27:01 +0000 Subject: [PATCH 14/29] force keytime format to hms (#77) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * force keytime format to hms * Update documentation * visible binding for global variables like ‘keytime1’ * minor changes * fix keytime in column names * import hms --------- Co-authored-by: lizihao-anu --- DESCRIPTION | 3 ++- R/read_slf.R | 30 ++++++++++++++++++++---------- 2 files changed, 22 insertions(+), 11 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index d5be86b..662ea5c 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -24,6 +24,7 @@ Imports: dplyr (>= 1.1.2), fs (>= 1.6.2), fst (>= 0.9.8), + hms, lifecycle (>= 1.0.3), magrittr (>= 2.0.3), openssl (>= 2.0.6), @@ -52,4 +53,4 @@ Language: en-GB LazyData: true Roxygen: list(markdown = TRUE, roclets = c("collate","namespace", "rd", "vignette" )) -RoxygenNote: 7.2.3 +RoxygenNote: 7.3.1 diff --git a/R/read_slf.R b/R/read_slf.R index 0b8bde6..aa5021c 100644 --- a/R/read_slf.R +++ b/R/read_slf.R @@ -146,17 +146,27 @@ read_slf_episode <- function( } # TODO add option to drop blank CHIs? # TODO add a filter by recid option - return( - read_slf( - year = year, - col_select = unique(col_select), - file_version = "episode", - partnerships = unique(partnerships), - recids = unique(recids), - as_data_frame = as_data_frame, - dev = dev - ) + + data <- read_slf( + year = year, + col_select = unique(col_select), + file_version = "episode", + partnerships = unique(partnerships), + recids = unique(recids), + as_data_frame = as_data_frame, + dev = dev ) + + if ("keytime1" %in% colnames(data)) { + data <- data %>% + dplyr::mutate(keytime1 = hms::as_hms(.data$keytime1)) + } + if ("keytime2" %in% colnames(data)) { + data <- data %>% + dplyr::mutate(keytime2 = hms::as_hms(.data$keytime2)) + } + + return(data) } #' Read a Source Linkage individual file From 034705609814a5507b0e98e6828a91d835ba82c0 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 22 Apr 2024 09:55:07 +0100 Subject: [PATCH 15/29] Bump JamesIves/github-pages-deploy-action from 4.5.0 to 4.6.0 (#79) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.5.0 to 4.6.0. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](https://github.com/jamesives/github-pages-deploy-action/compare/v4.5.0...v4.6.0) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- .github/workflows/pkgdown.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml index 236edd4..5bf3f5b 100644 --- a/.github/workflows/pkgdown.yaml +++ b/.github/workflows/pkgdown.yaml @@ -41,7 +41,7 @@ jobs: - name: Deploy to GitHub pages 🚀 if: github.event_name != 'pull_request' - uses: JamesIves/github-pages-deploy-action@v4.5.0 + uses: JamesIves/github-pages-deploy-action@v4.6.0 with: clean: false branch: gh-pages From aed3aada761222d2c5099d237b75865330e4b4b5 Mon Sep 17 00:00:00 2001 From: Jennifer Thom Date: Fri, 14 Jun 2024 15:05:56 +0100 Subject: [PATCH 16/29] add vignette for SLFhelper documentation --- vignettes/slf-documentation.Rmd | 465 ++++++++++++++++++++++++++++++++ 1 file changed, 465 insertions(+) create mode 100644 vignettes/slf-documentation.Rmd diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd new file mode 100644 index 0000000..8d827d2 --- /dev/null +++ b/vignettes/slf-documentation.Rmd @@ -0,0 +1,465 @@ +--- +title: "slf-documentation" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{slf-documentation} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + + +```{r setup, include = FALSE} +library(slfhelper) +library(tidyverse) +``` + +## SLFhelper +`SLFhelper` contains some easy to use functions designed to make working with the Source Linkage Files (SLFs) as efficient as possible. + +### Filter functions: +- `year` returns financial year of interest. You can also select multiple years using `c("1718", "1819", "1920")` +- `recid` returns recids of interest. Selecting this is beneficial for specific analysis. +- `partnerships` returns partnerships of interest. Selecting certain partnerships will reduce the SLFs size. +- `col_select` returns columns of interest. This is the best way to reduce the SLFs size. + +### Data snippets: +- `ep_file_vars` returns a list of all variables in the episode files. +- `indiv_file_vars` returns a list of all variables in the individual files. +- `partnerships` returns a list of partnership names (HSCP_2018 codes) +- `recid` returns a list of all recids available in the SLFs. +- `ep_file_bedday_vars` returns a list of all bedday related variables in the SLFs. +- `ep_file_cost_vars` returns a list of all cost related variables in the SLFs. + +### Anon CHI +- Use the function `get_chi()` to easily switch `anon_chi` to `chi`. +- Use the function `get_anon_chi()` to easily switch `chi` to `anon_chi`. + + +### Memory usage in SLFS + +While working with the Source Linkage Files (SLFs), it is recommended to use the features of the SLFhelper package to maximase the memory usage in posit, see [PHS Data Science Knowledge Base](https://public-health-scotland.github.io/knowledge-base/docs/Posit%20Infrastructure?doc=Memory%20Usage%20in%20SMR01.md) for further guidance on memory usage in posit workbench. + +Reading a full SLF file can be time consuming and take up resources on posit workbench. In the episode file there are `251 variables` and around `12 million rows` compared to the individual file where there are `193 variables` and around `6 million rows` in each file. This can be reduced by using available selections in SLFhelper to help reduce the size of the SLFs for analysis and to free up resources in posit workbench. + +The tables below show the memory usage of each full size SLF. + + +## Episode File + +| Year | Memory usage (MB)| +| ------------- |:----------------:| +| 1718 | 2651.2 | +| 1819 | 3196.5 | +| 1920 | 3145.4 | +| 2021 | 2715.6 | +| 2122 | 2959.3 | +| 2223 | 2995.1 | +| 2324 | 1894.5 | + + +## Individual File + +| Year | Memory usage (MB)| +| ------------- |:----------------:| +| 1718 | 1055.6 | +| 1819 | 1057.8 | +| 1920 | 1070.7 | +| 2021 | 1067.3 | +| 2122 | 1081.2 | +| 2223 | 1098.7 | +| 2324 | 775.5 | + + +## Using Parquet files with the arrow package + +The SLFs are available in parquet format. The {arrow} package gives some extra features which can speed up and reduce memory usage even further. You can read only specific columns `read_parquet(file, col_select = c(var1, var2))`. + +Using arrow’s ‘Arrow Table’ feature, you can speed up analysis efficiently. To do this, specify `as_data_frame = FALSE` when using SLFhelper and `dplyr::collect()` to read the data. + + +#### For example: + +Planned and unplanned beddays in Scotland +```{r} +# Filter for year of interest +slf_extract <- read_slf_episode(c( "1819", "1920"), + # Select recids of interest + recids = c("01B", "GLS", "04B"), + # Select columns + col_select = c("year", "anon_chi", "recid", + "yearstay", "age", "cij_pattype"), + # return an arrow table + as_data_frame = FALSE) %>% + # Filter for non-elective and elective episodes + filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + # Group by year and cij_pattype for analysis + group_by(year, cij_pattype) %>% + # summarise bedday totals + summarise(beddays = sum(yearstay)) %>% + # collect the arrow table + dplyr::collect() + +``` + + +## Examples using SLFhelper + +1. A&E attendances in East Lothian by age group. + +Produce a table to compare A&E Attendances for the following age groups (0-17, 18-64, 65-74, 75-84, 85+) for 2018/19 in East Lothian HSCP. +```{r} +# read in data required from slf individual file - filter for year 2018/19 +el_1819 <- read_slf_individual(year = "1819", + # select variables needed + col_select = c("age", "ae_attendances"), + # filter partnership for East Lothian + partnerships = "S37000010") + +# create age bands +age_labs <- c("0-17", "18-64", "65-74", "75-84", "85+") # create age labels + +# create age group variable +el_1819 <- el_1819 %>% + mutate(age_group = cut(age, + breaks=c(-1, 17, 64, 74, 84, 150), labels=age_labs)) + +# produce summary table +output_table_1 <- el_1819 %>% + group_by(age_group) %>% + summarise(attendances=sum(ae_attendances)) %>% + ungroup() + +``` + + +2. Outpatient attendances by specialty and gender. + +Create a table to compare the number of outpatient attendances (SMR00) broken down by specialty and gender in 2017/18 in Scotland. + +```{r} +# read in specialty lookup with names +spec_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% + select(spec = Speccode, + spec_name = Description) + +# read in data required from slf episode file - filter year = 2017/18 +op_1718 <- read_slf_episode(year = "1718", + # select columns + col_select = c("recid", "gender", "spec"), + # filter on recid for outpatients + recids = "00B") + +# produce output +output_table_2 <- op_1718 %>% + # get counts by specialty and gender + count(spec, gender) %>% + # exclude those with no gender recorded + filter(gender==1 | gender==2) %>% + # recode gender into M/F + mutate(gender = recode(as.character(gender), '1' = "Male", '2' = "Female")) %>% + # move gender to separate columns + pivot_wider(names_from = gender, values_from=n) %>% + # match on specialty names + left_join(spec_lookup) %>% + # reorder variables + select(spec, spec_name, Male, Female) + +``` + +3. Hospital admissions & beddays by HB of residence. + +Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. +```{r} +# Read in names for Health Boards +hb_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% + select(hb2019 = HealthBoardArea2019Code, + hb_desc = HealthBoardArea2019Name) + +# read in data required from slf individual file - filter for 2018/19 +indiv_1819 <- read_slf_individual(year = "1819", + # Select columns of interest + col_select = c("hb2019", "cij_el", "cij_non_el", + "acute_el_inpatient_beddays", + "mh_el_inpatient_beddays", + "gls_el_inpatient_beddays", + "acute_non_el_inpatient_beddays", + "mh_non_el_inpatient_beddays", + "gls_non_el_inpatient_beddays")) + + +# calculate total bed days and add on HB names +indiv_1819_inc_totals <- indiv_1819 %>% + # calculate overall bed days + mutate(elective_beddays = acute_el_inpatient_beddays + mh_el_inpatient_beddays + + gls_el_inpatient_beddays, + non_elective_beddays = acute_non_el_inpatient_beddays + mh_non_el_inpatient_beddays + + gls_non_el_inpatient_beddays) %>% + # match on HB name + left_join(hb_lookup) + +# produce summary table +output_table_3 <- indiv_1819_inc_totals %>% + # group by HB of residence + group_by(hb2019, hb_desc) %>% + # produce summary table + summarise(elective_adm = sum(cij_el), + non_elective_adm = sum(cij_non_el), + elective_beddays = sum(elective_beddays), + non_elective_beddays = sum(non_elective_beddays)) %>% + # calculate average length of stay + mutate(elective_alos = elective_beddays/elective_adm, + non_elective_alos = non_elective_beddays/non_elective_adm) + +``` + +4. GP Out of Hours Consulations in South Ayrshire. + +Create a table showing the number of GP Out of Hours consultations for patients with dementia in South Ayrshire HSCP in 2019/20 broken down by type of consultation. +```{r} +# read in data required from slf episode file - filter for year = 2019/20 +sa_1920 <- read_slf_episode(year = "1920", + # select columns + col_select = c("dementia", "smrtype"), + # filter for South Ayrshire HSCP + partnerships = "S37000027", + # Filter for GP OOH data + recids = "OoH") + +# select dementia patients +sa_dementia_1920 <- sa_1920 %>% + filter(dementia==1) + +# produce summary table +output_table_4 <- sa_dementia_1920 %>% + count(smrtype) + +``` + +5. Costs in Aberdeen City. + +Produce a table to show the number of patients and the total costs for Aberdeen City HSCP in 2018/19. Include a breakdown of costs for the following services: Acute (inpatients & daycases), GLS, Mental Health and Maternity, Outpatients, A&E, GP Out of Hours, Community Prescribing. +```{r} +# read in data required from slf individual file - filter year = 2018/19 +ab_1819 <- read_slf_individual(year = "1819", + # select columns + col_select = c("acute_cost", "gls_cost", "mh_cost", "mat_cost", + "op_cost_attend", "ae_cost", "ooh_cost", "pis_cost", + "health_net_cost"), + # filter for Aberdeen City + partnerships = "S37000001") + +# Have used variables which exclude the cost of outpatient attendances which did +# not attend (DNA) but you could also include this if needed. + +# produce summary table +output_table_5 <- ab_1819 %>% + # rename outatients variable + rename(op_cost = op_cost_attend) %>% + # sum of all cost variables and number of patients + summarise(across(ends_with("_cost"), ~sum(.x, na.rm=TRUE)), + patients = n()) %>% + # switch to rows + pivot_longer(everything()) + +``` + +6. Deaths from Dementia / Alzheimers + +Produce a chart to show the number of deaths from 2015/16 to 2019/20 in Scotland where the main cause of death was recorded as Dementia/Alzheimers (ICD 10 codes: G30, F01-F03, F05.1). + +```{r} +# read in data required from slf episode file - filter for years 2015/16 to 2019/20 +deaths <- read_slf_episode(year = c("1516", "1617", "1718", "1819", "1920"), + # select columns + col_select = c("year", "deathdiag1"), + # Filter for death records + recids = "NRS") + +# extract 3 & 4 digit codes and select those with dementia +dementia_deaths <- deaths %>% + # extract 3 & 4 digit ICD 10 codes + mutate(diag_3d = str_sub(deathdiag1, 1, 3), + diag_4d = str_sub(deathdiag1, 1, 4)) %>% + # select dementia codes + filter(diag_3d == "G30" | diag_3d == "F00" | diag_3d == "F01" + | diag_3d == "F02"| diag_3d == "F03" | diag_4d == "F051") + +# produce summary table +output_table_6 <- dementia_deaths %>% + count(year) %>% + rename(deaths=n) + +``` + +7. Number and cost of prescriptions for MS + +Create a table to compare the number and cost of prescribed items for patients with Multiple Sclerosis (MS) by HSCP in 2018/19. Include the number of dispensed items and cost per patient. + +```{R} +# read in HSCP names (used in exercises 7 & 9) +hscp_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Integration Authority 2019 Lookup.csv") %>% + select(hscp2019 = IntegrationAuthority2019Code, + hscp_desc = IntegrationAuthority2019Name) + +# read in data required from slf episode file - filter for year = 2018/19 +pis_1819 <- read_slf_individual("1819", + col_select = c("hscp2019", "ms", "pis_paid_items", "pis_cost")) + + +# select all patients with MS & add on HSCP name +ms_1819 <- pis_1819 %>% + filter(ms == 1) %>% + left_join(hscp_lookup) + +# produce summary table +output_table_7 <- ms_1819 %>% + # group by hscp + group_by(hscp2019, hscp_desc) %>% + # sum up number of items, costs & patients with MS (not all will have had prescription) + summarise(pis_paid_items = sum(pis_paid_items), + pis_cost = sum(pis_cost), + patients = sum(ms)) %>% + ungroup() %>% + # calculate number of items / cost per patient + mutate(items_per_patient = pis_paid_items/patients, + cost_per_patient = pis_cost/patients) + +``` + +8. A&E attendance in last 3 months of life. + +Produce a table to show the number of deaths in Glasgow City HSCP in 2019/20 and what proportion had an A&E attendance in the last 3 months of life. + +```{r} +# extract all deaths in Glasgow City in 1920 - Filter year = 1920 +gc_deaths <- read_slf_episode(year = "1920", + # select columns + col_select = c("anon_chi", "death_date"), + # filter for Glasgow City + partnerships = "S37000015", + # Filter for death records + recids = "NRS") %>% + # exclude those with missing chi + filter(anon_chi != "") %>% + # exclude duplicates + distinct(anon_chi, death_date) + +# extract all A&E attendances in 1819 & 1920 +ae <- read_slf_episode(year = c("1819", "1920"), + # select columns + col_select = c("anon_chi", "recid", "record_keydate1"), + # filter for A&E data + recids = "AE2") %>% + # exclude those with missing chi + filter(anon_chi != "") %>% + # rename date of attendance + rename(attendance_date = record_keydate1) + +# select A&E attendances for those individuals who are in the GC deaths file +ae_gc <- ae %>% + # filter A&E attendances for those in deaths file + semi_join(gc_deaths) %>% + # match on date of death + left_join(gc_deaths) + +# select A&E attendances which are within 3 months of death (counted as 91 days) +ae_gc_3m <- ae_gc %>% + # create 3 month interval + mutate(int_3m = interval(death_date - days(91), death_date)) %>% + # flag if attendance is in 3 month interval + mutate(att_3m = if_else(attendance_date %within% int_3m, 1, 0)) %>% + # select only those attendances in 3 months before death + filter(att_3m==1) + +# create list of patients with A&E attendance in 3m period +pats_ae_3m <- ae_gc_3m %>% + # select only chi and attendance flag + select(anon_chi, att_3m) %>% + # restrict to one row per person + distinct() + +# final output for total number of deaths and number with an A&E attendance in last 3 months +output_table_8 <- gc_deaths %>% + # match on attendance flag + left_join(pats_ae_3m) %>% + # summarise total deaths and deaths with A&E attendance in last 3 months + summarise(deaths=n(), + deaths_with_ae_att=sum(att_3m, na.rm=TRUE)) %>% + # calculate % + mutate(prop_ae_3m = deaths_with_ae_att/deaths) + +``` + +9. Non elective admissions in Geriatric Medicine. + +Create a table showing the number of non-elective admissions with any part of the stay (Continuous Inpatient Journey, CIJ) in the specialty Geriatric Medicine, by HSCP in 2019/20. Also include the associated bed days, cost and number of patients. + +```{r} +# extract data required from episode file +smr_1920 <- read_slf_episode(year = "1920", + col_select = c("anon_chi", "record_keydate1", "record_keydate2", + "spec", "hscp2019", "yearstay", "cost_total_net", + "cij_marker", "cij_pattype"), + recids = c("01B", "GLS", "04B")) %>% + # exclude those with missing chi + filter(anon_chi != "") + +# flag episodes in Geriatric Medicine specialty AB +smr_1920 <- smr_1920 %>% + mutate(ger_med = if_else(spec=="AB", 1, 0)) + +# select only those from non-elective stays +smr_1920_ne <- smr_1920 %>% + filter(cij_pattype=="Non-Elective") + +# aggregate to cij level +# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB +# take hscp from the last record and sum beddays & cost +cij_1920 <- smr_1920_ne %>% + arrange(anon_chi, cij_marker, record_keydate1, record_keydate2) %>% + group_by(anon_chi, cij_marker) %>% + summarise( + across(record_keydate1, min), + across(c(record_keydate2, ger_med), max), + across(c(cij_pattype, hscp2019), last), + across(c(yearstay, cost_total_net), sum))%>% + ungroup() + +# select only admissions with part of their stay in Geriatric Medicine specialty +cij_ger_med <- cij_1920 %>% + filter(ger_med==1) + +# aggregate up to patient level +# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB +# take hscp from the last record and sum beddays & cost +pat_1920 <- cij_ger_med %>% + group_by(anon_chi, hscp2019) %>% + summarise( + across(c(ger_med, yearstay, cost_total_net), sum)) %>% + ungroup() + +# produce output +# note patients may be counted in more than one hscp +output_table_9 <- pat_1920 %>% + # match on hscp names + left_join(hscp_lookup) %>% + # group up to hscp level + group_by(hscp2019, hscp_desc) %>% + # sum up measures + summarise(admissions=sum(ger_med), + beddays = sum(yearstay), + cost = sum(cost_total_net), + patients = n()) %>% + ungroup() + +``` + From 9f875e6b80732f693db5c0bb3968aff5eb7ff40f Mon Sep 17 00:00:00 2001 From: Jennit07 Date: Fri, 14 Jun 2024 14:09:31 +0000 Subject: [PATCH 17/29] Style package --- vignettes/slf-documentation.Rmd | 304 ++++++++++++++++++-------------- 1 file changed, 174 insertions(+), 130 deletions(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index 8d827d2..8765acd 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -89,23 +89,25 @@ Using arrow’s ‘Arrow Table’ feature, you can speed up analysis efficiently Planned and unplanned beddays in Scotland ```{r} # Filter for year of interest -slf_extract <- read_slf_episode(c( "1819", "1920"), - # Select recids of interest - recids = c("01B", "GLS", "04B"), - # Select columns - col_select = c("year", "anon_chi", "recid", - "yearstay", "age", "cij_pattype"), - # return an arrow table - as_data_frame = FALSE) %>% +slf_extract <- read_slf_episode(c("1819", "1920"), + # Select recids of interest + recids = c("01B", "GLS", "04B"), + # Select columns + col_select = c( + "year", "anon_chi", "recid", + "yearstay", "age", "cij_pattype" + ), + # return an arrow table + as_data_frame = FALSE +) %>% # Filter for non-elective and elective episodes - filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% # Group by year and cij_pattype for analysis group_by(year, cij_pattype) %>% # summarise bedday totals - summarise(beddays = sum(yearstay)) %>% - # collect the arrow table + summarise(beddays = sum(yearstay)) %>% + # collect the arrow table dplyr::collect() - ``` @@ -116,26 +118,28 @@ slf_extract <- read_slf_episode(c( "1819", "1920"), Produce a table to compare A&E Attendances for the following age groups (0-17, 18-64, 65-74, 75-84, 85+) for 2018/19 in East Lothian HSCP. ```{r} # read in data required from slf individual file - filter for year 2018/19 -el_1819 <- read_slf_individual(year = "1819", - # select variables needed - col_select = c("age", "ae_attendances"), - # filter partnership for East Lothian - partnerships = "S37000010") +el_1819 <- read_slf_individual( + year = "1819", + # select variables needed + col_select = c("age", "ae_attendances"), + # filter partnership for East Lothian + partnerships = "S37000010" +) # create age bands age_labs <- c("0-17", "18-64", "65-74", "75-84", "85+") # create age labels # create age group variable el_1819 <- el_1819 %>% - mutate(age_group = cut(age, - breaks=c(-1, 17, 64, 74, 84, 150), labels=age_labs)) + mutate(age_group = cut(age, + breaks = c(-1, 17, 64, 74, 84, 150), labels = age_labs + )) # produce summary table output_table_1 <- el_1819 %>% group_by(age_group) %>% - summarise(attendances=sum(ae_attendances)) %>% + summarise(attendances = sum(ae_attendances)) %>% ungroup() - ``` @@ -145,64 +149,75 @@ Create a table to compare the number of outpatient attendances (SMR00) broken do ```{r} # read in specialty lookup with names -spec_lookup <- - read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% - select(spec = Speccode, - spec_name = Description) +spec_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% + select( + spec = Speccode, + spec_name = Description + ) # read in data required from slf episode file - filter year = 2017/18 -op_1718 <- read_slf_episode(year = "1718", - # select columns - col_select = c("recid", "gender", "spec"), - # filter on recid for outpatients - recids = "00B") +op_1718 <- read_slf_episode( + year = "1718", + # select columns + col_select = c("recid", "gender", "spec"), + # filter on recid for outpatients + recids = "00B" +) # produce output output_table_2 <- op_1718 %>% # get counts by specialty and gender count(spec, gender) %>% # exclude those with no gender recorded - filter(gender==1 | gender==2) %>% + filter(gender == 1 | gender == 2) %>% # recode gender into M/F - mutate(gender = recode(as.character(gender), '1' = "Male", '2' = "Female")) %>% + mutate(gender = recode(as.character(gender), "1" = "Male", "2" = "Female")) %>% # move gender to separate columns - pivot_wider(names_from = gender, values_from=n) %>% + pivot_wider(names_from = gender, values_from = n) %>% # match on specialty names left_join(spec_lookup) %>% # reorder variables select(spec, spec_name, Male, Female) - ``` 3. Hospital admissions & beddays by HB of residence. Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. ```{r} -# Read in names for Health Boards -hb_lookup <- +# Read in names for Health Boards +hb_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% - select(hb2019 = HealthBoardArea2019Code, - hb_desc = HealthBoardArea2019Name) + select( + hb2019 = HealthBoardArea2019Code, + hb_desc = HealthBoardArea2019Name + ) # read in data required from slf individual file - filter for 2018/19 -indiv_1819 <- read_slf_individual(year = "1819", - # Select columns of interest - col_select = c("hb2019", "cij_el", "cij_non_el", - "acute_el_inpatient_beddays", - "mh_el_inpatient_beddays", - "gls_el_inpatient_beddays", - "acute_non_el_inpatient_beddays", - "mh_non_el_inpatient_beddays", - "gls_non_el_inpatient_beddays")) +indiv_1819 <- read_slf_individual( + year = "1819", + # Select columns of interest + col_select = c( + "hb2019", "cij_el", "cij_non_el", + "acute_el_inpatient_beddays", + "mh_el_inpatient_beddays", + "gls_el_inpatient_beddays", + "acute_non_el_inpatient_beddays", + "mh_non_el_inpatient_beddays", + "gls_non_el_inpatient_beddays" + ) +) # calculate total bed days and add on HB names indiv_1819_inc_totals <- indiv_1819 %>% # calculate overall bed days - mutate(elective_beddays = acute_el_inpatient_beddays + mh_el_inpatient_beddays + - gls_el_inpatient_beddays, - non_elective_beddays = acute_non_el_inpatient_beddays + mh_non_el_inpatient_beddays + - gls_non_el_inpatient_beddays) %>% + mutate( + elective_beddays = acute_el_inpatient_beddays + mh_el_inpatient_beddays + + gls_el_inpatient_beddays, + non_elective_beddays = acute_non_el_inpatient_beddays + mh_non_el_inpatient_beddays + + gls_non_el_inpatient_beddays + ) %>% # match on HB name left_join(hb_lookup) @@ -211,14 +226,17 @@ output_table_3 <- indiv_1819_inc_totals %>% # group by HB of residence group_by(hb2019, hb_desc) %>% # produce summary table - summarise(elective_adm = sum(cij_el), - non_elective_adm = sum(cij_non_el), - elective_beddays = sum(elective_beddays), - non_elective_beddays = sum(non_elective_beddays)) %>% + summarise( + elective_adm = sum(cij_el), + non_elective_adm = sum(cij_non_el), + elective_beddays = sum(elective_beddays), + non_elective_beddays = sum(non_elective_beddays) + ) %>% # calculate average length of stay - mutate(elective_alos = elective_beddays/elective_adm, - non_elective_alos = non_elective_beddays/non_elective_adm) - + mutate( + elective_alos = elective_beddays / elective_adm, + non_elective_alos = non_elective_beddays / non_elective_adm + ) ``` 4. GP Out of Hours Consulations in South Ayrshire. @@ -226,22 +244,23 @@ output_table_3 <- indiv_1819_inc_totals %>% Create a table showing the number of GP Out of Hours consultations for patients with dementia in South Ayrshire HSCP in 2019/20 broken down by type of consultation. ```{r} # read in data required from slf episode file - filter for year = 2019/20 -sa_1920 <- read_slf_episode(year = "1920", - # select columns - col_select = c("dementia", "smrtype"), - # filter for South Ayrshire HSCP - partnerships = "S37000027", - # Filter for GP OOH data - recids = "OoH") +sa_1920 <- read_slf_episode( + year = "1920", + # select columns + col_select = c("dementia", "smrtype"), + # filter for South Ayrshire HSCP + partnerships = "S37000027", + # Filter for GP OOH data + recids = "OoH" +) # select dementia patients sa_dementia_1920 <- sa_1920 %>% - filter(dementia==1) + filter(dementia == 1) # produce summary table output_table_4 <- sa_dementia_1920 %>% count(smrtype) - ``` 5. Costs in Aberdeen City. @@ -249,13 +268,17 @@ output_table_4 <- sa_dementia_1920 %>% Produce a table to show the number of patients and the total costs for Aberdeen City HSCP in 2018/19. Include a breakdown of costs for the following services: Acute (inpatients & daycases), GLS, Mental Health and Maternity, Outpatients, A&E, GP Out of Hours, Community Prescribing. ```{r} # read in data required from slf individual file - filter year = 2018/19 -ab_1819 <- read_slf_individual(year = "1819", - # select columns - col_select = c("acute_cost", "gls_cost", "mh_cost", "mat_cost", - "op_cost_attend", "ae_cost", "ooh_cost", "pis_cost", - "health_net_cost"), - # filter for Aberdeen City - partnerships = "S37000001") +ab_1819 <- read_slf_individual( + year = "1819", + # select columns + col_select = c( + "acute_cost", "gls_cost", "mh_cost", "mat_cost", + "op_cost_attend", "ae_cost", "ooh_cost", "pis_cost", + "health_net_cost" + ), + # filter for Aberdeen City + partnerships = "S37000001" +) # Have used variables which exclude the cost of outpatient attendances which did # not attend (DNA) but you could also include this if needed. @@ -265,11 +288,11 @@ output_table_5 <- ab_1819 %>% # rename outatients variable rename(op_cost = op_cost_attend) %>% # sum of all cost variables and number of patients - summarise(across(ends_with("_cost"), ~sum(.x, na.rm=TRUE)), - patients = n()) %>% + summarise(across(ends_with("_cost"), ~ sum(.x, na.rm = TRUE)), + patients = n() + ) %>% # switch to rows pivot_longer(everything()) - ``` 6. Deaths from Dementia / Alzheimers @@ -278,26 +301,29 @@ Produce a chart to show the number of deaths from 2015/16 to 2019/20 in Scotland ```{r} # read in data required from slf episode file - filter for years 2015/16 to 2019/20 -deaths <- read_slf_episode(year = c("1516", "1617", "1718", "1819", "1920"), - # select columns - col_select = c("year", "deathdiag1"), - # Filter for death records - recids = "NRS") +deaths <- read_slf_episode( + year = c("1516", "1617", "1718", "1819", "1920"), + # select columns + col_select = c("year", "deathdiag1"), + # Filter for death records + recids = "NRS" +) # extract 3 & 4 digit codes and select those with dementia dementia_deaths <- deaths %>% # extract 3 & 4 digit ICD 10 codes - mutate(diag_3d = str_sub(deathdiag1, 1, 3), - diag_4d = str_sub(deathdiag1, 1, 4)) %>% + mutate( + diag_3d = str_sub(deathdiag1, 1, 3), + diag_4d = str_sub(deathdiag1, 1, 4) + ) %>% # select dementia codes - filter(diag_3d == "G30" | diag_3d == "F00" | diag_3d == "F01" - | diag_3d == "F02"| diag_3d == "F03" | diag_4d == "F051") + filter(diag_3d == "G30" | diag_3d == "F00" | diag_3d == "F01" | + diag_3d == "F02" | diag_3d == "F03" | diag_4d == "F051") # produce summary table output_table_6 <- dementia_deaths %>% count(year) %>% - rename(deaths=n) - + rename(deaths = n) ``` 7. Number and cost of prescriptions for MS @@ -307,12 +333,15 @@ Create a table to compare the number and cost of prescribed items for patients w ```{R} # read in HSCP names (used in exercises 7 & 9) hscp_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Integration Authority 2019 Lookup.csv") %>% - select(hscp2019 = IntegrationAuthority2019Code, - hscp_desc = IntegrationAuthority2019Name) + select( + hscp2019 = IntegrationAuthority2019Code, + hscp_desc = IntegrationAuthority2019Name + ) # read in data required from slf episode file - filter for year = 2018/19 pis_1819 <- read_slf_individual("1819", - col_select = c("hscp2019", "ms", "pis_paid_items", "pis_cost")) + col_select = c("hscp2019", "ms", "pis_paid_items", "pis_cost") +) # select all patients with MS & add on HSCP name @@ -325,14 +354,17 @@ output_table_7 <- ms_1819 %>% # group by hscp group_by(hscp2019, hscp_desc) %>% # sum up number of items, costs & patients with MS (not all will have had prescription) - summarise(pis_paid_items = sum(pis_paid_items), - pis_cost = sum(pis_cost), - patients = sum(ms)) %>% + summarise( + pis_paid_items = sum(pis_paid_items), + pis_cost = sum(pis_cost), + patients = sum(ms) + ) %>% ungroup() %>% # calculate number of items / cost per patient - mutate(items_per_patient = pis_paid_items/patients, - cost_per_patient = pis_cost/patients) - + mutate( + items_per_patient = pis_paid_items / patients, + cost_per_patient = pis_cost / patients + ) ``` 8. A&E attendance in last 3 months of life. @@ -341,24 +373,28 @@ Produce a table to show the number of deaths in Glasgow City HSCP in 2019/20 and ```{r} # extract all deaths in Glasgow City in 1920 - Filter year = 1920 -gc_deaths <- read_slf_episode(year = "1920", - # select columns - col_select = c("anon_chi", "death_date"), - # filter for Glasgow City - partnerships = "S37000015", - # Filter for death records - recids = "NRS") %>% +gc_deaths <- read_slf_episode( + year = "1920", + # select columns + col_select = c("anon_chi", "death_date"), + # filter for Glasgow City + partnerships = "S37000015", + # Filter for death records + recids = "NRS" +) %>% # exclude those with missing chi filter(anon_chi != "") %>% # exclude duplicates distinct(anon_chi, death_date) # extract all A&E attendances in 1819 & 1920 -ae <- read_slf_episode(year = c("1819", "1920"), - # select columns - col_select = c("anon_chi", "recid", "record_keydate1"), - # filter for A&E data - recids = "AE2") %>% +ae <- read_slf_episode( + year = c("1819", "1920"), + # select columns + col_select = c("anon_chi", "recid", "record_keydate1"), + # filter for A&E data + recids = "AE2" +) %>% # exclude those with missing chi filter(anon_chi != "") %>% # rename date of attendance @@ -378,7 +414,7 @@ ae_gc_3m <- ae_gc %>% # flag if attendance is in 3 month interval mutate(att_3m = if_else(attendance_date %within% int_3m, 1, 0)) %>% # select only those attendances in 3 months before death - filter(att_3m==1) + filter(att_3m == 1) # create list of patients with A&E attendance in 3m period pats_ae_3m <- ae_gc_3m %>% @@ -392,11 +428,12 @@ output_table_8 <- gc_deaths %>% # match on attendance flag left_join(pats_ae_3m) %>% # summarise total deaths and deaths with A&E attendance in last 3 months - summarise(deaths=n(), - deaths_with_ae_att=sum(att_3m, na.rm=TRUE)) %>% + summarise( + deaths = n(), + deaths_with_ae_att = sum(att_3m, na.rm = TRUE) + ) %>% # calculate % - mutate(prop_ae_3m = deaths_with_ae_att/deaths) - + mutate(prop_ae_3m = deaths_with_ae_att / deaths) ``` 9. Non elective admissions in Geriatric Medicine. @@ -405,21 +442,25 @@ Create a table showing the number of non-elective admissions with any part of th ```{r} # extract data required from episode file -smr_1920 <- read_slf_episode(year = "1920", - col_select = c("anon_chi", "record_keydate1", "record_keydate2", - "spec", "hscp2019", "yearstay", "cost_total_net", - "cij_marker", "cij_pattype"), - recids = c("01B", "GLS", "04B")) %>% +smr_1920 <- read_slf_episode( + year = "1920", + col_select = c( + "anon_chi", "record_keydate1", "record_keydate2", + "spec", "hscp2019", "yearstay", "cost_total_net", + "cij_marker", "cij_pattype" + ), + recids = c("01B", "GLS", "04B") +) %>% # exclude those with missing chi filter(anon_chi != "") # flag episodes in Geriatric Medicine specialty AB -smr_1920 <- smr_1920 %>% - mutate(ger_med = if_else(spec=="AB", 1, 0)) +smr_1920 <- smr_1920 %>% + mutate(ger_med = if_else(spec == "AB", 1, 0)) # select only those from non-elective stays smr_1920_ne <- smr_1920 %>% - filter(cij_pattype=="Non-Elective") + filter(cij_pattype == "Non-Elective") # aggregate to cij level # we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB @@ -431,12 +472,13 @@ cij_1920 <- smr_1920_ne %>% across(record_keydate1, min), across(c(record_keydate2, ger_med), max), across(c(cij_pattype, hscp2019), last), - across(c(yearstay, cost_total_net), sum))%>% + across(c(yearstay, cost_total_net), sum) + ) %>% ungroup() # select only admissions with part of their stay in Geriatric Medicine specialty cij_ger_med <- cij_1920 %>% - filter(ger_med==1) + filter(ger_med == 1) # aggregate up to patient level # we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB @@ -444,7 +486,8 @@ cij_ger_med <- cij_1920 %>% pat_1920 <- cij_ger_med %>% group_by(anon_chi, hscp2019) %>% summarise( - across(c(ger_med, yearstay, cost_total_net), sum)) %>% + across(c(ger_med, yearstay, cost_total_net), sum) + ) %>% ungroup() # produce output @@ -455,11 +498,12 @@ output_table_9 <- pat_1920 %>% # group up to hscp level group_by(hscp2019, hscp_desc) %>% # sum up measures - summarise(admissions=sum(ger_med), - beddays = sum(yearstay), - cost = sum(cost_total_net), - patients = n()) %>% + summarise( + admissions = sum(ger_med), + beddays = sum(yearstay), + cost = sum(cost_total_net), + patients = n() + ) %>% ungroup() - ``` From 46c9f6e146d8cf7573be1250617c0fea223f38be Mon Sep 17 00:00:00 2001 From: Jennifer Thom Date: Fri, 14 Jun 2024 15:36:59 +0100 Subject: [PATCH 18/29] Hide messages --- vignettes/slf-documentation.Rmd | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index 8765acd..cb99cae 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -87,7 +87,7 @@ Using arrow’s ‘Arrow Table’ feature, you can speed up analysis efficiently #### For example: Planned and unplanned beddays in Scotland -```{r} +```{r, message=FALSE} # Filter for year of interest slf_extract <- read_slf_episode(c("1819", "1920"), # Select recids of interest @@ -116,7 +116,7 @@ slf_extract <- read_slf_episode(c("1819", "1920"), 1. A&E attendances in East Lothian by age group. Produce a table to compare A&E Attendances for the following age groups (0-17, 18-64, 65-74, 75-84, 85+) for 2018/19 in East Lothian HSCP. -```{r} +```{r, message=FALSE} # read in data required from slf individual file - filter for year 2018/19 el_1819 <- read_slf_individual( year = "1819", @@ -147,7 +147,7 @@ output_table_1 <- el_1819 %>% Create a table to compare the number of outpatient attendances (SMR00) broken down by specialty and gender in 2017/18 in Scotland. -```{r} +```{r, message=FALSE} # read in specialty lookup with names spec_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% @@ -184,9 +184,15 @@ output_table_2 <- op_1718 %>% 3. Hospital admissions & beddays by HB of residence. Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. +<<<<<<< HEAD ```{r} # Read in names for Health Boards hb_lookup <- +======= +```{r, message=FALSE} +# Read in names for Health Boards +hb_lookup <- +>>>>>>> 516d5a1 (Hide messages) read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% select( hb2019 = HealthBoardArea2019Code, @@ -242,7 +248,7 @@ output_table_3 <- indiv_1819_inc_totals %>% 4. GP Out of Hours Consulations in South Ayrshire. Create a table showing the number of GP Out of Hours consultations for patients with dementia in South Ayrshire HSCP in 2019/20 broken down by type of consultation. -```{r} +```{r, message=FALSE} # read in data required from slf episode file - filter for year = 2019/20 sa_1920 <- read_slf_episode( year = "1920", @@ -266,7 +272,7 @@ output_table_4 <- sa_dementia_1920 %>% 5. Costs in Aberdeen City. Produce a table to show the number of patients and the total costs for Aberdeen City HSCP in 2018/19. Include a breakdown of costs for the following services: Acute (inpatients & daycases), GLS, Mental Health and Maternity, Outpatients, A&E, GP Out of Hours, Community Prescribing. -```{r} +```{r, message=FALSE} # read in data required from slf individual file - filter year = 2018/19 ab_1819 <- read_slf_individual( year = "1819", @@ -299,7 +305,7 @@ output_table_5 <- ab_1819 %>% Produce a chart to show the number of deaths from 2015/16 to 2019/20 in Scotland where the main cause of death was recorded as Dementia/Alzheimers (ICD 10 codes: G30, F01-F03, F05.1). -```{r} +```{r, message=FALSE} # read in data required from slf episode file - filter for years 2015/16 to 2019/20 deaths <- read_slf_episode( year = c("1516", "1617", "1718", "1819", "1920"), @@ -330,7 +336,7 @@ output_table_6 <- dementia_deaths %>% Create a table to compare the number and cost of prescribed items for patients with Multiple Sclerosis (MS) by HSCP in 2018/19. Include the number of dispensed items and cost per patient. -```{R} +```{R, message=FALSE} # read in HSCP names (used in exercises 7 & 9) hscp_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Integration Authority 2019 Lookup.csv") %>% select( @@ -371,7 +377,7 @@ output_table_7 <- ms_1819 %>% Produce a table to show the number of deaths in Glasgow City HSCP in 2019/20 and what proportion had an A&E attendance in the last 3 months of life. -```{r} +```{r, message=FALSE} # extract all deaths in Glasgow City in 1920 - Filter year = 1920 gc_deaths <- read_slf_episode( year = "1920", @@ -440,7 +446,7 @@ output_table_8 <- gc_deaths %>% Create a table showing the number of non-elective admissions with any part of the stay (Continuous Inpatient Journey, CIJ) in the specialty Geriatric Medicine, by HSCP in 2019/20. Also include the associated bed days, cost and number of patients. -```{r} +```{r, message=FALSE} # extract data required from episode file smr_1920 <- read_slf_episode( year = "1920", From 78bc990af651427e52a5580f1172f72308344327 Mon Sep 17 00:00:00 2001 From: Jennifer Thom Date: Mon, 17 Jun 2024 10:09:55 +0100 Subject: [PATCH 19/29] remove conflict --- vignettes/slf-documentation.Rmd | 6 ------ 1 file changed, 6 deletions(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index cb99cae..21bedab 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -184,15 +184,9 @@ output_table_2 <- op_1718 %>% 3. Hospital admissions & beddays by HB of residence. Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. -<<<<<<< HEAD -```{r} -# Read in names for Health Boards -hb_lookup <- -======= ```{r, message=FALSE} # Read in names for Health Boards hb_lookup <- ->>>>>>> 516d5a1 (Hide messages) read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% select( hb2019 = HealthBoardArea2019Code, From ac69bd291d8b05781837c5efaf044772bc72b996 Mon Sep 17 00:00:00 2001 From: Jennit07 Date: Mon, 17 Jun 2024 09:15:59 +0000 Subject: [PATCH 20/29] Style package --- vignettes/slf-documentation.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index 21bedab..b70e54f 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -185,8 +185,8 @@ output_table_2 <- op_1718 %>% Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. ```{r, message=FALSE} -# Read in names for Health Boards -hb_lookup <- +# Read in names for Health Boards +hb_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% select( hb2019 = HealthBoardArea2019Code, From 7cb140f391ef7e4e1ef110fcd1eda4dd3e8095b6 Mon Sep 17 00:00:00 2001 From: Jennifer Thom Date: Fri, 26 Jul 2024 12:56:07 +0100 Subject: [PATCH 21/29] Split up documentation into 3 vignettes --- vignettes/slf-documentation.Rmd | 431 --------------------------- vignettes/slfhelper-applications.Rmd | 416 ++++++++++++++++++++++++++ vignettes/using-arrow-table.Rmd | 52 ++++ 3 files changed, 468 insertions(+), 431 deletions(-) create mode 100644 vignettes/slfhelper-applications.Rmd create mode 100644 vignettes/using-arrow-table.Rmd diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index b70e54f..2f0ae58 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -76,434 +76,3 @@ The tables below show the memory usage of each full size SLF. | 2223 | 1098.7 | | 2324 | 775.5 | - -## Using Parquet files with the arrow package - -The SLFs are available in parquet format. The {arrow} package gives some extra features which can speed up and reduce memory usage even further. You can read only specific columns `read_parquet(file, col_select = c(var1, var2))`. - -Using arrow’s ‘Arrow Table’ feature, you can speed up analysis efficiently. To do this, specify `as_data_frame = FALSE` when using SLFhelper and `dplyr::collect()` to read the data. - - -#### For example: - -Planned and unplanned beddays in Scotland -```{r, message=FALSE} -# Filter for year of interest -slf_extract <- read_slf_episode(c("1819", "1920"), - # Select recids of interest - recids = c("01B", "GLS", "04B"), - # Select columns - col_select = c( - "year", "anon_chi", "recid", - "yearstay", "age", "cij_pattype" - ), - # return an arrow table - as_data_frame = FALSE -) %>% - # Filter for non-elective and elective episodes - filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% - # Group by year and cij_pattype for analysis - group_by(year, cij_pattype) %>% - # summarise bedday totals - summarise(beddays = sum(yearstay)) %>% - # collect the arrow table - dplyr::collect() -``` - - -## Examples using SLFhelper - -1. A&E attendances in East Lothian by age group. - -Produce a table to compare A&E Attendances for the following age groups (0-17, 18-64, 65-74, 75-84, 85+) for 2018/19 in East Lothian HSCP. -```{r, message=FALSE} -# read in data required from slf individual file - filter for year 2018/19 -el_1819 <- read_slf_individual( - year = "1819", - # select variables needed - col_select = c("age", "ae_attendances"), - # filter partnership for East Lothian - partnerships = "S37000010" -) - -# create age bands -age_labs <- c("0-17", "18-64", "65-74", "75-84", "85+") # create age labels - -# create age group variable -el_1819 <- el_1819 %>% - mutate(age_group = cut(age, - breaks = c(-1, 17, 64, 74, 84, 150), labels = age_labs - )) - -# produce summary table -output_table_1 <- el_1819 %>% - group_by(age_group) %>% - summarise(attendances = sum(ae_attendances)) %>% - ungroup() -``` - - -2. Outpatient attendances by specialty and gender. - -Create a table to compare the number of outpatient attendances (SMR00) broken down by specialty and gender in 2017/18 in Scotland. - -```{r, message=FALSE} -# read in specialty lookup with names -spec_lookup <- - read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% - select( - spec = Speccode, - spec_name = Description - ) - -# read in data required from slf episode file - filter year = 2017/18 -op_1718 <- read_slf_episode( - year = "1718", - # select columns - col_select = c("recid", "gender", "spec"), - # filter on recid for outpatients - recids = "00B" -) - -# produce output -output_table_2 <- op_1718 %>% - # get counts by specialty and gender - count(spec, gender) %>% - # exclude those with no gender recorded - filter(gender == 1 | gender == 2) %>% - # recode gender into M/F - mutate(gender = recode(as.character(gender), "1" = "Male", "2" = "Female")) %>% - # move gender to separate columns - pivot_wider(names_from = gender, values_from = n) %>% - # match on specialty names - left_join(spec_lookup) %>% - # reorder variables - select(spec, spec_name, Male, Female) -``` - -3. Hospital admissions & beddays by HB of residence. - -Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. -```{r, message=FALSE} -# Read in names for Health Boards -hb_lookup <- - read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% - select( - hb2019 = HealthBoardArea2019Code, - hb_desc = HealthBoardArea2019Name - ) - -# read in data required from slf individual file - filter for 2018/19 -indiv_1819 <- read_slf_individual( - year = "1819", - # Select columns of interest - col_select = c( - "hb2019", "cij_el", "cij_non_el", - "acute_el_inpatient_beddays", - "mh_el_inpatient_beddays", - "gls_el_inpatient_beddays", - "acute_non_el_inpatient_beddays", - "mh_non_el_inpatient_beddays", - "gls_non_el_inpatient_beddays" - ) -) - - -# calculate total bed days and add on HB names -indiv_1819_inc_totals <- indiv_1819 %>% - # calculate overall bed days - mutate( - elective_beddays = acute_el_inpatient_beddays + mh_el_inpatient_beddays + - gls_el_inpatient_beddays, - non_elective_beddays = acute_non_el_inpatient_beddays + mh_non_el_inpatient_beddays + - gls_non_el_inpatient_beddays - ) %>% - # match on HB name - left_join(hb_lookup) - -# produce summary table -output_table_3 <- indiv_1819_inc_totals %>% - # group by HB of residence - group_by(hb2019, hb_desc) %>% - # produce summary table - summarise( - elective_adm = sum(cij_el), - non_elective_adm = sum(cij_non_el), - elective_beddays = sum(elective_beddays), - non_elective_beddays = sum(non_elective_beddays) - ) %>% - # calculate average length of stay - mutate( - elective_alos = elective_beddays / elective_adm, - non_elective_alos = non_elective_beddays / non_elective_adm - ) -``` - -4. GP Out of Hours Consulations in South Ayrshire. - -Create a table showing the number of GP Out of Hours consultations for patients with dementia in South Ayrshire HSCP in 2019/20 broken down by type of consultation. -```{r, message=FALSE} -# read in data required from slf episode file - filter for year = 2019/20 -sa_1920 <- read_slf_episode( - year = "1920", - # select columns - col_select = c("dementia", "smrtype"), - # filter for South Ayrshire HSCP - partnerships = "S37000027", - # Filter for GP OOH data - recids = "OoH" -) - -# select dementia patients -sa_dementia_1920 <- sa_1920 %>% - filter(dementia == 1) - -# produce summary table -output_table_4 <- sa_dementia_1920 %>% - count(smrtype) -``` - -5. Costs in Aberdeen City. - -Produce a table to show the number of patients and the total costs for Aberdeen City HSCP in 2018/19. Include a breakdown of costs for the following services: Acute (inpatients & daycases), GLS, Mental Health and Maternity, Outpatients, A&E, GP Out of Hours, Community Prescribing. -```{r, message=FALSE} -# read in data required from slf individual file - filter year = 2018/19 -ab_1819 <- read_slf_individual( - year = "1819", - # select columns - col_select = c( - "acute_cost", "gls_cost", "mh_cost", "mat_cost", - "op_cost_attend", "ae_cost", "ooh_cost", "pis_cost", - "health_net_cost" - ), - # filter for Aberdeen City - partnerships = "S37000001" -) - -# Have used variables which exclude the cost of outpatient attendances which did -# not attend (DNA) but you could also include this if needed. - -# produce summary table -output_table_5 <- ab_1819 %>% - # rename outatients variable - rename(op_cost = op_cost_attend) %>% - # sum of all cost variables and number of patients - summarise(across(ends_with("_cost"), ~ sum(.x, na.rm = TRUE)), - patients = n() - ) %>% - # switch to rows - pivot_longer(everything()) -``` - -6. Deaths from Dementia / Alzheimers - -Produce a chart to show the number of deaths from 2015/16 to 2019/20 in Scotland where the main cause of death was recorded as Dementia/Alzheimers (ICD 10 codes: G30, F01-F03, F05.1). - -```{r, message=FALSE} -# read in data required from slf episode file - filter for years 2015/16 to 2019/20 -deaths <- read_slf_episode( - year = c("1516", "1617", "1718", "1819", "1920"), - # select columns - col_select = c("year", "deathdiag1"), - # Filter for death records - recids = "NRS" -) - -# extract 3 & 4 digit codes and select those with dementia -dementia_deaths <- deaths %>% - # extract 3 & 4 digit ICD 10 codes - mutate( - diag_3d = str_sub(deathdiag1, 1, 3), - diag_4d = str_sub(deathdiag1, 1, 4) - ) %>% - # select dementia codes - filter(diag_3d == "G30" | diag_3d == "F00" | diag_3d == "F01" | - diag_3d == "F02" | diag_3d == "F03" | diag_4d == "F051") - -# produce summary table -output_table_6 <- dementia_deaths %>% - count(year) %>% - rename(deaths = n) -``` - -7. Number and cost of prescriptions for MS - -Create a table to compare the number and cost of prescribed items for patients with Multiple Sclerosis (MS) by HSCP in 2018/19. Include the number of dispensed items and cost per patient. - -```{R, message=FALSE} -# read in HSCP names (used in exercises 7 & 9) -hscp_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Integration Authority 2019 Lookup.csv") %>% - select( - hscp2019 = IntegrationAuthority2019Code, - hscp_desc = IntegrationAuthority2019Name - ) - -# read in data required from slf episode file - filter for year = 2018/19 -pis_1819 <- read_slf_individual("1819", - col_select = c("hscp2019", "ms", "pis_paid_items", "pis_cost") -) - - -# select all patients with MS & add on HSCP name -ms_1819 <- pis_1819 %>% - filter(ms == 1) %>% - left_join(hscp_lookup) - -# produce summary table -output_table_7 <- ms_1819 %>% - # group by hscp - group_by(hscp2019, hscp_desc) %>% - # sum up number of items, costs & patients with MS (not all will have had prescription) - summarise( - pis_paid_items = sum(pis_paid_items), - pis_cost = sum(pis_cost), - patients = sum(ms) - ) %>% - ungroup() %>% - # calculate number of items / cost per patient - mutate( - items_per_patient = pis_paid_items / patients, - cost_per_patient = pis_cost / patients - ) -``` - -8. A&E attendance in last 3 months of life. - -Produce a table to show the number of deaths in Glasgow City HSCP in 2019/20 and what proportion had an A&E attendance in the last 3 months of life. - -```{r, message=FALSE} -# extract all deaths in Glasgow City in 1920 - Filter year = 1920 -gc_deaths <- read_slf_episode( - year = "1920", - # select columns - col_select = c("anon_chi", "death_date"), - # filter for Glasgow City - partnerships = "S37000015", - # Filter for death records - recids = "NRS" -) %>% - # exclude those with missing chi - filter(anon_chi != "") %>% - # exclude duplicates - distinct(anon_chi, death_date) - -# extract all A&E attendances in 1819 & 1920 -ae <- read_slf_episode( - year = c("1819", "1920"), - # select columns - col_select = c("anon_chi", "recid", "record_keydate1"), - # filter for A&E data - recids = "AE2" -) %>% - # exclude those with missing chi - filter(anon_chi != "") %>% - # rename date of attendance - rename(attendance_date = record_keydate1) - -# select A&E attendances for those individuals who are in the GC deaths file -ae_gc <- ae %>% - # filter A&E attendances for those in deaths file - semi_join(gc_deaths) %>% - # match on date of death - left_join(gc_deaths) - -# select A&E attendances which are within 3 months of death (counted as 91 days) -ae_gc_3m <- ae_gc %>% - # create 3 month interval - mutate(int_3m = interval(death_date - days(91), death_date)) %>% - # flag if attendance is in 3 month interval - mutate(att_3m = if_else(attendance_date %within% int_3m, 1, 0)) %>% - # select only those attendances in 3 months before death - filter(att_3m == 1) - -# create list of patients with A&E attendance in 3m period -pats_ae_3m <- ae_gc_3m %>% - # select only chi and attendance flag - select(anon_chi, att_3m) %>% - # restrict to one row per person - distinct() - -# final output for total number of deaths and number with an A&E attendance in last 3 months -output_table_8 <- gc_deaths %>% - # match on attendance flag - left_join(pats_ae_3m) %>% - # summarise total deaths and deaths with A&E attendance in last 3 months - summarise( - deaths = n(), - deaths_with_ae_att = sum(att_3m, na.rm = TRUE) - ) %>% - # calculate % - mutate(prop_ae_3m = deaths_with_ae_att / deaths) -``` - -9. Non elective admissions in Geriatric Medicine. - -Create a table showing the number of non-elective admissions with any part of the stay (Continuous Inpatient Journey, CIJ) in the specialty Geriatric Medicine, by HSCP in 2019/20. Also include the associated bed days, cost and number of patients. - -```{r, message=FALSE} -# extract data required from episode file -smr_1920 <- read_slf_episode( - year = "1920", - col_select = c( - "anon_chi", "record_keydate1", "record_keydate2", - "spec", "hscp2019", "yearstay", "cost_total_net", - "cij_marker", "cij_pattype" - ), - recids = c("01B", "GLS", "04B") -) %>% - # exclude those with missing chi - filter(anon_chi != "") - -# flag episodes in Geriatric Medicine specialty AB -smr_1920 <- smr_1920 %>% - mutate(ger_med = if_else(spec == "AB", 1, 0)) - -# select only those from non-elective stays -smr_1920_ne <- smr_1920 %>% - filter(cij_pattype == "Non-Elective") - -# aggregate to cij level -# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB -# take hscp from the last record and sum beddays & cost -cij_1920 <- smr_1920_ne %>% - arrange(anon_chi, cij_marker, record_keydate1, record_keydate2) %>% - group_by(anon_chi, cij_marker) %>% - summarise( - across(record_keydate1, min), - across(c(record_keydate2, ger_med), max), - across(c(cij_pattype, hscp2019), last), - across(c(yearstay, cost_total_net), sum) - ) %>% - ungroup() - -# select only admissions with part of their stay in Geriatric Medicine specialty -cij_ger_med <- cij_1920 %>% - filter(ger_med == 1) - -# aggregate up to patient level -# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB -# take hscp from the last record and sum beddays & cost -pat_1920 <- cij_ger_med %>% - group_by(anon_chi, hscp2019) %>% - summarise( - across(c(ger_med, yearstay, cost_total_net), sum) - ) %>% - ungroup() - -# produce output -# note patients may be counted in more than one hscp -output_table_9 <- pat_1920 %>% - # match on hscp names - left_join(hscp_lookup) %>% - # group up to hscp level - group_by(hscp2019, hscp_desc) %>% - # sum up measures - summarise( - admissions = sum(ger_med), - beddays = sum(yearstay), - cost = sum(cost_total_net), - patients = n() - ) %>% - ungroup() -``` - diff --git a/vignettes/slfhelper-applications.Rmd b/vignettes/slfhelper-applications.Rmd new file mode 100644 index 0000000..4b4c3c4 --- /dev/null +++ b/vignettes/slfhelper-applications.Rmd @@ -0,0 +1,416 @@ +--- +title: "slfhelper-applications" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{slfhelper-applications} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(slfhelper) +``` + +## Examples using SLFhelper + +1. A&E attendances in East Lothian by age group. + +Produce a table to compare A&E Attendances for the following age groups (0-17, 18-64, 65-74, 75-84, 85+) for 2018/19 in East Lothian HSCP. +```{r chunk2, eval=FALSE, message=FALSE} +# read in data required from slf individual file - filter for year 2018/19 +el_1819 <- read_slf_individual( + year = "1819", + # select variables needed + col_select = c("age", "ae_attendances"), + # filter partnership for East Lothian + partnerships = "S37000010" +) + +# create age bands +age_labs <- c("0-17", "18-64", "65-74", "75-84", "85+") # create age labels + +# create age group variable +el_1819 <- el_1819 %>% + mutate(age_group = cut(age, + breaks = c(-1, 17, 64, 74, 84, 150), labels = age_labs + )) + +# produce summary table +output_table_1 <- el_1819 %>% + group_by(age_group) %>% + summarise(attendances = sum(ae_attendances)) %>% + ungroup() +``` + + +2. Outpatient attendances by specialty and gender. + +Create a table to compare the number of outpatient attendances (SMR00) broken down by specialty and gender in 2017/18 in Scotland. + +```{r chunk3, eval=FALSE, message=FALSE} +# read in specialty lookup with names +spec_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% + select( + spec = Speccode, + spec_name = Description + ) + +# read in data required from slf episode file - filter year = 2017/18 +op_1718 <- read_slf_episode( + year = "1718", + # select columns + col_select = c("recid", "gender", "spec"), + # filter on recid for outpatients + recids = "00B" +) + +# produce output +output_table_2 <- op_1718 %>% + # get counts by specialty and gender + count(spec, gender) %>% + # exclude those with no gender recorded + filter(gender == 1 | gender == 2) %>% + # recode gender into M/F + mutate(gender = recode(as.character(gender), "1" = "Male", "2" = "Female")) %>% + # move gender to separate columns + pivot_wider(names_from = gender, values_from = n) %>% + # match on specialty names + left_join(spec_lookup) %>% + # reorder variables + select(spec, spec_name, Male, Female) +``` + +3. Hospital admissions & beddays by HB of residence. + +Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. +```{r chunk4, eval=FALSE, message=FALSE} +# Read in names for Health Boards +hb_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% + select( + hb2019 = HealthBoardArea2019Code, + hb_desc = HealthBoardArea2019Name + ) + +# read in data required from slf individual file - filter for 2018/19 +indiv_1819 <- read_slf_individual( + year = "1819", + # Select columns of interest + col_select = c( + "hb2019", "cij_el", "cij_non_el", + "acute_el_inpatient_beddays", + "mh_el_inpatient_beddays", + "gls_el_inpatient_beddays", + "acute_non_el_inpatient_beddays", + "mh_non_el_inpatient_beddays", + "gls_non_el_inpatient_beddays" + ) +) + + +# calculate total bed days and add on HB names +indiv_1819_inc_totals <- indiv_1819 %>% + # calculate overall bed days + mutate( + elective_beddays = acute_el_inpatient_beddays + mh_el_inpatient_beddays + + gls_el_inpatient_beddays, + non_elective_beddays = acute_non_el_inpatient_beddays + mh_non_el_inpatient_beddays + + gls_non_el_inpatient_beddays + ) %>% + # match on HB name + left_join(hb_lookup) + +# produce summary table +output_table_3 <- indiv_1819_inc_totals %>% + # group by HB of residence + group_by(hb2019, hb_desc) %>% + # produce summary table + summarise( + elective_adm = sum(cij_el), + non_elective_adm = sum(cij_non_el), + elective_beddays = sum(elective_beddays), + non_elective_beddays = sum(non_elective_beddays) + ) %>% + # calculate average length of stay + mutate( + elective_alos = elective_beddays / elective_adm, + non_elective_alos = non_elective_beddays / non_elective_adm + ) +``` + +4. GP Out of Hours Consulations in South Ayrshire. + +Create a table showing the number of GP Out of Hours consultations for patients with dementia in South Ayrshire HSCP in 2019/20 broken down by type of consultation. +```{r chunk5, eval=FALSE, message=FALSE} +# read in data required from slf episode file - filter for year = 2019/20 +sa_1920 <- read_slf_episode( + year = "1920", + # select columns + col_select = c("dementia", "smrtype"), + # filter for South Ayrshire HSCP + partnerships = "S37000027", + # Filter for GP OOH data + recids = "OoH" +) + +# select dementia patients +sa_dementia_1920 <- sa_1920 %>% + filter(dementia == 1) + +# produce summary table +output_table_4 <- sa_dementia_1920 %>% + count(smrtype) +``` + +5. Costs in Aberdeen City. + +Produce a table to show the number of patients and the total costs for Aberdeen City HSCP in 2018/19. Include a breakdown of costs for the following services: Acute (inpatients & daycases), GLS, Mental Health and Maternity, Outpatients, A&E, GP Out of Hours, Community Prescribing. +```{r chunk6, eval=FALSE, message=FALSE} +# read in data required from slf individual file - filter year = 2018/19 +ab_1819 <- read_slf_individual( + year = "1819", + # select columns + col_select = c( + "acute_cost", "gls_cost", "mh_cost", "mat_cost", + "op_cost_attend", "ae_cost", "ooh_cost", "pis_cost", + "health_net_cost" + ), + # filter for Aberdeen City + partnerships = "S37000001" +) + +# Have used variables which exclude the cost of outpatient attendances which did +# not attend (DNA) but you could also include this if needed. + +# produce summary table +output_table_5 <- ab_1819 %>% + # rename outatients variable + rename(op_cost = op_cost_attend) %>% + # sum of all cost variables and number of patients + summarise(across(ends_with("_cost"), ~ sum(.x, na.rm = TRUE)), + patients = n() + ) %>% + # switch to rows + pivot_longer(everything()) +``` + +6. Deaths from Dementia / Alzheimers + +Produce a chart to show the number of deaths from 2015/16 to 2019/20 in Scotland where the main cause of death was recorded as Dementia/Alzheimers (ICD 10 codes: G30, F01-F03, F05.1). + +```{r chunk7, eval=FALSE, message=FALSE} +# read in data required from slf episode file - filter for years 2015/16 to 2019/20 +deaths <- read_slf_episode( + year = c("1516", "1617", "1718", "1819", "1920"), + # select columns + col_select = c("year", "deathdiag1"), + # Filter for death records + recids = "NRS" +) + +# extract 3 & 4 digit codes and select those with dementia +dementia_deaths <- deaths %>% + # extract 3 & 4 digit ICD 10 codes + mutate( + diag_3d = str_sub(deathdiag1, 1, 3), + diag_4d = str_sub(deathdiag1, 1, 4) + ) %>% + # select dementia codes + filter(diag_3d == "G30" | diag_3d == "F00" | diag_3d == "F01" | + diag_3d == "F02" | diag_3d == "F03" | diag_4d == "F051") + +# produce summary table +output_table_6 <- dementia_deaths %>% + count(year) %>% + rename(deaths = n) +``` + +7. Number and cost of prescriptions for MS + +Create a table to compare the number and cost of prescribed items for patients with Multiple Sclerosis (MS) by HSCP in 2018/19. Include the number of dispensed items and cost per patient. + +```{r chunk8, eval=FALSE, message=FALSE} +# read in HSCP names (used in exercises 7 & 9) +hscp_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Integration Authority 2019 Lookup.csv") %>% + select( + hscp2019 = IntegrationAuthority2019Code, + hscp_desc = IntegrationAuthority2019Name + ) + +# read in data required from slf episode file - filter for year = 2018/19 +pis_1819 <- read_slf_individual("1819", + col_select = c("hscp2019", "ms", "pis_paid_items", "pis_cost") +) + + +# select all patients with MS & add on HSCP name +ms_1819 <- pis_1819 %>% + filter(ms == 1) %>% + left_join(hscp_lookup) + +# produce summary table +output_table_7 <- ms_1819 %>% + # group by hscp + group_by(hscp2019, hscp_desc) %>% + # sum up number of items, costs & patients with MS (not all will have had prescription) + summarise( + pis_paid_items = sum(pis_paid_items), + pis_cost = sum(pis_cost), + patients = sum(ms) + ) %>% + ungroup() %>% + # calculate number of items / cost per patient + mutate( + items_per_patient = pis_paid_items / patients, + cost_per_patient = pis_cost / patients + ) +``` + +8. A&E attendance in last 3 months of life. + +Produce a table to show the number of deaths in Glasgow City HSCP in 2019/20 and what proportion had an A&E attendance in the last 3 months of life. + +```{r chunk9, eval=FALSE, message=FALSE} +# extract all deaths in Glasgow City in 1920 - Filter year = 1920 +gc_deaths <- read_slf_episode( + year = "1920", + # select columns + col_select = c("anon_chi", "death_date"), + # filter for Glasgow City + partnerships = "S37000015", + # Filter for death records + recids = "NRS" +) %>% + # exclude those with missing chi + filter(anon_chi != "") %>% + # exclude duplicates + distinct(anon_chi, death_date) + +# extract all A&E attendances in 1819 & 1920 +ae <- read_slf_episode( + year = c("1819", "1920"), + # select columns + col_select = c("anon_chi", "recid", "record_keydate1"), + # filter for A&E data + recids = "AE2" +) %>% + # exclude those with missing chi + filter(anon_chi != "") %>% + # rename date of attendance + rename(attendance_date = record_keydate1) + +# select A&E attendances for those individuals who are in the GC deaths file +ae_gc <- ae %>% + # filter A&E attendances for those in deaths file + semi_join(gc_deaths) %>% + # match on date of death + left_join(gc_deaths) + +# select A&E attendances which are within 3 months of death (counted as 91 days) +ae_gc_3m <- ae_gc %>% + # create 3 month interval + mutate(int_3m = interval(death_date - days(91), death_date)) %>% + # flag if attendance is in 3 month interval + mutate(att_3m = if_else(attendance_date %within% int_3m, 1, 0)) %>% + # select only those attendances in 3 months before death + filter(att_3m == 1) + +# create list of patients with A&E attendance in 3m period +pats_ae_3m <- ae_gc_3m %>% + # select only chi and attendance flag + select(anon_chi, att_3m) %>% + # restrict to one row per person + distinct() + +# final output for total number of deaths and number with an A&E attendance in last 3 months +output_table_8 <- gc_deaths %>% + # match on attendance flag + left_join(pats_ae_3m) %>% + # summarise total deaths and deaths with A&E attendance in last 3 months + summarise( + deaths = n(), + deaths_with_ae_att = sum(att_3m, na.rm = TRUE) + ) %>% + # calculate % + mutate(prop_ae_3m = deaths_with_ae_att / deaths) +``` + +9. Non elective admissions in Geriatric Medicine. + +Create a table showing the number of non-elective admissions with any part of the stay (Continuous Inpatient Journey, CIJ) in the specialty Geriatric Medicine, by HSCP in 2019/20. Also include the associated bed days, cost and number of patients. + +```{r chunk10, eval=FALSE, message=FALSE} +# extract data required from episode file +smr_1920 <- read_slf_episode( + year = "1920", + col_select = c( + "anon_chi", "record_keydate1", "record_keydate2", + "spec", "hscp2019", "yearstay", "cost_total_net", + "cij_marker", "cij_pattype" + ), + recids = c("01B", "GLS", "04B") +) %>% + # exclude those with missing chi + filter(anon_chi != "") + +# flag episodes in Geriatric Medicine specialty AB +smr_1920 <- smr_1920 %>% + mutate(ger_med = if_else(spec == "AB", 1, 0)) + +# select only those from non-elective stays +smr_1920_ne <- smr_1920 %>% + filter(cij_pattype == "Non-Elective") + +# aggregate to cij level +# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB +# take hscp from the last record and sum beddays & cost +cij_1920 <- smr_1920_ne %>% + arrange(anon_chi, cij_marker, record_keydate1, record_keydate2) %>% + group_by(anon_chi, cij_marker) %>% + summarise( + across(record_keydate1, min), + across(c(record_keydate2, ger_med), max), + across(c(cij_pattype, hscp2019), last), + across(c(yearstay, cost_total_net), sum) + ) %>% + ungroup() + +# select only admissions with part of their stay in Geriatric Medicine specialty +cij_ger_med <- cij_1920 %>% + filter(ger_med == 1) + +# aggregate up to patient level +# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB +# take hscp from the last record and sum beddays & cost +pat_1920 <- cij_ger_med %>% + group_by(anon_chi, hscp2019) %>% + summarise( + across(c(ger_med, yearstay, cost_total_net), sum) + ) %>% + ungroup() + +# produce output +# note patients may be counted in more than one hscp +output_table_9 <- pat_1920 %>% + # match on hscp names + left_join(hscp_lookup) %>% + # group up to hscp level + group_by(hscp2019, hscp_desc) %>% + # sum up measures + summarise( + admissions = sum(ger_med), + beddays = sum(yearstay), + cost = sum(cost_total_net), + patients = n() + ) %>% + ungroup() +``` + diff --git a/vignettes/using-arrow-table.Rmd b/vignettes/using-arrow-table.Rmd new file mode 100644 index 0000000..8ec2aa9 --- /dev/null +++ b/vignettes/using-arrow-table.Rmd @@ -0,0 +1,52 @@ +--- +title: "using-arrow-table" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{using-arrow-table} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(slfhelper) +``` + +## Using Parquet files with the arrow package + +The SLFs are available in parquet format. The {arrow} package gives some extra features which can speed up and reduce memory usage even further. You can read only specific columns `read_parquet(file, col_select = c(var1, var2))`. + +Using arrow’s ‘Arrow Table’ feature, you can speed up analysis efficiently. To do this, specify `as_data_frame = FALSE` when using SLFhelper and `dplyr::collect()` to read the data. + + +#### For example: + +Planned and unplanned beddays in Scotland +```{r chunk1, eval=FALSE, message=FALSE} +# Filter for year of interest +slf_extract <- read_slf_episode(c("1819", "1920"), + # Select recids of interest + recids = c("01B", "GLS", "04B"), + # Select columns + col_select = c( + "year", "anon_chi", "recid", + "yearstay", "age", "cij_pattype" + ), + # return an arrow table + as_data_frame = FALSE +) %>% + # Filter for non-elective and elective episodes + filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + # Group by year and cij_pattype for analysis + group_by(year, cij_pattype) %>% + # summarise bedday totals + summarise(beddays = sum(yearstay)) %>% + # collect the arrow table + dplyr::collect() +``` From 0c54bd434033dacecc3c4fdebe5f7879582d84f5 Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Mon, 29 Jul 2024 17:49:19 +0100 Subject: [PATCH 22/29] add a comparison table to show the efficiency improvement --- vignettes/using-arrow-table.Rmd | 56 ++++++++++++++++++++++++--------- 1 file changed, 42 insertions(+), 14 deletions(-) diff --git a/vignettes/using-arrow-table.Rmd b/vignettes/using-arrow-table.Rmd index 8ec2aa9..6e9d073 100644 --- a/vignettes/using-arrow-table.Rmd +++ b/vignettes/using-arrow-table.Rmd @@ -1,5 +1,5 @@ --- -title: "using-arrow-table" +title: "Using Parquet files with the arrow package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{using-arrow-table} @@ -14,23 +14,22 @@ knitr::opts_chunk$set( ) ``` -```{r setup} -library(slfhelper) -``` +## Using Parquet files with the arrow package -## Using Parquet files with the arrow package +The SLFs are available in parquet format. The {arrow} package gives some extra features which can speed up and reduce memory usage even further. You can read only specific columns `read_parquet(file, col_select = c(var1, var2))`. -The SLFs are available in parquet format. The {arrow} package gives some extra features which can speed up and reduce memory usage even further. You can read only specific columns `read_parquet(file, col_select = c(var1, var2))`. +Using arrow's 'Arrow Table' feature, you can speed up analysis efficiently. To do this, specify `as_data_frame = FALSE` when using SLFhelper and `dplyr::collect()` to read the data. -Using arrow’s ‘Arrow Table’ feature, you can speed up analysis efficiently. To do this, specify `as_data_frame = FALSE` when using SLFhelper and `dplyr::collect()` to read the data. +#### For example: +Imagine a scenario of analysing planned and unplanned beddays in Scotland, there are two ways to read the episode files and do analysis by setting `as_data_frame` to be `TRUE` or `FALSE` as follows. -#### For example: +```{r arrow, eval=FALSE, message=FALSE} +library(slfhelper) -Planned and unplanned beddays in Scotland -```{r chunk1, eval=FALSE, message=FALSE} +## FAST METHOD # Filter for year of interest -slf_extract <- read_slf_episode(c("1819", "1920"), +slf_extract1 <- read_slf_episode(c("1819", "1920"), # Select recids of interest recids = c("01B", "GLS", "04B"), # Select columns @@ -42,11 +41,40 @@ slf_extract <- read_slf_episode(c("1819", "1920"), as_data_frame = FALSE ) %>% # Filter for non-elective and elective episodes - filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + dplyr::filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% # Group by year and cij_pattype for analysis - group_by(year, cij_pattype) %>% + dplyr::group_by(year, cij_pattype) %>% # summarise bedday totals - summarise(beddays = sum(yearstay)) %>% + dplyr::summarise(beddays = sum(yearstay)) %>% # collect the arrow table dplyr::collect() + +## SLOW and DEFAULT Method +# Filter for year of interest +slf_extract2 <- read_slf_episode(c("1819", "1920"), + # Select recids of interest + recids = c("01B", "GLS", "04B"), + # Select columns + col_select = c( + "year", "anon_chi", "recid", + "yearstay", "age", "cij_pattype" + ), + # return an arrow table + as_data_frame = TRUE # which is default +) %>% + # Filter for non-elective and elective episodes + dplyr::filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + # Group by year and cij_pattype for analysis + dplyr::group_by(year, cij_pattype) %>% + # summarise bedday totals + dplyr::summarise(beddays = sum(yearstay)) ``` + +By specifying `as_data_frame = FALSE` when using reading SLF functions, one enjoys great advantages of `parquet` files. One of the advantages is fast query processing by reading only the necessary columns rather than entire rows. The table below demonstrates the huge impact of those advantages. + +| | Time consumption (seconds) | Memory usage (MiB) | +|-------------------------|:--------------------------:|:------------------:| +| `as_data_frame = TRUE` | 4.46 | 553 | +| `as_data_frame = FALSE` | 1.82 | 0.43 | + +: Comparison of different ways of reading SLF files From 79148e59733f8e620adc572e9461a42fab370232 Mon Sep 17 00:00:00 2001 From: Jennifer Thom Date: Fri, 16 Aug 2024 08:22:51 +0100 Subject: [PATCH 23/29] Update - round memory size --- vignettes/slf-documentation.Rmd | 38 +++++++++++++++++---------------- 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index 2f0ae58..716e9c0 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -53,26 +53,28 @@ The tables below show the memory usage of each full size SLF. ## Episode File -| Year | Memory usage (MB)| -| ------------- |:----------------:| -| 1718 | 2651.2 | -| 1819 | 3196.5 | -| 1920 | 3145.4 | -| 2021 | 2715.6 | -| 2122 | 2959.3 | -| 2223 | 2995.1 | -| 2324 | 1894.5 | +| Year | Memory usage| +| | (GB) | +| ------------- |:-----------:| +| 1718 | 2.5 | +| 1819 | 3.5 | +| 1920 | 3.5 | +| 2021 | 3 | +| 2122 | 3 | +| 2223 | 3 | +| 2324 | 2 | ## Individual File -| Year | Memory usage (MB)| -| ------------- |:----------------:| -| 1718 | 1055.6 | -| 1819 | 1057.8 | -| 1920 | 1070.7 | -| 2021 | 1067.3 | -| 2122 | 1081.2 | -| 2223 | 1098.7 | -| 2324 | 775.5 | +| Year | Memory usage| +| | (GB) | +| ------------- |:-----------:| +| 1718 | 1.5 | +| 1819 | 1.5 | +| 1920 | 1.5 | +| 2021 | 1.5 | +| 2122 | 1.5 | +| 2223 | 1.5 | +| 2324 | 1 | From fb509d05c27f210d7abdd6e74148ca26f38d1b1e Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Fri, 16 Aug 2024 10:36:58 +0100 Subject: [PATCH 24/29] replace columns by col_select and add tidyselect --- vignettes/variable-packs.Rmd | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/vignettes/variable-packs.Rmd b/vignettes/variable-packs.Rmd index 61e0072..65560e0 100644 --- a/vignettes/variable-packs.Rmd +++ b/vignettes/variable-packs.Rmd @@ -16,10 +16,11 @@ knitr::opts_chunk$set( ## Selecting only specified variables -It is recommended to only choose the variables you need when reading in a Source Linkage File. This can be achieved by specifying a `column` argument to the relevant `read_slf_` function. +It is recommended to only choose the variables you need when reading in a Source Linkage File. This can be achieved by specifying a `col_select` argument to the relevant `read_slf_` function. This will result in the data being read in much faster as well as being easy to work with. The full episode and individual files have 200+ and 100+ variables respectively! + ```{r load-package, include=FALSE} library(slfhelper) ``` @@ -27,9 +28,25 @@ library(slfhelper) ```{r column-example, eval=FALSE} library(slfhelper) -ep_data <- read_slf_episode(year = 1920, columns = c("year", "anon_chi", "recid")) +ep_data <- read_slf_episode(year = 1920, col_select = c("year", "anon_chi", "recid")) + +indiv_data <- read_slf_individual(year = 1920, col_select = c("year", "anon_chi", "nsu")) +``` -indiv_data <- read_slf_individual(year = 1920, columns = c("year", "anon_chi", "nsu")) +## Selecting variables using `tidyselect` functions +It is now allowed to use `tidyselect` functions, such as `contains()` and `start_with()`, to select variables in relevant `read_slf_` function. One can also mix `tidyselect` functions with specified variables when selecting. + +```{r tidyselect, eval=FALSE} +library(slfhelper) +ep_data <- + read_slf_episode(year = 1920, + col_select = !tidyselect::contains("keytime")) + +indiv_data <- + read_slf_individual( + year = 1920, + col_select = c("year", "anon_chi", "nsu", tidyselect::starts_with("sds")) + ) ``` ## Looking up variable names @@ -85,7 +102,7 @@ For example to take some demographic data and LTC flags from the individual file ```{r use-ltc-indiv, eval=FALSE} library(slfhelper) -indiv_ltc_data <- read_slf_individual(year = 1920, columns = c("year", demog_vars, ltc_vars)) +indiv_ltc_data <- read_slf_individual(year = 1920, col_select = c("year", demog_vars, ltc_vars)) ``` @@ -95,7 +112,7 @@ library(slfhelper) acute_beddays <- read_slf_episode( year = 1920, - columns = c("year", "anon_chi", "hbtreatcode", "recid", ep_file_bedday_vars, "cij_pattype"), + col_select = c("year", "anon_chi", "hbtreatcode", "recid", ep_file_bedday_vars, "cij_pattype"), recid = c("01B", "GLS") ) ``` From 2c38d4c68021b9ba6c58a07dac018f34f1032bc6 Mon Sep 17 00:00:00 2001 From: lizihao-anu Date: Fri, 16 Aug 2024 09:41:35 +0000 Subject: [PATCH 25/29] Style package --- vignettes/variable-packs.Rmd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/vignettes/variable-packs.Rmd b/vignettes/variable-packs.Rmd index 65560e0..78b0033 100644 --- a/vignettes/variable-packs.Rmd +++ b/vignettes/variable-packs.Rmd @@ -39,8 +39,10 @@ It is now allowed to use `tidyselect` functions, such as `contains()` and `start ```{r tidyselect, eval=FALSE} library(slfhelper) ep_data <- - read_slf_episode(year = 1920, - col_select = !tidyselect::contains("keytime")) + read_slf_episode( + year = 1920, + col_select = !tidyselect::contains("keytime") + ) indiv_data <- read_slf_individual( From 9e04b8d75feb5bf64dd237d84d34ec79fd80c49a Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Fri, 16 Aug 2024 16:05:45 +0100 Subject: [PATCH 26/29] update ep_file_vars and indiv_file_vars --- data/ep_file_vars.rda | Bin 1438 -> 4673 bytes data/indiv_file_vars.rda | Bin 1196 -> 4272 bytes 2 files changed, 0 insertions(+), 0 deletions(-) diff --git a/data/ep_file_vars.rda b/data/ep_file_vars.rda index e0279d813e53b81338c7feebc899a6c97dedcc25..d0e91505cb7ac45734c66c27da933d72bcbd5a4f 100644 GIT binary patch literal 4673 zcmZ`-+iu)85Zz?g-uv1~T(>Cl3kB?Tnx^kXKcHxVyb)ZAyJ9Sf3Mtvx{(6C;XC#h9 z$?1b(ox|af9L|mV$8Vn>R-ads$z(d2y?Q;FUf^?b@zHrtAu!T^m z3Hb7)i$wD})5IRaz*L@i&gH2rom8x$w7ya{;nBBABAoW6ExbA4D4CK$a-AM3r)4aB zo#;%S1=hqI>1`22Z%Su!>CU1ZAB^34oVQ5+XLO~^7lEmd-63|w_)5A;kbS}m8)9kQ zh`T1Ploi=WL`PZ>Z_0#^ye2G@JhEe>4wUevX0e7#jE}h`O>GXv=F6;-mFZ4CixjM2 z%CEJ|Om$`x%F>C0P7- z05eO~1HvSc;4qd{_^estbso02_ji3F9#*=Q|&mF6<0cCvD~j-xrm>{^inx|mw1 zr$J_0;kt$rOJ&P}$TrcrE-Euva0?ec=t}!o`Bs;xQKd(A?(^wFr2}=lR3D|uY8WJv zn&$&eS0xg)DUkOE^v->2?iyJi)hSlDJ!IA!(c?8?bd zWV_hf2AZmDM{d&eG!|SDH;|h4UBiM3WK;q-a3}dUvd;eQp3~H2dH~|own)7yzkxX! zUMd-lwur~ByvdXO-Tv-rFPtnM!`W{ln(`0YuZ+{)?QraCZ4gD>=nv6oyUXG1A774- zGz=ud$%Z=K$2wSIq-9aNPl65>P6YgMgdawDALF^oYL9#vrmxZi^?*@TZi`MFUWfpN zrcSe4-$;cBL^C7W%-U3-pl=oxd_+;fJ=82m0uM4%G)gR9q1WPXnN-K{eF+-=642ac zNhFA}?-j@UyaAOBqKgLhQ>pfI>pQ0e2F&Z(a?eQ{V*8O8mv;<3L}CvkF|PR#`w)qJ z7>RKsaE-S8?p`Q#lY{1VEi($mjM~w#zm=~KIplg~d6Zce6C1CyV=7%92`!Y(^VE7S zUJGh9sq0}2QJ)!OYBGlyeL^Hp?PZLqw~R40mNBNTGRD+W#+dra7*jJDW9lR$JVmAE zt>bA_nY?=%Edu7N+DTMx=ggDL^8O*SL}K-}c8UXiLt;(C*E`}099$pjM;<^77EGVJj(MsSJHt(ASy3U$ zdPRj8OUU(%Dku4cYW-VOvcb)gDM3D_;DX z#Nk$Dh8CjGo)_Lwu;B@3qIb`{NPXme7sv;lM=3SFv1Qifk z30vu2scU+*qHA-N=H1>jcZAFeTMDaG?Od0-Zd>+XB_7&g3_Vvk{pOLvO(~qPo3XJL zT53FYd%uf0-ha&TDPoS#vL<~B*|cxN=eV!v3A$F9DNwW{YFk*i23l}~W3u&^xIJ`S zy&meQ>8_)Tz5$gX>%m5CHu3!nb*@rT$ea(q1|0naumXjj6ewG3?saiAC+A?tcpgjxhXnV4W&@Bsed=vQtD9TQ;KDuo1;}z3OzyqQG05;|w$pmkH3)7!D&)FyU(-BwEK^boB6rUcJB#M9!YbZ zzlV+&+)kB=`z!}?QrS3ao$P3=(rwL4moR0fY{fuFiYzD6(8R6|gF@KCMrm?dnF}z- z9X@&>`Y57reSWkpYcO)5A$VTFdf4~bHf1N?ohz(%KYSi`wW(R~O`5`w44g9Ud~xC9 z>5+xWwvw5q_6wF|y*Av0-^MDO01ChY5MiKH0$@-R;}rq`df^QC6FLMj=wPb+{Y*(Z zkqB$VHZx^ZR!~+ERuVTzBOqd0*8k$yBoLrVu4rIm46PwxF#XJH#`5Cg)&5I(%( z843^;Yj!f+wG5gwGxCHzF3vVIAgdil9R}7}_r{6T(k&8tgH~V>X;Mgt)KqEXC5Y2v zDLnFlsE`3F-1@T1Eah=M0lAkT`RmA;R_fVI9qj4~D#~HrMc8mj87S)mKylwuWm3bQ z5ISCyR1%TV3H!JlM#$YT;V3eixlAMUzmlzXly43>yPX4$7v3G#uM5XX7yxDdP#2qN z<}kLduJ=y|id>nTl%=Cbi0dav(*c<5GzXA1$`9*H2i)W9z>(&`R*wPl^R3EkNV!P~MnIqllv z{kt8NEX+^4GQG1n_x_Z|Rx}P=bQ`Qj`cpnh&P>4=odu^(I@qsUs1_?GV(BEUxS(S; zX`F2W*1o4}R%yKT*iC!VGubFqSxD{9%xj8CVK7iYb4>PT4|xX2qaf;bi8X_F(nQfG z!Op&VU{{x3(o(G8`C1k38v~u_7D!6vZ>bEJQLe&H-4iYQe(T zKu~Lc&P3p%U8rNcNJvPWtsahIt*%5G2xMkg5wHu2OJraaf&#|eY@(h#sV#_7Vrm4N z!5%grx-6Y;hZQ~Du1&2oH9`hWBnEW|Q7D_oCxr4S;zU`vcWy9_-vz92(sr?w=Gdn4 zMVy#~Hg6#$BtD7CC*K~LJUAu}wCwe3ShWCnL}t4LtGnC1#PjI@=U&s*)bRLKO`Cc9 z3E6C?b=d2-1|?|xG&^jcv2Qob4C#c9n1r@O=Gabk&U6vllQAp87)58`JQFB^kDc-m z)k$Oy15(_`1h5{eXi0{MuE#Ozh?}_Rd1TEq0~&Lo1%Kk^1Gl^-xR)^9;2wz(V%MxD sgZgGhk)YwEM^s-mdv0Y|hX_(TMvS*RXXoP2KYsh^rPDb-1LITYyfc9R zXR_pSM>aASWJk3D=nDS*>*+o%RY_PTy?wbAB^Nrdl`@toF1+34YTY_km8Q&FbHhkZ ztuU6DdR<6kq$)i(V;N!65K&13QpUprCjTNUlIeia+ zQZNuT?FfrXi=8MfUBlcd_gRJUr*P&r~F!Q zPTxZcUm9BENK zEQ%Y$%BJ3ZhasOEXCF$1m_^JaMZAmIcusRievs}J=Fo7XB1?1a0faoR!91DiT|^{_ z&Lc9;a|v9u$Vu$yl#j!mbQF$~eE{xgO%wSqr+nfeT!u$7XqfRfKk5Q~hasOEXP+$G zB4(2G?qW7h>^nzBlTV_TM-H*TUOf^(DSz*RkLI0>@Nm*89%Y~Akq{bbp(;{}H>OHM zsI}0kY027lHwD%?d&;-38H(X04XtM!oDtTe0x!iB%;Qn|(8VJ!XIE5mBCRNbe1G4I z4^4F7>`WD6=hGTGfbyMW;cmHf2oIOWKA_CZ>Sx^DmCf3_sptGF7H*&`XEL>eL?YQa z!VTOFrhp#a)~1cY%6CmjY3kfU&xbq6$|9sW;&3I!H(3uLv`WqcmdH|OLpjC_jz)sb zId{9p-L79##aqn9o8%g(lQCB<5eJMimdZ~7+TXmeBsDw6u76b@Br=!np53w z@(}=5CeG`G+OMRaj36Bs%igT3CK#vweB;#KRpN7Lc_j14zLRr}5ORS372#&;2Dd>CUD$yQJHrP_E{FmO*65fGOxA3UcDkxnpIT4$`TG!1W)y; z*b(##c#LDY$vRhxuQv*VW4H#OvAQ=*TgxezTX>5@5iol0O75$D*;LzGu(f@m1Krfv zTU{ux|9P>mtT)bGQT^|JsAp^YQU*Iz(c4`pp#M2&Yv+Np_J|D+6${R7u{?0u)X7`@ zP$F;le04%}TNn=vxuEergo!W#PuGy$%s>vB5@Vr!hy6S8Fz6q^ah7dU$DzX_uWrt0 zNH;_!P1es0tB^ymUwptNqlj&+z|D*- z`08=>xLXlT%U2&VM=*TYK|$aEcy1AMf7cQYQOhn7+oT;w-yiddLge9R7Hd3Ujo^`} z^lSY>+{Ap=VE>ZX&xyUlc8}$?vF%%9n62H%Zd7$G2`oeCe1pu}RVcdyvhcOsYpg)- zO|%Of3ud5auR>I$QdJG#)Q1o0sZk8tA$;RPmyw7437_6_C(A9|POx+Q_z$pfrmiYz z&SaxiL2CGbg~yQ#4n6(K`81kuO24-F4dC{@5=3p>3&DVPs3R~1Tyf#eFajICo%X8U VWyVwu_im{yc`emj34d((?|*NZQh5LX literal 1196 zcmV;d1XKG$T4*^jL0KkKS@fq!=Kup8|A_zp|NnpjeP9Iu7C^uM|F8rA00B@0-rUZ@ zXH0Kn$Bvrhpm%Xfy$!2AGh9 z6V#d;QwMr?x;k|TyR zcKCUC{`;H9%hDdZlLvQa8KgK2;;3YmOB*UNq_e$V^u$`HCNt^XHUPyLK3_iN(6Zn} zyCAMRao*;+LcBY<*;rkrh)kX`4yOk}To_RmQAHLtky@sydO>TNDF}u8?z+5F=Aubm zi6nmf)#Zfhs$#GD02^3nkIUt)2y;Ab=cTQy;p-1fvyaI6Zb!!juPi|^*(qvP(^ zNv=!v(r$cTQE^6jZ-G`pK)Zxu`knI@)5KSDY$D-Xi;__h5p0Ena@P(i-!Wc;^w}$m z2tn6&?6(qm(?<)z&_EC&1a(wRV!4rrr){y!`32aBR>YxIwwQ4Bb6wB2{9sGNS`kUN ze4~F$YoD?n{ZT!*;^)qSjKU0}3r=uw@f@U$49OCq3ihZgK!dp}0uVtrpvALIjVmA= zJJbe-YmRIKXt8+6!~KY*WhiKE->YWMUY{MF*H+wyko0jI=nEU#lmd&W(@hz;$_mn_ zo}QY;Y3u2p{x2oQK*r;k7lazi%Hdl?K{r=1?1*>RDGz=J^) zHM5K$1xN&-iHRt}KqP{V6rJlKNxDFx5$6GWHayxEu6>c5Hik;jg7ItXFwM`a4gz{W zO9Zg%!!#+-2$GS7N*Idrm}9l1g`uP)NSKs~!V)CON-$DkAjHhoNkAp&--jby3@(gC z Date: Fri, 16 Aug 2024 16:07:19 +0100 Subject: [PATCH 27/29] add session memory recommendation --- vignettes/slf-documentation.Rmd | 117 +++++++++++++++++++------------- 1 file changed, 68 insertions(+), 49 deletions(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index 716e9c0..0d62dc0 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -14,67 +14,86 @@ knitr::opts_chunk$set( ) ``` - ```{r setup, include = FALSE} library(slfhelper) library(tidyverse) ``` -## SLFhelper -`SLFhelper` contains some easy to use functions designed to make working with the Source Linkage Files (SLFs) as efficient as possible. - -### Filter functions: -- `year` returns financial year of interest. You can also select multiple years using `c("1718", "1819", "1920")` -- `recid` returns recids of interest. Selecting this is beneficial for specific analysis. -- `partnerships` returns partnerships of interest. Selecting certain partnerships will reduce the SLFs size. -- `col_select` returns columns of interest. This is the best way to reduce the SLFs size. - -### Data snippets: -- `ep_file_vars` returns a list of all variables in the episode files. -- `indiv_file_vars` returns a list of all variables in the individual files. -- `partnerships` returns a list of partnership names (HSCP_2018 codes) -- `recid` returns a list of all recids available in the SLFs. -- `ep_file_bedday_vars` returns a list of all bedday related variables in the SLFs. -- `ep_file_cost_vars` returns a list of all cost related variables in the SLFs. - -### Anon CHI -- Use the function `get_chi()` to easily switch `anon_chi` to `chi`. -- Use the function `get_anon_chi()` to easily switch `chi` to `anon_chi`. - +## SLFhelper -### Memory usage in SLFS +`SLFhelper` contains some easy to use functions designed to make working with the Source Linkage Files (SLFs) as efficient as possible. -While working with the Source Linkage Files (SLFs), it is recommended to use the features of the SLFhelper package to maximase the memory usage in posit, see [PHS Data Science Knowledge Base](https://public-health-scotland.github.io/knowledge-base/docs/Posit%20Infrastructure?doc=Memory%20Usage%20in%20SMR01.md) for further guidance on memory usage in posit workbench. - -Reading a full SLF file can be time consuming and take up resources on posit workbench. In the episode file there are `251 variables` and around `12 million rows` compared to the individual file where there are `193 variables` and around `6 million rows` in each file. This can be reduced by using available selections in SLFhelper to help reduce the size of the SLFs for analysis and to free up resources in posit workbench. +### Filter functions: -The tables below show the memory usage of each full size SLF. +- `year` returns financial year of interest. You can also select multiple years using `c("1718", "1819", "1920")` +- `recid` returns recids of interest. Selecting this is beneficial for specific analysis. +- `partnerships` returns partnerships of interest. Selecting certain partnerships will reduce the SLFs size. +- `col_select` returns columns of interest. This is the best way to reduce the SLFs size. +### Data snippets: -## Episode File +- `ep_file_vars` returns a list of all variables in the episode files. +- `indiv_file_vars` returns a list of all variables in the individual files. +- `partnerships` returns a list of partnership names (HSCP_2018 codes) +- `recid` returns a list of all recids available in the SLFs. +- `ep_file_bedday_vars` returns a list of all bedday related variables in the SLFs. +- `ep_file_cost_vars` returns a list of all cost related variables in the SLFs. -| Year | Memory usage| -| | (GB) | -| ------------- |:-----------:| -| 1718 | 2.5 | -| 1819 | 3.5 | -| 1920 | 3.5 | -| 2021 | 3 | -| 2122 | 3 | -| 2223 | 3 | -| 2324 | 2 | +### Anon CHI +- Use the function `get_chi()` to easily switch `anon_chi` to `chi`. +- Use the function `get_anon_chi()` to easily switch `chi` to `anon_chi`. -## Individual File +### Memory usage in SLFS -| Year | Memory usage| -| | (GB) | -| ------------- |:-----------:| -| 1718 | 1.5 | -| 1819 | 1.5 | -| 1920 | 1.5 | -| 2021 | 1.5 | -| 2122 | 1.5 | -| 2223 | 1.5 | -| 2324 | 1 | +While working with the Source Linkage Files (SLFs), it is recommended to use the features of the SLFhelper package to maximase the memory usage in posit, see [PHS Data Science Knowledge Base](https://public-health-scotland.github.io/knowledge-base/docs/Posit%20Infrastructure?doc=Memory%20Usage%20in%20SMR01.md) for further guidance on memory usage in posit workbench. +Reading a full SLF file can be time consuming and take up resources on posit workbench. In the episode file there are `r length(slfhelper::ep_file_vars)` variables and around 12 million rows compared to the individual file where there are `r length(slfhelper::indiv_file_vars)` variables and around 6 million rows in each file. This can be reduced by using available selections in SLFhelper to help reduce the size of the SLFs for analysis and to free up resources in posit workbench. + +The tables below show the memory usage of each full size SLF. + +#### Episode File + +| Year | Memory Usage (GiB) | +|------|:------------------:| +| 1718 | 22 | +| 1819 | 22 | +| 1920 | 22 | +| 2021 | 19 | +| 2122 | 21 | +| 2223 | 21 | +| 2324 | 18 | + +#### Individual File + +| Year | Memory Usage (GiB) | +|------|:------------------:| +| 1718 | 6.8 | +| 1819 | 6.8 | +| 1920 | 7.0 | +| 2021 | 7.0 | +| 2122 | 7.0 | +| 2223 | 7.1 | +| 2324 | 5.1 | + +If one can use selection features in SLFhelper, the session memory requirement can be reduced. There are `r length(slfhelper::ep_file_vars)` columns for a year episode file of size around 20 GiB. Hence, on average, a column with all rows takes around 0.1 GiB, which can give a rough estimate on the session memory one needs. Taking Year 1920 as a demonstration, the following tables present various sizes of extracts from the SLF files, from 5 columns to all columns, along with the amount of memory required to work with the data one reads in. Keep in mind that tables below are just recommendations, and that memory usage depends on how one handles data and optimises data pipeline. + + +#### Episode File +| Column Number | Memory usage (GiB) | Session Memory Recommendation | +|---------------|:------------------:|---------------------------------------------------| +| 5 | 0.5 | 4 GiB (4096 MiB) | +| 10 | 1.4 | between 4 GiB (4096 MiB) and 8 GiB (8192 MiB) | +| 50 | 5.1 | between 8 GiB (8192 MiB) and 16 GiB (16384 MiB) | +| 150 | 13 | between 20 GiB (20480 MiB) and 38 GiB (38912 MiB) | +| 251 | 22 | between 32 GiB (32768 MiB) and 64 GiB (65536 MiB) | + +#### Individual File + +| Column Number | Memory usage (GiB) | Session Memory Recommendation | +|---------------|:------------------:|---------------------------------------------------| +| 5 | 0.7 | 4 GiB (4096 MiB) | +| 10 | 0.8 | 4 GiB (4096 MiB) | +| 50 | 2.2 | between 4 GiB (4096 MiB) and 8 GiB (8192 MiB) | +| 150 | 5.5 | between 8 GiB (8192 MiB) and 16 GiB (16384 MiB) | +| 193 | 7.0 | between 11 GiB (11264 MiB) and 21 GiB (21504 MiB) | From 362789d43bb1f22c35e09da348389903d4aa3026 Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Fri, 16 Aug 2024 16:12:20 +0100 Subject: [PATCH 28/29] Update R-CMD-check.yaml --- .github/workflows/R-CMD-check.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index 35aa114..ce2fba5 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -18,7 +18,7 @@ jobs: strategy: fail-fast: false matrix: - r_version: ['3.6.1', '4.0.2', '4.1.2', 'release', 'devel'] + r_version: ['4.0.2', '4.1.2', 'release'] env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} From 6b9061e397d36696e32e0729e42eef179aabef88 Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Fri, 16 Aug 2024 16:42:29 +0100 Subject: [PATCH 29/29] fix cmd build error --- vignettes/slf-documentation.Rmd | 1 - 1 file changed, 1 deletion(-) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd index 0d62dc0..fcebbbc 100644 --- a/vignettes/slf-documentation.Rmd +++ b/vignettes/slf-documentation.Rmd @@ -16,7 +16,6 @@ knitr::opts_chunk$set( ```{r setup, include = FALSE} library(slfhelper) -library(tidyverse) ``` ## SLFhelper