Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug - speed up get_chi() #68

Merged
merged 18 commits into from
Dec 12, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Renviron
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TESTTHAT_CPUS = 12
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: slfhelper
Title: Useful functions for working with the Source Linkage Files
Version: 0.10.0
Version: 0.10.0.9000
Authors@R: c(
person("Public Health Scotland", , , "phs.source@phs.scot", role = "cph"),
person("James", "McMahon", , "james.mcmahon@phs.scot", role = c("aut"),
Expand Down Expand Up @@ -46,6 +46,7 @@ VignetteBuilder:
Remotes:
Public-Health-Scotland/phsmethods
Config/testthat/edition: 3
Config/testthat/parallel: true
Encoding: UTF-8
Language: en-GB
LazyData: true
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# slfhelper (development version)

# slfhelper 0.10.0

* [`{glue}`](https://glue.tidyverse.org/) is no longer a dependency as the required functionality can be provided by [`stringr::str_glue()](https://stringr.tidyverse.org/reference/str_glue.html).
* Dependency versions have been updated to the latest.
* `get_chi()` and `get_anon_chi()` now properly match missing (`NA`) and blank (`""`) values.
* slfhelper now defaults to using the `.parquet` file versions, old versions of slfhelper will no longer work.
* There is now a `dev` parameter available when using the `read_slf_*` functions which allows reading the file from the development environment.

# slfhelper 0.9.0

Expand Down
6 changes: 5 additions & 1 deletion R/get_anon_chi.R
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,11 @@ get_anon_chi <- function(chi_cohort, chi_var = "chi", drop = TRUE, check = TRUE)
lookup <- tibble::tibble(
chi = unique(chi_cohort[[chi_var]])
) %>%
dplyr::mutate(anon_chi = convert_chi_to_anon_chi(.data$chi))
dplyr::mutate(
chi = dplyr::if_else(is.na(.data$chi), "", .data$chi),
anon_chi = purrr::map_chr(.data$chi, openssl::base64_encode),
anon_chi = dplyr::if_else(.data$anon_chi == "", NA_character_, .data$anon_chi)
)

chi_cohort <- chi_cohort %>%
dplyr::left_join(
Expand Down
24 changes: 10 additions & 14 deletions R/get_chi.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,11 @@ get_chi <- function(data, anon_chi_var = "anon_chi", drop = TRUE) {
lookup <- tibble::tibble(
anon_chi = unique(data[[anon_chi_var]])
) %>%
dplyr::mutate(chi = convert_anon_chi_to_chi(.data$anon_chi))

dplyr::mutate(
anon_chi = dplyr::if_else(is.na(.data$anon_chi), "", .data$anon_chi),
chi = unname(convert_anon_chi_to_chi(.data$anon_chi)),
chi = dplyr::if_else(.data$chi == "", NA_character_, .data$chi)
)
data <- data %>%
dplyr::left_join(
lookup,
Expand All @@ -36,17 +39,10 @@ get_chi <- function(data, anon_chi_var = "anon_chi", drop = TRUE) {
return(data)
}

convert_anon_chi_to_chi <- function(anon_chi) {
chi <- purrr::map_chr(
anon_chi,
~ dplyr::case_match(.x,
NA_character_ ~ NA_character_,
"" ~ "",
.default = openssl::base64_decode(.x) %>%
substr(2, 2) %>%
paste0(collapse = "")
)
)
convert_anon_chi_to_chi <- Vectorize(function(anon_chi) {
chi <- openssl::base64_decode(anon_chi) %>%
substr(2, 2) %>%
paste0(collapse = "")

return(chi)
}
})
27 changes: 19 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,24 @@ stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://

The goal of slfhelper is to provide some easy-to-use functions that make
working with the Source Linkage Files as painless and efficient as
possible.
possible. It is only intended for use by PHS employees and will only
work on the PHS R infrastructure.

## Installation

The preferred method of installation is to use the [{`pak`}
package](https://pak.r-lib.org/), which does an excellent job of
handling the errors which sometimes occur.
The simplest way to install to the PHS Posit Workbench environment is to
use the [PHS Package
Manager](https://ppm.publichealthscotland.org/client/#/repos/3/packages/slfhelper),
this will be the default setting and means you can install `slfhelper`
as you would any other package.

``` r
install.packages("slfhelper")
```

If this doesn’t work you can install it directly from GitHub, there are
a number of ways to do this, we recommend the [{`pak`}
package](https://pak.r-lib.org/).

``` r
# Install pak (if needed)
Expand All @@ -37,9 +48,9 @@ pak::pak("Public-Health-Scotland/slfhelper")
**Note:** Reading a full file is quite slow and will use a lot of
memory, we would always recommend doing a column selection to only keep
the variables that you need for your analysis. Just doing this will
dramatically speed up the read-time.
dramatically speed up the read time.

We provide some data snippets to help with the column selection and
We provide some data snippets to help with column selection and
filtering.

``` r
Expand Down Expand Up @@ -97,11 +108,11 @@ ep_1718 <- read_slf_episode(c("1718", "1819", "1920"),
) %>%
get_chi()

# Change chi numbers from data above back to anon_chi
# Change chi numbers from the data above back to anon_chi
ep_1718_anon <- ep_1718 %>%
get_anon_chi(chi_var = "chi")

# Add anon_chi to cohort sample
# Add anon_chi to the cohort sample
chi_cohort <- chi_cohort %>%
get_anon_chi(chi_var = "upi_number")
```
2 changes: 1 addition & 1 deletion man/read_slf.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/read_slf_episode.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/read_slf_individual.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 7 additions & 2 deletions man/slfhelper-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 7 additions & 5 deletions tests/testthat/_snaps/get_anon_chi.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,22 @@
Code
get_anon_chi(data)
Output
# A tibble: 2 x 1
# A tibble: 3 x 1
anon_chi
<chr>
1 ""
1 <NA>
2 <NA>
3 <NA>

---

Code
get_anon_chi(data, drop = FALSE)
Output
# A tibble: 2 x 2
# A tibble: 3 x 2
chi anon_chi
<chr> <chr>
1 "" ""
2 <NA> <NA>
1 "" <NA>
2 "" <NA>
3 <NA> <NA>

62 changes: 32 additions & 30 deletions tests/testthat/_snaps/get_chi.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,40 +3,42 @@
Code
get_chi(data)
Output
# A tibble: 12 x 1
chi
<chr>
1 "2601211618"
2 "2210680631"
3 "1410920754"
4 "3112358158"
5 "0112418156"
6 "0612732243"
7 "2310474015"
8 "2411063698"
9 "3801112374"
10 "2311161233"
11 ""
12 <NA>
# A tibble: 13 x 1
chi
<chr>
1 2601211618
2 2210680631
3 1410920754
4 3112358158
5 0112418156
6 0612732243
7 2310474015
8 2411063698
9 3801112374
10 2311161233
11 <NA>
12 <NA>
13 <NA>

---

Code
get_chi(data, drop = FALSE)
Output
# A tibble: 12 x 2
anon_chi chi
<chr> <chr>
1 "MjYwMTIxMTYxOA==" "2601211618"
2 "MjIxMDY4MDYzMQ==" "2210680631"
3 "MTQxMDkyMDc1NA==" "1410920754"
4 "MzExMjM1ODE1OA==" "3112358158"
5 "MDExMjQxODE1Ng==" "0112418156"
6 "MDYxMjczMjI0Mw==" "0612732243"
7 "MjMxMDQ3NDAxNQ==" "2310474015"
8 "MjQxMTA2MzY5OA==" "2411063698"
9 "MzgwMTExMjM3NA==" "3801112374"
10 "MjMxMTE2MTIzMw==" "2311161233"
11 "" ""
12 <NA> <NA>
# A tibble: 13 x 2
anon_chi chi
<chr> <chr>
1 "MjYwMTIxMTYxOA==" 2601211618
2 "MjIxMDY4MDYzMQ==" 2210680631
3 "MTQxMDkyMDc1NA==" 1410920754
4 "MzExMjM1ODE1OA==" 3112358158
5 "MDExMjQxODE1Ng==" 0112418156
6 "MDYxMjczMjI0Mw==" 0612732243
7 "MjMxMDQ3NDAxNQ==" 2310474015
8 "MjQxMTA2MzY5OA==" 2411063698
9 "MzgwMTExMjM3NA==" 3801112374
10 "MjMxMTE2MTIzMw==" 2311161233
11 "" <NA>
12 "" <NA>
13 <NA> <NA>

26 changes: 0 additions & 26 deletions tests/testthat/test-files_exist.R

This file was deleted.

28 changes: 28 additions & 0 deletions tests/testthat/test-files_readable.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
skip_on_ci()


test_that("Episode files are readable", {
# Episode files
expect_true(fs::file_access(gen_file_path("1415", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1516", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1617", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1718", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1819", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1920", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("2021", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("2122", "episode"), mode = "read"))
expect_true(fs::file_access(gen_file_path("2223", "episode"), mode = "read"))
})


test_that("Individual files are readable", {
expect_true(fs::file_access(gen_file_path("1415", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1516", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1617", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1718", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1819", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("1920", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("2021", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("2122", "individual"), mode = "read"))
expect_true(fs::file_access(gen_file_path("2223", "individual"), mode = "read"))
})
8 changes: 4 additions & 4 deletions tests/testthat/test-multiple_selections.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,27 @@ test_that("select years and recid", {
set.seed(50)

acute_only <- read_slf_episode(c("1718", "1819"),
col_select = c("year", "anon_chi", "recid", "keydate1_dateformat"),
col_select = c("year", "anon_chi", "recid", "record_keydate1"),
recids = "01B"
) %>%
dplyr::slice_sample(n = 200000)

expect_equal(
names(acute_only),
c("year", "anon_chi", "recid", "keydate1_dateformat")
c("year", "anon_chi", "recid", "record_keydate1")
)
expect_equal(unique(acute_only$year), c("1718", "1819"))
expect_equal(unique(acute_only$recid), "01B")

hosp_only <- read_slf_episode(c("1718", "1819"),
col_select = c("year", "anon_chi", "recid", "keydate1_dateformat"),
col_select = c("year", "anon_chi", "recid", "record_keydate1"),
recids = c("01B", "02B", "04B", "GLS")
) %>%
dplyr::slice_sample(n = 200000)

expect_equal(
names(hosp_only),
c("year", "anon_chi", "recid", "keydate1_dateformat")
c("year", "anon_chi", "recid", "record_keydate1")
)
expect_equal(unique(hosp_only$year), c("1718", "1819"))
expect_equal(sort(unique(hosp_only$recid)), c("01B", "02B", "04B", "GLS"))
Expand Down
Loading