Install from GitHub with:
# install.packages("remotes")
remotes::install_github("epicentre-msf/queryr")
Load package and examine example dataset ll
, an epidemiological
“linelist”.
library(queryr)
data(ll)
head(ll)
#> id site age status date_onset date_admit date_lab lab_result date_exit outcome
#> 1 TC143 A 14.000 Not a case 2020-03-30 2020-04-01 2020-04-01 Negative 2024-04-02 Sent home
#> 2 TC345 B 0.667 Suspected 2024-04-03 2020-04-05 <NA> Positive 2020-04-19 <NA>
#> 3 T104 A 29.000 Suspected 2020-03-16 2020-04-03 2020-03-12 <NA> 2020-03-11 <NA>
#> 4 TC623 C 91.000 Confirmed 2020-04-10 2020-04-12 2020-03-12 Unclear 2020-04-30 Cured
#> 5 TC685 C 0.192 Not a case <NA> <NA> 2020-03-25 Negative 2010-03-26 Sent home
#> 6 TC361 B 68.000 Confirmed 2020-03-20 2020-03-25 2020-03-26 Positive <NA> Unknown
E.g. 1. Find observations where date_exit
is earlier than
date_admit
.
query(ll, date_exit < date_admit, cols_base = id:site)
#> id site variable1 value1 variable2 value2
#> 1 T104 A date_exit 2020-03-11 date_admit 2020-04-03
#> 2 TC550 B date_exit 2010-05-13 date_admit 2020-04-30
#> 3 TC457 A date_exit 2020-03-19 date_admit 2020-04-10
Note that by default the columns referenced in the query expression are
pivoted to long-format (variable1
, value1
, variable2
, value2
,
…), to enable stacking multiple queries on different variables. The
optional cols_base
argument can be used to specify additional columns
to retain in the output (via
tidy-selection).
E.g. 2. Find any date value in the future using a .x
selector
within the query expression to represent a set of multiple columns. The
columns represented by .x
are specified separately with argument
cols_dotx
.
query(ll, .x > Sys.Date(), cols_dotx = starts_with("date"), cols_base = id:site)
#> id site variable1 value1
#> 1 TC345 B date_onset 2024-04-03
#> 2 TC457 A date_lab 2040-04-12
#> 3 TC143 A date_exit 2024-04-02
#> 4 TC443 C date_exit 2030-04-05
E.g. 3. Find non-valid values of lab_result
.
lab_result_valid <- c("Positive", "Negative", "Inc.", NA)
query(ll, !lab_result %in% lab_result_valid, cols_base = id:site)
#> id site variable1 value1
#> 1 TC623 C lab_result Unclear
E.g. 4. Stack multiple queries on different variables.
Because we’ll usually want to retain the same set of ‘base columns’
across multiple queries, the argument cols_base
can be set for an
entire session using option “queryr_cols_base”.
# set argument `cols_base` for all subsequent queries (wrap in quote() if using
# non-standard evaluation rather than a character vector)
options(queryr_cols_base = quote(id:site))
# stack multiple queries
dplyr::bind_rows(
.id = "query",
# non-valid patient id
"IDENT_01" = query(ll, !grepl("^TC\\d{3}", id)),
# date in future
"DATES_01" = query(ll, .x > Sys.Date(), cols_dotx = starts_with("date")),
# exit before admission
"DATES_02" = query(ll, date_exit < date_admit),
# non-valid value of age_unit
"FCTRS_01" = query(ll, !lab_result %in% lab_result_valid),
# status 'Confirmed' but lab result not 'Positive'
"LOGIC_01" = query(ll, status == "Confirmed" & !lab_result %in% "Positive")
)
#> query id site variable1 value1 variable2 value2
#> 1 IDENT_01 T104 A id T104 <NA> <NA>
#> 2 IDENT_01 190 A id 190 <NA> <NA>
#> 3 DATES_01 TC345 B date_onset 2024-04-03 <NA> <NA>
#> 4 DATES_01 TC457 A date_lab 2040-04-12 <NA> <NA>
#> 5 DATES_01 TC143 A date_exit 2024-04-02 <NA> <NA>
#> 6 DATES_01 TC443 C date_exit 2030-04-05 <NA> <NA>
#> 7 DATES_02 T104 A date_exit 2020-03-11 date_admit 2020-04-03
#> 8 DATES_02 TC550 B date_exit 2010-05-13 date_admit 2020-04-30
#> 9 DATES_02 TC457 A date_exit 2020-03-19 date_admit 2020-04-10
#> 10 FCTRS_01 TC623 C lab_result Unclear <NA> <NA>
#> 11 LOGIC_01 TC623 C status Confirmed lab_result Unclear
For these we’ll introduce a second example dataset sll
, a ‘summary’
linelist. Whereas the original linelist ll
contains all patients
presenting at treatment centres, the summary linelist sll
contains
only patients with status “Confirmed” or “Probable”.
data(sll)
head(sll)
#> sll_id tc_admit tc_id tc_site sll_age sll_status sll_date_outcome sll_outcome
#> 1 SLL_001 Yes TC932 B 61.00 Confirmed 2020-04-24 Died
#> 2 SLL_002 Yes TC345 <NA> 0.67 Confirmed 2020-04-19 Died
#> 3 SLL_003 Yes TC623 C 91.00 Confirmed 2020-04-30 Cured
#> 4 SLL_004 No <NA> <NA> 0.50 Probable 2020-05-13 Died
#> 5 SLL_005 Yes TC361 B 68.00 Confirmed 2020-04-02 Cured
#> 6 SLL_006 Yes TC550 B 39.00 Probable 2010-05-13 Died
E.g. 1. Find treatment centre IDs that appear in ll
but not sll
,
and the corresponding patient status.
query2(
ll,
sll,
cols_base1 = c(id, site, status),
join_type = "anti",
join_by = c("id" = "tc_id")
)
#> id site status
#> 1 TC143 A Not a case
#> 2 T104 A Suspected
#> 3 TC685 C Not a case
#> 4 190 A Not a case
#> 5 TC443 C <NA>
#> 6 TC206 A Not a case
E.g. 2. Find confirmed/probable patients in ll
with different
outcomes listed in ll
vs sll
.
query2(
ll,
sll,
cond1 = status %in% c("Confirmed", "Probable"),
cols_base1 = id:site,
join_type = "inner",
join_by = c("id" = "tc_id"),
cond3 = outcome != sll_outcome
)
#> id site variable1 value1 variable2 value2 variable3 value3
#> 1 TC361 B status Confirmed outcome Unknown sll_outcome Cured
#> 2 TC550 B status Probable outcome Other sll_outcome Died