queryr: Data validation queries with tidy, stackable output

Installation

Install from GitHub with:

# install.packages("remotes")
remotes::install_github("epicentre-msf/queryr")

Preliminaries

Load package and examine example dataset ll, an epidemiological “linelist”.

library(queryr)
data(ll)
head(ll)
#>      id site    age     status date_onset date_admit   date_lab lab_result  date_exit   outcome
#> 1 TC143    A 14.000 Not a case 2020-03-30 2020-04-01 2020-04-01   Negative 2024-04-02 Sent home
#> 2 TC345    B  0.667  Suspected 2024-04-03 2020-04-05       <NA>   Positive 2020-04-19      <NA>
#> 3  T104    A 29.000  Suspected 2020-03-16 2020-04-03 2020-03-12       <NA> 2020-03-11      <NA>
#> 4 TC623    C 91.000  Confirmed 2020-04-10 2020-04-12 2020-03-12    Unclear 2020-04-30     Cured
#> 5 TC685    C  0.192 Not a case       <NA>       <NA> 2020-03-25   Negative 2010-03-26 Sent home
#> 6 TC361    B 68.000  Confirmed 2020-03-20 2020-03-25 2020-03-26   Positive       <NA>   Unknown

Example queries

E.g. 1. Find observations where date_exit is earlier than date_admit.

query(ll, date_exit < date_admit, cols_base = id:site)
#>      id site variable1     value1  variable2     value2
#> 1  T104    A date_exit 2020-03-11 date_admit 2020-04-03
#> 2 TC550    B date_exit 2010-05-13 date_admit 2020-04-30
#> 3 TC457    A date_exit 2020-03-19 date_admit 2020-04-10

Note that by default the columns referenced in the query expression are pivoted to long-format (variable1, value1, variable2, value2, …), to enable stacking multiple queries on different variables. The optional cols_base argument can be used to specify additional columns to retain in the output (via tidy-selection).

E.g. 2. Find any date value in the future using a .x selector within the query expression to represent a set of multiple columns. The columns represented by .x are specified separately with argument cols_dotx.

query(ll, .x > Sys.Date(), cols_dotx = starts_with("date"), cols_base = id:site)
#>      id site  variable1     value1
#> 1 TC345    B date_onset 2024-04-03
#> 2 TC457    A   date_lab 2040-04-12
#> 3 TC143    A  date_exit 2024-04-02
#> 4 TC443    C  date_exit 2030-04-05

E.g. 3. Find non-valid values of lab_result.

lab_result_valid <- c("Positive", "Negative", "Inc.", NA)
query(ll, !lab_result %in% lab_result_valid, cols_base = id:site)
#>      id site  variable1  value1
#> 1 TC623    C lab_result Unclear

E.g. 4. Stack multiple queries on different variables.

Because we’ll usually want to retain the same set of ‘base columns’ across multiple queries, the argument cols_base can be set for an entire session using option “queryr_cols_base”.

# set argument `cols_base` for all subsequent queries (wrap in quote() if using
# non-standard evaluation rather than a character vector)
options(queryr_cols_base = quote(id:site))

# stack multiple queries
dplyr::bind_rows(
  .id = "query",
  # non-valid patient id
  "IDENT_01" = query(ll, !grepl("^TC\\d{3}", id)),
  # date in future
  "DATES_01" = query(ll, .x > Sys.Date(), cols_dotx = starts_with("date")),
  # exit before admission
  "DATES_02" = query(ll, date_exit < date_admit),
  # non-valid value of age_unit
  "FCTRS_01" = query(ll, !lab_result %in% lab_result_valid),
  # status 'Confirmed' but lab result not 'Positive'
  "LOGIC_01" = query(ll, status == "Confirmed" & !lab_result %in% "Positive")
)
#>       query    id site  variable1     value1  variable2     value2
#> 1  IDENT_01  T104    A         id       T104       <NA>       <NA>
#> 2  IDENT_01   190    A         id        190       <NA>       <NA>
#> 3  DATES_01 TC345    B date_onset 2024-04-03       <NA>       <NA>
#> 4  DATES_01 TC457    A   date_lab 2040-04-12       <NA>       <NA>
#> 5  DATES_01 TC143    A  date_exit 2024-04-02       <NA>       <NA>
#> 6  DATES_01 TC443    C  date_exit 2030-04-05       <NA>       <NA>
#> 7  DATES_02  T104    A  date_exit 2020-03-11 date_admit 2020-04-03
#> 8  DATES_02 TC550    B  date_exit 2010-05-13 date_admit 2020-04-30
#> 9  DATES_02 TC457    A  date_exit 2020-03-19 date_admit 2020-04-10
#> 10 FCTRS_01 TC623    C lab_result    Unclear       <NA>       <NA>
#> 11 LOGIC_01 TC623    C     status  Confirmed lab_result    Unclear

Queries relating to two data frames

For these we’ll introduce a second example dataset sll, a ‘summary’ linelist. Whereas the original linelist ll contains all patients presenting at treatment centres, the summary linelist sll contains only patients with status “Confirmed” or “Probable”.

data(sll)
head(sll)
#>    sll_id tc_admit tc_id tc_site sll_age sll_status sll_date_outcome sll_outcome
#> 1 SLL_001      Yes TC932       B   61.00  Confirmed       2020-04-24        Died
#> 2 SLL_002      Yes TC345    <NA>    0.67  Confirmed       2020-04-19        Died
#> 3 SLL_003      Yes TC623       C   91.00  Confirmed       2020-04-30       Cured
#> 4 SLL_004       No  <NA>    <NA>    0.50   Probable       2020-05-13        Died
#> 5 SLL_005      Yes TC361       B   68.00  Confirmed       2020-04-02       Cured
#> 6 SLL_006      Yes TC550       B   39.00   Probable       2010-05-13        Died

E.g. 1. Find treatment centre IDs that appear in ll but not sll, and the corresponding patient status.

query2(
  ll,
  sll,
  cols_base1 = c(id, site, status),
  join_type = "anti",
  join_by = c("id" = "tc_id")
)
#>      id site     status
#> 1 TC143    A Not a case
#> 2  T104    A  Suspected
#> 3 TC685    C Not a case
#> 4   190    A Not a case
#> 5 TC443    C       <NA>
#> 6 TC206    A Not a case

E.g. 2. Find confirmed/probable patients in ll with different outcomes listed in ll vs sll.

query2(
  ll,
  sll,
  cond1 = status %in% c("Confirmed", "Probable"),
  cols_base1 = id:site,
  join_type = "inner",
  join_by = c("id" = "tc_id"),
  cond3 = outcome != sll_outcome
)
#>      id site variable1    value1 variable2  value2   variable3 value3
#> 1 TC361    B    status Confirmed   outcome Unknown sll_outcome  Cured
#> 2 TC550    B    status  Probable   outcome   Other sll_outcome   Died

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
R		R
data-raw		data-raw
data		data
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
codecov.yml		codecov.yml
queryr.Rproj		queryr.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

queryr: Data validation queries with tidy, stackable output

Installation

Preliminaries

Example queries

Queries relating to two data frames

About

Releases

Packages

Languages

License

epicentre-msf/queryr

Folders and files

Latest commit

History

Repository files navigation

queryr: Data validation queries with tidy, stackable output

Installation

Preliminaries

Example queries

Queries relating to two data frames

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages