Skip to content

Commit

Permalink
Islbs port (#77)
Browse files Browse the repository at this point in the history
* documentation for some of the data sets

* data sets added from islbs and devtools::check passed

* updated NEWS.md

* pkgdown.yml updated
  • Loading branch information
npaterno authored Jan 1, 2025
1 parent 9b858c9 commit 3b7ffbd
Show file tree
Hide file tree
Showing 168 changed files with 126,893 additions and 1 deletion.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ License: GPL-3
Encoding: UTF-8
LazyData: true
LazyDataCompression: xz
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
Suggests:
broom,
dplyr,
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Developmental

* Added new datasets:
* `LEAP`, `arenosa`, `cdc`, `cdc.samp`, `census.2010`, `danish.ed.primary`, `danish.ed.validation`, `dds.discr`, `famuss`, `forest.birds`, `frog`, `hyperuricemia`, `hyperuricemia.samp`, `infant_mortality_2022`, `mcas`, `nhanes.samp`, `nhanes.samp.adult`, `nhanes.samp.adult.500`, `opp_insights_colleges`, `opp_insights_colleges_4year`, `prevend`, `prevend.samp`, `sugar.levels.A`, `sugar.levels.B`, `swim`, `tb.interruption`, `thermometry`, `wdi_2022` ported from ISLBS by [@npaterno](https://github.com/npaterno)

# openintro 2.5.0

* Added new datasets:
Expand Down
44 changes: 44 additions & 0 deletions R/data-LEAP.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#' Patient level data on the randomized trial Learning Early About Peanut (LEAP) allergies.
#'
#' This study examined whether early exposure to peanuts increased tolerance and
#' protection from developing a peanut allergy in children who are allergic to
#' eggs or who have severe eczema. Participants between 4 and 11 months old were
#' randomized to either avoid versus consume peanut based products during the
#' first three years of life. The longer title of the study is Induction of
#' Tolerance Through Early Introduction of Peanut in High-Risk Children and can
#' be found in \url{https://clinicaltrials.gov/} as study NCT00329784.
#'
#' More variables are available at the site in the source.
#'
#' @docType data
#' @format A data frame with 640 rows and 7 columns
#' \describe{
#' \item{\code{participant.ID}}{Character vector, unique identifier for each participant.}
#' \item{\code{stratum}}{Factor, outcome of a skin prick test (SPT) conducted
#' before randomization, with levels \code{SPT-Negative}, participant
#' shows no evidence of peanut allergy, and \code{SPT-Positive}, evidence
#' of a peanut allergy. Participants were
#' randomized separately within each stratum. The primary analysis of the
#' study is typically restricted to the SPT-Negative group.}
#' \item{\code{treatment.group}}{Factor, randomized assignment for each participant,
#' with levels \code{Peanut Avoidance} and \code{Peanut Consumption}}.
#' \item{\code{age.months}}{Participant age in months at randomization.}
#' \item{\code{sex}}{Factor, sex of participant with levels \code{Female} and
#' \code{Male}}
#' \item{\code{primary.ethnicity}}{Factor variable with levels \code{Asian},
#' \code{Black}, \code{Other}, \code{Mixed}, and \code{White}.}
#' \item{\code{overall.V60.outcome}}{Factor, indicating whether after 5 years,
#' the participant had an allergic reaction in the OFC,
#' with levels for having a reaction to a peanut based oral food challenge,
#' with levels (\code{FAIL OFC}) (allergic reaction),
#' (\code{PASS OFC}) (no allergic reaction)}
#' }
#' @source These data are a subset of variables from the file ADSTART0_2015-03-03_14-20-10.txt,
#' available by downloading study files from
#' \url{https://www.immport.org/shared/study/SDY660}
#' @references Du Toit, George, et al. "Randomized trial of peanut consumption in
#' infants at risk for peanut allergy."
#' New England Journal of Medicine 372.9 (2015): 803-813.
#' doi 10.1056/nejmoa1414850
#'
"LEAP"
39 changes: 39 additions & 0 deletions R/data-arenosa.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#' arenosa
#'
#' Published results used RNA-Seq to investigate how cold responsiveness differs
#' in two populations of A. arenosa:
#' TBG (collected from Triberg, Germany) and
#' KA (collected from Kasparstein, Austria). Each row corresponds to a gene;
#' the first column contains the gene name; other columns correspond to expression
#' measured in a plant sample. Three plants of each population were exposed
#' to cold (vernalized, denoted by v), and three were not (non-vernalized,
#' denoted by nv). Expression was measured in gene counts
#' (i.e. the number of RNA transcripts present in a sample);
#' the data were then normalized to allow comparison between samples.
#'
#' @name arenosa
#' @docType data
#' @format A tibble with 1088 rows and 13 variables:
#' \describe{
#' \item{\code{gene.name}}{a character vector}
#' \item{\code{ka.nv.1}}{a numeric vector}
#' \item{\code{ka.nv.2}}{a numeric vector}
#' \item{\code{ka.nv.3}}{a numeric vector}
#' \item{\code{ka.v.1}}{a numeric vector}
#' \item{\code{ka.v.2}}{a numeric vector}
#' \item{\code{ka.v.3}}{a numeric vector}
#' \item{\code{tbg.nv.1}}{a numeric vector}
#' \item{\code{tbg.nv.2}}{a numeric vector}
#' \item{\code{tbg.nv.3}}{a numeric vector}
#' \item{\code{tbg.v.1}}{a numeric vector}
#' \item{\code{tbg.v.2}}{a numeric vector}
#' \item{\code{tbg.v.3}}{a numeric vector}
#' }
#' @references Pierre Baduel, Brian Arnold, Cara M. Weisman, Ben Hunter, Kirsten Bomblies,
#' Habitat-Associated Life History and
#' Stress-Tolerance Variation in Arabidopsis arenosa, Plant Physiology,
#' Volume 171, Issue 1, May 2016, Pages 437–451
#' https://doi.org/10.1104/pp.15.01875https://doi.org/10.1104/pp.15.01875
#' @source K Bomblies Harvard University lab.
#'
"arenosa"
27 changes: 27 additions & 0 deletions R/data-cdc.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#' cdc
#'
#' A dataset from the 2000 Behavioral Risk Factors Surveillance System (BRFSS)
#' conducted by the US Centers for Disease Control and Prevention used to
#' illustrate inference on demographic data.
#'
#' @name cdc
#' @docType data
#' @format A dataframe with 20,000 rows and 9 variables:
#' \describe{
#' \item{\code{genhlth}}{Factor with levels \code{excellent}, \code{very good}
#' \code{good}, \code{fair}, \code{poor}}
#' \item{\code{exerany}}{Numeric vector; 1 if the respondent exercised in the
#' past month and 0 otherwise.}
#' \item{\code{hlthplan}}{Numeric; 1 if the respondent has some form
#' of health coverage and 0 otherwise.}
#' \item{\code{smoke100}}{Numeric; 1 if the respondent has smoked at least 100
#' cigarettes in their entire life and 0 otherwise.}
#' \item{\code{height}}{Numeric; respondent's height in inches.}
#' \item{\code{weight}}{Numeric; respondent's weight in pounds.}
#' \item{\code{wtdesire}}{Numeric; respondent's desired weight in pounds.}
#' \item{\code{age}}{Numeric; respondent's age in years.}
#' \item{\code{gender}}{Factor with two levels \code{m} \code{f}}
#' }
#' @source("https://www.cdc.gov/brfss/index.html")
#'
"cdc"
26 changes: 26 additions & 0 deletions R/data-cdc.samp.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#' cdc.samp
#'
#' A sample of 60 individuals from the 2000 Behavioral Risk Factors Surveillance System
#' (BRFSS) conducted by the US Centers for Disease Control.
#'
#' @name cdc.samp
#' @docType data
#' @format A tibble with 60 rows and 9 variables:
#' \describe{
#' \item{\code{genhlth}}{Factor with levels \code{excellent}, \code{very good}
#' \code{good}, \code{fair}, \code{poor}}
#' \item{\code{exerany}}{Numeric vector; 1 if the respondent exercised in the
#' past month and 0 otherwise.}
#' \item{\code{hlthplan}}{Numeric vector; 1 if the respondent has some form
#' of health coverage and 0 otherwise.}
#' \item{\code{smoke100}}{Numeric; 1 if the respondent has smoked at least 100
#' cigarettes in their entire life and 0 otherwise.}
#' \item{\code{height}}{Numeric; respondent's height in inches.}
#' \item{\code{weight}}{Numeric; respondent's weight in pounds.}
#' \item{\code{wtdesire}}{Numeric; respondent's desired weight in pounds.}
#' \item{\code{age}}{Numeric; respondent's age in years.}
#' \item{\code{gender}}{Factor with two levels \code{m} \code{f}}
#' }
#' @source("http://www.openintro.org/stat/data/cdc.R")
#'
"cdc.samp"
26 changes: 26 additions & 0 deletions R/data-census.2010.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#' census.2010
#'
#' United States 2010 infant mortality and number of physicians by state,
#' including the District of Columbia.
#'
#' Data were abstracted from the 2010 Statistical Abstract of the United States.
#' Due to a lag in recording state level data, the infant mortality data is from
#' 2009 and the data on physicians from 2007. Both measurements are subject to
#' change annually, so these data are not current and should not be used for
#' inference about infant mortality. More current data can be found at the US
#' Centers for Disease Control and Prevention (\url{https://www.cdc.gov/nchs/pressroom/sosmap/infant_mortality_rates/infant_mortality.htm}), and in the dataset \code{infant_mort_2022}.
#'
#' @name census.2010
#' @docType data
#' @format A data frame with 51 rows and 3 columns.
#' \describe{
#' \item{\code{state}}{Character vector vector, US State including the District of Columbia}
#' \item{\code{inf.mort}}{Numeric vector, number of deaths per 1000 live births between 1 day
#' and 1 year of age}
#' \item{\code{doctors}}{Numeric vector, active physicians per 100,000 population}
#' }
#' @source \url{https://www.census.gov/library/publications/2009/compendia/statab/129ed/births-deaths-marriages-divorces.html},
#' \url{https://www.census.gov/library/publications/2009/compendia/statab/129ed/health-nutrition.html}
#'
"census.2010"

56 changes: 56 additions & 0 deletions R/data-danish.ed.primary.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#' danish.ed.primary
#'
#' Data from a Danish study on triage in an emergency department (ED)
#'
#' Data from a prospective cohort study of triage scoring for an emergency
#' department (ED). The study examined whether the use of patient level
#' measurements would improve an existing triage score. These data are the
#' training data (called primary data in the original manuscript) used for model
#' building. Some variable names have been changed for readability, but the data
#' on 21 variables for the 6,249 participants are otherwise unchanged.
#'
#' @name danish.ed.primary
#' @docType data
#' @format A tibble with 6249 rows and 21 variables:
#' \describe{
#' \item{\code{mort30}}{numeric, 1 if patient died within 30 days of admission, 0
#' otherwise}
#' \item{\code{triage}}{factor, triage score given at arrival to ED.
#' Values \code{green}, \code{yellow}, \code{orange}, \code{red}, from lowest
#' to highest priority
#' for treatment. The value \code{blue} normally denotes severity not
#' warranting admission to the ED, but no participants coded blue
#' are in these data.}
#' \item{\code{age}}{numeric, age in years, rounded to lower integer}
#' \item{\code{sex}}{factor, values \code{female}, \code{male}}
#' \item{\code{albumin}}{numeric, serum albumin, in g/L}
#' \item{\code{creatinine}}{numeric, serum creatinine, in umol/L}
#' \item{\code{hemaglobin}}{numeric, serum hemaglobin, in mmol/L }
#' \item{\code{potassium}}{numeric, serum potassium, in mmol/L}
#' \item{\code{leuk.count}}{blood leukocyte count, in 10E9/L}
#' \item{\code{sodium}}{numeric, serum sodium, in mmol/L}
#' \item{\code{c.react.protein}}{numeric, serum C-reactive protein}
#' \item{\code{oxygen.sat}}{numeric, peripheral arterial oxygen saturation, as a percent}
#' \item{\code{resp.rate}}{numeric, respiratory rate per minute}
#' \item{\code{heart.rate}}{numeric, heart rate, beats/min}
#' \item{\code{systolic.bp}}{numeric, systolic blood pressure, in mmHg}
#' \item{\code{glasgow.coma.scale}}{numeric, extent
#' of impaired consciousness in patients with acute medical condition or
#' trauma, scored between 3 and 15, 3 being the worst and 15 the best. Score
#' is based on 3 subscales, best eye, verbal and motor responses.}
#' \item{\code{readmit.hosp}}{factor, readmitted to hospital within 30 days,
#' values \code{yes}, \code{no}}
#' \item{\code{days.in.hosp}}{numeric, number of days admitted to hospital}
#' \item{\code{icu.time}}{numeric, number of days in the intensive care unit.
#' value 99999 indicates patient not admitted to ICU}
#' \item{\code{icu.status}}{factor, patient admitted to ICU, values \code{yes},
#' \code{no}}
#' }
#' #' @references Kristensen, Michael, et al. "Routine blood tests are associated
#' with short term mortality and can improve emergency department triage: a cohort
#' study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and
#' Emergency Medicine 25 (2017): 1-8.
#' \url{https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader}
#' @source \url{doi:10.5061/dryad.m2bq5}
#'
"danish.ed.primary"
50 changes: 50 additions & 0 deletions R/data-danish.ed.validation.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#' Data from a Danish study on triage in an emergency department (ED)
#'
#' Data from a prospective cohort study of triage scoring for an emergency
#' department (ED). The study examined whether the use of patient level
#' measurements would improve an existing triage score. These data were used as
#' a test set (called validation in the manuscript) to examine the performance
#' of the model built using the training (primary) cohort. Some variable names
#' have been changed for readability and for consistency with the primary dataset,
#' but the data on 18 variables for the 6,383 participants are otherwise unchanged.
#' Some variables in the primary dataset do not appear in these data.
#'
#' @name danish.ed.validation
#' @docType data
#' @format A tibble with 6383 rows and 18 variables:
#' \describe{
#' \item{\code{mort30}}{numeric, 1 if patient died within 30 days of admission, 0
#' otherwise}
#' \item{\code{triage}}{factor, triage score given at arrival to ED.
#' Values \code{blue}, \code{green}, \code{yellow}, \code{orange}, \code{red},
#' from lowest to highest priority
#' for treatment. The value \code{blue} normally denotes severity not
#' warranting admission to the ED. Participants coded \code{blue}
#' are in these data but not in the primary data.}
#' \item{\code{age}}{numeric, age in years, rounded to lower integer}
#' \item{\code{sex}}{factor, \code{female}, \code{male}}
#' \item{\code{albumin}}{numeric, serum albumin, in g/L}
#' \item{\code{creatinine}}{numeric, serum creatinine, in umol/L}
#' \item{\code{hemaglobin}}{numeric, serum hemaglobin, in mmol/L }
#' \item{\code{potassium}}{numeric, serum potassium, in mmol/L}
#' \item{\code{leuk.count}}{blood leukocyte count, in 10E9/L}
#' \item{\code{sodium}}{numeric, serum sodium, in mmol/L}
#' \item{\code{c.react.protein}}{numeric, serum C-reactive protein}
#' \item{\code{oxygen.sat}}{numeric, peripheral arterial oxygen saturation, %}
#' \item{\code{resp.rate}}{numeric, respiratory rate per minute}
#' \item{\code{heart.rate}}{numeric, heart rate, beats/min}
#' \item{\code{systolic.bp}}{numeric, systolic blood pressure, in mmHg}
#' \item{\code{readmit.hosp}}{factor, readmitted to hospital within 30 days,
#' with values \code{yes}, \code{no}}
#' \item{\code{days.in.hosp}}{numeric, number of days admitted to hospital}
#' \item{\code{icu.status}}{factor, patient admitted to ICU, with values
#' \code{yes}, \code{no}}
#' }
#' @references Kristensen, Michael, et al. "Routine blood tests are associated
#' with short term mortality and can improve emergency department triage: a cohort
#' study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and
#' Emergency Medicine 25 (2017): 1-8.
#' \url{https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader}
#' @source \url{doi:10.5061/dryad.m2bq5}
#'
"danish.ed.validation"
36 changes: 36 additions & 0 deletions R/data-dds.dscr.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#' A dataset on disbursements from the California Department of Developmental Services (DDS)
#'
#' The dataset represents a sample of 1,000 DDS consumers (out of a total
#' population of approximately 250,000),and includes information about age,
#' gender, ethnicity, and the amount of financial support per consumer provided
#' by the DDS.The dataset is based on recorded attributes of consumers, but has
#' been altered to maintain consumer privacy. From the Taylor and Mickel paper:
#' "The data set originated from DDS’s Client Master File. In order to remain in
#' compliance with California State Legislation, the data have been altered to
#' protect the rights and privacy of specific individual consumers. The provided
#' data set is based on actual attributes of consumers."
#'
#' @name dds.dscr
#' @docType data
#' @format A dataframe with 1000 rows and 6 variables:
#' \describe{
#' \item{\code{id}}{Numeric, Unique identification code for each resident}
#' \item{\code{age.cohort}}{A factor, \code{0-5} years,
#' \code{6-12} years, \code{13-17} years, \code{18-21} years, \code{22-50} years,
#' and \code{51+} years}
#' \item{\code{age}}{Numeric, Age measured in years}
#' \item{\code{gender}}{A factor, with levels \code{Female} or \code{Male}}
#' \item{\code{expenditures}}{Numeric, Amount of expenditures spent by the
#' State on an individual annually, measured in USD}
#' \item{\code{ethnicity}}{Factor, Ethnic group, recorded as
#' \code{American Indian}, \code{Asian}, \code{Black}, \code{Hispanic},
#' \code{Multi Race}, \code{Native Hawaiian}, \code{Other},
#' \code{White not Hispanic}}
#' }
#' #' @references www.amstat.org/publications/jse/v22n1/mickel.pdf Taylor, Stanley A.,
#' and Amy E. Mickel. Simpson's paradox: A data set and discrimination case study
#' exercise. Journal of Statistics Education 22.1 (2014).
#' Data contained in supplement B of Taylor and Mickel.
#'
"dds.discr"

42 changes: 42 additions & 0 deletions R/data-famuss.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#' A dataset to examine the relationship between muscle strength and the single nucleotide polymorphism (SNP) actn3.r577x.
#'
#' This dataset is a subset of the larger data set from the Functional SNPs
#' Associated with Muscle Size and Strength (FAMuSS) by Thompson et.al. It
#' contains demographic, response and coding for the SNP for the study participants.
#' Unlike the data in the previous version of the \code{oibiostat} data package,
#' this dataset retains the missing values. The data are also discussed in the
#' Foulkes text. Strength was measured in both dominant and non-dominant arms
#' before and after resistance training. The particular gene of interest was
#' ACTN3, the "sports gene."
#'
# '@name famuss
#' @docType data
#' @format A tibble with 1397 rows and 10 variables
#' \describe{
#' \item{\code{ndrm.ch}}{A numeric vector, the percent change in strength
#' in a non-dominant arm, from before training and after.}
#' \item{\code{drm.ch}}{A numeric vector, percent change in strength in
#' dominant arm.}
#' \item{\code{sex}}{A factor with levels \code{Female} and \code{Male}}
#' \item{\code{age}}{A numeric vector, age in years.}
#' \item{\code{race}}{A factor with levels \code{African Am} \code{Asian}
#' \code{Caucasian} \code{Hispanic} \code{Other}}
#' \item{\code{height}}{A numeric vector,
#' height in inches.}
#' \item{\code{weight}}{A numeric vector, weight in pounds.}
#' \item{\code{actn3.r577x}}{A factor with levels \code{CC} \code{CT} \code{TT},
#' that shows the genotype at residue rs540874 (location r577x) within the ACTN3
#' SNP.}
#' \item{\code{bmi}}{A numeric vector, body mass index}
#' }
#' @source Personal communication from A. Foulkes
#' @references Thompson PMoyna NSeip R et al. Medicine and Science in Sports and
#' Exercise, (2004), 1132-1139, 36(7). Clarkson P, et al., Journal of Applied
#' Physiology 99: 154-163, 2005.Pescatello L, et al. Highlights from the
#' functional single nucleotide polymorphisms associated with human muscle
#' size and strength or FAMuSS study, BioMed Research International 2013. Foulkes, Andrea S.
#' Applied Statistical Genetics using R for Population Association Studies.
#' Springer, 2009).
#'
"famuss"

Loading

0 comments on commit 3b7ffbd

Please sign in to comment.