Islbs port (#77)

* documentation for some of the data sets * data sets added from islbs and devtools::check passed * updated NEWS.md * pkgdown.yml updated
OpenIntroStat · Jan 1, 2025 · 3b7ffbd · 3b7ffbd
1 parent 9b858c9
commit 3b7ffbd
Show file tree

Hide file tree

Showing 168 changed files with 126,893 additions and 1 deletion.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -22,7 +22,7 @@ License: GPL-3
 Encoding: UTF-8
 LazyData: true
 LazyDataCompression: xz
-RoxygenNote: 7.3.1
+RoxygenNote: 7.3.2
 Suggests: 
     broom,
     dplyr,

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,8 @@
+# Developmental
+
+* Added new datasets:
+  * `LEAP`, `arenosa`, `cdc`, `cdc.samp`, `census.2010`, `danish.ed.primary`, `danish.ed.validation`, `dds.discr`, `famuss`, `forest.birds`, `frog`, `hyperuricemia`, `hyperuricemia.samp`, `infant_mortality_2022`, `mcas`, `nhanes.samp`, `nhanes.samp.adult`, `nhanes.samp.adult.500`, `opp_insights_colleges`, `opp_insights_colleges_4year`, `prevend`, `prevend.samp`, `sugar.levels.A`, `sugar.levels.B`, `swim`, `tb.interruption`, `thermometry`, `wdi_2022` ported from ISLBS by [@npaterno](https://github.com/npaterno)
+
 # openintro 2.5.0
 
 * Added new datasets:

diff --git a/R/data-LEAP.R b/R/data-LEAP.R
@@ -0,0 +1,44 @@
+#' Patient level data on the randomized trial Learning Early About Peanut (LEAP) allergies.
+#'
+#' This study examined whether early exposure to peanuts increased tolerance and
+#' protection from developing a peanut allergy in children who are allergic to
+#' eggs or who have severe eczema. Participants between 4 and 11 months old were
+#' randomized to either avoid versus consume peanut based products during the
+#' first three years of life.  The longer title of the study is Induction of
+#' Tolerance Through Early Introduction of Peanut in High-Risk Children and can
+#' be found in \url{https://clinicaltrials.gov/} as study NCT00329784.
+#'
+#' More variables are available at the site in the source.
+#'
+#' @docType data
+#' @format A data frame with 640 rows and 7 columns
+#' \describe{
+#'      \item{\code{participant.ID}}{Character vector, unique identifier for each participant.}
+#'      \item{\code{stratum}}{Factor, outcome of a skin prick test (SPT) conducted
+#'      before randomization, with levels \code{SPT-Negative}, participant
+#'      shows no evidence of peanut allergy,  and \code{SPT-Positive}, evidence
+#'      of a peanut allergy.  Participants were
+#'      randomized separately within each stratum.  The primary analysis of the
+#'      study is typically restricted to the SPT-Negative group.}
+#'      \item{\code{treatment.group}}{Factor, randomized assignment for each participant,
+#'       with levels \code{Peanut Avoidance} and \code{Peanut Consumption}}.
+#'      \item{\code{age.months}}{Participant age in months at randomization.}
+#'      \item{\code{sex}}{Factor, sex of participant with levels \code{Female} and
+#'      \code{Male}}
+#'      \item{\code{primary.ethnicity}}{Factor variable with levels \code{Asian},
+#'      \code{Black}, \code{Other}, \code{Mixed}, and \code{White}.}
+#'      \item{\code{overall.V60.outcome}}{Factor, indicating whether after 5 years,
+#'      the participant had an allergic reaction in the OFC,
+#'      with levels for having a reaction to a peanut based oral food challenge,
+#'      with levels (\code{FAIL OFC}) (allergic reaction),
+#'      (\code{PASS OFC}) (no allergic reaction)}
+#'   }
+#' @source These data are a subset of variables from the file ADSTART0_2015-03-03_14-20-10.txt,
+#'     available by downloading study files from
+#'     \url{https://www.immport.org/shared/study/SDY660}
+#' @references Du Toit, George, et al. "Randomized trial of peanut consumption in
+#'       infants at risk for peanut allergy."
+#'       New England Journal of Medicine 372.9 (2015): 803-813.
+#'       doi 10.1056/nejmoa1414850
+#'
+"LEAP"
diff --git a/R/data-arenosa.R b/R/data-arenosa.R
@@ -0,0 +1,39 @@
+#' arenosa
+#'
+#' Published results used RNA-Seq to investigate how cold responsiveness differs
+#'     in two populations of A. arenosa:
+#'     TBG (collected from Triberg, Germany) and
+#'     KA (collected from Kasparstein, Austria). Each row corresponds to a gene;
+#'     the first column contains the gene name; other columns correspond to expression
+#'     measured in a plant sample. Three plants of each population were exposed
+#'     to cold (vernalized, denoted by v), and three were not (non-vernalized,
+#'     denoted by nv). Expression was measured in gene counts
+#'     (i.e. the number of RNA transcripts present in a sample);
+#'     the data were then normalized to allow comparison between samples.
+#'
+#' @name arenosa
+#' @docType data
+#' @format A tibble with 1088 rows and 13 variables:
+#' \describe{
+#'    \item{\code{gene.name}}{a character vector}
+#'    \item{\code{ka.nv.1}}{a numeric vector}
+#'    \item{\code{ka.nv.2}}{a numeric vector}
+#'    \item{\code{ka.nv.3}}{a numeric vector}
+#'    \item{\code{ka.v.1}}{a numeric vector}
+#'    \item{\code{ka.v.2}}{a numeric vector}
+#'    \item{\code{ka.v.3}}{a numeric vector}
+#'    \item{\code{tbg.nv.1}}{a numeric vector}
+#'    \item{\code{tbg.nv.2}}{a numeric vector}
+#'    \item{\code{tbg.nv.3}}{a numeric vector}
+#'    \item{\code{tbg.v.1}}{a numeric vector}
+#'    \item{\code{tbg.v.2}}{a numeric vector}
+#'    \item{\code{tbg.v.3}}{a numeric vector}
+#'    }
+#' @references Pierre Baduel, Brian Arnold, Cara M. Weisman, Ben Hunter, Kirsten Bomblies,
+#'     Habitat-Associated Life History and
+#'     Stress-Tolerance Variation in Arabidopsis arenosa, Plant Physiology,
+#'     Volume 171, Issue 1, May 2016, Pages 437–451
+#'     https://doi.org/10.1104/pp.15.01875https://doi.org/10.1104/pp.15.01875
+#' @source K Bomblies Harvard University lab.
+#'
+"arenosa"
diff --git a/R/data-cdc.R b/R/data-cdc.R
@@ -0,0 +1,27 @@
+#' cdc
+#'
+#' A dataset from the 2000 Behavioral Risk Factors Surveillance System (BRFSS)
+#' conducted by the US Centers for Disease Control and Prevention used to
+#' illustrate inference on demographic data.
+#'
+#' @name cdc
+#' @docType data
+#' @format A dataframe with 20,000 rows and 9 variables:
+#' \describe{
+#'    \item{\code{genhlth}}{Factor with levels \code{excellent}, \code{very good}
+#'     \code{good}, \code{fair}, \code{poor}}
+#'    \item{\code{exerany}}{Numeric vector; 1 if the respondent exercised in the
+#'    past month and 0 otherwise.}
+#'    \item{\code{hlthplan}}{Numeric; 1 if the respondent has some form
+#'    of health coverage and 0 otherwise.}
+#'    \item{\code{smoke100}}{Numeric; 1 if the respondent has smoked at least 100
+#'    cigarettes in their entire life and 0 otherwise.}
+#'    \item{\code{height}}{Numeric; respondent's height in inches.}
+#'    \item{\code{weight}}{Numeric;  respondent's weight in pounds.}
+#'    \item{\code{wtdesire}}{Numeric; respondent's desired weight in pounds.}
+#'    \item{\code{age}}{Numeric;  respondent's age in years.}
+#'    \item{\code{gender}}{Factor with two levels \code{m} \code{f}}
+#'    }
+#' @source("https://www.cdc.gov/brfss/index.html")
+#'
+"cdc"
diff --git a/R/data-cdc.samp.R b/R/data-cdc.samp.R
@@ -0,0 +1,26 @@
+#' cdc.samp
+#'
+#' A sample of 60 individuals from the 2000 Behavioral Risk Factors Surveillance System
+#' (BRFSS) conducted by the US Centers for Disease Control.
+#'
+#' @name cdc.samp
+#' @docType data
+#' @format A tibble with 60 rows and 9 variables:
+#' \describe{
+#'    \item{\code{genhlth}}{Factor with levels \code{excellent}, \code{very good}
+#'    \code{good}, \code{fair}, \code{poor}}
+#'    \item{\code{exerany}}{Numeric vector; 1 if the respondent exercised in the
+#'    past month and 0 otherwise.}
+#'    \item{\code{hlthplan}}{Numeric vector; 1 if the respondent has some form
+#'    of health coverage and 0 otherwise.}
+#'    \item{\code{smoke100}}{Numeric; 1 if the respondent has smoked at least 100
+#'    cigarettes in their entire life and 0 otherwise.}
+#'    \item{\code{height}}{Numeric; respondent's height in inches.}
+#'    \item{\code{weight}}{Numeric;  respondent's weight in pounds.}
+#'    \item{\code{wtdesire}}{Numeric; respondent's desired weight in pounds.}
+#'    \item{\code{age}}{Numeric;  respondent's age in years.}
+#'    \item{\code{gender}}{Factor with two levels \code{m} \code{f}}
+#'    }
+#' @source("http://www.openintro.org/stat/data/cdc.R")
+#'
+"cdc.samp"
diff --git a/R/data-census.2010.R b/R/data-census.2010.R
@@ -0,0 +1,26 @@
+#' census.2010
+#'
+#' United States 2010 infant mortality and number of physicians by state,
+#' including the District of Columbia.
+#'
+#' Data were abstracted from the 2010 Statistical Abstract of the United States.
+#' Due to a lag in recording state level data, the infant mortality data is from
+#' 2009 and the data on physicians from 2007.  Both measurements are subject to
+#' change annually, so these data are not current and should not be used for
+#' inference about infant mortality. More current data can be found at the US
+#' Centers for Disease Control and Prevention (\url{https://www.cdc.gov/nchs/pressroom/sosmap/infant_mortality_rates/infant_mortality.htm}), and in the dataset \code{infant_mort_2022}.
+#'
+#' @name census.2010
+#' @docType data
+#' @format A data frame with 51 rows and 3 columns.
+#' \describe{
+#'    \item{\code{state}}{Character vector vector, US State including the District of Columbia}
+#'    \item{\code{inf.mort}}{Numeric vector, number of deaths per 1000 live births between 1 day
+#'        and 1 year of age}
+#'    \item{\code{doctors}}{Numeric vector, active physicians per 100,000 population}
+#'   }
+#' @source \url{https://www.census.gov/library/publications/2009/compendia/statab/129ed/births-deaths-marriages-divorces.html},
+#'    \url{https://www.census.gov/library/publications/2009/compendia/statab/129ed/health-nutrition.html}
+#'
+"census.2010"
+
diff --git a/R/data-danish.ed.primary.R b/R/data-danish.ed.primary.R
@@ -0,0 +1,56 @@
+#' danish.ed.primary
+#'
+#' Data from a Danish study on triage in an emergency department (ED)
+#'
+#' Data from a prospective cohort study of triage scoring for an emergency
+#' department (ED).  The study examined whether the use of patient level
+#' measurements would improve an existing triage score. These data are the
+#' training data (called primary data in the original manuscript) used for model
+#' building. Some variable names have been changed for readability, but the data
+#' on 21 variables for the 6,249 participants are otherwise unchanged.
+#'
+#' @name danish.ed.primary
+#' @docType data
+#' @format A tibble with 6249 rows and 21 variables:
+#' \describe{
+#'    \item{\code{mort30}}{numeric, 1 if patient died within 30 days of admission, 0
+#'    otherwise}
+#'    \item{\code{triage}}{factor, triage score given at arrival to ED.
+#'    Values \code{green}, \code{yellow}, \code{orange}, \code{red}, from lowest
+#'    to highest priority
+#'    for treatment.  The value \code{blue} normally denotes severity not
+#'    warranting admission to the ED, but no participants coded blue
+#'    are in these data.}
+#'    \item{\code{age}}{numeric, age in years, rounded to lower integer}
+#'    \item{\code{sex}}{factor, values \code{female}, \code{male}}
+#'    \item{\code{albumin}}{numeric, serum albumin, in g/L}
+#'    \item{\code{creatinine}}{numeric, serum creatinine, in umol/L}
+#'    \item{\code{hemaglobin}}{numeric, serum hemaglobin, in mmol/L }
+#'    \item{\code{potassium}}{numeric, serum potassium, in mmol/L}
+#'    \item{\code{leuk.count}}{blood leukocyte count, in 10E9/L}
+#'    \item{\code{sodium}}{numeric, serum sodium, in mmol/L}
+#'    \item{\code{c.react.protein}}{numeric, serum C-reactive protein}
+#'    \item{\code{oxygen.sat}}{numeric, peripheral arterial oxygen saturation, as a percent}
+#'    \item{\code{resp.rate}}{numeric, respiratory rate per minute}
+#'    \item{\code{heart.rate}}{numeric, heart rate, beats/min}
+#'    \item{\code{systolic.bp}}{numeric, systolic blood pressure, in mmHg}
+#'    \item{\code{glasgow.coma.scale}}{numeric, extent
+#'    of impaired consciousness in patients with acute medical condition or
+#'    trauma, scored between 3 and 15, 3 being the worst and 15 the best. Score
+#'    is based on 3 subscales, best eye, verbal and motor responses.}
+#'    \item{\code{readmit.hosp}}{factor, readmitted to hospital within 30 days,
+#'    values \code{yes}, \code{no}}
+#'    \item{\code{days.in.hosp}}{numeric, number of days admitted to hospital}
+#'    \item{\code{icu.time}}{numeric, number of days in the intensive care unit.
+#'    value 99999 indicates patient not admitted to ICU}
+#'    \item{\code{icu.status}}{factor, patient admitted to ICU, values \code{yes},
+#'    \code{no}}
+#'    }
+#'    #' @references Kristensen, Michael, et al. "Routine blood tests are associated
+#' with short term mortality and can improve emergency department triage: a cohort
+#'  study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and
+#'  Emergency Medicine 25 (2017): 1-8.
+#' \url{https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader}
+#' @source \url{doi:10.5061/dryad.m2bq5}
+#'
+"danish.ed.primary"
diff --git a/R/data-danish.ed.validation.R b/R/data-danish.ed.validation.R
@@ -0,0 +1,50 @@
+#' Data from a Danish study on triage in an emergency department (ED)
+#'
+#' Data from a prospective cohort study of triage scoring for an emergency
+#' department (ED).  The study examined whether the use of patient level
+#' measurements would improve an existing triage score. These data were used as
+#' a test set (called validation in the manuscript) to examine the performance
+#' of the model built using the training (primary) cohort. Some variable names
+#' have been changed for readability and for consistency with the primary dataset,
+#' but the data on 18 variables for the 6,383 participants are otherwise unchanged.
+#' Some variables in the primary dataset do not appear in these data.
+#'
+#' @name danish.ed.validation
+#' @docType data
+#' @format A tibble with 6383 rows and 18 variables:
+#' \describe{
+#'    \item{\code{mort30}}{numeric, 1 if patient died within 30 days of admission, 0
+#'    otherwise}
+#'    \item{\code{triage}}{factor, triage score given at arrival to ED.
+#'    Values \code{blue}, \code{green}, \code{yellow}, \code{orange}, \code{red},
+#'    from lowest to highest priority
+#'    for treatment.  The value \code{blue} normally denotes severity not
+#'    warranting admission to the ED.  Participants coded \code{blue}
+#'    are in these data but not in the primary data.}
+#'    \item{\code{age}}{numeric, age in years, rounded to lower integer}
+#'    \item{\code{sex}}{factor, \code{female}, \code{male}}
+#'    \item{\code{albumin}}{numeric, serum albumin, in g/L}
+#'    \item{\code{creatinine}}{numeric, serum creatinine, in umol/L}
+#'    \item{\code{hemaglobin}}{numeric, serum hemaglobin, in mmol/L }
+#'    \item{\code{potassium}}{numeric, serum potassium, in mmol/L}
+#'    \item{\code{leuk.count}}{blood leukocyte count, in 10E9/L}
+#'    \item{\code{sodium}}{numeric, serum sodium, in mmol/L}
+#'    \item{\code{c.react.protein}}{numeric, serum C-reactive protein}
+#'    \item{\code{oxygen.sat}}{numeric, peripheral arterial oxygen saturation, %}
+#'    \item{\code{resp.rate}}{numeric, respiratory rate per minute}
+#'    \item{\code{heart.rate}}{numeric, heart rate, beats/min}
+#'    \item{\code{systolic.bp}}{numeric, systolic blood pressure, in mmHg}
+#'    \item{\code{readmit.hosp}}{factor, readmitted to hospital within 30 days,
+#'    with values \code{yes}, \code{no}}
+#'    \item{\code{days.in.hosp}}{numeric, number of days admitted to hospital}
+#'    \item{\code{icu.status}}{factor, patient admitted to ICU, with values
+#'    \code{yes}, \code{no}}
+#'    }
+#' @references Kristensen, Michael, et al. "Routine blood tests are associated
+#' with short term mortality and can improve emergency department triage: a cohort
+#'  study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and
+#'  Emergency Medicine 25 (2017): 1-8.
+#' \url{https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader}
+#' @source \url{doi:10.5061/dryad.m2bq5}
+#'
+"danish.ed.validation"
diff --git a/R/data-dds.dscr.R b/R/data-dds.dscr.R
@@ -0,0 +1,36 @@
+#' A dataset on disbursements from the California Department of Developmental Services (DDS)
+#'
+#' The dataset represents a sample of 1,000 DDS consumers (out of a total
+#' population of approximately 250,000),and includes information about age,
+#' gender, ethnicity, and the amount of financial support per consumer provided
+#' by the DDS.The dataset is based on recorded attributes of consumers, but has
+#' been altered to maintain consumer privacy.  From the Taylor and Mickel paper:
+#' "The data set originated from DDS’s Client Master File. In order to remain in
+#' compliance with California State Legislation, the data have been altered to
+#' protect the rights and privacy of specific individual consumers. The provided
+#' data set is based on actual attributes of consumers."
+#'
+#' @name dds.dscr
+#' @docType data
+#' @format A dataframe with 1000 rows and 6 variables:
+#'    \describe{
+#'    \item{\code{id}}{Numeric, Unique identification code for each resident}
+#'    \item{\code{age.cohort}}{A factor, \code{0-5} years,
+#'      \code{6-12} years, \code{13-17} years, \code{18-21} years, \code{22-50} years,
+#'      and \code{51+} years}
+#'    \item{\code{age}}{Numeric, Age measured in years}
+#'    \item{\code{gender}}{A factor, with levels  \code{Female} or \code{Male}}
+#'    \item{\code{expenditures}}{Numeric, Amount of expenditures spent by the
+#'       State on an individual annually, measured in USD}
+#'   \item{\code{ethnicity}}{Factor, Ethnic group, recorded as
+#'       \code{American Indian},  \code{Asian}, \code{Black}, \code{Hispanic},
+#'       \code{Multi Race}, \code{Native Hawaiian}, \code{Other},
+#'       \code{White not Hispanic}}
+#'   }
+#'   #' @references www.amstat.org/publications/jse/v22n1/mickel.pdf Taylor, Stanley A.,
+#'   and Amy E. Mickel. Simpson's paradox: A data set and discrimination case study
+#'   exercise. Journal of Statistics Education 22.1 (2014).
+#'   Data contained in supplement B of Taylor and Mickel.
+#'
+"dds.discr"
+
diff --git a/R/data-famuss.R b/R/data-famuss.R
@@ -0,0 +1,42 @@
+#' A dataset to examine the relationship between muscle strength and the single nucleotide polymorphism (SNP) actn3.r577x.
+#'
+#' This dataset is a  subset of the larger data set from the Functional SNPs
+#' Associated with Muscle Size and Strength (FAMuSS) by Thompson et.al. It
+#' contains demographic, response and coding for the SNP for the study participants.
+#' Unlike the data in the previous version of the \code{oibiostat} data package,
+#' this dataset retains the missing values. The data are also discussed in the
+#' Foulkes text. Strength was measured in both dominant and non-dominant arms
+#' before and after resistance training. The particular gene of interest was
+#' ACTN3, the "sports gene."
+#'
+# '@name famuss
+#' @docType data
+#' @format A tibble with 1397 rows and 10 variables
+#' \describe{
+#'    \item{\code{ndrm.ch}}{A numeric vector, the percent change in strength
+#'    in a non-dominant arm, from before training and after.}
+#'    \item{\code{drm.ch}}{A numeric vector, percent change in strength in
+#'     dominant arm.}
+#'    \item{\code{sex}}{A factor with levels \code{Female} and \code{Male}}
+#'    \item{\code{age}}{A numeric vector, age in years.}
+#'    \item{\code{race}}{A factor with levels \code{African Am} \code{Asian}
+#'       \code{Caucasian} \code{Hispanic} \code{Other}}
+#'    \item{\code{height}}{A numeric vector,
+#'    height in inches.}
+#'    \item{\code{weight}}{A numeric vector, weight in pounds.}
+#'    \item{\code{actn3.r577x}}{A factor with levels \code{CC} \code{CT} \code{TT},
+#'     that shows the genotype at residue rs540874 (location r577x) within the ACTN3
+#'     SNP.}
+#'    \item{\code{bmi}}{A numeric vector, body mass index}
+#'    }
+#' @source Personal communication from A. Foulkes
+#' @references Thompson PMoyna NSeip R et al. Medicine and Science in Sports and
+#'     Exercise, (2004), 1132-1139, 36(7). Clarkson P, et al., Journal of Applied
+#'     Physiology 99: 154-163, 2005.Pescatello L, et al. Highlights from the
+#'     functional single nucleotide polymorphisms associated with human muscle
+#'     size and strength or FAMuSS study, BioMed Research International 2013. Foulkes, Andrea S.
+#'     Applied Statistical Genetics using R for Population Association Studies.
+#'     Springer, 2009).
+#'
+"famuss"
+