Skip to content

Commit

Permalink
updating a few datasets (#70)
Browse files Browse the repository at this point in the history
* updating a few datasets

* fixed the time in the paralympic dataset

* fixing the variable names

* temperature data, 1950 and 2022

* Update paralympic data prep and example

* Document paralympic data

* Document NYC marathon data

* Update 2022 durham pm25 data + align 2011 data

* Update pipe

* Update pipe

* Run document

* Add type variable to docs

* Fix data prep issues and add pkgs ex depends on

* Move US temp data to usdata pkg

* Ungroup data in example

* Add new datasets to pkgdown site

---------

Co-authored-by: Mine Çetinkaya-Rundel <cetinkaya.mine@gmail.com>
  • Loading branch information
hardin47 and mine-cetinkaya-rundel authored Sep 26, 2023
1 parent 6fac392 commit 8ab1ca4
Show file tree
Hide file tree
Showing 28 changed files with 889 additions and 152 deletions.
6 changes: 3 additions & 3 deletions R/data-fish_age.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@
#'
#' # Count the number of one-year-old fish at each location.
#'
#' fish_age %>%
#' filter(one_year_old == "yes") %>%
#' count(year, location) %>%
#' fish_age |>
#' filter(one_year_old == "yes") |>
#' count(year, location) |>
#' pivot_wider(names_from = location, values_from = n)
#'
"fish_age"
11 changes: 5 additions & 6 deletions R/data-lecture_learning.R
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,9 @@
#'
#' # Calculate the average memory test proportion by lecture delivery method
#' # and gender.
#' lecture_learning %>%
#' group_by(method, gender) %>%
#' summarize(average_memory = mean(memory), count = n())
#' lecture_learning |>
#' group_by(method, gender) |>
#' summarize(average_memory = mean(memory), count = n(), .groups = "drop")
#'
#' # Compare visually the differences in memory test proportions by delivery
#' # method and gender.
Expand All @@ -61,10 +61,9 @@
#'
#' # Calculating the proportion of students who were most motivated to remain
#' # attentive in each delivery method.
#' lecture_learning %>%
#' count(motivation_both) %>%
#' lecture_learning |>
#' count(motivation_both) |>
#' mutate(proportion = n / sum(n))
#'
#' @source [PLOS One](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141587)
#'
"lecture_learning"
4 changes: 2 additions & 2 deletions R/data-nyc_marathon.R
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
#' New York City Marathon Times
#'
#' Marathon times of runners in the Men and Women divisions of the New York
#' City Marathon, 1970 - 2020.
#' City Marathon, 1970 - 2022.
#'
#' @name nyc_marathon
#' @docType data
#' @format A data frame with 102 observations on the following 7 variables.
#' @format A data frame with 106 observations on the following 7 variables.
#' \describe{
#' \item{year}{Year of marathom.}
#' \item{name}{Name of winner.}
Expand Down
53 changes: 53 additions & 0 deletions R/data-paralympic_1500.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#' Race time for Olympic and Paralympic 1500m.
#'
#' Compiled gold medal times for the 1500m race in the Olympic Games and the
#' Paralympic Games. The times given for contestants competing in
#' the Paralympic Games are for athletes with different visual impairments;
#' T11 indicates fully blind (with an option to race with a guide-runner)
#' with T12 and T13 as lower levels of visual impairment.
#'
#'
#' @name paralympic_1500
#' @docType data
#' @format A data frame with 83 rows and 10 variables.
#' \describe{
#' \item{year}{Year the games took place.}
#' \item{city}{City of the games.}
#' \item{country_of_games}{Country of the games.}
#' \item{division}{Division: `Men` or `Women`.}
#' \item{type}{Type.}
#' \item{name}{Name of the athlete.}
#' \item{country_of_athlete}{Country of athlete.}
#' \item{time}{Time of gold medal race, in m:s.}
#' \item{time_min}{Time of gold medal race, in decimal minutes (min + sec/60).}
#' }
#' @source [https://www.paralympic.org/](https://www.paralympic.org/) and [https://en.wikipedia.org/wiki/1500_metres_at_the_Olympics](https://en.wikipedia.org/wiki/1500_metres_at_the_Olympics).
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#' library(dplyr)
#'
#' paralympic_1500 |>
#' mutate(
#' sight_level = case_when(
#' type == "T11" ~ "total impairment",
#' type == "T12" ~ "some impairment",
#' type == "T13" ~ "some impairment",
#' type == "Olympic" ~ "no impairment"
#' )
#' ) |>
#' filter(division == "Men", year > 1920) |>
#' filter(type == "Olympic" | type == "T11") |>
#' ggplot(aes(x = year, y = time_min, color = sight_level, shape = sight_level)) +
#' geom_point() +
#' scale_x_continuous(breaks = seq(1924, 2020, by = 8)) +
#' labs(
#' title = "Men's Olympic and Paralympic 1500m race times",
#' x = "Year",
#' y = "Time of Race (minutes)",
#' color = "Sight level",
#' shape = "Sight level"
#' )
#'
"paralympic_1500"
41 changes: 22 additions & 19 deletions R/data-pm25_2011_durham.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#' Air quality for Durham, NC
#'
#' Daily air quality is measured by the air quality index (AQI) reported by the
#' Environmental Protection Agency.
#' Environmental Protection Agency in 2011.
#'
#'
#' @name pm25_2011_durham
Expand All @@ -10,30 +10,33 @@
#' @format A data frame with 449 observations on the following 20 variables.
#' \describe{
#' \item{date}{Date}
#' \item{aqs_site_id}{a factor with levels \code{37-063-0015}}
#' \item{poc}{a numeric vector}
#' \item{daily_mean_pm2_5_concentration}{a numeric vector}
#' \item{units}{a factor with levels \code{ug/m3 LC}}
#' \item{daily_aqi_value}{a numeric vector}
#' \item{daily_obs_count}{a numeric vector}
#' \item{percent_complete}{a numeric vector}
#' \item{aqs_parameter_code}{a numeric vector}
#' \item{aqs_parameter_desc}{a factor with levels \code{Acceptable PM2.5 AQI & Speciation Mass} \code{PM2.5 - Local Conditions}}
#' \item{aqs_site_id}{The numeric site ID.}
#' \item{poc}{A numeric vector, the Parameter Occurance Code.}
#' \item{daily_mean_pm2_5_concentration}{A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.}
#' \item{units}{A character vector with value \code{ug/m3 LC}.}
#' \item{daily_aqi_value}{A numeric vector with the daily air quality index.}
#' \item{daily_obs_count}{A numeric vector.}
#' \item{percent_complete}{A numeric vector.}
#' \item{aqs_parameter_code}{A numeric vector.}
#' \item{aqs_parameter_desc}{A factor with levels \code{PM2.5 - Local Conditions} and \code{Acceptable PM2.5 AQI & Speciation Mass}.}
#' \item{cbsa_code}{A numeric vector.}
#' \item{cbsa_name}{A character vector with value \code{Durham, NC}.}
#' \item{state_code}{A numeric vector.}
#' \item{state}{A character vector with value \code{North Carolina}.}
#' \item{county_code}{A numeric vector.}
#' \item{county}{A character vector with value \code{Durham}.}
#' \item{site_latitude}{A numeric vector of the latitude.}
#' \item{site_longitude}{A numeric vector of the longitude.}
#' \item{csa_code}{a numeric vector}
#' \item{csa_name}{a factor with levels \code{Raleigh-Durham-Cary, NC}}
#' \item{cbsa_code}{a numeric vector}
#' \item{cbsa_name}{a factor with levels \code{Durham, NC}}
#' \item{state_code}{a numeric vector}
#' \item{state}{a factor with levels \code{North Carolina}}
#' \item{county_code}{a numeric vector}
#' \item{county}{a factor with levels \code{Durham}}
#' \item{site_latitude}{a numeric vector}
#' \item{site_longitude}{a numeric vector}
#' }
#' @source US Environmental Protection Agency, AirData, 2011.
#' \url{http://www3.epa.gov/airdata/ad_data_daily.html}
#' @keywords datasets
#' @examples
#'
#' pm25_2011_durham
#' library(ggplot2)
#'
#' ggplot(pm25_2011_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) +
#' geom_line()
"pm25_2011_durham"
40 changes: 40 additions & 0 deletions R/data-pm25_2022_durham.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#' Air quality for Durham, NC
#'
#' Daily air quality is measured by the air quality index (AQI) reported by the
#' Environmental Protection Agency in 2022.
#'
#'
#' @name pm25_2022_durham
#' @docType data
#' @format A data frame with 356 observations on the following 20 variables.
#' \describe{
#' \item{date}{Date.}
#' \item{aqs_site_id}{The numeric site ID.}
#' \item{poc}{A numeric vector, the Parameter Occurance Code.}
#' \item{daily_mean_pm2_5_concentration}{A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.}
#' \item{units}{A character vector with value \code{ug/m3 LC}.}
#' \item{daily_aqi_value}{A numeric vector with the daily air quality index.}
#' \item{daily_obs_count}{A numeric vector.}
#' \item{percent_complete}{A numeric vector.}
#' \item{aqs_parameter_code}{A numeric vector.}
#' \item{aqs_parameter_desc}{A factor vector with level \code{PM2.5 - Local Conditions}.}
#' \item{cbsa_code}{A numeric vector.}
#' \item{cbsa_name}{A character vector with value \code{Durham-Chapel Hill, NC}.}
#' \item{state_code}{A numeric vector.}
#' \item{state}{A character vector with value \code{North Carolina}.}
#' \item{county_code}{A numeric vector.}
#' \item{county}{A character vector with value \code{Durham}.}
#' \item{site_latitude}{A numeric vector of the latitude.}
#' \item{site_longitude}{A numeric vector of the longitude.}
#' \item{site_name}{A character vector with value \code{Durham Armory}.}
#' }
#' @source US Environmental Protection Agency, AirData, 2022.
#' \url{http://www3.epa.gov/airdata/ad_data_daily.html}
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#'
#' ggplot(pm25_2022_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) +
#' geom_line()
"pm25_2022_durham"
60 changes: 31 additions & 29 deletions data-raw/nyc_marathon/nyc-marathon-mens.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
year | name | country | time | note
1970 | Gary Muhrcke | United States | 2:31:38 | Course record
1971 | Norman Higgins | United States | 2:22:54 | Course record
1972 | Sheldon Karlin | United States | 2:27:52 |
1972 | Sheldon Karlin | United States | 2:27:52 |
1973 | Tom Fleming | United States | 2:21:54 | Course record
1974 | Norbert Sander | United States | 2:26:30 |
1974 | Norbert Sander | United States | 2:26:30 |
1975 | Tom Fleming | United States | 2:19:27 | Course record, second victory
1976 | Bill Rodgers | United States | 2:10:10 | Course record
1977 | Bill Rodgers | United States | 2:11:28 | Second victory
Expand All @@ -12,41 +12,43 @@ year | name | country | time | note
1980 | Alberto Salazar | United States | 2:09:41 | Course record
1981 | Alberto Salazar | United States | 2:08:13 | Course record (course measured short), second victory
1982 | Alberto Salazar | United States | 2:09:29 | Third victory
1983 | Rod Dixon | New Zealand | 2:08:59 |
1984 | Orlando Pizzolato | Italy | 2:14:53 |
1983 | Rod Dixon | New Zealand | 2:08:59 |
1984 | Orlando Pizzolato | Italy | 2:14:53 |
1985 | Orlando Pizzolato | Italy | 2:11:34 | Second victory
1986 | Gianni Poli | Italy | 2:11:06 |
1987 | Ibrahim Hussein | Kenya | 2:11:01 |
1988 | Steve Jones | United Kingdom| 2:08:20 |
1986 | Gianni Poli | Italy | 2:11:06 |
1987 | Ibrahim Hussein | Kenya | 2:11:01 |
1988 | Steve Jones | United Kingdom| 2:08:20 |
1989 | Juma Ikangaa | Tanzania | 2:08:01 | Course record
1990 | Douglas Wakiihuri | Kenya | 2:12:39 |
1991 | Salvador García | Mexico | 2:09:28 |
1992 | Willie Mtolo | South Africa | 2:09:29 |
1993 | Andrés Espinosa | Mexico | 2:10:04 |
1994 | Germán Silva | Mexico | 2:11:21 |
1990 | Douglas Wakiihuri | Kenya | 2:12:39 |
1991 | Salvador García | Mexico | 2:09:28 |
1992 | Willie Mtolo | South Africa | 2:09:29 |
1993 | Andrés Espinosa | Mexico | 2:10:04 |
1994 | Germán Silva | Mexico | 2:11:21 |
1995 | Germán Silva | Mexico | 2:11:00 | Second victory
1996 | Giacomo Leone | Italy | 2:09:54 |
1997 | John Kagwe | Kenya | 2:08:12 |
1996 | Giacomo Leone | Italy | 2:09:54 |
1997 | John Kagwe | Kenya | 2:08:12 |
1998 | John Kagwe | Kenya | 2:08:45 | Second victory
1999 | Joseph Chebet | Kenya | 2:09:14 |
2000 | Abdelkader El Mouaziz | Morocco | 2:10:09 |
1999 | Joseph Chebet | Kenya | 2:09:14 |
2000 | Abdelkader El Mouaziz | Morocco | 2:10:09 |
2001 | Tesfaye Jifar | Ethiopia | 2:07:43 | Course record
2002 | Rodgers Rop | Kenya | 2:08:07 |
2003 | Martin Lel | Kenya | 2:10:30 |
2004 | Hendrick Ramaala | South Africa | 2:09:28 |
2005 | Paul Tergat | Kenya | 2:09:30 |
2006 | Marílson Gomes dos Santos| Brazil | 2:09:58 |
2002 | Rodgers Rop | Kenya | 2:08:07 |
2003 | Martin Lel | Kenya | 2:10:30 |
2004 | Hendrick Ramaala | South Africa | 2:09:28 |
2005 | Paul Tergat | Kenya | 2:09:30 |
2006 | Marílson Gomes dos Santos| Brazil | 2:09:58 |
2007 | Martin Lel | Kenya | 2:09:04 | Second victory
2008 | Marílson Gomes dos Santos| Brazil | 2:08:43 | Second victory
2009 | Meb Keflezighi | United States | 2:09:15 |
2010 | Gebregziabher Gebremariam| Ethiopia | 2:08:14 |
2009 | Meb Keflezighi | United States | 2:09:15 |
2010 | Gebregziabher Gebremariam| Ethiopia | 2:08:14 |
2011 | Geoffrey Mutai | Kenya | 2:05:06 | Current course record
2012 | | | | Canceled due to Hurricane Sandy
2013 | Geoffrey Mutai | Kenya | 2:08:24 | Second victory
2014 | Wilson Kipsang | Kenya | 2:10:59 |
2015 | Stanley Biwott | Kenya | 2:10:34 |
2016 | Ghirmay Ghebreslassie | Eritrea | 2:07:51 |
2017 | Geoffrey Kamworor | Kenya | 2:10:53 |
2018 | Lelisa Desisa | Ethiopia | 2:05:59 |
2014 | Wilson Kipsang | Kenya | 2:10:59 |
2015 | Stanley Biwott | Kenya | 2:10:34 |
2016 | Ghirmay Ghebreslassie | Eritrea | 2:07:51 |
2017 | Geoffrey Kamworor | Kenya | 2:10:53 |
2018 | Lelisa Desisa | Ethiopia | 2:05:59 |
2019 | Geoffrey Kamworor | Kenya | 2:08:13 | Second victory
2020 | Kevin Quinn | United Kingdom| 2:23:48 | Virtual event held due to the COVID-19
2020 | Kevin Quinn | United Kingdom| 2:23:48 | Virtual event held due to the COVID-19
2021 | Albert Korir | Kenya | 2:08:22 |
2022 | Evans Chebet | Kenya | 2:08:41 |
50 changes: 26 additions & 24 deletions data-raw/nyc_marathon/nyc-marathon-womens.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
year | name | country | time | note
1970 | | | | No woman finishers
1971 | Beth Bonner | United States | 2:55:22 | World record
1972 | Nina Kuscsik | United States | 3:08:41 |
1972 | Nina Kuscsik | United States | 3:08:41 |
1973 | Nina Kuscsik | United States | 2:57:07 | Second victory
1974 | Kathrine Switzer | United States | 3:07:29 |
1974 | Kathrine Switzer | United States | 3:07:29 |
1975 | Kim Merritt | United States | 2:46:14 | Course record
1976 | Miki Gorman | United States | 2:39:11 | Course record
1977 | Miki Gorman | United States | 2:43:10 | Second victory
Expand All @@ -16,37 +16,39 @@ year | name | country | time | note
1984 | Grete Waitz | Norway | 2:29:30 | Sixth victory
1985 | Grete Waitz | Norway | 2:28:34 | Seventh victory
1986 | Grete Waitz | Norway | 2:28:06 | Eighth victory
1987 | Priscilla Welch | United Kingdom| 2:30:17 |
1987 | Priscilla Welch | United Kingdom| 2:30:17 |
1988 | Grete Waitz | Norway | 2:28:07 | Ninth victory
1989 | Ingrid Kristiansen | Norway | 2:25:30 |
1990 | Wanda Panfil | Poland | 2:30:45 |
1991 | Liz McColgan | United Kingdom| 2:27:32 |
1989 | Ingrid Kristiansen | Norway | 2:25:30 |
1990 | Wanda Panfil | Poland | 2:30:45 |
1991 | Liz McColgan | United Kingdom| 2:27:32 |
1992 | Lisa Ondieki | Australia | 2:24:40 | Course record
1993 | Uta Pippig | Germany | 2:26:24 |
1994 | Tegla Loroupe | Kenya | 2:27:37 |
1993 | Uta Pippig | Germany | 2:26:24 |
1994 | Tegla Loroupe | Kenya | 2:27:37 |
1995 | Tegla Loroupe | Kenya | 2:28:06 | Second victory
1996 | Anuța Cătună | Romania | 2:28:18 |
1997 | Franziska Rochat-Moser | Switzerland | 2:28:43 |
1998 | Franca Fiacconi | Italy | 2:25:17 |
1999 | Adriana Fernández | Mexico | 2:25:06 |
2000 | Lyudmila Petrova | Russia | 2:25:45 |
1996 | Anuța Cătună | Romania | 2:28:18 |
1997 | Franziska Rochat-Moser | Switzerland | 2:28:43 |
1998 | Franca Fiacconi | Italy | 2:25:17 |
1999 | Adriana Fernández | Mexico | 2:25:06 |
2000 | Lyudmila Petrova | Russia | 2:25:45 |
2001 | Margaret Okayo | Kenya | 2:24:21 | Course record
2002 | Joyce Chepchumba | Kenya | 2:25:56 |
2002 | Joyce Chepchumba | Kenya | 2:25:56 |
2003 | Margaret Okayo | Kenya | 2:22:31 | Current course record, second victory
2004 | Paula Radcliffe | United Kingdom| 2:23:10 |
2005 | Jeļena Prokopčuka | Latvia | 2:24:41 |
2004 | Paula Radcliffe | United Kingdom| 2:23:10 |
2005 | Jeļena Prokopčuka | Latvia | 2:24:41 |
2006 | Jeļena Prokopčuka | Latvia | 2:25:05 | Second victory
2007 | Paula Radcliffe | United Kingdom| 2:23:09 | Second victory
2008 | Paula Radcliffe | United Kingdom| 2:23:56 | Third victory
2009 | Derartu Tulu | Ethiopia | 2:28:52 |
2010 | Edna Kiplagat | Kenya | 2:28:20 |
2011 | Firehiwot Dado | Ethiopia | 2:23:15 |
2009 | Derartu Tulu | Ethiopia | 2:28:52 |
2010 | Edna Kiplagat | Kenya | 2:28:20 |
2011 | Firehiwot Dado | Ethiopia | 2:23:15 |
2012 | | | | Canceled due to Hurricane Sandy
2013 | Priscah Jeptoo | Kenya | 2:25:07 |
2014 | Mary Keitany | Kenya | 2:25:07 |
2013 | Priscah Jeptoo | Kenya | 2:25:07 |
2014 | Mary Keitany | Kenya | 2:25:07 |
2015 | Mary Keitany | Kenya | 2:24:25 | Second victory
2016 | Mary Keitany | Kenya | 2:24:26 | Third victory
2017 | Shalane Flanagan | United States | 2:26:53 |
2017 | Shalane Flanagan | United States | 2:26:53 |
2018 | Mary Keitany | Kenya | 2:22:48 | Fourth victory
2019 | Joyciline Jepkosgei | Kenya | 2:22:38 |
2020 | Stephanie Bruce | United States | 2:35:28 | Virtual event held due to the COVID-19
2019 | Joyciline Jepkosgei | Kenya | 2:22:38 |
2020 | Stephanie Bruce | United States | 2:35:28 | Virtual event held due to the COVID-19
2021 | Peres Jepchirchir | Kenya | 2:22:39 |
2022 | Sharon Lokedi | Kenya | 2:23:23 |
18 changes: 18 additions & 0 deletions data-raw/paralympic_1500/paralympic_1500-dataprep.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# load packages ----------------------------------------------------------------

library(tidyverse)

# load data --------------------------------------------------------------------

paralympic_1500_raw <- read_csv(here::here("data-raw/paralympic_1500/paralympic_1500.csv"))

# cleaning ---------------------------------------------------------------------

paralympic_1500 <- paralympic_1500_raw |>
mutate(time_min = time) |>
mutate(time = paste(min, sec, sep = ":")) |>
select(-min, -sec)

# save data --------------------------------------------------------------------

usethis::use_data(paralympic_1500, overwrite = TRUE)
Loading

0 comments on commit 8ab1ca4

Please sign in to comment.