Ims issue 4 (#68)

* cars04 data added * added life_exp data and adjusted documentation for cars04 * comics data added * data cleaning update for comics * nyc dataset added * iowa dataset added * adjusted iowa documentation, added iran data * manhattan data added * gss_wordsum_class added * twins data added * LAhomes data added * partial movies data set * movies data set complete * ucb_admit data added * updated news.md * fixed documentation for life_exp
OpenIntroStat · May 22, 2023 · 6fac392 · 6fac392
1 parent c5e6fb6
commit 6fac392
Show file tree

Hide file tree

Showing 65 changed files with 32,564 additions and 1 deletion.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -32,7 +32,8 @@ Suggests:
     scales,
     testthat (>= 3.0.0),
     tidyr,
-    tidytext
+    tidytext,
+    stringr
 Imports: 
     ggplot2 (>= 2.2.1),
     graphics,

diff --git a/NEWS.md b/NEWS.md
@@ -6,6 +6,8 @@
   * `lecture_learning` by [@jonathanaakin](https://github.com/jonathanaakin)
 * Fix HTML version of manual
 * Remove some URLs that no longer work
+* Added new datasets:
+  * `cars04`, `life_exp`, `comics`, `nyc`, `gss_wordsum_class`, `manhattan`, `iran`, `iowa`, `twins`, `LAhomes`, `movies`, `ucb_admit`, `soda` ported from IMS Tutorials by [@npaterno](https://github.com/npaterno)
 
 # openintro 2.3.0
 

diff --git a/R/data-LAhomes.R b/R/data-LAhomes.R
@@ -0,0 +1,33 @@
+#' LAhomes
+#'
+#' Data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010.
+#'
+#' @name LAhomes
+#' @docType data
+#' @format A data frame with 1594 observations on the following 8 variables.
+#' \describe{
+#'   \item{city}{City where the home is located.}
+#'   \item{type}{Type of home with levels `Condo/Twh` - condo or townhouse, `SFR` - single family residence, and `NA`}
+#'   \item{bed}{Number of bedrooms in the home.}
+#'   \item{bath}{Number of bathrooms in the home.}
+#'   \item{garage}{Number of cars that can be parked in the garage. Note that a value of `4` refers to 4 or more garage spaces.}
+#'   \item{sqft}{Squarefootage of the home.}
+#'   \item{pool}{Indicates if the home has a pool.}
+#'   \item{price}{Listing price of the home.}
+#'   }
+#' @keywords datasets
+#' @examples
+#'
+#' library(ggplot2)
+#'
+#' ggplot(LAhomes, aes(sqft, price)) +
+#'   geom_point(alpha = 0.2) +
+#'   theme_minimal() +
+#'   labs(
+#'     title = "Can we predict list price from squarefootage?",
+#'     subtitle = "Homes in the Los Angeles area",
+#'     x = "Square feet",
+#'     y = "List price"
+#'   )
+
+"LAhomes"
diff --git a/R/data-cars04.R b/R/data-cars04.R
@@ -0,0 +1,45 @@
+#' cars04
+#'
+#' A data frame with 428 rows and 19 columns. This is a record of characteristics on all of the new models of cars for sale in the US in the year 2004.
+#'
+#'
+#' @name cars04
+#' @docType data
+#' @format A data frame with 428 observations on the following 19 variables.
+#' \describe{
+#'   \item{name}{The name of the vehicle including manufacturer and model.}
+#'   \item{sports_car}{Logical variable indicating if the vehicle is a sports car.}
+#'   \item{suv}{Logical variable indicating if the vehicle is an suv.}
+#'   \item{wagon}{Logical variable indicating if the vehicle is a wagon.}
+#'   \item{minivan}{Logical variable indicating if the vehicle is a minivan.}
+#'   \item{pickup}{Logical variable indicating if the vehicle is a pickup.}
+#'   \item{all_wheel}{Logical variable indicating if the vehicle is all-wheel drive.}
+#'   \item{rear_wheel}{Logical variable indicating if the vehicle is rear-wheel drive.}
+#'   \item{msrp}{Manufacturer suggested retail price of the vehicle.}
+#'   \item{dealer_cost}{Amount of money the dealer paid for the vehicle.}
+#'   \item{eng_size}{Displacement of the engine - the total volume of all the cylinders, measured in liters.}
+#'   \item{ncyl}{Number of cylinders in the engine.}
+#'   \item{horsepwr}{Amount of horsepower produced by the engine.}
+#'   \item{city_mpg}{Gas mileage for city driving, measured in miles per gallon.}
+#'   \item{hwy_mpg}{Gas mileage for highway driving, measured in miles per gallon.}
+#'   \item{weight}{Total weight of the vehicle, measured in pounds.}
+#'   \item{wheel_base}{Distance between the center of the front wheels and the center of the rear wheels, measured in inches.}
+#'   \item{length}{Total length of the vehicle, measured in inches.}
+#'   \item{width}{Total width of the vehicle, measured in inches.}
+#' }
+#' @keywords datasets
+#' @examples
+#'
+#' library(ggplot2)
+#'
+#' # Highway gas mileage
+#' ggplot(cars04, aes(x = hwy_mpg)) +
+#'  geom_histogram(bins = 15, color = "white",
+#'  fill = openintro::IMSCOL["green", "full"]) +
+#'  theme_minimal() +
+#'  labs(
+#'  title = "Highway gas milage for cars from 2004",
+#'  x = "Gas Mileage (miles per gallon)",
+#'  y = "Number of cars")
+
+"cars04"
diff --git a/R/data-comics.R b/R/data-comics.R
@@ -0,0 +1,45 @@
+#' comics
+#'
+#' A data frame containing information about comic book characters from Marvel Comics and DC Comics.
+#'
+#'
+#' @name comics
+#' @docType data
+#' @format A data frame with 21821 observations on the following 11 variables.
+#' \describe{
+#'   \item{name}{Name of the character. May include: Real name, hero or villain name,  alias(es) and/or which universe they live in (i.e. Earth-616 in Marvel's multiverse).}
+#'   \item{id}{Status of the characters identity with levels `Secret`, `Publie`, `No Dual` and `Unknown`.}
+#'   \item{align}{Character's alignment with levels `Good`, `Bad`, `Neutral` and `Reformed Criminals`.}
+#'   \item{eye}{Character's eye color.}
+#'   \item{hair}{Character's hair color.}
+#'   \item{gender}{Character's gender.}
+#'   \item{gsm}{Character's classification as a gender or sexual minority.}
+#'   \item{alive}{Is the character dead or alive?}
+#'   \item{appearances}{Number of comic boooks the character appears in.}
+#'   \item{first_appear}{Date of publication for the comic book the character first appeared in.}
+#'   \item{publisher}{Publisher of the comic with levels `Marvel` and `DC`.}
+#' }
+#' @keywords datasets
+#' @examples
+#'
+#' library(ggplot2)
+#' library(dplyr)
+#'
+#' # Good v Bad
+#'
+#' plot_data <- comics %>%
+#'  filter(align == "Good" | align == "Bad")
+#'
+#' ggplot(plot_data, aes(x = align, fill = align)) +
+#'  geom_bar() +
+#'  facet_wrap(~publisher)+
+#'  scale_fill_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) +
+#'  theme_minimal() +
+#'  labs(
+#'    title = "Is there a balance of power",
+#'    x = "",
+#'    y = "Number of characters",
+#'    fill = ""
+#'  )
+
+"comics"
diff --git a/R/data-gss_wordsum_class.R b/R/data-gss_wordsum_class.R
@@ -0,0 +1,22 @@
+#' gss_wordsum_class
+#'
+#' A data frame containing data from the General Social Survey.
+#'
+#' @name gss_wordsum_class
+#' @docType data
+#' @format A data frame with 795 observations on the following 2 variables.
+#' \describe{
+#'   \item{wordsum}{A vocabulary score calculated based on a ten question vocabulary test, where a higher score means better vocabulary. Scores range from 1 to 10.}
+#'   \item{class}{Self-identified social class has 4 levels: lower, working, middle, and upper class.}
+#'   }
+#' @keywords datasets
+#' @examples
+#'
+#' library(dplyr)
+#'
+#' gss_wordsum_class %>%
+#'   group_by(class) %>%
+#'   summarize(mean_wordsum = mean(wordsum))
+#'
+
+"gss_wordsum_class"
diff --git a/R/data-iowa.R b/R/data-iowa.R
@@ -0,0 +1,37 @@
+#' iowa
+#'
+#' A data frame containing information about the 2016 US Presidential Election for the state of Iowa.
+#'
+#' @name iowa
+#' @docType data
+#' @format A data frame with 1386 observations on the following 5 variables.
+#' \describe{
+#'   \item{office}{The office that the candidates were running for.}
+#'   \item{candidate}{President/Vice President pairs who were running for office.}
+#'   \item{party}{Political part of the candidate.}
+#'   \item{county}{County in Iowa where the votes were cast.}
+#'   \item{votes}{Number of votes received by the candidate.}
+#'   }
+#' @keywords datasets
+#' @examples
+#'
+#' library(ggplot2)
+#' library(dplyr)
+#'
+#' plot_data <- iowa %>%
+#'   filter(candidate != "Total") %>%
+#'   group_by(candidate) %>%
+#'   summarize(total_votes = sum(votes) / 1000)
+#'
+#' ggplot(plot_data, aes(total_votes, candidate)) +
+#'   geom_col() +
+#'   theme_minimal() +
+#'   labs(
+#'     title = "2016 Presidential Election in Iowa",
+#'     subtitle = "Popular vote",
+#'     y = "",
+#'     x = "Number of Votes (in thousands)
+#'     "
+#'   )
+
+"iowa"
diff --git a/R/data-iran.R b/R/data-iran.R
@@ -0,0 +1,50 @@
+#' iran
+#'
+#' A data frame containing information about the 2009 Presidential Election in Iran. There were widespread claims of election fraud in this election both internationally and within Iran.
+#'
+#' @name iran
+#' @docType data
+#' @format A data frame with 366 observations on the following 9 variables.
+#' \describe{
+#'   \item{province}{Iranian province where votes were cast.}
+#'   \item{city}{City within province where votes were cast.}
+#'   \item{ahmadinejad}{Number of votes received by Ahmadinejad.}
+#'   \item{rezai}{Number of votes received by Rezai.}
+#'   \item{karrubi}{Number of votes received by Karrubi.}
+#'   \item{mousavi}{Number of votes received by Mousavi.}
+#'   \item{total_votes_cast}{Total number of votes cast.}
+#'   \item{voided_votes}{Number of votes that were not counted.}
+#'   \item{legitimate_votes}{Number of votes that were counted.}
+#'   }
+#' @keywords datasets
+#' @examples
+#'
+#' library(dplyr)
+#' library(ggplot2)
+#' library(tidyr)
+#' library(stringr)
+#'
+#' plot_data <- iran %>%
+#'   summarize(
+#'     ahmadinejad = sum(ahmadinejad) / 1000,
+#'     rezai = sum(rezai) / 1000,
+#'     karrubi = sum(karrubi) / 1000,
+#'     mousavi = sum(mousavi) / 1000
+#'   ) %>%
+#'   pivot_longer(
+#'     cols = c(ahmadinejad, rezai, karrubi, mousavi),
+#'     names_to = "candidate",
+#'     values_to = "votes"
+#'   ) %>%
+#'   mutate(candidate = str_to_title(candidate))
+#'
+#' ggplot(plot_data, aes(votes, candidate)) +
+#'   geom_col() +
+#'   theme_minimal() +
+#'   labs(
+#'     title = "2009 Iranian Presidential Election",
+#'     x = "Number of votes (in thousands)",
+#'     y = ""
+#'   )
+
+"iran"
diff --git a/R/data-life_exp.R b/R/data-life_exp.R
@@ -0,0 +1,29 @@
+#' life_exp
+#'
+#' A data frame with 3142 rows and 4 columns. County level data for life expectancy and median income in the United States.
+#'
+#'
+#' @name life_exp
+#' @docType data
+#' @format A data frame with 3142 observations on the following 4 variables.
+#' \describe{
+#'   \item{state}{Name of the state.}
+#'   \item{county}{Name of the county.}
+#'   \item{expectancy}{Life expectancy in the county.}
+#'   \item{income}{Median income in the county, measured in US $.}
+#' }
+#' @keywords datasets
+#' @examples
+#'
+#' library(ggplot2)
+#'
+#' # Income V Expectancy
+#' ggplot(life_exp, aes(x = income, y = expectancy)) +
+#'  geom_point(color = openintro::IMSCOL["green", "full"], alpha = 0.2) +
+#'  theme_minimal() +
+#'  labs(
+#'  title = "Is there a relationship between median income and life expectancy?",
+#'  x = "Median income (US $)",
+#'  y = "Life Expectancy (year)")
+
+"life_exp"
diff --git a/R/data-manhattan.R b/R/data-manhattan.R
@@ -0,0 +1,26 @@
+#' manhattan
+#'
+#' A data frame containing data on apartment rentals in Manhattan.
+#'
+#' @name manhattan
+#' @docType data
+#' @format A data frame with 20 observations on the following 1 variable.
+#' \describe{
+#'   \item{rent}{Monthly rent for a 1 bedroom apartment listed as "For rent by owner".}
+#'   }
+#' @keywords datasets
+#' @examples
+#'
+#' library(ggplot2)
+#'
+#' ggplot(manhattan, aes(rent)) +
+#'   geom_histogram(color = "white", binwidth = 300) +
+#'   theme_minimal() +
+#'   labs(
+#'     title = "Rent in Manhattan",
+#'     subtitle = "1 Bedroom Apartments",
+#'     x = "Rent (in US$)",
+#'     caption = "Source: Craigslist"
+#'   )
+
+"manhattan"
diff --git a/R/data-movies.R b/R/data-movies.R
@@ -0,0 +1,32 @@
+#' movies
+#'
+#' A data set with information about movies released in 2003.
+#'
+#' @name movies
+#' @docType data
+#' @format A data frame with 140 observations on the following 5 variables.
+#' \describe{
+#'   \item{movie}{Title of the movie.}
+#'   \item{genre}{Genre of the movie.}
+#'   \item{score}{Critics score of the movie on a 0 to 100 scale.}
+#'   \item{rating}{MPAA rating of the film.}
+#'   \item{box_office}{Millions of dollars earned at the box office in the US and Canada.}
+#' }
+#' @keywords datasets
+#' @source [Investigating Statistical Concepts, Applications and Methods](http://www.rossmanchance.com/iscam2/data/movies03.txt)
+#' @examples
+#'
+#' library(ggplot2)
+#'
+#' ggplot(movies, aes(score, box_office, color = genre)) +
+#'   geom_point() +
+#'   theme_minimal() +
+#'   labs(
+#'     title = "Does a critic score predict box office earnings?",
+#'     x = "Critic rating",
+#'     y = "Box office earnings (millions US$",
+#'     color = "Genre"
+#'   )
+#'
+
+"movies"
diff --git a/R/data-nyc.R b/R/data-nyc.R
@@ -0,0 +1,44 @@
+#' nyc
+#'
+#' Zagat is a public survey where anyone can provide scores to a restaurant. The scores from the general public are then gathered to produce ratings. This data set contains a list of 168 NYC restaurants and their Zagat Ratings.
+#'
+#'For each category the scales are as follows:
+#'
+#' 0 - 9: poor to fair
+#' 10 - 15: fair to good
+#' 16 - 19: good to very good
+#' 20 - 25: very good to excellent
+#' 25 - 30: extraordinary to perfection
+#'
+#' @name nyc
+#' @docType data
+#' @format A data frame with 168 observations on the following 6 variables.
+#' \describe{
+#'   \item{restaurant}{Name of the restaurant.}
+#'   \item{price}{Price of a mean for two, with drinks, in US $.}
+#'   \item{food}{Zagat rating for food.}
+#'   \item{decor}{Zagat rating for decor.}
+#'   \item{service}{Zagat rating for service.}
+#'   \item{east}{Indicator variable for location of the restaurant. `0` = west of 5th Avenue, `1` = east of 5th Avenue}
+#' }
+#' @keywords datasets
+#'
+#' @examples
+#' library(dplyr)
+#' library(ggplot2)
+#'
+#' location_labs <- c("West", "East")
+#' names(location_labs) <- c(0, 1)
+#'
+#' ggplot(nyc, mapping = aes(x = price, group = east, fill = east)) +
+#'   geom_boxplot(alpha = 0.5) +
+#'  facet_grid(east ~ ., labeller = labeller(east = location_labs)) +
+#'  labs(
+#'    title = "Is food more expensive east of 5th Avenue?",
+#'    x = "Price (US$)"
+#'  ) +
+#'  guides(fill = "none") +
+#'  theme_minimal() +
+#'  theme(axis.text.y = element_blank())
+
+"nyc"