Skip to content

Commit

Permalink
Ims issue 4 (#68)
Browse files Browse the repository at this point in the history
* cars04 data added

* added life_exp data and adjusted documentation for cars04

* comics data added

* data cleaning update for comics

* nyc dataset added

* iowa dataset added

* adjusted iowa documentation, added iran data

* manhattan data added

* gss_wordsum_class added

* twins data added

* LAhomes data added

* partial movies data set

* movies data set complete

* ucb_admit data added

* updated news.md

* fixed documentation for life_exp
  • Loading branch information
npaterno authored May 22, 2023
1 parent c5e6fb6 commit 6fac392
Show file tree
Hide file tree
Showing 65 changed files with 32,564 additions and 1 deletion.
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ Suggests:
scales,
testthat (>= 3.0.0),
tidyr,
tidytext
tidytext,
stringr
Imports:
ggplot2 (>= 2.2.1),
graphics,
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
* `lecture_learning` by [@jonathanaakin](https://github.com/jonathanaakin)
* Fix HTML version of manual
* Remove some URLs that no longer work
* Added new datasets:
* `cars04`, `life_exp`, `comics`, `nyc`, `gss_wordsum_class`, `manhattan`, `iran`, `iowa`, `twins`, `LAhomes`, `movies`, `ucb_admit`, `soda` ported from IMS Tutorials by [@npaterno](https://github.com/npaterno)

# openintro 2.3.0

Expand Down
33 changes: 33 additions & 0 deletions R/data-LAhomes.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#' LAhomes
#'
#' Data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010.
#'
#' @name LAhomes
#' @docType data
#' @format A data frame with 1594 observations on the following 8 variables.
#' \describe{
#' \item{city}{City where the home is located.}
#' \item{type}{Type of home with levels `Condo/Twh` - condo or townhouse, `SFR` - single family residence, and `NA`}
#' \item{bed}{Number of bedrooms in the home.}
#' \item{bath}{Number of bathrooms in the home.}
#' \item{garage}{Number of cars that can be parked in the garage. Note that a value of `4` refers to 4 or more garage spaces.}
#' \item{sqft}{Squarefootage of the home.}
#' \item{pool}{Indicates if the home has a pool.}
#' \item{price}{Listing price of the home.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#'
#' ggplot(LAhomes, aes(sqft, price)) +
#' geom_point(alpha = 0.2) +
#' theme_minimal() +
#' labs(
#' title = "Can we predict list price from squarefootage?",
#' subtitle = "Homes in the Los Angeles area",
#' x = "Square feet",
#' y = "List price"
#' )

"LAhomes"
45 changes: 45 additions & 0 deletions R/data-cars04.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#' cars04
#'
#' A data frame with 428 rows and 19 columns. This is a record of characteristics on all of the new models of cars for sale in the US in the year 2004.
#'
#'
#' @name cars04
#' @docType data
#' @format A data frame with 428 observations on the following 19 variables.
#' \describe{
#' \item{name}{The name of the vehicle including manufacturer and model.}
#' \item{sports_car}{Logical variable indicating if the vehicle is a sports car.}
#' \item{suv}{Logical variable indicating if the vehicle is an suv.}
#' \item{wagon}{Logical variable indicating if the vehicle is a wagon.}
#' \item{minivan}{Logical variable indicating if the vehicle is a minivan.}
#' \item{pickup}{Logical variable indicating if the vehicle is a pickup.}
#' \item{all_wheel}{Logical variable indicating if the vehicle is all-wheel drive.}
#' \item{rear_wheel}{Logical variable indicating if the vehicle is rear-wheel drive.}
#' \item{msrp}{Manufacturer suggested retail price of the vehicle.}
#' \item{dealer_cost}{Amount of money the dealer paid for the vehicle.}
#' \item{eng_size}{Displacement of the engine - the total volume of all the cylinders, measured in liters.}
#' \item{ncyl}{Number of cylinders in the engine.}
#' \item{horsepwr}{Amount of horsepower produced by the engine.}
#' \item{city_mpg}{Gas mileage for city driving, measured in miles per gallon.}
#' \item{hwy_mpg}{Gas mileage for highway driving, measured in miles per gallon.}
#' \item{weight}{Total weight of the vehicle, measured in pounds.}
#' \item{wheel_base}{Distance between the center of the front wheels and the center of the rear wheels, measured in inches.}
#' \item{length}{Total length of the vehicle, measured in inches.}
#' \item{width}{Total width of the vehicle, measured in inches.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#'
#' # Highway gas mileage
#' ggplot(cars04, aes(x = hwy_mpg)) +
#' geom_histogram(bins = 15, color = "white",
#' fill = openintro::IMSCOL["green", "full"]) +
#' theme_minimal() +
#' labs(
#' title = "Highway gas milage for cars from 2004",
#' x = "Gas Mileage (miles per gallon)",
#' y = "Number of cars")

"cars04"
45 changes: 45 additions & 0 deletions R/data-comics.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#' comics
#'
#' A data frame containing information about comic book characters from Marvel Comics and DC Comics.
#'
#'
#' @name comics
#' @docType data
#' @format A data frame with 21821 observations on the following 11 variables.
#' \describe{
#' \item{name}{Name of the character. May include: Real name, hero or villain name, alias(es) and/or which universe they live in (i.e. Earth-616 in Marvel's multiverse).}
#' \item{id}{Status of the characters identity with levels `Secret`, `Publie`, `No Dual` and `Unknown`.}
#' \item{align}{Character's alignment with levels `Good`, `Bad`, `Neutral` and `Reformed Criminals`.}
#' \item{eye}{Character's eye color.}
#' \item{hair}{Character's hair color.}
#' \item{gender}{Character's gender.}
#' \item{gsm}{Character's classification as a gender or sexual minority.}
#' \item{alive}{Is the character dead or alive?}
#' \item{appearances}{Number of comic boooks the character appears in.}
#' \item{first_appear}{Date of publication for the comic book the character first appeared in.}
#' \item{publisher}{Publisher of the comic with levels `Marvel` and `DC`.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#' library(dplyr)
#'
#' # Good v Bad
#'
#' plot_data <- comics %>%
#' filter(align == "Good" | align == "Bad")
#'
#' ggplot(plot_data, aes(x = align, fill = align)) +
#' geom_bar() +
#' facet_wrap(~publisher)+
#' scale_fill_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) +
#' theme_minimal() +
#' labs(
#' title = "Is there a balance of power",
#' x = "",
#' y = "Number of characters",
#' fill = ""
#' )

"comics"
22 changes: 22 additions & 0 deletions R/data-gss_wordsum_class.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#' gss_wordsum_class
#'
#' A data frame containing data from the General Social Survey.
#'
#' @name gss_wordsum_class
#' @docType data
#' @format A data frame with 795 observations on the following 2 variables.
#' \describe{
#' \item{wordsum}{A vocabulary score calculated based on a ten question vocabulary test, where a higher score means better vocabulary. Scores range from 1 to 10.}
#' \item{class}{Self-identified social class has 4 levels: lower, working, middle, and upper class.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(dplyr)
#'
#' gss_wordsum_class %>%
#' group_by(class) %>%
#' summarize(mean_wordsum = mean(wordsum))
#'

"gss_wordsum_class"
37 changes: 37 additions & 0 deletions R/data-iowa.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#' iowa
#'
#' A data frame containing information about the 2016 US Presidential Election for the state of Iowa.
#'
#' @name iowa
#' @docType data
#' @format A data frame with 1386 observations on the following 5 variables.
#' \describe{
#' \item{office}{The office that the candidates were running for.}
#' \item{candidate}{President/Vice President pairs who were running for office.}
#' \item{party}{Political part of the candidate.}
#' \item{county}{County in Iowa where the votes were cast.}
#' \item{votes}{Number of votes received by the candidate.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#' library(dplyr)
#'
#' plot_data <- iowa %>%
#' filter(candidate != "Total") %>%
#' group_by(candidate) %>%
#' summarize(total_votes = sum(votes) / 1000)
#'
#' ggplot(plot_data, aes(total_votes, candidate)) +
#' geom_col() +
#' theme_minimal() +
#' labs(
#' title = "2016 Presidential Election in Iowa",
#' subtitle = "Popular vote",
#' y = "",
#' x = "Number of Votes (in thousands)
#' "
#' )

"iowa"
50 changes: 50 additions & 0 deletions R/data-iran.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#' iran
#'
#' A data frame containing information about the 2009 Presidential Election in Iran. There were widespread claims of election fraud in this election both internationally and within Iran.
#'
#' @name iran
#' @docType data
#' @format A data frame with 366 observations on the following 9 variables.
#' \describe{
#' \item{province}{Iranian province where votes were cast.}
#' \item{city}{City within province where votes were cast.}
#' \item{ahmadinejad}{Number of votes received by Ahmadinejad.}
#' \item{rezai}{Number of votes received by Rezai.}
#' \item{karrubi}{Number of votes received by Karrubi.}
#' \item{mousavi}{Number of votes received by Mousavi.}
#' \item{total_votes_cast}{Total number of votes cast.}
#' \item{voided_votes}{Number of votes that were not counted.}
#' \item{legitimate_votes}{Number of votes that were counted.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(dplyr)
#' library(ggplot2)
#' library(tidyr)
#' library(stringr)
#'
#' plot_data <- iran %>%
#' summarize(
#' ahmadinejad = sum(ahmadinejad) / 1000,
#' rezai = sum(rezai) / 1000,
#' karrubi = sum(karrubi) / 1000,
#' mousavi = sum(mousavi) / 1000
#' ) %>%
#' pivot_longer(
#' cols = c(ahmadinejad, rezai, karrubi, mousavi),
#' names_to = "candidate",
#' values_to = "votes"
#' ) %>%
#' mutate(candidate = str_to_title(candidate))
#'
#' ggplot(plot_data, aes(votes, candidate)) +
#' geom_col() +
#' theme_minimal() +
#' labs(
#' title = "2009 Iranian Presidential Election",
#' x = "Number of votes (in thousands)",
#' y = ""
#' )

"iran"
29 changes: 29 additions & 0 deletions R/data-life_exp.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#' life_exp
#'
#' A data frame with 3142 rows and 4 columns. County level data for life expectancy and median income in the United States.
#'
#'
#' @name life_exp
#' @docType data
#' @format A data frame with 3142 observations on the following 4 variables.
#' \describe{
#' \item{state}{Name of the state.}
#' \item{county}{Name of the county.}
#' \item{expectancy}{Life expectancy in the county.}
#' \item{income}{Median income in the county, measured in US $.}
#' }
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#'
#' # Income V Expectancy
#' ggplot(life_exp, aes(x = income, y = expectancy)) +
#' geom_point(color = openintro::IMSCOL["green", "full"], alpha = 0.2) +
#' theme_minimal() +
#' labs(
#' title = "Is there a relationship between median income and life expectancy?",
#' x = "Median income (US $)",
#' y = "Life Expectancy (year)")

"life_exp"
26 changes: 26 additions & 0 deletions R/data-manhattan.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#' manhattan
#'
#' A data frame containing data on apartment rentals in Manhattan.
#'
#' @name manhattan
#' @docType data
#' @format A data frame with 20 observations on the following 1 variable.
#' \describe{
#' \item{rent}{Monthly rent for a 1 bedroom apartment listed as "For rent by owner".}
#' }
#' @keywords datasets
#' @examples
#'
#' library(ggplot2)
#'
#' ggplot(manhattan, aes(rent)) +
#' geom_histogram(color = "white", binwidth = 300) +
#' theme_minimal() +
#' labs(
#' title = "Rent in Manhattan",
#' subtitle = "1 Bedroom Apartments",
#' x = "Rent (in US$)",
#' caption = "Source: Craigslist"
#' )

"manhattan"
32 changes: 32 additions & 0 deletions R/data-movies.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#' movies
#'
#' A data set with information about movies released in 2003.
#'
#' @name movies
#' @docType data
#' @format A data frame with 140 observations on the following 5 variables.
#' \describe{
#' \item{movie}{Title of the movie.}
#' \item{genre}{Genre of the movie.}
#' \item{score}{Critics score of the movie on a 0 to 100 scale.}
#' \item{rating}{MPAA rating of the film.}
#' \item{box_office}{Millions of dollars earned at the box office in the US and Canada.}
#' }
#' @keywords datasets
#' @source [Investigating Statistical Concepts, Applications and Methods](http://www.rossmanchance.com/iscam2/data/movies03.txt)
#' @examples
#'
#' library(ggplot2)
#'
#' ggplot(movies, aes(score, box_office, color = genre)) +
#' geom_point() +
#' theme_minimal() +
#' labs(
#' title = "Does a critic score predict box office earnings?",
#' x = "Critic rating",
#' y = "Box office earnings (millions US$",
#' color = "Genre"
#' )
#'

"movies"
44 changes: 44 additions & 0 deletions R/data-nyc.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#' nyc
#'
#' Zagat is a public survey where anyone can provide scores to a restaurant. The scores from the general public are then gathered to produce ratings. This data set contains a list of 168 NYC restaurants and their Zagat Ratings.
#'
#'For each category the scales are as follows:
#'
#' 0 - 9: poor to fair
#' 10 - 15: fair to good
#' 16 - 19: good to very good
#' 20 - 25: very good to excellent
#' 25 - 30: extraordinary to perfection
#'
#' @name nyc
#' @docType data
#' @format A data frame with 168 observations on the following 6 variables.
#' \describe{
#' \item{restaurant}{Name of the restaurant.}
#' \item{price}{Price of a mean for two, with drinks, in US $.}
#' \item{food}{Zagat rating for food.}
#' \item{decor}{Zagat rating for decor.}
#' \item{service}{Zagat rating for service.}
#' \item{east}{Indicator variable for location of the restaurant. `0` = west of 5th Avenue, `1` = east of 5th Avenue}
#' }
#' @keywords datasets
#'
#' @examples
#' library(dplyr)
#' library(ggplot2)
#'
#' location_labs <- c("West", "East")
#' names(location_labs) <- c(0, 1)
#'
#' ggplot(nyc, mapping = aes(x = price, group = east, fill = east)) +
#' geom_boxplot(alpha = 0.5) +
#' facet_grid(east ~ ., labeller = labeller(east = location_labs)) +
#' labs(
#' title = "Is food more expensive east of 5th Avenue?",
#' x = "Price (US$)"
#' ) +
#' guides(fill = "none") +
#' theme_minimal() +
#' theme(axis.text.y = element_blank())

"nyc"
Loading

0 comments on commit 6fac392

Please sign in to comment.