diff --git a/_freeze/01-data-hello/execute-results/html.json b/_freeze/01-data-hello/execute-results/html.json new file mode 100644 index 00000000..b0705724 --- /dev/null +++ b/_freeze/01-data-hello/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "d45367881d8748c00e927cde0f00a41e", + "result": { + "markdown": "# Hello data {#sec-data-hello}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nScientists seek to answer questions using rigorous methods and careful observations.\nThese observations -- collected from the likes of field notes, surveys, and experiments -- form the backbone of a statistical investigation and are called **data**.\nStatistics is the study of how best to collect, analyze, and draw conclusions from data.\nIn this first chapter, we focus on both the properties of data and on the collection of data.\n:::\n\n\n\n\n\n## Case study: Using stents to prevent strokes {#sec-case-study-stents-strokes}\n\nIn this section we introduce a classic challenge in statistics: evaluating the efficacy of a medical treatment.\nTerms in this section, and indeed much of this chapter, will all be revisited later in the text.\nThe plan for now is simply to get a sense of the role statistics can play in practice.\n\nAn experiment is designed to study the effectiveness of stents in treating patients at risk of stroke [@chimowitz2011stenting].\nStents are small mesh tubes that are placed inside narrow or weak arteries to assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death.\n\nMany doctors have hoped that there would be similar benefits for patients at risk of stroke.\nWe start by writing the principal question the researchers hope to answer:\n\n> Does the use of stents reduce the risk of stroke?\n\nThe researchers who asked this question conducted an experiment with 451 at-risk patients.\nEach volunteer patient was randomly assigned to one of two groups:\n\n- **Treatment group**. Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.\n- **Control group**. Patients in the control group received the same medical management as the treatment group, but they did not receive stents.\n\nResearchers randomly assigned 224 patients to the treatment group and 227 to the control group.\nIn this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group.\n\n\\clearpage\n\nResearchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment.\nThe results of 5 patients are summarized in @tbl-stentStudyResultsDF.\nPatient outcomes are recorded as `stroke` or `no event`, representing whether the patient had a stroke during that time period.\n\n::: {.data data-latex=\"\"}\nThe [`stent30`](http://openintrostat.github.io/openintro/reference/stent30.html) data and [`stent365`](http://openintrostat.github.io/openintro/reference/stent365.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {#tbl-stentStudyResultsDF .cell tbl-cap='Results for five patients from the stent study.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
patient group 30 days 365 days
1 treatment no event no event
2 treatment stroke stroke
3 treatment no event no event
4 treatment no event no event
5 control no event no event
\n\n`````\n:::\n:::\n\n\nIt would be difficult to answer a question on the impact of stents on the occurrence of strokes for **all** study patients using these *individual* observations.\nThis question is better addressed by performing a statistical data analysis of *all* observations.\n@tbl-stentStudyResultsDFsummary summarizes the raw data in a more helpful way.\nIn this table, we can quickly see what happened over the entire study.\nFor instance, to identify the number of patients in the treatment group who had a stroke within 30 days after the treatment, we look in the leftmost column (30 days), at the intersection of treatment and stroke: 33.\nTo identify the number of control patients who did not have a stroke after 365 days after receiving treatment, we look at the rightmost column (365 days), at the intersection of control and no event: 199.\n\n\n::: {#tbl-stentStudyResultsDFsummary .cell tbl-cap='Descriptive statistics for the stent study.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
30 days
365 days
Group Stroke No event Stroke No event
Control 13 214 28 199
Treatment 33 191 45 179
Total 46 405 73 378
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nOf the 224 patients in the treatment group, 45 had a stroke by the end of the first year.\nUsing these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year.\n(Note: answers to all Guided Practice exercises are provided in footnotes!)[^01-data-hello-1]\n:::\n\n[^01-data-hello-1]: The proportion of the 224 patients who had a stroke within 365 days: $45/224 = 0.20.$\n\nWe can compute summary statistics from the table to give us a better idea of how the impact of the stent treatment differed between the two groups.\nA **summary statistic** is a single number summarizing data from a sample.\nFor instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.\n\n\n\n\n\n- Proportion who had a stroke in the treatment (stent) group: $45/224 = 0.20 = 20\\%.$\n- Proportion who had a stroke in the control group: $28/227 = 0.12 = 12\\%.$\n\nThese two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke!\nThis is important for two reasons.\nFirst, it is contrary to what doctors expected, which was that stents would *reduce* the rate of strokes.\nSecond, it leads to a statistical question: do the data show a \"real\" difference between the groups?\n\nThis second question is subtle.\nSuppose you flip a coin 100 times.\nWhile the chance a coin lands heads in any given coin flip is 50%, we probably won't observe exactly 50 heads.\nThis type of variation is part of almost any type of data generating process.\nIt is possible that the 8% difference in the stent study is due to this natural variation.\nHowever, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance.\nSo, what we are really asking is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?\n\nWhile we do not yet have statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients.\n\n**Be careful:** Do not generalize the results of this study to all patients and all stents.\nThis study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.\nIn addition, there are many types of stents, and this study only considered the self-expanding Wingspan stent (Boston Scientific).\nHowever, this study does leave us with an important lesson: we should keep our eyes open for surprises.\n\n## Data basics {#sec-data-basics}\n\nEffective presentation and description of data is a first step in most analyses.\nThis section introduces one structure for organizing data as well as some terminology that will be used throughout this book.\n\n### Observations, variables, and data matrices\n\n@tbl-loan50-df displays six rows of a dataset for 50 randomly sampled loans offered through Lending Club, which is a peer-to-peer lending company.\nThis dataset will be referred to as `loan50`.\n\n::: {.data data-latex=\"\"}\nThe [`loan50`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nEach row in the table represents a single loan.\nThe formal name for a row is a \\index{case}**case** or \\index{unit of observation}**observational unit**.\nThe columns represent characteristics of each loan, where each column is referred to as a \\index{variable}**variable**.\nFor example, the first row represents a loan of \\$22,000 with an interest rate of 10.90%, where the borrower is based in New Jersey (NJ) and has an income of \\$59,000.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the grade of the first loan in @tbl-loan50-df?\nAnd what is the home ownership status of the borrower for that first loan?\nReminder: for these Guided Practice questions, you can check your answer in the footnote.[^01-data-hello-2]\n:::\n\n[^01-data-hello-2]: The loan's grade is B, and the borrower rents their residence.\n\nIn practice, it is especially important to ask clarifying questions to ensure important aspects of the data are understood.\nFor instance, it is always important to be sure we know what each variable means and its units of measurement.\nDescriptions of the variables in the `loan50` dataset are given in @tbl-loan-50-variables.\n\n\n::: {#tbl-loan50-df .cell tbl-cap='Six observations from the `loan50` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
loan_amount interest_rate term grade state total_income homeownership
1 22,000 10.90 60 B NJ 59,000 rent
2 6,000 9.92 36 B CA 60,000 rent
3 25,000 26.30 36 E SC 75,000 mortgage
4 6,000 9.92 36 B CA 75,000 rent
5 25,000 9.43 60 B OH 254,000 mortgage
6 6,400 9.92 36 B IN 67,000 mortgage
\n\n`````\n:::\n:::\n\n::: {#tbl-loan-50-variables .cell tbl-cap='Variables and their descriptions for the `loan50` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
loan_amount Amount of the loan received, in US dollars.
interest_rate Interest rate on the loan, in an annual percentage.
term The length of the loan, which is always set as a whole number of months.
grade Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid.
state US state where the borrower resides.
total_income Borrower's total income, including any second income, in US dollars.
homeownership Indicates whether the person owns, owns but has a mortgage, or rents.
\n\n`````\n:::\n:::\n\n\nThe data in @tbl-loan50-df represent a \\index{data frame}**data frame**, which is a convenient and common way to organize data, especially if collecting data in a spreadsheet.\nA data frame where each row is a unique case (observational unit), each column is a variable, and each cell is a single value is commonly referred to as \\index{tidy data}**tidy data** @wickham2014.\n\n\n\n\n\nWhen recording data, use a tidy data frame unless you have a very good reason to use a different structure.\nThis structure allows new cases to be added as rows or new variables as new columns and facilitates visualization, summarization, and other statistical analyses.\n\n::: {.guidedpractice data-latex=\"\"}\nThe grades for assignments, quizzes, and exams in a course are often recorded in a gradebook that takes the form of a data frame.\nHow might you organize a course's grade data using a data frame?\nDescribe the observational units and variables.[^01-data-hello-3]\n:::\n\n[^01-data-hello-3]: There are multiple strategies that can be followed.\n One common strategy is to have each student represented by a row, and then add a column for each assignment, quiz, or exam.\n Under this setup, it is easy to review a single line to understand the grade history of a student.\n There should also be columns to include student information, such as one column to list student names.\n\n::: {.guidedpractice data-latex=\"\"}\nWe consider data for 3,142 counties in the United States, which includes the name of each county, the state where it resides, its population in 2017, the population change from 2010 to 2017, poverty rate, and nine additional characteristics.\nHow might these data be organized in a data frame?[^01-data-hello-4]\n:::\n\n[^01-data-hello-4]: Each county may be viewed as a case, and there are eleven pieces of information recorded for each case.\n A table with 3,142 rows and 14 columns could hold these data, where each row represents a county and each column represents a particular piece of information.\n\n\\clearpage\n\nThe data described in the Guided Practice above represents the `county` dataset, which is shown as a data frame in @tbl-county-df.\nThe variables as well as the variables in the dataset that did not fit in @tbl-county-df are described in @tbl-county-variables.\n\n\n::: {#tbl-county-df .cell tbl-cap='Six observations and six variables from the `county` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
name state pop2017 pop_change unemployment_rate median_edu
Autauga County Alabama 55,504 1.48 3.86 some_college
Baldwin County Alabama 212,628 9.19 3.99 some_college
Barbour County Alabama 25,270 -6.22 5.90 hs_diploma
Bibb County Alabama 22,668 0.73 4.39 hs_diploma
Blount County Alabama 58,013 0.68 4.02 hs_diploma
Bullock County Alabama 10,309 -2.28 4.93 hs_diploma
\n\n`````\n:::\n:::\n\n::: {#tbl-county-variables .cell tbl-cap='Variables and their descriptions for the `county` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
name Name of county.
state Name of state.
pop2000 Population in 2000.
pop2010 Population in 2010.
pop2017 Population in 2017.
pop_change Population change from 2010 to 2017 (in percent).
poverty Percent of population in poverty in 2017.
homeownership Homeownership rate, 2006-2010.
multi_unit Multi-unit rate: percent of housing units that are in multi-unit structures, 2006-2010.
unemployment_rate Unemployment rate in 2017.
metro Whether the county contains a metropolitan area, taking one of the values yes or no.
median_edu Median education level (2013-2017), taking one of the values below_hs, hs_diploma, some_college, or bachelors.
per_capita_income Per capita (per person) income (2013-2017).
median_hh_income Median household income.
smoking_ban Describes the type of county-level smoking ban in place in 2010, taking one of the values none, partial, or comprehensive.
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`county`](http://openintrostat.github.io/usdata/reference/county.html) data can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.\n:::\n\n### Types of variables {#variable-types}\n\nExamine the `unemployment_rate`, `pop2017`, `state`, and `median_edu` variables in the `county` dataset.\nEach of these variables is inherently different from the other three, yet some share certain characteristics.\n\nFirst consider `unemployment_rate`, which is said to be a \\index{numerical variable}**numerical** variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.\nOn the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes does not have any clear meaning.\nInstead, we would consider area codes as a categorical variable.\n\n\n\n\n\nThe `pop2017` variable is also numerical, although it seems to be a little different than `unemployment_rate`.\nThis variable of the population count can only take whole non-negative numbers (0, 1, 2, ...).\nFor this reason, the population variable is said to be **discrete** since it can only take numerical values with jumps.\nOn the other hand, the unemployment rate variable is said to be **continuous**.\n\n\n\n\n\nThe variable `state` can take up to 51 values after accounting for Washington, DC: Alabama, Alaska, ..., and Wyoming.\nBecause the responses themselves are categories, `state` is called a **categorical** variable, and the possible values (states) are called the variable's **levels** (e.g., District of Columbia, Alabama, Alaska, etc.) .\n\n\n\n\n\nFinally, consider the `median_edu` variable, which describes the median education level of county residents and takes values `below_hs`, `hs_diploma`, `some_college`, or `bachelors` in each county.\nThis variable seems to be a hybrid: it is a categorical variable, but the levels have a natural ordering.\nA variable with these properties is called an **ordinal** variable, while a regular categorical variable without this type of special ordering is called a **nominal** variable.\nTo simplify analyses, any categorical variable in this book will be treated as a nominal (unordered) categorical variable.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Breakdown of variables into their respective types.](01-data-hello_files/figure-html/variables-1.png){fig-alt='Types of variables are broken down into numerical (which can be discrete or continuous) and categorical (which can be ordinal or nominal).' width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nData were collected about students in a statistics course.\nThree variables were recorded for each student: number of siblings, student height, and whether the student had previously taken a statistics course.\nClassify each of the variables as continuous numerical, discrete numerical, or categorical.\n\n------------------------------------------------------------------------\n\nThe number of siblings and student height represent numerical variables.\nBecause the number of siblings is a count, it is discrete.\nHeight varies continuously, so it is a continuous numerical variable.\nThe last variable classifies students into two categories -- those who have and those who have not taken a statistics course -- which makes this variable categorical.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nAn experiment is evaluating the effectiveness of a new drug in treating migraines.\nA `group` variable is used to indicate the experiment group for each patient: treatment or control.\nThe `num_migraines` variable represents the number of migraines the patient experienced during a 3-month period.\nClassify each variable as either numerical or categorical?[^01-data-hello-5]\n:::\n\n[^01-data-hello-5]: The `group` variable can take just one of two group names, making it categorical.\n The `num_migraines` variable describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this is a numerical outcome; more specifically, since it represents a count, `num_migraines` is a discrete numerical variable.\n\n### Relationships between variables {#variable-relations}\n\nMany analyses are motivated by a researcher looking for a relationship between two or more variables.\nA social scientist may like to answer some of the following questions:\n\n> Does a higher-than-average increase in county population tend to correspond to counties with higher or lower median household incomes?\n\n> If homeownership in one county is lower than the national average, will the percent of housing units that are in multi-unit structures in that county tend to be above or below the national average?\n\n> How much can the median education level explain the median household income for counties in the US?\n\nTo answer these questions, data must be collected, such as the `county` dataset shown in @tbl-county-df.\nExamining \\index{summary statistic}**summary statistics** can provide numerical insights about the specifics of each of these questions.\nAlternatively, graphs can be used to visually explore the data, potentially providing more insight than a summary statistic.\n\n\\index{scatterplot}**Scatterplots** are one type of graph used to study the relationship between two numerical variables.\n@fig-county-multi-unit-homeownership displays the relationship between the variables `homeownership` and `multi_unit`, which is the percent of housing units that are in multi-unit structures (e.g., apartments, condos).\nEach point on the plot represents a single county.\nFor instance, the highlighted dot corresponds to County 413 in the `county` dataset: Chattahoochee County, Georgia, which has 39.4% of housing units that are in multi-unit structures and a homeownership rate of 31.3%.\nThe scatterplot suggests a relationship between the two variables: counties with a higher rate of housing units that are in multi-unit structures tend to have lower homeownership rates.\nWe might brainstorm as to why this relationship exists and investigate each idea to determine which are the most reasonable explanations.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot of homeownership versus the percent of housing units that are in multi-unit structures for US counties. The highlighted dot represents Chattahoochee County, Georgia, which has a multi-unit rate of 39.4\\% and a homeownership rate of 31.3\\%.](01-data-hello_files/figure-html/fig-county-multi-unit-homeownership-1.png){#fig-county-multi-unit-homeownership width=90%}\n:::\n:::\n\n\nThe multi-unit and homeownership rates are said to be associated because the plot shows a discernible pattern.\nWhen two variables show some connection with one another, they are called **associated** variables.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nExamine the variables in the `loan50` dataset, which are described in @tbl-loan-50-variables.\nCreate two questions about possible relationships between variables in `loan50` that are of interest to you.[^01-data-hello-6]\n:::\n\n[^01-data-hello-6]: Two example questions: (1) What is the relationship between loan amount and total income?\n (2) If someone's income is above the average, will their interest rate tend to be above or below the average?\n\n::: {.workedexample data-latex=\"\"}\nThis example examines the relationship between the percent change in population from 2010 to 2017 and median household income for counties, which is visualized as a scatterplot in @fig-county-pop-change-med-hh-income.\nAre these variables associated?\n\n------------------------------------------------------------------------\n\nThe larger the median household income for a county, the higher the population growth observed for the county.\nWhile it isn't true that every county with a higher median household income has a higher population growth, the trend in the plot is evident.\nSince there is some relationship between the variables, they are associated.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot showing population change against median household income. Owsley County of Kentucky is highlighted, which lost 3.63\\% of its population from 2010 to 2017 and had median household income of \\$22,736.](01-data-hello_files/figure-html/fig-county-pop-change-med-hh-income-1.png){#fig-county-pop-change-med-hh-income width=90%}\n:::\n:::\n\n\nBecause there is a downward trend in @fig-county-multi-unit-homeownership -- counties with more housing units that are in multi-unit structures are associated with lower homeownership -- these variables are said to be **negatively associated**.\nA **positive association** is shown in the relationship between the `median_hh_income` and `pop_change` variables in @fig-county-pop-change-med-hh-income, where counties with higher median household income tend to have higher rates of population growth.\n\n\n\n\n\nIf two variables are not associated, then they are said to be **independent**.\nThat is, two variables are independent if there is no evident relationship between the two.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Associated or independent, not both.**\n\nA pair of variables are either related in some way (associated) or not (independent).\nNo pair of variables is both associated and independent.\n:::\n\n### Explanatory and response variables\n\nWhen we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.\nConsider the following rephrasing of an earlier question about the `county` dataset:\n\n> If there is an increase in the median household income in a county, does this drive an increase in its population?\n\nIn this question, we are asking whether one variable affects another.\nIf this is our underlying belief, then *median household income* is the **explanatory variable**, and the *population change* is the **response variable** in the hypothesized relationship.[^01-data-hello-7]\n\n[^01-data-hello-7]: In some disciplines, it's customary to refer to the explanatory variable as the **independent variable** and the response variable as the **dependent variable**.\n However, this becomes confusing since a *pair* of variables might be independent or dependent, so we avoid this language.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Explanatory and response variables.**\n\nWhen we suspect one variable might causally affect another, we label the first variable the explanatory variable and the second the response variable.\nWe also use the terms **explanatory** and **response** to describe variables where the **response** might be predicted using the **explanatory** even if there is no causal relationship.\n\n
explanatory variable $\\rightarrow$ *might affect* $\\rightarrow$ response variable
\n\n
For many pairs of variables, there is no hypothesized relationship, and these labels would not be applied to either variable in such cases.\n:::\n\nBear in mind that the act of labeling the variables in this way does nothing to guarantee that a causal relationship exists.\nA formal evaluation to check whether one variable causes a change in another requires an experiment.\n\n### Observational studies and experiments\n\nThere are two primary types of data collection: experiments and observational studies.\n\nWhen researchers want to evaluate the effect of particular traits, treatments, or conditions, they conduct an **experiment**.\nFor instance, we may suspect drinking a high-calorie energy drink will improve performance in a race.\nTo check if there really is a causal relationship between the explanatory variable (whether the runner drank an energy drink or not) and the response variable (the race time), researchers identify a sample of individuals and split them into groups.\nThe individuals in each group are *assigned* a treatment.\nWhen individuals are randomly assigned to a group, the experiment is called a **randomized experiment**.\nRandom assignment organizes the participants in a study into groups that are roughly equal on all aspects, thus allowing us to control for any confounding variables that might affect the outcome (e.g., fitness level, racing experience, etc.).\nFor example, each runner in the experiment could be randomly assigned, perhaps by flipping a coin, into one of two groups: the first group receives a **placebo** (fake treatment, in this case a no-calorie drink) and the second group receives the high-calorie energy drink.\nSee the case study in @sec-case-study-stents-strokes for another example of an experiment, though that study did not employ a placebo.\n\n\n\n\n\nResearchers perform an **observational study** when they collect data in a way that does not directly interfere with how the data arise.\nFor instance, researchers may collect information via surveys, review medical or company records, or follow a **cohort** of many similar individuals to form hypotheses about why certain diseases might develop.\nIn each of these situations, researchers merely observe the data that arise.\nIn general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection as they do not offer a mechanism for controlling for confounding variables.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Association** $\\neq$ **Causation.**\n\nIn general, association does not imply causation.\nAn advantage of a randomized experiment is that it is easier to establish causal relationships with such a study.\nThe main reason for this is that observational studies do not control for confounding variables, and hence establishing causal relationships with observational studies requires advanced statistical methods (that are beyond the scope of this book).\nWe will revisit this idea when we discuss experiments later in the book.\n:::\n\n\\vspace{10mm}\n\n## Chapter review {#chp1-review}\n\n### Summary\n\nThis chapter introduced you to the world of data.\nData can be organized in many ways but tidy data, where each row represents an observation and each column represents a variable, lends itself most easily to statistical analysis.\nMany of the ideas from this chapter will be seen as we move on to doing full data analyses.\nIn the next chapter you're going to learn about how we can design studies to collect the data we need to make conclusions with the desired scope of inference.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
associated experiment ordinal
case explanatory variable placebo
categorical independent positive association
cohort level randomized experiment
continuous negative association response variable
data nominal summary statistic
data frame numerical tidy data
dependent observational study variable
discrete observational unit
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp1-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-01].\n\n::: {.exercises data-latex=\"\"}\n1. **Marvel Cinematic Universe films.**\nThe data frame below contains information on Marvel Cinematic Universe films through the Infinity saga (a movie storyline spanning from Ironman in 2008 to Endgame in 2019). \nBox office totals are given in millions of US Dollars.\nHow many observations and how many variables does this data frame have?^[The [`mcu_films`](http://openintrostat.github.io/openintro/reference/mcu_films.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Length
Gross
Title Hrs Mins Release Date Opening Wknd US US World
1 Iron Man 2 6 5/2/2008 98.62 319.03 585.8
2 The Incredible Hulk 1 52 6/12/2008 55.41 134.81 264.77
3 Iron Man 2 2 4 5/7/2010 128.12 312.43 623.93
4 Thor 1 55 5/6/2011 65.72 181.03 449.33
5 Captain America: The First Avenger 2 4 7/22/2011 65.06 176.65 370.57
... ... ... ... ... ... ... ...
23 Spiderman: Far from Home 2 9 7/2/2019 92.58 390.53 1131.93
\n \n `````\n :::\n :::\n\n1. **Cherry Blossom Run.**\nThe data frame below contains information on runners in the 2017 Cherry Blossom Run, which is an annual road race that takes place in Washington, DC.\nMost runners participate in a 10-mile run while a smaller fraction take part in a 5k run or walk.\nHow many observations and how many variables does this data frame have?^[The [`run17`](http://openintrostat.github.io/openintro/reference/run17.html) data used in this exercise can be found in the [**cherryblossom**](http://openintrostat.github.io/cherryblossom) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Time
Bib Name Sex Age City / Country Net Clock Pace Event
1 6 Hiwot G. F 21 Ethiopia 3217 3217 321 10 Mile
2 22 Buze D. F 22 Ethiopia 3232 3232 323 10 Mile
3 16 Gladys K. F 31 Kenya 3276 3276 327 10 Mile
4 4 Mamitu D. F 33 Ethiopia 3285 3285 328 10 Mile
5 20 Karolina N. F 35 Poland 3288 3288 328 10 Mile
... ... ... ... ... ... ... ... ... ...
19961 25153 Andres E. M 33 Woodbridge, VA 5287 5334 1700 5K
\n \n `````\n :::\n :::\n\n1. **Air pollution and birth outcomes, study components.** \nResearchers collected data to examine the relationship between air pollutants and preterm births in Southern California.\nDuring the study air pollution levels were measured by air quality monitoring stations.\nSpecifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM$_{10}$) in $\\mu g/m^3$.\nLength of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth.\nThe analysis suggested that increased ambient PM$_{10}$ and, to a lesser degree, CO concentrations may be associated with the occurrence of preterm births. [@Ritz+Yu+Chapa+Fruin:2000]\n\n a. Identify the main research question of the study.\n\n b. Who are the subjects in this study, and how many are included?\n\n c. What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the variable is ordinal.\n\n1. **Cheaters, study components.** \nResearchers studying the relationship between honesty, age and self-control conducted an experiment on 160 children between the ages of 5 and 15. Participants reported their age, sex, and whether they were an only child or not. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet and said they would only reward children who report white. [@Bucciol:2011]\n\n a. Identify the main research question of the study.\n\n b. Who are the subjects in this study, and how many are included?\n\n c. The study's findings can be summarized as follows: *\"Half the students were explicitly told not to cheat, and the others were not given any explicit instructions. In the no instruction group probability of cheating was found to be uniform across groups based on child's characteristics. In the group that was explicitly told to not cheat, girls were less likely to cheat, and while rate of cheating didn't vary by age for boys, it decreased with age for girls.\"* How many variables were recorded for each subject in the study in order to conclude these findings? State the variables and their types.\n\n1. **Gamification and statistics, study components.** \nGamification is the application of game-design elements and game principles in non-game contexts. \nIn educational settings, gamification is often implemented as educational activities to solve problems by using characteristics of game elements.\nResearchers investigating the effects of gamification on learning statistics conducted a study where they split college students in a statistics class into four groups: (1) no reading exercises and no gamification, (2) reading exercises but no gamification, (3) gamification but no reading exercises, and (4) gamification and reading exercises.\nStudents in all groups also attended lectures. \nStudents in the class were from two majors: Electrical and Computer Engineering (n = 279) and Business Administration (n = 86). \nAfter their assigned learning experience, each student took a final evaluation comprised of 30 multiple choice question and their score was measured as the number of questions they answered correctly.\nThe researchers considered students' gender, level of studies (first through fourth year) and academic major. \nOther variables considered were expertise in the English language and use of personal computers and games, both of which were measured on a scale of 1 (beginner) to 5 (proficient). \nThe study found that gamification had a positive effect on student learning compared to traditional teaching methods involving lectures and reading exercises.\nThey also found that the effect was larger for females and Engineering students. [@Legaki:2020]\n\n a. Identify the main research question of the study.\n\n b. Who were the subjects in this study, and how many were included?\n\n c. What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the variable is ordinal.\n\n1. **Stealers, study components.** \nIn a study of the relationship between socio-economic class and unethical behavior, 129 University of California undergraduates at Berkeley were asked to identify themselves as having low or high social class by comparing themselves to others with the most (least) money, most (least) education, and most (least) respected jobs.\nThey were also presented with a jar of individually wrapped candies and informed that the candies were for children in a nearby laboratory, but that they could take some if they wanted.\nAfter completing some unrelated tasks, participants reported the number of candies they had taken. [@Piff:2012]\n\n a. Identify the main research question of the study.\n\n b. Who were the subjects in this study, and how many were included?\n\n c. The study found that students who were identified as upper-class took more candy than others. How many variables were recorded for each subject in the study in order to conclude these findings? State the variables and their types. \n \n \\clearpage\n\n1. \"Figure **Migraine and acupuncture.** A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture. \nTo determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 individuals who identified as female diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control. \nForty-three (43) patients in the treatment group received acupuncture that is specifically designed to treat migraines. \nForty-six (46) patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). \nTwenty-four (24) hours after patients received acupuncture, they were asked if they were pain free. \nResults are summarized in the contingency table below. \nAlso provided is a figure from the original paper displaying the appropriate area (M) versus the inappropriate area (S) used in the treatment of migraine attacks.^[The [`migraine`](http://openintrostat.github.io/openintro/reference/migraine.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Allais:2011]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Pain free?
Group No Yes
Control 44 2
Treatment 33 10
\n \n `````\n :::\n :::\n\n a. What percent of patients in the treatment group were pain free 24 hours after receiving acupuncture?\n\n b. What percent were pain free in the control group?\n\n c. In which group did a higher percent of patients become pain free 24 hours after receiving acupuncture?\n\n d. Your findings so far might suggest that acupuncture is an effective treatment for migraines for all people who suffer from migraines.\n However, this is not the only possible conclusion.\n What is one other possible explanation for the observed difference between the percentages of patients that are pain free 24 hours after receiving acupuncture in the two groups?\n \n e. What are the explanatory and response variables in this study?\n\n1. **Sinusitis and antibiotics.** \nResearchers studying the effect of antibiotic treatment for acute sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with acute sinusitis to one of two groups: treatment or control. \nStudy participants received either a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste. \nThe placebo consisted of symptomatic treatments such as acetaminophen, nasal decongestants, etc. \nAt the end of the 10-day period, patients were asked if they experienced improvement in symptoms. \nThe distribution of responses is summarized below.^[The [`sinusitis`](http://openintrostat.github.io/openintro/reference/sinusitis.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Garbutt:2012]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Improvement
Group No Yes
Control 16 65
Treatment 19 66
\n \n `````\n :::\n :::\n\n a. What percent of patients in the treatment group experienced improvement in symptoms?\n\n b. What percent experienced improvement in symptoms in the control group?\n\n c. In which group did a higher percentage of patients experience improvement in symptoms?\n\n d. Your findings so far might suggest a real difference in the effectiveness of antibiotic and placebo treatments for improving symptoms of sinusitis. However, this is not the only possible conclusion. What is one other possible explanation for the observed difference between the percentages patients who experienced improvement in symptoms?\n \n e. What are the explanatory and response variables in this study?\n\n1. **Daycare fines, study components.** \nResearchers tested the deterrence hypothesis which predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to the fine, with the condition that the fine leaves everything else unchanged by instituting a fine for late pickup at daycare centers. \nFor this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late. \nThey randomly selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount) for picking up children late and then removed it. \nIn the remaining 4 daycare centers no fine was introduced. \nThe study period was divided into four: before the fine (weeks 1–4), the first 4 weeks with the fine (weeks 5-8), the last 8 weeks with fine (weeks 9–16), and the after fine period (weeks 17-20).\nThroughout the study, the number of kids who were picked up late was recorded each week for each daycare. \nThe study found that the number of late-coming parents increased significantly when the fine was introduced, and no reduction occurred after the fine was removed.^[The [`daycare_fines`](http://openintrostat.github.io/openintro/reference/daycare_fines.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Gneezy:2000]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
center week group late_pickups study_period
1 1 test 8 before fine
1 2 test 8 before fine
1 3 test 7 before fine
1 4 test 6 before fine
1 5 test 8 first 4 weeks with fine
... ... ... ... ...
1 1 test 8 before fine
1 2 test 8 before fine
1 3 test 7 before fine
1 4 test 6 before fine
1 5 test 8 first 4 weeks with fine
1 6 test 9 first 4 weeks with fine
1 7 test 9 first 4 weeks with fine
1 8 test 12 first 4 weeks with fine
1 9 test 13 last 8 weeks with fine
1 10 test 13 last 8 weeks with fine
1 11 test 15 last 8 weeks with fine
1 12 test 13 last 8 weeks with fine
1 13 test 14 last 8 weeks with fine
1 14 test 16 last 8 weeks with fine
1 15 test 14 last 8 weeks with fine
1 16 test 15 last 8 weeks with fine
1 17 test 16 after fine
1 18 test 13 after fine
1 19 test 15 after fine
1 20 test 17 after fine
2 1 test 6 before fine
2 2 test 7 before fine
2 3 test 3 before fine
2 4 test 5 before fine
2 5 test 2 first 4 weeks with fine
2 6 test 11 first 4 weeks with fine
2 7 test 14 first 4 weeks with fine
2 8 test 9 first 4 weeks with fine
2 9 test 16 last 8 weeks with fine
2 10 test 12 last 8 weeks with fine
2 11 test 10 last 8 weeks with fine
2 12 test 14 last 8 weeks with fine
2 13 test 14 last 8 weeks with fine
2 14 test 16 last 8 weeks with fine
2 15 test 12 last 8 weeks with fine
2 16 test 17 last 8 weeks with fine
2 17 test 14 after fine
2 18 test 10 after fine
2 19 test 14 after fine
2 20 test 15 after fine
3 1 test 8 before fine
3 2 test 9 before fine
3 3 test 8 before fine
3 4 test 9 before fine
3 5 test 3 first 4 weeks with fine
3 6 test 5 first 4 weeks with fine
3 7 test 15 first 4 weeks with fine
3 8 test 18 first 4 weeks with fine
3 9 test 16 last 8 weeks with fine
3 10 test 14 last 8 weeks with fine
3 11 test 20 last 8 weeks with fine
3 12 test 18 last 8 weeks with fine
3 13 test 25 last 8 weeks with fine
3 14 test 22 last 8 weeks with fine
3 15 test 27 last 8 weeks with fine
3 16 test 19 last 8 weeks with fine
3 17 test 20 after fine
3 18 test 23 after fine
3 19 test 23 after fine
3 20 test 22 after fine
4 1 test 10 before fine
4 2 test 3 before fine
4 3 test 14 before fine
4 4 test 9 before fine
4 5 test 6 first 4 weeks with fine
4 6 test 24 first 4 weeks with fine
4 7 test 8 first 4 weeks with fine
4 8 test 22 first 4 weeks with fine
4 9 test 22 last 8 weeks with fine
4 10 test 19 last 8 weeks with fine
4 11 test 25 last 8 weeks with fine
4 12 test 18 last 8 weeks with fine
4 13 test 23 last 8 weeks with fine
4 14 test 22 last 8 weeks with fine
4 15 test 24 last 8 weeks with fine
4 16 test 17 last 8 weeks with fine
4 17 test 15 after fine
4 18 test 23 after fine
4 19 test 25 after fine
4 20 test 18 after fine
5 1 test 13 before fine
5 2 test 12 before fine
5 3 test 9 before fine
5 4 test 13 before fine
5 5 test 15 first 4 weeks with fine
5 6 test 10 first 4 weeks with fine
5 7 test 27 first 4 weeks with fine
5 8 test 28 first 4 weeks with fine
5 9 test 35 last 8 weeks with fine
5 10 test 10 last 8 weeks with fine
5 11 test 24 last 8 weeks with fine
5 12 test 32 last 8 weeks with fine
5 13 test 29 last 8 weeks with fine
5 14 test 29 last 8 weeks with fine
5 15 test 26 last 8 weeks with fine
5 16 test 31 last 8 weeks with fine
5 17 test 26 after fine
5 18 test 35 after fine
5 19 test 29 after fine
5 20 test 28 after fine
6 1 test 5 before fine
6 2 test 8 before fine
6 3 test 7 before fine
6 4 test 5 before fine
6 5 test 5 first 4 weeks with fine
6 6 test 9 first 4 weeks with fine
6 7 test 12 first 4 weeks with fine
6 8 test 14 first 4 weeks with fine
6 9 test 19 last 8 weeks with fine
6 10 test 17 last 8 weeks with fine
6 11 test 14 last 8 weeks with fine
6 12 test 13 last 8 weeks with fine
6 13 test 10 last 8 weeks with fine
6 14 test 15 last 8 weeks with fine
6 15 test 14 last 8 weeks with fine
6 16 test 16 last 8 weeks with fine
6 17 test 6 after fine
6 18 test 12 after fine
6 19 test 17 after fine
6 20 test 13 after fine
7 1 control 7 before fine
7 2 control 10 before fine
7 3 control 12 before fine
7 4 control 6 before fine
7 5 control 4 first 4 weeks with fine
7 6 control 13 first 4 weeks with fine
7 7 control 7 first 4 weeks with fine
7 8 control 8 first 4 weeks with fine
7 9 control 5 last 8 weeks with fine
7 10 control 12 last 8 weeks with fine
7 11 control 3 last 8 weeks with fine
7 12 control 5 last 8 weeks with fine
7 13 control 6 last 8 weeks with fine
7 14 control 13 last 8 weeks with fine
7 15 control 7 last 8 weeks with fine
7 16 control 4 last 8 weeks with fine
7 17 control 7 after fine
7 18 control 10 after fine
7 19 control 4 after fine
7 20 control 6 after fine
8 1 control 12 before fine
8 2 control 9 before fine
8 3 control 14 before fine
8 4 control 18 before fine
8 5 control 10 first 4 weeks with fine
8 6 control 11 first 4 weeks with fine
8 7 control 6 first 4 weeks with fine
8 8 control 15 first 4 weeks with fine
8 9 control 14 last 8 weeks with fine
8 10 control 13 last 8 weeks with fine
8 11 control 7 last 8 weeks with fine
8 12 control 12 last 8 weeks with fine
8 13 control 9 last 8 weeks with fine
8 14 control 9 last 8 weeks with fine
8 15 control 17 last 8 weeks with fine
8 16 control 8 last 8 weeks with fine
8 17 control 5 after fine
8 18 control 11 after fine
8 19 control 8 after fine
8 20 control 13 after fine
9 1 control 3 before fine
9 2 control 4 before fine
9 3 control 9 before fine
9 4 control 3 before fine
9 5 control 3 first 4 weeks with fine
9 6 control 5 first 4 weeks with fine
9 7 control 9 first 4 weeks with fine
9 8 control 5 first 4 weeks with fine
9 9 control 2 last 8 weeks with fine
9 10 control 7 last 8 weeks with fine
9 11 control 6 last 8 weeks with fine
9 12 control 6 last 8 weeks with fine
9 13 control 9 last 8 weeks with fine
9 14 control 4 last 8 weeks with fine
9 15 control 9 last 8 weeks with fine
9 16 control 2 last 8 weeks with fine
9 17 control 3 after fine
9 18 control 8 after fine
9 19 control 3 after fine
9 20 control 5 after fine
10 1 control 15 before fine
10 2 control 13 before fine
10 3 control 13 before fine
10 4 control 12 before fine
10 5 control 10 first 4 weeks with fine
10 6 control 9 first 4 weeks with fine
10 7 control 15 first 4 weeks with fine
10 8 control 15 first 4 weeks with fine
10 9 control 15 last 8 weeks with fine
10 10 control 10 last 8 weeks with fine
10 11 control 17 last 8 weeks with fine
10 12 control 12 last 8 weeks with fine
10 13 control 13 last 8 weeks with fine
10 14 control 11 last 8 weeks with fine
10 15 control 14 last 8 weeks with fine
10 16 control 17 last 8 weeks with fine
10 17 control 12 after fine
10 18 control 9 after fine
10 19 control 15 after fine
10 20 control 13 after fine
\n \n `````\n :::\n :::\n\n a. Is this an observational study or an experiment? Explain your reasoning.\n\n b. What are the cases in this study and how many are included?\n\n c. What is the response variable in the study and what type of variable is it?\n\n d. What are the explanatory variables in the study and what types of variables are they?\n \n \\vspace{5mm}\n\n1. **Efficacy of COVID-19 vaccine on adolescents, study components.** \nResults of a Phase 3 trial announced in March 2021 show that the Pfizer-BioNTech COVID-19 vaccine demonstrated 100% efficacy and robust antibody responses on 12 to 15 years old adolescents with or without prior evidence of SARS-CoV-2 infection. In this trial 2,260 adolescents were randomly assigned to two groups: one group got the vaccine (n = 1,131) and the other got a placebo (n = 1,129). While 18 cases of COVID-19 were observed in the placebo group, none were observed in the vaccine group.^[The [`biontech_adolescents`](http://openintrostat.github.io/openintro/reference/biontech_adolescents.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Pfizer:2021]\n\n a. Is this an observational study or an experiment? Explain your reasoning.\n\n b. What are the cases in this study and how many are included?\n\n c. What is the response variable in the study and what type of variable is it?\n\n d. What are the explanatory variables in the study and what types of variables are they?\n \n \\clearpage\n\n1. **Palmer penguins.**\nData were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica. \nIn addition to which island each penguin lives on, the data contains information on the species of the penguin (*Adelie*, *Chinstrap*, or *Gentoo*), its bill length, bill depth, and flipper length (measured in millimeters), its body mass (measured in grams), and the sex of the penguin (female or male).^[The [`penguins`](https://allisonhorst.github.io/palmerpenguins/reference/penguins.html) data used in this exercise can be found in the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) R package.] Bill length and depth are measured as shown in the image.^[Artwork by [Allison Horst](https://twitter.com/allison_horst).] [@palmerpenguins]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](exercises/images/culmen_depth.png){fig-alt='Bill length and depth marked on an illustration of a penguin head.' width=40%}\n :::\n :::\n\n a. How many cases were included in the data?\n b. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.\n c. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories) for each.\n\n \\vspace{5mm}\n\n1. **Smoking habits of UK residents.** \nA survey was conducted to study the smoking habits of 1,691 UK residents. Below is a data frame displaying a portion of the data collected in this survey. \nA blank cell indicates that data for that variable was not available for a given respondent.^[The [`smoking`](http://openintrostat.github.io/openintro/reference/smoking.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] \n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
amount
sex age marital_status gross_income smoke weekend weekday
1 Female 61 Married 2,600 to 5,200 No
2 Female 61 Divorced 10,400 to 15,600 Yes 5 4
3 Female 69 Widowed 5,200 to 10,400 No
4 Female 50 Married 5,200 to 10,400 No
5 Male 31 Single 10,400 to 15,600 Yes 10 20
... ... ... ... ... ...
1691 Male 49 Divorced Above 36,400 Yes 15 10
\n \n `````\n :::\n :::\n\n a. What does each row of the data frame represent?\n\n b. How many participants were included in the survey?\n\n c. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.\n \n \\clearpage\n\n1. **US Airports.** \nThe visualization below shows the geographical distribution of airports in the contiguous United States and Washington, DC. \nThis visualization was constructed based on a dataset where each observation is an airport.^[The [`usairports`](http://openintrostat.github.io/airports/reference/usairports.html) data used in this exercise can be found in the [**airports**](http://openintrostat.github.io/airports/) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-33-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.\n \n \\vspace{5mm}\n\n1. **UN Votes.** \nThe visualization below shows voting patterns in the United States, Canada, and Mexico in the United Nations General Assembly on a variety of issues. \nSpecifically, for a given year between 1946 and 2019, it displays the percentage of roll calls in which the country voted yes for each issue. \nThis visualization was constructed based on a dataset where each observation is a country/year pair.^[The data used in this exercise can be found in the [**unvotes**](https://cran.r-project.org/web/packages/unvotes/index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n\n a. List the variables used in creating this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.\n\n1. **UK baby names.** \nThe visualization below shows the number of baby girls born in the United Kingdom (comprised of England & Wales, Northern Ireland, and Scotland) who were given the name \"Fiona\" over the years.^[The [`ukbabynames`](https://mine-cetinkaya-rundel.github.io/ukbabynames/reference/ukbabynames.html) data used in this exercise can be found in the [**ukbabynames**](https://mine-cetinkaya-rundel.github.io/ukbabynames/) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.\n \n \\vspace{5mm}\n\n1. **Shows on Netflix.** \nThe visualization below shows the distribution of ratings of TV shows on Netflix (a streaming entertainment service) based on the decade they were released in and the country they were produced in. In the dataset, each observation is a TV show.^[The [`netflix_titles`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md) data used in this exercise can be found in the [**tidytuesdayR**](https://cran.r-project.org/web/packages/tidytuesdayR/index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-36-1.png){width=90%}\n :::\n :::\n\n a. List the variables you believe were necessary to create this visualization.\n\n b. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.\n \n \\clearpage\n\n1. **Stanford Open Policing.** \nThe Stanford Open Policing project gathers, analyzes, and releases records from traffic stops by law enforcement agencies across the United States. Their goal is to help researchers, journalists, and policy makers investigate and improve interactions between police and the public. The following is an excerpt from a summary table created based off the data collected as part of this project. [@pierson2020large]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Driver
Car
County State Race / Ethnicity Arrest rate Stops / year Search rate
Apache County AZ Black 0.016 266 0.077
Apache County AZ Hispanic 0.018 1008 0.053
Apache County AZ White 0.006 6322 0.017
Cochise County AZ Black 0.015 1169 0.047
Cochise County AZ Hispanic 0.01 9453 0.037
Cochise County AZ White 0.008 10826 0.024
... ... ... ... ... ...
Wood County WI Black 0.098 16 0.244
Wood County WI Hispanic 0.029 27 0.036
Wood County WI White 0.029 1157 0.033
\n \n `````\n :::\n :::\n\n a. What variables were collected on each individual traffic stop in order to create the summary table above?\n\n b. State whether each variable is numerical or categorical. If numerical, state whether it is continuous or discrete. If categorical, state whether it is ordinal or not.\n\n c. Suppose we wanted to evaluate whether vehicle search rates are different for drivers of different races. In this analysis, which variable would be the response variable and which variable would be the explanatory variable?\n \n \\vspace{5mm}\n\n1. **Space launches.** \nThe following summary table shows the number of space launches in the US by the type of launching agency and the outcome of the launch (success or failure).^[The data used in this exercise comes from the [JSR Launch Vehicle Database, 2019 Feb 10 Edition](https://www.openintro.org/go?id=textbook-space-launches-data&referrer=ims0_html).]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
1957 - 1999
2000-2018
Failure Success Failure Success
Private 13 295 10 562
State 281 3751 33 711
Startup 0 0 5 65
\n \n `````\n :::\n :::\n\n a. What variables were collected on each launch in order to create to the summary table above?\n\n b. State whether each variable is numerical or categorical. If numerical, state whether it is continuous or discrete. If categorical, state whether it is ordinal or not.\n\n c. Suppose we wanted to study how the success rate of launches vary between launching agencies and over time. In this analysis, which variable would be the response variable and which variable would be the explanatory variable?\n \n \\clearpage\n\n1. **Pet names.** \nThe city of Seattle, WA has an open data portal that includes pets registered in the city. For each registered pet, we have information on the pet's name and species. \nThe following visualization plots the proportion of dogs with a given name versus the proportion of cats with the same name. The 20 most common cat and dog names are displayed. \nThe diagonal line on the plot is the $x = y$ line; if a name appeared on this line, the name's popularity would be exactly the same for dogs and cats.^[The [`seattlepets`](http://openintrostat.github.io/openintro/reference/seattlepets.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](01-data-hello_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n\n a. Are these data collected as part of an experiment or an observational study?\n\n b. What is the most common dog name? What is the most common cat name?\n\n c. What names are more common for cats than dogs?\n\n d. Is the relationship between the two variables positive or negative? What does this mean in context of the data?\n \n \\vspace{5mm}\n\n1. **Stressed out in an elevator.** \nIn a study evaluating the relationship between stress and muscle cramps, half the subjects are randomly assigned to be exposed to increased stress by being placed into an elevator that falls rapidly and stops abruptly and the other half are left at no or baseline stress.\n\n a. What type of study is this?\n\n b. Can this study be used to conclude a causal relationship between increased stress and muscle cramps?\n\n\n:::\n", + "supporting": [ + "01-data-hello_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/01-data-hello/figure-html/fig-county-multi-unit-homeownership-1.png b/_freeze/01-data-hello/figure-html/fig-county-multi-unit-homeownership-1.png new file mode 100644 index 00000000..e67ded3d Binary files /dev/null and b/_freeze/01-data-hello/figure-html/fig-county-multi-unit-homeownership-1.png differ diff --git a/_freeze/01-data-hello/figure-html/fig-county-pop-change-med-hh-income-1.png b/_freeze/01-data-hello/figure-html/fig-county-pop-change-med-hh-income-1.png new file mode 100644 index 00000000..3bace9a6 Binary files /dev/null and b/_freeze/01-data-hello/figure-html/fig-county-pop-change-med-hh-income-1.png differ diff --git a/_freeze/01-data-hello/figure-html/unnamed-chunk-33-1.png b/_freeze/01-data-hello/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 00000000..c791d2f1 Binary files /dev/null and b/_freeze/01-data-hello/figure-html/unnamed-chunk-33-1.png differ diff --git a/_freeze/01-data-hello/figure-html/unnamed-chunk-34-1.png b/_freeze/01-data-hello/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 00000000..fcd90289 Binary files /dev/null and b/_freeze/01-data-hello/figure-html/unnamed-chunk-34-1.png differ diff --git a/_freeze/01-data-hello/figure-html/unnamed-chunk-35-1.png b/_freeze/01-data-hello/figure-html/unnamed-chunk-35-1.png new file mode 100644 index 00000000..13c00b0c Binary files /dev/null and b/_freeze/01-data-hello/figure-html/unnamed-chunk-35-1.png differ diff --git a/_freeze/01-data-hello/figure-html/unnamed-chunk-36-1.png b/_freeze/01-data-hello/figure-html/unnamed-chunk-36-1.png new file mode 100644 index 00000000..4e41c50a Binary files /dev/null and b/_freeze/01-data-hello/figure-html/unnamed-chunk-36-1.png differ diff --git a/_freeze/01-data-hello/figure-html/unnamed-chunk-39-1.png b/_freeze/01-data-hello/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 00000000..5da30a4a Binary files /dev/null and b/_freeze/01-data-hello/figure-html/unnamed-chunk-39-1.png differ diff --git a/_freeze/01-data-hello/figure-html/variables-1.png b/_freeze/01-data-hello/figure-html/variables-1.png new file mode 100644 index 00000000..05d68013 Binary files /dev/null and b/_freeze/01-data-hello/figure-html/variables-1.png differ diff --git a/_freeze/02-data-design/execute-results/html.json b/_freeze/02-data-design/execute-results/html.json new file mode 100644 index 00000000..6f221255 --- /dev/null +++ b/_freeze/02-data-design/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "b0aeba09f91d2995b3f4ee4a41862916", + "result": { + "markdown": "# Study design {#sec-data-design}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nBefore digging into the details of working with data, we stop to think about how data come to be.\nThat is, if the data are to be used to make broad and complete conclusions, then it is important to understand who or what the data represent.\nOne important aspect of data provenance is sampling.\nKnowing how the observational units were selected from a larger entity will allow for generalizations back to the population from which the data were randomly selected.\nAdditionally, by understanding the structure of the study, causal relationships can be separated from those relationships which are only associated.\nA good question to ask oneself before working with the data at all is, \"How were these observations collected?\".\nYou will learn a lot about the data by understanding its source.\n:::\n\n## Sampling principles and strategies {#sec-sampling-principles-strategies}\n\nThe first step in conducting research is to identify topics or questions that are to be investigated.\nA clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important.\nIt is also important to consider *how* data are collected so that the data are reliable and help achieve the research goals.\n\n### Populations and samples\n\nConsider the following three research questions:\n\n1. What is the average mercury content in swordfish in the Atlantic Ocean?\n2. Over the last five years, what is the average time to complete a degree for Duke undergrads?\n3. Does a new drug reduce the number of deaths in patients with severe heart disease?\n\nEach research question refers to a target **population**.\nIn the first question, the target population is all swordfish in the Atlantic Ocean, and each fish represents a case.\nOftentimes, it is not feasible to collect data for every case in a population.\nCollecting data for an entire population is called a **census**.\nA census is difficult because it is too expensive to collect data for the entire population, but it might also be because it is difficult or impossible to identify the entire population of interest!\nInstead, a sample is taken.\nA **sample** is the data you have.\nIdeally, a sample is a small fraction of the population.\nFor instance, 60 swordfish (or some other number) in the population might be selected, and this sample data may be used to provide an estimate of the population average and to answer the research question.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nFor the second and third questions above, identify the target population and what represents an individual case.[^02-data-design-1]\n:::\n\n[^02-data-design-1]: The question *\"Over the last five years, what is the average time to complete a degree for Duke undergrads?\"* is only relevant to students who complete their degree; the average cannot be computed using a student who never finished their degree.\n Thus, only Duke undergrads who graduated in the last five years represent cases in the population under consideration.\n Each such student is an individual case.\n For the question *\"Does a new drug reduce the number of deaths in patients with severe heart disease?\"*, a person with severe heart disease represents a case.\n The population includes all people with severe heart disease.\n\n### Parameters and statistics\n\nIn most statistical analysis procedures, the research question at hand boils down to understanding a numerical summary.\nThe number (or set of numbers) may be a quantity you are already familiar with (like the average) or it may be something you learn through this text (like the slope and intercept from a least squares model, provided in @sec-least-squares-regression).\n\nA numerical summary can be calculated on either the sample of observations or the entire population.\nHowever, measuring every unit in the population is usually prohibitive (so the parameter is very rarely calculated).\nSo, a \"typical\" numerical summary is calculated from a sample.\nYet, we can still conceptualize calculating the average income of all adults in Argentina.\n\nWe use specific terms in order to differentiate when a number is being calculated on a sample of data (**statistic**) and when it is being calculated or considered for calculation on the entire population (**parameter**).\nThe terms statistic and parameter are useful for communicating claims and models and will be used extensively in later chapters which delve into making inference on populations.\n\n\n\n\n\n### Anecdotal evidence\n\n\\index{bias}\n\nConsider the following possible responses to the three research questions:\n\n1. A man on the news got mercury poisoning from eating swordfish, so the average mercury concentration in swordfish must be dangerously high.\n2. I met two students who took more than 7 years to graduate from Duke, so it must take longer to graduate at Duke than at many other colleges.\n3. My friend's dad had a heart attack and died after they gave him a new heart disease drug, so the drug must not work.\n\nEach conclusion is based on data.\nHowever, there are two problems.\nFirst, the data only represent one or two cases.\nSecond, and more importantly, it is unclear whether these cases are actually representative of the population.\nData collected in this haphazard fashion are called **anecdotal evidence**.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Anecdotal evidence.**\n\nBe careful of data collected in a haphazard fashion.\nSuch evidence may be true and verifiable, but it may only represent extraordinary cases and therefore not be a good representation of the population.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![In February 2010, some media pundits cited one large snowstorm as evidence against global warming. As comedian Jon Stewart pointed out, \"It is one storm, in one region of one country.\"](images/mn-winter/mn-winter.jpg){fig-alt='Photograph shows a city street covered in snow and ice.' width=35%}\n:::\n:::\n\n\nAnecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics.\nFor instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years.\nInstead, of looking at the most unusual cases, we should examine a sample of many cases that better represent the population.\n\n### Sampling from a population\n\n\\index{random sample} \\index{bias}\n\nWe might try to estimate the time to graduation for Duke undergraduates in the last five years by collecting a sample of graduates.\nAll graduates in the last five years represent the *population*\\index{population}, and graduates who are selected for review are collectively called the *sample*\\index{sample}.\nIn general, we always seek to *randomly* select a sample from a population.\nThe most basic type of random selection is equivalent to how raffles are conducted.\nFor example, in selecting graduates, we could write each graduate's name on a raffle ticket and draw 10 tickets.\nThe selected names would represent a random sample of 10 graduates.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![In this graphic, 10 graduates are randomly selected from the population to be included in the sample.](02-data-design_files/figure-html/pop-to-sample-1.png){fig-alt='A large circle contains many dots which indicate all the graduates. A smaller circle contains a few of the dots (i.e., graduates) which have been randomly selected from the larger circle.' width=70%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nSuppose we ask a student who happens to be majoring in nutrition to select several graduates for the study.\nWhat kind of students do you think they might collect?\nDo you think their sample would be representative of all graduates?\n\n------------------------------------------------------------------------\n\nPerhaps they would pick a disproportionate number of graduates from health-related fields.\nOr perhaps their selection would be a good representation of the population.\nWhen selecting samples by hand, we run the risk of picking a **biased** sample, even if our bias is unintended.\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Asked to pick a sample of graduates, a nutrition major might inadvertently pick a disproportionate number of graduates from health-related majors.](02-data-design_files/figure-html/pop-to-sub-sample-graduates-1.png){fig-alt='A large circle contains many dots which indicate all the graduates, but some of the dots have been greyed out where others are dark dots from which the sample is taken. A smaller circle contains a few of the dots (i.e., graduates) which have been selected from the biased group of dark dots in the large circle.' width=70%}\n:::\n:::\n\n\nIf someone was permitted to pick and choose exactly which graduates were included in the sample, it is entirely possible that the sample would overrepresent that person's interests, which may be entirely unintentional.\nThis introduces **bias** into a sample.\nSampling randomly helps address this problem.\nThe most basic random sample is called a **simple random sample** and is equivalent to drawing names out of a hat to select cases.\nThis means that each case in the population has an equal chance of being included and the cases in the sample are not related to each other.\n\n\n\n\n\nThe act of taking a simple random sample helps minimize bias.\nHowever, bias can crop up in other ways.\nEven when people are picked at random, e.g., for surveys, caution must be exercised if the **non-response rate**\\index{non-response rate} is high.\nFor instance, if only 30% of the people randomly sampled for a survey actually respond, then it is unclear whether the results are **representative** of the entire population.\nThis **non-response bias**\\index{non-response bias} can skew results.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Due to the possibility of non-response, survey studies may only reach a certain group within the population. It is difficult, and oftentimes impossible, to completely fix this problem.](02-data-design_files/figure-html/survey-sample-1.png){fig-alt='A large circle contains many dots which indicate the population of interest, but some of the dots have been greyed out where others are dark dots from which the sample is taken (where the grey dots are potentially due to non-response bias). A smaller circle contains a few of the dots which have been selected from the group of dark dots in the large circle who were individuals willing to respond to the survey.' width=70%}\n:::\n:::\n\n\nAnother common downfall is a **convenience sample**\\index{convenience sample}, where individuals who are easily accessible are more likely to be included in the sample.\nFor instance, if a political survey is done by stopping people walking in the Bronx, this will not represent all of New York City.\nIt is often difficult to discern what sub-population a convenience sample represents.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nWe can easily access ratings for products, sellers, and companies through websites.\nThese ratings are based only on those people who go out of their way to provide a rating.\nIf 50% of online reviews for a product are negative, do you think this means that 50% of buyers are dissatisfied with the product?\nWhy or why not?[^02-data-design-2]\n:::\n\n[^02-data-design-2]: Answers will vary.\n From our own anecdotal experiences, we believe people tend to rant more about products that fell below expectations than rave about those that perform as expected.\n For this reason, we suspect there is a negative bias in product ratings on sites like Amazon.\n However, since our experiences may not be representative, we also keep an open mind.\n\n\\index{random sample} \\index{bias} \\index{population} \\index{sample}\n\n### Four sampling methods {#sec-samp-methods}\n\nAlmost all statistical methods are based on the notion of implied randomness.\nIf data are not collected in a random framework from a population, these statistical methods -- the estimates and errors associated with the estimates -- are not reliable.\nHere we consider four random sampling techniques: simple, stratified, cluster, and multistage sampling.\nFigures @fig-simple-stratified and @fig-cluster-multistage provide graphical representations of these techniques.\n\n\\index{simple random sampling} \\index{stratified sampling}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Examples of simple random and stratified sampling. In the top panel, simple random sampling was used to randomly select the 18 cases (denoted in red). In the bottom panel, stratified sampling was used: cases were first grouped into strata, then simple random sampling was employed to randomly select 3 cases within each stratum.](02-data-design_files/figure-html/fig-simple-stratified-1.png){#fig-simple-stratified fig-alt='The top box shows a population of dots (i.e., individuals) where a handful of the dots have been sampled randomly. The bottom box shows the same population of dots but grouped in such a way that there are six strata. From each stratum three dots (i.e., individuals) are randomly selected.' width=100%}\n:::\n:::\n\n\n**Simple random sampling** is probably the most intuitive form of random sampling.\nConsider the salaries of Major League Baseball (MLB) players, where each player is a member of one of the league's 30 teams.\nTo take a simple random sample of 120 baseball players and their salaries, we could write the names of that season's several hundreds of players onto slips of paper, drop the slips into a bucket, shake the bucket around until we are sure the names are all mixed up, then draw out slips until we have the sample of 120 players.\nIn general, a sample is referred to as \"simple random\" if each case in the population has an equal chance of being included in the final sample *and* knowing that a case is included in a sample does not provide useful information about which other cases are included.\n\n\\index{strata}\n\n**Stratified sampling** is a divide-and-conquer sampling strategy.\nThe population is divided into groups called **strata**.\nThe strata are chosen so that similar cases are grouped together, then a second sampling method, usually simple random sampling, is employed within each stratum.\nIn the baseball salary example, each of the 30 teams could represent a stratum, since some teams have a lot more money (up to 4 times as much!).\nThen we might randomly sample 4 players from each team for our sample of 120 players.\n\n\n\n\n\n**Stratified sampling** is especially useful when the cases in each stratum are very similar with respect to the outcome of interest.\nThe downside is that analyzing data from a stratified sample is a more complex task than analyzing data from a simple random sample.\nThe analysis methods introduced in this book would need to be extended to analyze data collected using stratified sampling.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nWhy would it be good for cases within each stratum to be very similar?\n\n------------------------------------------------------------------------\n\nWe might get a more stable estimate for the subpopulation in a stratum if the cases are very similar, leading to more precise estimates within each group.\nWhen we combine these estimates into a single estimate for the full population, that population estimate will tend to be more precise since each individual group estimate is itself more precise.\n:::\n\nIn a **cluster sample**, we break up the population into many groups, called **clusters**.\nThen we sample a fixed number of clusters and include all observations from each of those clusters in the sample.\nA **multistage sample** is like a cluster sample, but rather than keeping all observations in each cluster, we would collect a random sample within each selected cluster.\n\n\n\n\n\n\\index{cluster sampling}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Examples of cluster and multistage sampling. In the top panel, cluster sampling was used: data were binned into nine clusters, three of these clusters were sampled, and all observations within these three clusters were included in the sample. In the bottom panel, multistage sampling was used, which differs from cluster sampling only in that we randomly select a subset of each cluster to be included in the sample rather than measuring every case in each sampled cluster.](02-data-design_files/figure-html/fig-cluster-multistage-1.png){#fig-cluster-multistage fig-alt='In the top figure, dots are grouped into clusters, three clusters are selected, and every dot (i.e., all individuals) from each of the three clusters are sampled. In the bottom figure, dots are again grouped into clusters and three clusters are selected. However, random sampling is applied so that a random sample from each of the three selected clusters is taken.' width=100%}\n:::\n:::\n\n\nSometimes cluster or multistage sampling can be more economical than the alternative sampling techniques.\nAlso, unlike stratified sampling, these approaches are most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves do not look very different from one another.\nFor example, if neighborhoods represented clusters, then cluster or multistage sampling work best when the populations inside each neighborhood are very diverse.\nA downside of these methods is that more advanced techniques are typically required to analyze the data, though the methods in this book can be extended to handle such data.\n\n::: {.workedexample data-latex=\"\"}\nSuppose we are interested in estimating the malaria rate in a densely tropical portion of rural Indonesia.\nWe learn that there are 30 villages in that part of the Indonesian jungle, each more or less like the next, but the distances between the villages are substantial.\nOur goal is to test 150 individuals for malaria.\nWhat sampling method should be employed?\n\n------------------------------------------------------------------------\n\nA simple random sample would likely draw individuals from all 30 villages, which could make data collection extremely expensive.\nStratified sampling would be a challenge since it is unclear how we would build strata of similar individuals.\nHowever, cluster sampling or multistage sampling seem like very good ideas.\nIf we decided to use multistage sampling, we might randomly select half of the villages, then randomly select 10 people from each.\nThis would probably reduce our data collection costs substantially in comparison to a simple random sample, and the cluster sample would still give us reliable information, even if we would need to analyze the data with slightly more advanced methods than we discuss in this book.\n:::\n\n\\clearpage\n\n## Experiments {#sec-experiments}\n\nStudies where the researchers assign treatments to cases are called **experiments**.\nWhen this assignment includes randomization, e.g., using a coin flip to decide which treatment a patient receives, it is called a **randomized experiment**.\nRandomized experiments are fundamentally important when trying to show a causal connection between two variables.\n\n\n\n\n\n### Principles of experimental design {#sec-principles-experimental-design}\n\n1. **Controlling.** Researchers assign treatments to cases, and they do their best to **control** any other differences in the groups[^02-data-design-3]. For example, when patients take a drug in pill form, some patients take the pill with only a sip of water while others may have it with an entire glass of water. To control for the effect of water consumption, a doctor may instruct every patient to drink a 12-ounce glass of water with the pill.\n\n[^02-data-design-3]: This is a different concept than a *control group*, which we discuss in the second principle and in @sec-reducing-bias-human-experiments.\n\n\n\n\n\n2. **Randomization.** Researchers randomize patients into treatment groups to account for variables that cannot be controlled.\n For example, some patients may be more susceptible to a disease than others due to their dietary habits.\n In this example dietary habit is a **confounding variable**[^02-data-design-4], which is defined as a variable that is associated with both the explanatory and response variables.\n Randomizing patients into the treatment or control group helps even out such differences.\n\n\n \n\n\n3. **Replication.** The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response.\n In a single study, we **replicate** by collecting a sufficiently large sample.\n What is considered sufficiently large varies from experiment to experiment, but at a minimum we want to have multiple subjects (experimental units) per treatment group.\n Another way of achieving replication is replicating an entire study to verify an earlier finding.\n The term **replication crisis** refers to the ongoing methodological crisis in which past findings from scientific studies in several disciplines have failed to be replicated.\n **Pseudoreplication** occurs when individual observations under different treatments are heavily dependent on each other.\n For example, suppose you have 50 subjects in an experiment where you're taking blood pressure measurements at 10 time points throughout the course of the study.\n By the end, you will have 50 $\\times$ 10 = 500 measurements.\n Reporting that you have 500 observations would be considered pseudoreplication, as the blood pressure measurements of a given individual are not independent of each other.\n Pseudoreplication often happens when the wrong entity is replicated, and the reported sample sizes are exaggerated.\n\n[^02-data-design-4]: Also called a **lurking variable**, **confounding factor**, or a **confounder**.\n\n\n\n\n\n4. **Blocking.** Researchers sometimes know or suspect that variables, other than the treatment, influence the response. Under these circumstances, they may first group individuals based on this variable into **blocks** and then randomize cases within each block to the treatment groups. This strategy is often referred to as **blocking**. For instance, if we are looking at the effect of a drug on heart attacks, we might first split patients in the study into low-risk and high-risk blocks, then randomly assign half the patients from each block to the control group and the other half to the treatment group, as shown in @fig-blocking. This strategy ensures that each treatment group has the same number of low-risk patients and the same number of high-risk patients.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Blocking using a variable depicting patient risk. Patients are first divided into low-risk and high-risk blocks, then each block is evenly separated into the treatment groups using randomization. This strategy ensures an equal representation of patients in each treatment group from both the low-risk and high-risk categories.](02-data-design_files/figure-html/fig-blocking-1.png){#fig-blocking fig-alt='Before randomly allocating, the red low risk patients and blue high risk patients are split into two separate groups. Subsequently, half of the red low risk patients are randomly chosen to receive the treatment, and half of the blue high risk patients are randomly chosen to receive the treatment.' width=100%}\n:::\n:::\n\n\nIt is important to incorporate the first three experimental design principles into any study, and this book describes applicable methods for analyzing data from such experiments.\nBlocking is a slightly more advanced technique, and statistical methods in this book may be extended to analyze data collected using blocking.\n\n\\clearpage\n\n### Reducing bias in human experiments {#sec-reducing-bias-human-experiments}\n\nRandomized experiments have long been considered to be the gold standard for data collection, but they do not ensure an unbiased perspective into the cause-and-effect relationship in all cases.\nHuman studies are perfect examples where bias can unintentionally arise.\nHere we reconsider a study where a new drug was used to treat heart attack patients.\nIn particular, researchers wanted to know if the drug reduced deaths in patients.\n\nThese researchers designed a randomized experiment because they wanted to draw causal conclusions about the drug's effect.\nStudy volunteers[^02-data-design-5] were randomly placed into two study groups.\nOne group, the **treatment group**, received the drug.\nThe other group, called the **control group**, did not receive any drug treatment.\n\n[^02-data-design-5]: Human subjects are often called **patients**, **volunteers**, or **study participants**.\n\n\n\n\n\nPut yourself in the place of a person in the study.\nIf you are in the treatment group, you are given a fancy new drug that you anticipate will help you.\nOn the other hand, a person in the other group does not receive the drug and sits idly, hoping her participation does not increase her risk of death.\nThese perspectives suggest there are actually two effects in this study: the one of interest is the effectiveness of the drug, and the second is an emotional effect of (not) taking the drug, which is difficult to quantify.\n\nResearchers aren't usually interested in the emotional effect, which might bias the study.\nTo circumvent this problem, researchers do not want patients to know which group they are in.\nWhen researchers keep the patients uninformed about their treatment, the study is said to be **blind**.\nBut there is one problem: if a patient does not receive a treatment, they will know they're in the control group.\nA solution to this problem is to give a fake treatment to patients in the control group.\nThis is called a **placebo**, and an effective placebo is the key to making a study truly blind.\nA classic example of a placebo is a sugar pill that is made to look like the actual treatment pill.\nHowever, offering such a fake treatment may not be ethical in certain experiments.\nFor example, in medical experiments, typically the control group must get the current standard of care.\nOftentimes, a placebo results in a slight but real improvement in patients.\nThis effect has been dubbed the **placebo effect**.\n\n\n\n\n\nThe patients are not the only ones who should be blinded: doctors and researchers can unintentionally bias a study.\nWhen a doctor knows a patient has been given the real treatment, they might inadvertently give that patient more attention or care than a patient that they know is on the placebo.\nTo guard against this bias, which again has been found to have a measurable effect in some instances, most modern studies employ a **double-blind** setup where doctors or researchers who interact with patients are, just like the patients, unaware of who is or is not receiving the treatment.[^02-data-design-6]\n\n[^02-data-design-6]: There are always some researchers involved in the study who do know which patients are receiving which treatment.\n However, they do not interact with the study's patients and do not tell the blinded health care professionals who is receiving which treatment.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nLook back to the study in @sec-case-study-stents-strokes where researchers were testing whether stents were effective at reducing strokes in at-risk patients.\nIs this an experiment?\nWas the study blinded?\nWas it double-blinded?[^02-data-design-7]\n:::\n\n[^02-data-design-7]: The researchers assigned the patients into their treatment groups, so this study was an experiment.\n However, the patients could distinguish what treatment they received because a stent is a surgical procedure.\n There is no equivalent surgical placebo, so this study was not blind.\n The study could not be double-blind since it was not blind.\n\n::: {.guidedpractice data-latex=\"\"}\nFor the study in @sec-case-study-stents-strokes, could the researchers have employed a placebo?\nIf so, what would that placebo have looked like?[^02-data-design-8]\n:::\n\n[^02-data-design-8]: Ultimately, can we make patients think they got treated from a surgery?\n In fact, we can, and some experiments use a **sham surgery**.\n In a sham surgery, the patient does undergo surgery, but the patient does not receive the full treatment, though they will still get a placebo effect.\n\nYou may have many questions about the ethics of sham surgeries to create a placebo.\nThese questions may have even arisen in your mind when in the general experiment context, where a possibly helpful treatment was withheld from individuals in the control group; the main difference is that a sham surgery tends to create additional risk, while withholding a treatment only maintains a person's risk.\n\nThere are always multiple viewpoints of experiments and placebos, and rarely is it obvious which is ethically \"correct\".\nFor instance, is it ethical to use a sham surgery when it creates a risk to the patient?\nHowever, if we do not use sham surgeries, we may promote the use of a costly treatment that has no real effect; if this happens, money and other resources will be diverted away from other treatments that are known to be helpful.\nUltimately, this is a difficult situation where we cannot perfectly protect both the patients who have volunteered for the study and the patients who may benefit (or not) from the treatment in the future.\n\n## Observational studies {#sec-observational-studies}\n\nData where no treatment has been explicitly applied (or explicitly withheld) is called **observational data**.\nFor instance, the loan data and county data described in @sec-data-basics are both examples of observational data.\n\n\n\n\n\nMaking causal conclusions based on experiments is often reasonable, since we can randomly assign the explanatory variable(s), i.e., the treatments.\nHowever, making the same causal conclusions based on observational data can be treacherous and is not recommended.\nThus, observational studies are generally only sufficient to show associations or form hypotheses that can be later checked with experiments.\n\n::: {.guidedpractice data-latex=\"\"}\nSuppose an observational study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer.\nDoes this mean sunscreen *causes* skin cancer?[^02-data-design-9]\n:::\n\n[^02-data-design-9]: No.\n See the paragraph following the question!\n\nSome previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer.\nOne important piece of information that is absent is sun exposure.\nIf someone is out in the sun all day, they are more likely to use sunscreen *and* more likely to get skin cancer.\nExposure to the sun is unaccounted for in the simple observational investigation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](02-data-design_files/figure-html/sun-causes-cancer-1.png){fig-alt='Three boxes are shown in a triangle arrangement representing: sun exposure, using sunscreen, and skin cancer. A solid arrow connects sun exposure as a causal mechanism to using sunscreen; a solid arrow also connects sun exposure as a causal mechanism to skin cancer. A questioning arrow indicates that the causal effect of using sunscreen on skin cancer is unknown.' width=60%}\n:::\n:::\n\n\nIn this example, sun exposure is a confounding variable.\nThe presence of confounding variables is what inhibits the ability for observational studies to make causal claims.\nWhile one method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, there is no guarantee that all confounding variables can be examined or measured.\n\n::: {.guidedpractice data-latex=\"\"}\n@fig-county-multi-unit-homeownership shows a negative association between the homeownership rate and the percentage of housing units that are in multi-unit structures in a county.\nHowever, it is unreasonable to conclude that there is a causal relationship between the two variables.\nSuggest a variable that might explain the negative relationship.[^02-data-design-10]\n:::\n\n[^02-data-design-10]: Answers will vary.\n Population density may be important.\n If a county is very dense, then this may require a larger percentage of residents to live in housing units that are in multi-unit structures.\n Additionally, the high density may contribute to increases in property value, making homeownership unfeasible for many residents.\n\nObservational studies come in two forms: prospective and retrospective studies.\nA **prospective study** identifies individuals and collects information as events unfold.\nFor instance, medical researchers may identify and follow a group of patients over many years to assess the possible influences of behavior on cancer risk.\nOne example of such a study is The Nurses' Health Study.\nStarted in 1976 and expanded in 1989, the Nurses' Health Study has collected data on over 275,000 nurses and is still enrolling participants.\nThis prospective study recruits registered nurses and then collects data from them using questionnaires.\n**Retrospective studies** collect data after events have taken place, e.g., researchers may review past events in medical records.\nSome datasets may contain both prospectively- and retrospectively collected variables, such as medical studies which gather information on participants' lives before they enter the study and subsequently collect data on participants throughout the study.\n\n\n\n\n\n\\clearpage\n\n## Chapter review {#sec-chp2-review}\n\n### Summary\n\nA strong analyst will have a good sense of the types of data they are working with and how to visualize the data in order to gain a complete understanding of the variables.\nEqually important however, is an understanding of the data source.\nIn this chapter, we have discussed randomized experiments and taking good, random, representative samples from a population.\nWhen we discuss inferential methods (starting in @sec-foundations-randomization), the conclusions that can be drawn will be dependent on how the data were collected.\n@fig-randsampValloc summarizes the differences between random assignment of treatments and random samples.[^02-data-design-11]\nRegularly revisiting @fig-randsampValloc will be important when making conclusions from a given data analysis.\n\n[^02-data-design-11]: Derived from similar figures in @ISCAM and @sleuth.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![As we will see, analysis conclusions should be made carefully according to how the data were collected. Note that very few datasets come from the top left box because usually ethics require that random assignment of treatments can only be given to volunteers. Both representative (ideally random) sampling and experiments (random assignment of treatments) are important for how statistical conclusions can be made on populations.](images/randsampValloc.png){#fig-randsampValloc width=100%}\n:::\n:::\n\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
anecdotal evidence experiment replication
bias multistage sample replication crisis
blind non-response bias representative
blocking non-response rate retrospective study
census observational data sample
cluster parameter sample bias
cluster sampling placebo simple random sample
confounding variable placebo effect simple random sampling
control population statistic
control group prospective study strata
convenience sample pseudoreplication stratified sampling
double-blind randomized experiment treatment group
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#sec-chp2-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-02].\n\n::: {.exercises data-latex=\"\"}\n1. **Parameters and statistics.** \nIdentify which value represents the sample mean and which value represents the claimed population mean.\n\n a. American households spent an average of about \\$52 in 2007 on Halloween merchandise such as costumes, decorations, and candy. To see if this number had changed, researchers conducted a new survey in 2008 before industry numbers were reported. The survey included 1,500 households and found that average Halloween spending was \\$58 per household.\n\n b. The average GPA of students in 2001 at a private university was 3.37. A survey on a sample of 203 students from this university yielded an average GPA of 3.59 a decade later.\n\n1. **Sleeping in college.** \nA recent article in a college newspaper stated that college students get an average of 5.5 hours of sleep each night. \nA student who was skeptical about this value decided to conduct a survey by randomly sampling 25 students. \nOn average, the sampled students slept 6.25 hours per night. \nIdentify which value represents the sample mean and which value represents the claimed population mean.\n\n1. **Air pollution and birth outcomes, scope of inference.**\nResearchers collected data to examine the relationship between air pollutants and preterm births in Southern California. \nDuring the study air pollution levels were measured by air quality monitoring stations. \nLength of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. [@Ritz+Yu+Chapa+Fruin:2000]\n\n a. Identify the population of interest and the sample in this study.\n\n b. Comment on whether the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.\n\n1. **Cheaters, scope of inference.**\nResearchers studying the relationship between honesty, age and self-control conducted an experiment on 160 children between the ages of 5 and 15. \nThe researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet and said they would only reward children who report white. \nHalf the students were explicitly told not to cheat, and the others were not given any explicit instructions. Differences were observed in the cheating rates in the instruction and no instruction groups, as well as some differences across children's characteristics within each group. [@Bucciol:2011]\n\n a. Identify the population of interest and the sample in this study.\n\n b. Comment on whether the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.\n\n1. **Gamification and statistics, scope of inference.**\nResearchers investigating the effects of gamification (application of game-design elements and game principles in non-game contexts) on learning statistics randomly assigned 365 college students in a statistics course to one of four groups; one of these groups had no reading exercises and no gamification, one group had reading but no gamification, one group had gamification but no reading, and a final group had gamification and reading. \nStudents in all groups also attended lectures. \nThe study found that gamification had a positive impact on student learning compared to traditional teaching methods involving reading exercises. [@Legaki:2020]\n\n a. Identify the population of interest and the sample in this study.\n\n b. Comment on whether the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.\n \n \\clearpage\n\n1. **Stealers, scope of inference.** \nIn a study of the relationship between socio-economic class and unethical behavior, 129 University of California undergraduates at Berkeley were asked to identify themselves as having low or high social class by comparing themselves to others with the most (least) money, most (least) education, and most (least) respected jobs. \nThey were also presented with a jar of individually wrapped candies and informed that the candies were for children in a nearby laboratory, but that they could take some if they wanted. \nAfter completing some unrelated tasks, participants reported the number of candies they had taken. \nIt was found that those who were identified as upper-class took more candy than others. [@Piff:2012]\n\n a. Identify the population of interest and the sample in this study.\n\n b. Comment on whether the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.\n\n1. **Relaxing after work.** \nThe General Social Survey asked the question, *\"After an average workday, about how many hours do you have to relax or pursue activities that you enjoy?\"* to a random sample of 1,155 Americans. \nThe average relaxing time was found to be 1.65 hours. \nDetermine which of the following is an observation, a variable, a sample statistic (value calculated based on the observed sample), or a population parameter.^[The data used in this exercise comes from the [General Social Survey, 2018](https://www.openintro.org/go?id=textbook-gss-data&referrer=ims0_html).]\n\n a. An American in the sample.\n\n b. Number of hours spent relaxing after an average workday.\n\n c. 1.65.\n\n d. Average number of hours all Americans spend relaxing after an average workday.\n\n1. **Cats on YouTube.** \nSuppose you want to estimate the percentage of videos on YouTube that are cat videos. \nIt is impossible for you to watch all videos on YouTube, so you use a random video picker to select 1000 videos for you. \nYou find that 2% of these videos are cat videos. \nDetermine which of the following is an observation, a variable, a sample statistic (value calculated based on the observed sample), or a population parameter.\n\n a. Percentage of all videos on YouTube that are cat videos.\n\n b. 2%.\n\n c. A video in your sample.\n\n d. Whether a video is a cat video.\n\n1. **Course satisfaction across sections.** \nA large college class has 160 students. \nAll 160 students attend the lectures together, but the students are divided into 4 groups, each of 40 students, for lab sections administered by different teaching assistants. \nThe professor wants to conduct a survey about how satisfied the students are with the course, and he believes that the lab section a student is in might affect the student's overall satisfaction with the course.\n\n a. What type of study is this?\n\n b. Suggest a sampling strategy for carrying out this study.\n\n1. **Housing proposal across dorms.** \nOn a large college campus first-year students and sophomores live in dorms located on the eastern part of the campus and juniors and seniors live in dorms located on the western part of the campus. \nSuppose you want to collect student opinions on a new housing structure the college administration is proposing. and you want to make sure your survey equally represents opinions from students from all years.\n\n a. What type of study is this?\n\n b. Suggest a sampling strategy for carrying out this study.\n\n1. **Internet use and life expectancy.** \nThe following scatterplot was created as part of a study evaluating the relationship between estimated life expectancy at birth (as of 2014) and percentage of internet users (as of 2009) in 208 countries for which such data were available.^[The [`cia_factbook`](http://openintrostat.github.io/openintro/reference/cia_factbook.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](02-data-design_files/figure-html/unnamed-chunk-32-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between life expectancy and percentage of internet users.\n\n b. What type of study is this?\n\n c. State a possible confounding variable that might explain this relationship and describe its potential effect.\n\n1. **Stressed out.** \nA study that surveyed a random sample of otherwise healthy high school students found that they are more likely to get muscle cramps when they are stressed. \nThe study also noted that students drink more coffee and sleep less when they are stressed.\n\n a. What type of study is this?\n\n b. Can this study be used to conclude a causal relationship between increased stress and muscle cramps?\n\n c. State possible confounding variables that might explain the observed relationship between increased stress and muscle cramps.\n\n1. **Evaluate sampling methods.** \nA university wants to determine what fraction of its undergraduate student body support a new $25 annual fee to improve the student union. \nFor each proposed method below, indicate whether the method is reasonable or not.\n\n a. Survey a simple random sample of 500 students.\n\n b. Stratify students by their field of study, then sample 10% of students from each stratum.\n\n c. Cluster students by their ages (e.g., 18 years old in one cluster, 19 years old in one cluster, etc.), then randomly sample three clusters and survey all students in those clusters.\n\n1. **Random digit dialing.** \nThe Gallup Poll uses a procedure called random digit dialing, which creates phone numbers based on a list of all area codes in America in conjunction with the associated number of residential households in each area code. \nGive a possible reason the Gallup Poll chooses to use random digit dialing instead of picking phone numbers from the phone book.\n\n \\clearpage\n\n1. **Haters are gonna hate, study confirms.** \nA study published in the *Journal of Personality and Social Psychology* asked a group of 200 randomly sampled participants recruited online using Amazon's Mechanical Turk to evaluate how they felt about various subjects, such as camping, health care, architecture, taxidermy, crossword puzzles, and Japan in order to measure their attitude towards mostly independent stimuli. \nThen, they presented the participants with information about a new product: a microwave oven. \nThis microwave oven does not exist, but the participants didn't know this, and were given three positive and three negative fake reviews. \nPeople who reacted positively to the subjects on the dispositional attitude measurement also tended to react positively to the microwave oven, and those who reacted negatively tended to react negatively to it. \nResearchers concluded that *\"some people tend to like things, whereas others tend to dislike things, and a more thorough understanding of this tendency will lead to a more thorough understanding of the psychology of attitudes.\"* [@Hepler:2013]\n\n a. What are the cases?\n\n b. What is (are) the response variable(s) in this study?\n\n c. What is (are) the explanatory variable(s) in this study?\n\n d. Does the study employ random sampling? Explain. How could they have obtained participants?\n\n e. Is this an observational study or an experiment? Explain your reasoning.\n\n f. Can we establish a causal link between the explanatory and response variables?\n\n g. Can the results of the study be generalized to the population at large?\n\n1. **Family size.** \nSuppose we want to estimate household size, where a *\"household\"* is defined as people living together in the same dwelling and sharing living accommodations. \nIf we select students at random at an elementary school and ask them what their family size is, will this be a good measure of household size? \nOr will our average be biased? \nIf so, will it overestimate or underestimate the true value?\n\n1. **Sampling strategies.** \nA statistics student who is curious about the relationship between the amount of time students spend on social networking sites and their performance at school decides to conduct a survey. \nVarious research strategies for collecting data are described below. In each, name the sampling method proposed and any bias you might expect.\n\n a. They randomly sample 40 students from the study's population, give them the survey, ask them to fill it out and bring it back the next day.\n\n b. They give out the survey only to their friends, making sure each one of them fills out the survey.\n\n c. They post a link to an online survey on Facebook and ask their friends to fill out the survey.\n\n d. They randomly sample 5 classes and asks a random sample of students from those classes to fill out the survey.\n\n1. **Reading the paper.** \nBelow are excerpts from two articles published in the *NY Times*:\n\n a. An excerpt from an article titled *Risks: Smokers Found More Prone to Dementia* is below. Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning. [@news:smokingDementia]\n\n > \"Researchers analyzed data from 23,123 health plan members who participated in a voluntary exam and health behavior survey from 1978 to 1985, when they were 50-60 years old. 23 years later, about 25% of the group had dementia, including 1,136 with Alzheimer's disease and 416 with vascular dementia. After adjusting for other factors, the researchers concluded that pack-a-day smokers were 37% more likely than nonsmokers to develop dementia, and the risks went up with increased smoking; 44% for one to two packs a day; and twice the risk for more than two packs.\"\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part b.*\n :::\n \n \\clearpage\n\n b. An excerpt from an article titled *The School Bully Is Sleepy* is below. A friend of yours who read the article says, *\"The study shows that sleep disorders lead to bullying in school children.\"* Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study? [@news:bullySleep]\n\n > \"The University of Michigan study collected survey data from parents on each child's sleep habits and asked both parents and teachers to assess behavioral concerns. About a third of the students studied were identified by parents or teachers as having problems with disruptive behavior or bullying. The researchers found that children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.\"\n\n \n\n1. **Light and exam performance.** \nA study is designed to test the effect of light level on exam performance of students. \nThe researcher believes that light levels might have different effects on people who wear glasses and people who do not, so they want to make sure both groups of people are equally represented in each treatment. \nThe treatments are fluorescent overhead lighting, yellow overhead lighting, no overhead lighting (only desk lamps).\n\n a. What is the response variable?\n\n b. What is the explanatory variable? What are its levels?\n\n c. What is the blocking variable? What are its levels?\n\n1. **Vitamin supplements.** \nTo assess the effectiveness of taking large doses of vitamin C in reducing the duration of the common cold, researchers recruited 400 healthy volunteers from staff and students at a university. \nA quarter of the patients were assigned a placebo, and the rest were evenly divided between 1g Vitamin C, 3g Vitamin C, or 3g Vitamin C plus additives to be taken at onset of a cold for the following two days. \nAll tablets had identical appearance and packaging. \nThe nurses who handed the prescribed pills to the patients knew which patient received which treatment, but the researchers assessing the patients when they were sick did not. \nNo significant differences were observed in any measure of cold duration or severity between the four groups, and the placebo group had the shortest duration of symptoms. [@Audera:2001]\n\n a. Was this an experiment or an observational study? Why?\n\n b. What are the explanatory and response variables in this study?\n\n c. Were the patients blinded to their treatment?\n\n d. Was this study double-blind?\n\n e. Participants are ultimately able to choose whether to use the pills prescribed to them. We might expect that not all of them will adhere and take their pills. Does this introduce a confounding variable to the study? Explain your reasoning.\n\n1. **Light, noise, and exam performance.** \nA study is designed to test the effect of light level and noise level on exam performance of students. \nThe researcher believes that light and noise levels might have different effects on people who wear glasses and people who do not, so they want to make sure both groups of people are equally represented in each treatment. \nThe light treatments considered are fluorescent overhead lighting, yellow overhead lighting, no overhead lighting (only desk lamps). \nThe noise treatments considered are no noise, construction noise, and human chatter noise.\n\n a. What type of study is this?\n\n b. How many factors are considered in this study? Identify them and describe their levels.\n\n c. What is the role of the wearing glasses variable in this study?\n\n1. **Music and learning.** \nYou would like to conduct an experiment in class to see if students learn better if they study without any music, with music that has no lyrics (instrumental), or with music that has lyrics. \nBriefly outline a design for this study.\n\n \\clearpage\n\n1. **Soda preference.** \nYou would like to conduct an experiment in class to see if your classmates prefer the taste of regular Coke or Diet Coke. \nBriefly outline a design for this study.\n\n1. **Exercise and mental health.** \nA researcher is interested in the effects of exercise on mental health, and they propose the following study: use stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41-55 year-olds from the population. \nNext, randomly assign half the subjects from each age group to exercise twice a week and instruct the rest not to exercise. \nConduct a mental health exam at the beginning and at the end of the study and compare the results.\n\n a. What type of study is this?\n\n b. What are the treatment and control groups in this study?\n\n c. Does this study make use of blocking? If so, what is the blocking variable?\n\n d. Does this study make use of blinding?\n\n e. Comment on whether the results of the study can be used to establish a causal relationship between exercise and mental health and indicate whether the conclusions can be generalized to the population at large.\n\n f. Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?\n\n1. **Chia seeds and weight loss.** \nChia Pets -- those terra-cotta figurines that sprout fuzzy green hair -- made the chia plant a household name. But chia has gained an entirely new reputation as a diet supplement. \nIn one 2009 study, a team of researchers recruited 38 men and divided them randomly into two groups: treatment or control. \nThey also recruited 38 women, and they randomly placed half of these participants into the treatment group and the other half into the control group. \nOne group was given 25 grams of chia seeds twice a day, and the other was given a placebo. \nThe subjects volunteered to be a part of the study. \nAfter 12 weeks, the scientists found no significant difference between the groups in appetite or weight loss. [@Nieman:2009]\n\n a. What type of study is this?\n\n b. What are the experimental and control treatments in this study?\n\n c. Has blocking been used in this study? If so, what is the blocking variable?\n\n d. Has blinding been used in this study?\n\n e. Comment on whether we can make a causal statement and indicate whether we can generalize the conclusion to the population at large.\n\n1. **City council survey.** \nA city council has requested a household survey be conducted in a suburban area of their city. \nThe area is broken into many distinct and unique neighborhoods, some including large homes, some with only apartments, and others a diverse mixture of housing structures. \nFor each part below, identify the sampling methods described, and describe the statistical pros and cons of the method in the city's context.\n\n a. Randomly sample 200 households from the city.\n\n b. Divide the city into 20 neighborhoods, and sample 10 households from each neighborhood.\n\n c. Divide the city into 20 neighborhoods, randomly sample 3 neighborhoods, and then sample all households from those 3 neighborhoods.\n\n d. Divide the city into 20 neighborhoods, randomly sample 8 neighborhoods, and then randomly sample 50 households from those neighborhoods.\n\n e. Sample the 200 households closest to the city council offices.\n \n \\clearpage\n\n1. **Flawed reasoning.** \nIdentify the flaw(s) in reasoning in the following scenarios. \nExplain what the individuals in the study should have done differently if they wanted to make such strong conclusions.\n\n a. Students at an elementary school are given a questionnaire that they are asked to return after their parents have completed it. One of the questions asked is, *\"Do you find that your work schedule makes it difficult for you to spend time with your kids after school?\"* Of the parents who replied, 85% said *\"no\"*. Based on these results, the school officials conclude that a great majority of the parents have no difficulty spending time with their kids after school.\n\n b. A survey is conducted on a simple random sample of 1,000 women who recently gave birth, asking them about whether they smoked during pregnancy. A follow-up survey asking if the children have respiratory problems is conducted 3 years later. However, only 567 of these women are reached at the same address. The researcher reports that these 567 women are representative of all mothers.\n\n c. An orthopedist administers a questionnaire to 30 of his patients who do not have any joint problems and finds that 20 of them regularly go running. He concludes that running decreases the risk of joint problems.\n \n \\vspace{5mm}\n\n1. **Income and education in US counties.** \nThe scatterplot below shows the relationship between per capita income (in thousands of dollars) and percent of population with a bachelor's degree in 3,142 counties in the US in 2019.^[The [`county_complete`](http://openintrostat.github.io/openintro/reference/county_complete.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](02-data-design_files/figure-html/unnamed-chunk-33-1.png){width=90%}\n :::\n :::\n\n a. What are the explanatory and response variables?\n\n b. Describe the relationship between the two variables. Make sure to discuss unusual observations, if any.\n\n c. Can we conclude that having a bachelor's degree increases one's income?\n \n \\clearpage\n\n1. **Eat well, feel better.**\nIn a public health study on the effects of consumption of fruits and vegetables on psychological well-being in young adults, participants were randomly assigned to three groups: (1) diet-as-usual, (2) an ecological momentary intervention involving text message reminders to increase their fruits and vegetable consumption plus a voucher to purchase them, or (3) a fruit and vegetable intervention in which participants were given two additional daily servings of fresh fruits and vegetables to consume on top of their normal diet. \nParticipants were asked to take a nightly survey on their smartphones. \nParticipants were student volunteers at the University of Otago, New Zealand. \nAt the end of the 14-day study, only participants in the third group showed improvements to their psychological well-being across the 14-days relative to the other groups. [@conner2017let]\n\n a. What type of study is this?\n\n b. Identify the explanatory and response variables.\n\n c. Comment on whether the results of the study can be generalized to the population.\n\n d. Comment on whether the results of the study can be used to establish causal relationships.\n\n e. A newspaper article reporting on the study states, \"The results of this study provide proof that giving young adults fresh fruits and vegetables to eat can have psychological benefits, even over a brief period of time.\" How would you suggest revising this statement so that it can be supported by the study?\n \n \\vspace{5mm}\n\n1. **Screens, teens, and psychological well-being.** \nIn a study of three nationally representative large-scale datasets from Ireland, the United States, and the United Kingdom (n = 17,247), teenagers between the ages of 12 to 15 were asked to keep a diary of their screen time and answer questions about how they felt or acted. \nThe answers to these questions were then used to compute a psychological well-being score. \nAdditional data were collected and included in the analysis, such as each child's sex and age, and on the mother's education, ethnicity, psychological distress, and employment. \nThe study concluded that there is little clear-cut evidence that screen time decreases adolescent well-being. [@orben2018screens]\n\n a. What type of study is this?\n\n b. Identify the explanatory variables.\n\n c. Identify the response variable.\n\n d. Comment on whether the results of the study can be generalized to the population, and why.\n\n e. Comment on whether the results of the study can be used to establish causal relationships.\n\n\n:::\n", + "supporting": [ + "02-data-design_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/02-data-design/figure-html/fig-blocking-1.png b/_freeze/02-data-design/figure-html/fig-blocking-1.png new file mode 100644 index 00000000..207f47e2 Binary files /dev/null and b/_freeze/02-data-design/figure-html/fig-blocking-1.png differ diff --git a/_freeze/02-data-design/figure-html/fig-cluster-multistage-1.png b/_freeze/02-data-design/figure-html/fig-cluster-multistage-1.png new file mode 100644 index 00000000..b0c102ca Binary files /dev/null and b/_freeze/02-data-design/figure-html/fig-cluster-multistage-1.png differ diff --git a/_freeze/02-data-design/figure-html/fig-simple-stratified-1.png b/_freeze/02-data-design/figure-html/fig-simple-stratified-1.png new file mode 100644 index 00000000..d8e0138d Binary files /dev/null and b/_freeze/02-data-design/figure-html/fig-simple-stratified-1.png differ diff --git a/_freeze/02-data-design/figure-html/pop-to-sample-1.png b/_freeze/02-data-design/figure-html/pop-to-sample-1.png new file mode 100644 index 00000000..8b1d563b Binary files /dev/null and b/_freeze/02-data-design/figure-html/pop-to-sample-1.png differ diff --git a/_freeze/02-data-design/figure-html/pop-to-sub-sample-graduates-1.png b/_freeze/02-data-design/figure-html/pop-to-sub-sample-graduates-1.png new file mode 100644 index 00000000..7669044a Binary files /dev/null and b/_freeze/02-data-design/figure-html/pop-to-sub-sample-graduates-1.png differ diff --git a/_freeze/02-data-design/figure-html/sun-causes-cancer-1.png b/_freeze/02-data-design/figure-html/sun-causes-cancer-1.png new file mode 100644 index 00000000..99f72bfe Binary files /dev/null and b/_freeze/02-data-design/figure-html/sun-causes-cancer-1.png differ diff --git a/_freeze/02-data-design/figure-html/survey-sample-1.png b/_freeze/02-data-design/figure-html/survey-sample-1.png new file mode 100644 index 00000000..73e79745 Binary files /dev/null and b/_freeze/02-data-design/figure-html/survey-sample-1.png differ diff --git a/_freeze/02-data-design/figure-html/unnamed-chunk-32-1.png b/_freeze/02-data-design/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 00000000..edc22056 Binary files /dev/null and b/_freeze/02-data-design/figure-html/unnamed-chunk-32-1.png differ diff --git a/_freeze/02-data-design/figure-html/unnamed-chunk-33-1.png b/_freeze/02-data-design/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 00000000..2c84e18f Binary files /dev/null and b/_freeze/02-data-design/figure-html/unnamed-chunk-33-1.png differ diff --git a/_freeze/03-data-applications/execute-results/html.json b/_freeze/03-data-applications/execute-results/html.json new file mode 100644 index 00000000..591e16ef --- /dev/null +++ b/_freeze/03-data-applications/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "1011f7566a5c43b3419d3a2e8bfe6adf", + "result": { + "markdown": "# Applications: Data {#sec-data-applications}\n\n\n\n\n\n## Case study: Passwords {#case-study-passwords}\n\nStop for a second and think about how many passwords you've used so far today.\nYou've probably used one to unlock your phone, one to check email, and probably at least one to log on to a social media account.\nMade a debit purchase?\nYou've probably entered a password there too.\n\nIf you're reading this book, and particularly if you're reading it online, chances are you have had to create a password once or twice in your life.\nAnd if you are diligent about your safety and privacy, you've probably chosen passwords that would be hard for others to guess, or *crack*.\n\nIn this case study we introduce a dataset on passwords.\nThe goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset as well as to provide some \"foreshadowing\" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis.\n\n::: {.data data-latex=\"\"}\nThe [`passwords`](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-14/readme.md) data can be found in the [**tidytuesdayR**](https://thebioengineer.github.io/tidytuesdayR/) R package.\n:::\n\n@tbl-passwords-df-head shows the first ten rows from the dataset, which are the ten most common passwords.\nPerhaps unsurprisingly, \"password\" tops the list, followed by \"123456\".\n\n\n::: {#tbl-passwords-df-head .cell tbl-cap='Top ten rows of the `passwords` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
rank password category value time_unit offline_crack_sec strength
1 password password-related 6.91 years 2.170 8
2 123456 simple-alphanumeric 18.52 minutes 0.000 4
3 12345678 simple-alphanumeric 1.29 days 0.001 4
4 1234 simple-alphanumeric 11.11 seconds 0.000 4
5 qwerty simple-alphanumeric 3.72 days 0.003 8
6 12345 simple-alphanumeric 1.85 minutes 0.000 4
7 dragon animal 3.72 days 0.003 8
8 baseball sport 6.91 years 2.170 4
9 football sport 6.91 years 2.170 7
10 letmein password-related 3.19 months 0.084 8
\n\n`````\n:::\n:::\n\n\nWhen you encounter a new dataset, taking a peek at the first few rows as we did in @tbl-passwords-df-head is almost instinctual.\nIt can often be helpful to look at the last few rows of the data as well to get a sense of the size of the data as well as potentially discover any characteristics that may not be apparent in the top few rows.\n@tbl-passwords-df-tail shows the bottom ten rows of the passwords dataset, which reveals that we are looking at a dataset of 500 passwords.\n\n\n::: {#tbl-passwords-df-tail .cell tbl-cap='Bottom ten rows of the `passwords` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
rank password category value time_unit offline_crack_sec strength
491 natasha name 3.19 months 0.084 7
492 sniper cool-macho 3.72 days 0.003 8
493 chance name 3.72 days 0.003 7
494 genesis nerdy-pop 3.19 months 0.084 7
495 hotrod cool-macho 3.72 days 0.003 7
496 reddog cool-macho 3.72 days 0.003 6
497 alexande name 6.91 years 2.170 9
498 college nerdy-pop 3.19 months 0.084 7
499 jester name 3.72 days 0.003 7
500 passw0rd password-related 92.27 years 29.020 28
\n\n`````\n:::\n:::\n\n\nAt this stage it's also useful to think about how these data were collected, as that will inform the scope of any inference you can make based on your analysis of the data.\n\n::: {.guidedpractice data-latex=\"\"}\nDo these data come from an observational study or an experiment?[^03-data-applications-1]\n:::\n\n[^03-data-applications-1]: This is an observational study.\n Researchers collected data on existing passwords in use and identified most common ones to put together this dataset.\n\n::: {.guidedpractice data-latex=\"\"}\nThere are 500 rows and 7 columns in the dataset.\nWhat does each row and each column represent?[^03-data-applications-2]\n:::\n\n[^03-data-applications-2]: Each row represents a password and each column represents a variable which contains information on each password.\n\nOnce you've identified the rows and columns, it's useful to review the data dictionary to learn about what each column in the dataset represents.\nThis is provided in @tbl-passwords-var-def.\n\n\n::: {#tbl-passwords-var-def .cell tbl-cap='Variables and their descriptions for the `passwords` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variable Description
rank Popularity in the database of released passwords.
password Actual text of the password.
category Category password falls into.
value Time to crack by online guessing.
time_unit Time unit to match with value.
offline_crack_sec Time to crack offline in seconds.
strength Strength of password, relative only to passwords in this dataset. Lower values indicate weaker passwords.
\n\n`````\n:::\n:::\n\n\nWe now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables.\n\n::: {.workedexample data-latex=\"\"}\nDetermine whether each variable in the passwords dataset is numerical or categorical.\nFor numerical variables, further classify them as continuous or discrete.\nFor categorical variables, determine if the variable is ordinal.\n\n------------------------------------------------------------------------\n\nThe numerical variables in the dataset are `rank` (discrete), `value` (continuous), and `offline_crack_sec` (continuous).\nThe categorical variables are `password`, `time_unit`.\nThe strength variable is trickier to classify -- we can think of it as discrete numerical or as an ordinal variable as it takes on numerical values, however it's used to categorize the passwords on an ordinal scale.\nOne way of approaching this is thinking about whether the values the variable takes vary linearly, e.g., is the difference in strength between passwords with strength levels 8 and 9 the same as the difference with those with strength levels 9 and 10.\nIf this is not necessarily the case, we would classify the variable as ordinal.\nDetermining the classification of this variable requires understanding of how `strength` values were determined, which is a very typical workflow for working with data.\nSometimes the data dictionary (presented in @tbl-passwords-var-def) isn't sufficient, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully.\n:::\n\nNext, let's try to get to know each variable a little bit better.\nFor categorical variables, this involves figuring out what their levels are and how commonly represented they are in the data.\n@fig-passwords-cat shows the distributions of the categorical variables in this dataset.\nWe can see that password strengths of 0-10 are more common than higher values.\nThe most common password category is name (e.g. michael, jennifer, jordan, etc.) and the least common is food (e.g., pepper, cheese, coffee, etc.).\nMany passwords can be cracked in the matter of days by online cracking with some taking as little as seconds and some as long as years to break.\nEach of these visualizations is a bar plot, which you will learn more about in @sec-explore-categorical.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distributions of the categorical variables in the `passwords` dataset. Plot A shows the distribution of password strengths, Plot B password categories, and Plot C length of time it takes to crack the passwords by online guessing.](03-data-applications_files/figure-html/fig-passwords-cat-1.png){#fig-passwords-cat width=100%}\n:::\n:::\n\n\nSimilarly, we can examine the distributions of the numerical variables as well.\nWe already know that rank ranges between 1 and 500 in this dataset, based on @tbl-passwords-df-head and @tbl-passwords-df-tail.\nThe value variable is slightly more complicated to consider since the numerical values in that column are meaningless without the time unit that accompanies them.\n@tbl-passwords-online-crack-summary shows the minimum and maximum amount of time it takes to crack a password by online guessing.\nFor example, there are 11 passwords in the dataset that can be broken in a matter of seconds, and each of them take 11.11 seconds to break, since the minimum and the maximum of observations in this group are exactly equal to this value.\nAnd there are 65 passwords that take years to break, ranging from 2.56 years to 92.27 years.\n\n\n::: {#tbl-passwords-online-crack-summary .cell tbl-cap='Minimum and maximum amount of time it takes to crack a password by online guessing as well as the number of observations that fall into each time unit category.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
time_unit n min max
seconds 11 11.11 11.11
minutes 51 1.85 18.52
hours 43 3.09 17.28
days 238 1.29 3.72
weeks 5 1.84 3.70
months 87 3.19 3.19
years 65 2.56 92.27
\n\n`````\n:::\n:::\n\n\nEven though passwords that take a large number of years to crack can seem like good options (see @tbl-passwords-long-crack for a list of them), now that you've seen them here (and the fact that they are in a dataset of 500 most common passwords), you should not use them as secure passwords!\n\n\n::: {#tbl-passwords-long-crack .cell tbl-cap='Passwords that take the longest amount of time to crack by online guessing.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
rank password category value time_unit offline_crack_sec strength
26 trustno1 simple-alphanumeric 92.3 years 29.0 25
336 rush2112 nerdy-pop 92.3 years 29.0 48
406 jordan23 sport 92.3 years 29.3 34
500 passw0rd password-related 92.3 years 29.0 28
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\nThe last numerical variable in the dataset is `offline_crack_sec`.\n@fig-password-offline-crack-hist shows the distribution of this variable, which reveals that all of these passwords can be cracked offline in under 30 seconds, with a large number of them being crackable in just a few seconds.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of the length of time it takes to crack passwords offline.](03-data-applications_files/figure-html/fig-password-offline-crack-hist-1.png){#fig-password-offline-crack-hist width=90%}\n:::\n:::\n\n\nSo far we examined the distributions of each individual variable, but it would be more interesting to explore relationships between multiple variables.\n@fig-password-strength-rank-category shows the relationship between rank and strength of passwords by category, where more common passwords (those with higher rank) are plotted higher on the y-axis than those that are less common in this dataset.\nThe stronger the password, the larger text it's represented with on the plot.\nWhile this visualization reveals some passwords that are less common, and stronger than others, we should reiterate that you should not use any of these passwords.\nAnd if you already do, it's time to go change it!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rank vs. strength of 500 most common passwords by category.](03-data-applications_files/figure-html/fig-password-strength-rank-category-1.png){#fig-password-strength-rank-category width=100%}\n:::\n:::\n\n\nIn this case study, we introduced you to the very first steps a data scientist takes when they start working with a new dataset.\nIn the next few chapters, we will introduce exploratory data analysis and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better.\n\nBefore you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them.\nIt's okay if your answer is \"I'm not sure\", we simply want to get your exploratory juices flowing to prime you for what's to come!\n\n1. What characteristics are associated with a strong vs. a weak password?\n2. Do more popular passwords take shorter or longer to crack compared to less popular passwords?\n3. Are passwords that start with letters or numbers more common among the list of top 500 most common passwords?\n\n\\clearpage\n\n## Interactive R tutorials {#data-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 1: Introduction to data](https://openintrostat.github.io/ims-tutorials/01-data/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintrostat.github.io/ims-tutorials/01-data\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 1: Language of data](https://openintro.shinyapps.io/ims-01-data-01/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-01\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 2: Types of studies](https://openintro.shinyapps.io/ims-01-data-02/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-02\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 3: Sampling strategies and experimental design](https://openintro.shinyapps.io/ims-01-data-03/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-03\n:::\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 1 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-01-data-04/)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-01-data-04\n:::\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs {#data-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Intro to R - Birth rates](https://www.openintro.org/go?id=ims-r-lab-intro-to-r)\n\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?i\nd=ims-r-lab-intro-to-r\n:::\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "supporting": [ + "03-data-applications_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/03-data-applications/figure-html/fig-password-offline-crack-hist-1.png b/_freeze/03-data-applications/figure-html/fig-password-offline-crack-hist-1.png new file mode 100644 index 00000000..74649c1f Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-password-offline-crack-hist-1.png differ diff --git a/_freeze/03-data-applications/figure-html/fig-password-strength-rank-category-1.png b/_freeze/03-data-applications/figure-html/fig-password-strength-rank-category-1.png new file mode 100644 index 00000000..7d420051 Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-password-strength-rank-category-1.png differ diff --git a/_freeze/03-data-applications/figure-html/fig-passwords-cat-1.png b/_freeze/03-data-applications/figure-html/fig-passwords-cat-1.png new file mode 100644 index 00000000..59a5a3cc Binary files /dev/null and b/_freeze/03-data-applications/figure-html/fig-passwords-cat-1.png differ diff --git a/_freeze/04-explore-categorical/execute-results/html.json b/_freeze/04-explore-categorical/execute-results/html.json new file mode 100644 index 00000000..db18eb33 --- /dev/null +++ b/_freeze/04-explore-categorical/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "f468cf6d1a32cfe193f64b206f651d5a", + "result": { + "markdown": "\n\n\n# Exploring categorical data {#sec-explore-categorical}\n\n::: {.chapterintro data-latex=\"\"}\nThis chapter focuses on exploring **categorical** data using summary statistics and visualizations.\nThe summaries and graphs presented in this chapter are created using statistical software; however, since this might be your first exposure to the concepts, we take our time in this chapter to detail how to create them.\nWhere possible, we present multivariate plots; plots that visualize the relationship between multiple variables.\nMastery of the content presented in this chapter will be crucial for understanding the methods and techniques introduced in the rest of the book.\n:::\n\nIn this chapter we will work with data on loans from Lending Club that you've previously seen in Chapter @sec-data-hello.\nThe `loan50` dataset from @sec-data-hello represents a sample from a larger loan dataset called `loans`.\nThis larger dataset contains information on 10,000 loans made through Lending Club.\nWe will examine the relationship between `homeownership`, which for the `loans` data can take a value of `rent`, `mortgage` (owns but has a mortgage), or `own`, and `app_type`, which indicates whether the loan application was made with a partner or whether it was an individual application.\n\n::: {.data data-latex=\"\"}\nThe [`loans_full_schema`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\nBased on the data in this dataset we have modified the `homeownership` and `application_type` variables.\nWe will refer to this modified dataset as `loans`.\n:::\n\n## Contingency tables and bar plots\n\n\n::: {.cell}\n\n:::\n\n\n@tbl-loan-home-app-type-totals summarizes two variables: `application_type` and `homeownership`.\nA table that summarizes data for two categorical variables in this way is called a **contingency table**.\nEach value in the table represents the number of times a particular combination of variable outcomes occurred.\n\nFor example, the value 3496 corresponds to the number of loans in the dataset where the borrower rents their home and the application type was by an individual.\nRow and column totals are also included.\nThe **row totals** provide the total counts across each row and the **column totals** down each column.\nWe can also create a table that shows only the overall percentages or proportions for each combination of categories, or we can create a table for a single variable, such as the one shown in @tbl-loan-homeownership-totals for the `homeownership` variable.\n\n\n\n\n::: {#tbl-loan-home-app-type-totals .cell tbl-cap='A contingency table for application type and homeownership.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
homeownership
application_type rent mortgage own Total
joint 362 950 183 1495
individual 3496 3839 1170 8505
Total 3858 4789 1353 10000
\n\n`````\n:::\n:::\n\n::: {#tbl-loan-homeownership-totals .cell tbl-cap='A table summarizing the frequencies for each value of the homeownership variable -- mortgage, own, and rent.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
homeownership Count
rent 3858
mortgage 4789
own 1353
Total 10000
\n\n`````\n:::\n:::\n\n\nA bar plot is a common way to display a single categorical variable.\nThe left panel of @fig-loan-homeownership-bar-plot shows a **bar plot** for the `homeownership` variable.\nIn the right panel, the counts are converted into proportions, showing the proportion of observations that are in each level.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two bar plots: the left panel shows the counts, and the right panel shows the proportions of values of the homeownership variable.](04-explore-categorical_files/figure-html/fig-loan-homeownership-bar-plot-1.png){#fig-loan-homeownership-bar-plot width=90%}\n:::\n:::\n\n\n## Visualizing two categorical variables\n\n### Bar plots with two variables\n\nWe can display the distributions of two categorical variables on a bar plot concurrently.\nSuch plots are generally useful for visualizing the relationship between two categorical variables.\n@fig-loan-homeownership-app-type-bar-plot shows three such plots that visualize the relationship between `homeownership` and `application_type` variables.\nPlot A in @fig-loan-homeownership-app-type-bar-plot is a **stacked bar plot**.\nThis plot most clearly displays that loan applicants most commonly live in mortgaged homes.\nIt is difficult to say, based on this plot alone, how different application types vary across the levels of homeownership.\nPlot B is a **dodged bar plot**.\nThis plot most clearly displays that within each level of homeownership, individual applications are more common than joint applications.\nFinally, plot C is a **standardized bar plot** (also known as **filled bar plot**).\nThis plot most clearly displays that joint applications are most common among loans for applicants who live in mortgaged homes, compared to renters and owners.\nThis type of visualization is helpful in understanding the fraction of individual or joint loan applications for borrowers in each level of `homeownership`.\nAdditionally, since the proportions of joint and individual loans vary across the groups, we can conclude that the two variables are associated for this sample.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three bar plots (stacked, dodged, and standardized) displaying homeownership and application type variables.](04-explore-categorical_files/figure-html/fig-loan-homeownership-app-type-bar-plot-1.png){#fig-loan-homeownership-app-type-bar-plot width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nExamine the three bar plots in @fig-loan-homeownership-app-type-bar-plot.\nWhen is the stacked, dodged, or standardized bar plot the most useful?\n\n------------------------------------------------------------------------\n\nThe stacked bar plot is most useful when it's reasonable to assign one variable as the explanatory variable (here `homeownership`) and the other variable as the response (here `application_type`) since we are effectively grouping by one variable first and then breaking it down by the others.\n\nDodged bar plots are more agnostic in their display about which variable, if any, represents the explanatory and which the response variable.\nIt is also easy to discern the number of cases in each of the six different group combinations.\nHowever, one downside is that it tends to require more horizontal space; the narrowness of Plot B compared to the other two in @fig-loan-homeownership-app-type-bar-plot makes the plot feel a bit cramped.\nAdditionally, when two groups are of very different sizes, as we see in the group `own` relative to either of the other two groups, it is difficult to discern if there is an association between the variables.\n\nThe standardized stacked bar plot is helpful if the primary variable in the stacked bar plot is relatively imbalanced, e.g., the category has only a third of the observations in the category, making the simple stacked bar plot less useful for checking for an association.\nThe major downside of the standardized version is that we lose all sense of how many cases each of the bars represents.\n:::\n\n### Mosaic plots\n\nA **mosaic plot** is a visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the benefit that we still see the relative group sizes of the primary variable as well.\n\n\n\n\n\nTo get started in creating our first mosaic plot, we'll break a square into columns for each category of the variable, with the result shown in Plot A of @fig-loan-homeownership-type-mosaic-plot.\nEach column represents a level of `homeownership`, and the column widths correspond to the proportion of loans in each of those categories.\nFor instance, there are fewer loans where the borrower is an owner than where the borrower has a mortgage.\nIn general, mosaic plots use box *areas* to represent the number of cases in each category.\n\nPlot B in @fig-loan-homeownership-type-mosaic-plot displays the relationship between homeownership and application type.\nEach column is split proportionally to the number of loans from individual and joint borrowers.\nFor example, the second column represents loans where the borrower has a mortgage, and it was divided into individual loans (upper) and joint loans (lower).\nAs another example, the bottom segment of the third column represents loans where the borrower owns their home and applied jointly, while the upper segment of this column represents borrowers who are homeowners and filed individually.\nWe can again use this plot to see that the `homeownership` and `application_type` variables are associated, since some columns are divided in different vertical locations than others, which was the same technique used for checking an association in the standardized stacked bar plot.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The mosaic plots: one for homeownership alone and the other displaying the relationship between homeownership and application type.](04-explore-categorical_files/figure-html/fig-loan-homeownership-type-mosaic-plot-1.png){#fig-loan-homeownership-type-mosaic-plot width=90%}\n:::\n:::\n\n\nIn @fig-loan-homeownership-type-mosaic-plot, we chose to first split by the homeowner status of the borrower.\nHowever, we could have instead first split by the application type, as in @fig-loan-app-type-mosaic-plot.\nLike with the bar plots, it's common to use the explanatory variable to represent the first split in a mosaic plot, and then for the response to break up each level of the explanatory variable if these labels are reasonable to attach to the variables under consideration.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Mosaic plot where loans are grouped by homeownership after they have been divided into individual and joint application types.](04-explore-categorical_files/figure-html/fig-loan-app-type-mosaic-plot-1.png){#fig-loan-app-type-mosaic-plot width=90%}\n:::\n:::\n\n\n## Row and column proportions\n\nIn the previous sections we inspected visualizations of two categorical variables in bar plots and mosaic plots.\nHowever, we have not discussed how the values in the bar and mosaic plots that show proportions are calculated.\nIn this section we will investigate fractional breakdown of one variable in another and we can modify our contingency table to provide such a view.\n@tbl-loan-home-app-type-row-proportions shows **row proportions** for @tbl-loan-home-app-type-totals, which are computed as the counts divided by their row totals.\nThe value 3496 at the intersection of individual and rent is replaced by $3496 / 8505 = 0.411,$ i.e., 3496 divided by its row total, 8505.\nSo, what does 0.411 represent?\nIt corresponds to the proportion of individual applicants who rent.\n\n\n::: {#tbl-loan-home-app-type-row-proportions .cell tbl-cap='A contingency table with row proportions for the application type and\nhomeownership variables.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
homeownership
application_type rent mortgage own Total
joint 0.242 0.635 0.122 1
individual 0.411 0.451 0.138 1
\n\n`````\n:::\n:::\n\n\nA contingency table of the **column proportions** is computed in a similar way, where each is computed as the count divided by the corresponding column total.\n@tbl-loan-home-app-type-column-proportions shows such a table, and here the value 0.906 indicates that 90.6% of renters applied as individuals for the loan.\nThis rate is higher compared to loans from people with mortgages (80.2%) or who own their home (86.5%).\nBecause these rates vary between the three levels of `homeownership` (`rent`, `mortgage`, `own`), this provides evidence that `app_type` and `homeownership` variables may be associated.\n\n\n::: {#tbl-loan-home-app-type-column-proportions .cell tbl-cap='A contingency table with column proportions for the application type and homeownership variables.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
homeownership
application_type rent mortgage own
joint 0.094 0.198 0.135
individual 0.906 0.802 0.865
Total 1.000 1.000 1.000
\n\n`````\n:::\n:::\n\n\nRow and column proportions can also be thought of as **conditional proportions** as they tell us about the proportion of observations in a given level of a categorical variable conditional on the level of another categorical variable.\n\n\n\n\n\nWe could also have checked for an association between `application_type` and `homeownership` in @tbl-loan-home-app-type-row-proportions using row proportions.\nWhen comparing these row proportions, we would look down columns to see if the fraction of loans where the borrower rents, has a mortgage, or owns varied across the application types.\n\n::: {.guidedpractice data-latex=\"\"}\nWhat does 0.451 represent in @tbl-loan-home-app-type-row-proportions?\nWhat does 0.802 represent in @tbl-loan-home-app-type-column-proportions?[^04-explore-categorical-1]\n:::\n\n[^04-explore-categorical-1]: 0.451 represents the proportion of individual applicants who have a mortgage.\n 0.802 represents the fraction of applicants with mortgages who applied as individuals.\n\n::: {.guidedpractice data-latex=\"\"}\nWhat does 0.122 represent in @tbl-loan-home-app-type-row-proportions?\nWhat does 0.135 represent in @tbl-loan-home-app-type-column-proportions?[^04-explore-categorical-2]\n:::\n\n[^04-explore-categorical-2]: 0.122 represents the fraction of joint borrowers who own their home.\n 0.135 represents the home-owning borrowers who had a joint application for the loan.\n\n::: {.workedexample data-latex=\"\"}\nData scientists use statistics to build email spam filters.\nBy noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy.\nOne such characteristic is whether the email contains no numbers, small numbers, or big numbers.\nAnother characteristic is the email format, which indicates whether an email has any HTML content, such as bolded text.\nWe'll focus on email format and spam status using the dataset; these variables are summarized in a contingency table in @tbl-email-count-table.\nWhich would be more helpful to someone hoping to classify email as spam or regular email for this table: row or column proportions?\n\n------------------------------------------------------------------------\n\nA data scientist would be interested in how the proportion of spam changes within each email format.\nThis corresponds to column proportions: the proportion of spam in plain text emails and the proportion of spam in HTML emails.\n\nIf we generate the column proportions, we can see that a higher fraction of plain text emails are spam ($209/1195 = 17.5\\%$) than compared to HTML emails ($158/2726 = 5.8\\%$).\nThis information on its own is insufficient to classify an email as spam or not spam, as over 80% of plain text emails are not spam.\nYet, when we carefully combine this information with many other characteristics, we stand a reasonable chance of being able to classify some emails as spam or not spam with confidence.\nThis example points out that row and column proportions are not equivalent.\nBefore settling on one form for a table, it is important to consider each to ensure that the most useful table is constructed.\nHowever, sometimes it simply isn't clear which, if either, is more useful.\n:::\n\n::: {.data data-latex=\"\"}\nThe [email](http://openintrostat.github.io/openintro/reference/email.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {#tbl-email-count-table .cell tbl-cap='A contingency table for spam and format.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
spam HTML text Total
not spam 2568 986 3554
spam 158 209 367
Total 2726 1195 3921
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nLook back to @tbl-loan-home-app-type-row-proportions and @tbl-loan-home-app-type-column-proportions.\nAre there any obvious scenarios where one might be more useful than the other?\n\n------------------------------------------------------------------------\n\nNone that we think are obvious!\nWhat is distinct about the email example is that the two loan variables do not have a clear explanatory-response variable relationship that we might hypothesize.\nUsually it is most useful to \"condition\" on the explanatory variable.\nFor instance, in the email example, the email format was seen as a possible explanatory variable of whether the message was spam, so we would find it more interesting to compute the relative frequencies (proportions) for each email format.\n:::\n\n## Pie charts\n\nA **pie chart** is shown in @fig-loan-homeownership-pie-chart alongside a bar plot representing the same information.\nPie charts can be useful for giving a high-level overview to show how a set of cases break down.\nHowever, it is also difficult to decipher certain details in a pie chart.\nFor example, it's not immediately obvious that there are more loans where the borrower has a mortgage than rent when looking at the pie chart, while this detail is very obvious in the bar plot.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A pie chart and bar plot of homeownership.](04-explore-categorical_files/figure-html/fig-loan-homeownership-pie-chart-1.png){#fig-loan-homeownership-pie-chart width=90%}\n:::\n:::\n\n\nPie charts can work well when the goal is to visualize a categorical variable with very few levels, and especially if each level represents a simple fraction (e.g., one-half, one-quarter, etc.).\nHowever, they can be quite difficult to read when they are used to visualize a categorical variable with many levels.\nFor example, the pie chart and the bar plot in @fig-loan-grade-pie-chart both represent the distribution of loan grades (A through G).\nIn this case, it is far easier to compare the counts of each loan grade using the bar plot than the pie chart.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A pie chart and bar plot of loan grades.](04-explore-categorical_files/figure-html/fig-loan-grade-pie-chart-1.png){#fig-loan-grade-pie-chart width=90%}\n:::\n:::\n\n\n## Waffle charts\n\nAnother useful technique of visualizing categorical data is a **waffle chart**.\nWaffle charts can be used to communicate the proportion of the data that falls into each level of a categorical variable.\nJust like with pie charts, they work best when the number of levels represented is low.\nHowever, unlike pie charts, they can make it easier to compare proportions that represent non-simple fractions.\n@fig-loan-waffle displays two examples of waffle charts: one for the distribution of homeownership and the other for the distribution of loan status.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot A: Waffle chart of homeownership, with levels rent, mortgage, and own. Plot B: Waffle chart of loan status, with levels current, fully paid, in grade period, and late.](04-explore-categorical_files/figure-html/fig-loan-waffle-1.png){#fig-loan-waffle width=90%}\n:::\n:::\n\n\n## Comparing numerical data across groups\n\nSome of the more interesting investigations can be considered by examining numerical data across groups.\nIn this section we will expand on a few methods we have already seen to make plots for numerical data from multiple groups on the same graph as well as introduce a few new methods for comparing numerical data across groups.\n\nWe will revisit the `county` dataset and compare the median household income for counties that gained population from 2010 to 2017 versus counties that had no gain.\nWhile we might like to make a causal connection between income and population growth, remember that these are observational data and so such an interpretation would be, at best, half-baked.\n\n\n::: {.cell}\n\n:::\n\n\nWe have data on 3142 counties in the United States.\nWe are missing 2017 population data from 3 of them, and of the remaining 3139 counties, in 1541 the population increased from 2010 to 2017 and in the remaining 1598 the population decreased.\n@tbl-countyIncomeSplitByPopGainTable shows a sample of 5 observations from each group.\n\n\n::: {#tbl-countyIncomeSplitByPopGainTable .cell tbl-cap='The median household income from a random sample of 5 counties with population gain between 2010 to 2017 and another random sample of 5 counties with no population gain.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
State County Population change (%) Gain / No gain Median household income
Arkansas Izard County 2.13 gain 39135
Georgia Jackson County 10.17 gain 57999
Oregon Hood River County 3.41 gain 57269
Texas Montague County 0.75 gain 46592
Virginia Appomattox County 2.38 gain 54875
Kentucky Ballard County -2.62 no gain 42988
Kentucky Fleming County -0.71 no gain 41095
Kentucky Letcher County -5.13 no gain 30293
Maine Penobscot County -0.73 no gain 47886
Virginia Richmond County -0.19 no gain 47341
\n\n`````\n:::\n:::\n\n\nColor can be used to split histograms (see @sec-histograms for an introduction to histograms) for numerical variables by levels of a categorical variable.\nAn example of this is shown in Plot A of @fig-countyIncomeSplitByPopGain.\nThe **side-by-side box plot** is another traditional tool for comparing across groups.\nAn example is shown in Plot B of @fig-countyIncomeSplitByPopGain, where there are two box plots (see @sec-boxplots for an introduction to box plots), one for each group, placed into one plotting window and drawn on the same scale.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histograms (Plot A) and side by-side box plots (Plot B) for median household income, where counties are split by whether there was a population gain or not.](04-explore-categorical_files/figure-html/fig-countyIncomeSplitByPopGain-1.png){#fig-countyIncomeSplitByPopGain width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nUse the plots in @fig-countyIncomeSplitByPopGain to compare the incomes for counties across the two groups.\nWhat do you notice about the approximate center of each group?\nWhat do you notice about the variability between groups?\nIs the shape relatively consistent between groups?\nHow many *prominent* modes are there for each group?[^04-explore-categorical-3]\n:::\n\n[^04-explore-categorical-3]: Answers may vary a little.\n The counties with population gains tend to have higher income (median of about \\$45,000) versus counties without a gain (median of about \\$40,000).\n The variability is also slightly larger for the population gain group.\n This is evident in the IQR, which is about 50% bigger in the *gain* group.\n Both distributions show slight to moderate right skew and are unimodal.\n The box plots indicate there are many observations far above the median in each group, though we should anticipate that many observations will fall beyond the whiskers when examining any dataset that contain more than a few hundred data points.\n\n::: {.guidedpractice data-latex=\"\"}\nWhat components of each plot in @fig-countyIncomeSplitByPopGain do you find most useful?[^04-explore-categorical-4]\n:::\n\n[^04-explore-categorical-4]: Answers will vary.\n The side-by-side box plots are especially useful for comparing centers and spreads, while the hollow histograms are more useful for seeing distribution shape, skew, modes, and potential anomalies.\n\nAnother useful visualization for comparing numerical data across groups is a **ridge plot**, which combines density plots (see @sec-boxplots for an introduction to density plots) for various groups drawn on the same scale in a single plotting window.\n@fig-countyIncomeSplitByPopGainRidge displays a ridge plot for the distribution of median household income in counties, split by whether there was a population gain or not.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Ridge plot for median household income, where counties are split by whether there was a population gain or not.](04-explore-categorical_files/figure-html/fig-countyIncomeSplitByPopGainRidge-1.png){#fig-countyIncomeSplitByPopGainRidge width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat components of the ridge plot in @fig-countyIncomeSplitByPopGainRidge do you find most useful compared to those in @fig-countyIncomeSplitByPopGain?[^04-explore-categorical-5]\n:::\n\n[^04-explore-categorical-5]: The ridge plot give us a better sense of the shape, and especially modality, of the data.\n\nOne last visualization technique we'll highlight for comparing numerical data across groups is **faceting**.\nIn this technique we split (facet) the graphical display of the data across plotting windows based on groups.\nPlot A in @fig-countyIncomeSplitByPopGainFacetHist displays the same information as Plot A in @fig-countyIncomeSplitByPopGain, however here the distributions of median household income for counties with and without population gain are faceted across two plotting windows.\nWe preserve the same scale on the x and y axes for easier comparison.\nAn advantage of this approach is that it extends to splitting the data across levels of two categorical variables, which allows for displaying relationships between three variables.\nIn Plot B in @fig-countyIncomeSplitByPopGainFacetHist we have now split the data into four groups using the `pop_change` and `metro` variables:\n\n- top left represents counties that are *not* in a `metro`politan area with population gain,\n- top right represents counties that are in a metropolitan area with population gain,\n- bottom left represents counties that are *not* in a metropolitan area without population gain, and finally\n- bottom right represents counties that are in a metropolitan area without population gain.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distribution of median income in counties using faceted histograms: Plot A facets by whether there was a population gain or not and Plot B facets by both population gain and whether the county is in a metropolitan area.](04-explore-categorical_files/figure-html/fig-countyIncomeSplitByPopGainFacetHist-1.png){#fig-countyIncomeSplitByPopGainFacetHist width=100%}\n:::\n:::\n\n\nWe can continue building upon this visualization to add one more variable, `median_edu`, which is the median education level in the county.\nIn @fig-countyIncomeRidgeMulti, we represent median education level using color, where pink (solid line) represents counties where the median education level is high school diploma, yellow (dashed line) is some college degree, and red (dotted line) is Bachelor's.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distribution of median income in counties using a ridge plot, faceted by whether the county had a population gain or not as well as whether the county is in a metropolitan area and colored by the median education level in the county.](04-explore-categorical_files/figure-html/fig-countyIncomeRidgeMulti-1.png){#fig-countyIncomeRidgeMulti width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nBased on @fig-countyIncomeRidgeMulti, what can you say about how median household income in counties vary depending on population gain/no gain, metropolitan area/not, and median degree?[^04-explore-categorical-6]\n:::\n\n[^04-explore-categorical-6]: Regardless of the location (metropolitan or not) or change in population, it seems like there is an increase in median household income from individuals with only a HS diploma, to individuals with some college, to individuals with a Bachelor's degree.\n\n\\vspace{20mm}\n\n## Chapter review {#chp4-review}\n\n### Summary\n\nFluently working with categorical variables is an important skill for data analysts.\nIn this chapter we have introduced different visualizations and numerical summaries applied to categorical variables.\nThe graphical visualizations are even more descriptive when two variables are presented simultaneously.\nWe presented bar plots, mosaic plots, pie charts, and estimations of conditional proportions.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
column proportions faceted plot row totals
column totals filled bar plot side-by-side box plot
conditional proportions mosaic plot stacked bar plot
contingency table ridge plot standardized bar plot
dodged bar plot row proportions
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp4-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-04].\n\n::: {.exercises data-latex=\"\"}\n1. **Antibiotic use in children.** The bar plot and the pie chart below show the distribution of pre-existing medical conditions of children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.[^_04-ex-explore-categorical-1]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](04-explore-categorical_files/figure-html/unnamed-chunk-29-1.png){width=100%}\n :::\n :::\n\n a. What features are apparent in the bar plot but not in the pie chart?\n\n b. What features are apparent in the pie chart but not in the bar plot?\n\n c. Which graph would you prefer to use for displaying these categorical data?\n\n \\vspace{5mm}\n\n2. **Views on immigration.** Nine-hundred and ten (910) randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country.\n The results of the survey by political ideology are shown below.[^_04-ex-explore-categorical-2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Response Conservative Liberal Moderate Total
Apply for citizenship 57 101 120 278
Guest worker 121 28 113 262
Leave the country 179 45 126 350
Not sure 15 1 4 20
Total 372 175 363 910
\n \n `````\n :::\n :::\n\n a. What percent of these Tampa, FL voters identify themselves as conservatives?\n\n b. What percent of these Tampa, FL voters are in favor of the citizenship option?\n\n c. What percent of these Tampa, FL voters identify themselves as conservatives and are in favor of the citizenship option?\n\n d. What percent of these Tampa, FL voters who identify themselves as conservatives are also in favor of the citizenship option?\n What percent of moderates share this view?\n What percent of liberals share this view?\n\n e. Do political ideology and views on immigration appear to be associated?\n Explain your reasoning.\n\n f. Conjecture other possible variables that might explain the potential relationship between these two variables.\n\n \\clearpage\n\n3. **Black Lives Matter.** A Washington Post-Schar School poll conducted in the United States in June 2020, among a random national sample of 1,006 adults, asked respondents whether they support or oppose protests following George Floyd's killing that have taken place in cities across the US.\n The survey also collected information on the age of the respondents.\n [@survey:blmWaPoScar:2020] The results are summarized in the stacked bar plot below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](04-explore-categorical_files/figure-html/unnamed-chunk-31-1.png){width=90%}\n :::\n :::\n\n a. Based on the stacked bar plot, do views on the protests and age appear to be associated?\n Explain your reasoning.\n\n b. Conjecture other possible variables that might explain the potential association between these two variables.\n\n \\vspace{5mm}\n\n4. **Raise taxes.** A random sample of registered voters nationally were asked whether they think it's better to raise taxes on the rich or raise taxes on the poor.\n The survey also collected information on the political party affiliation of the respondents.\n [@survey:raiseTaxes:2015]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](04-explore-categorical_files/figure-html/unnamed-chunk-32-1.png){width=90%}\n :::\n :::\n\n a. Based on the stacked bar plot shown above, do views on raising taxes and political affiliation appear to be associated?\n Explain your reasoning.\n\n b. Conjecture other possible variables that might explain the potential association between these two variables.\n\n \\clearpage\n\n5. **Heart transplant data display.** The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan.\n Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart.\n Patients were randomly assigned into treatment and control groups.\n Patients in the treatment group received a transplant, and those in the control group did not.\n The visualization below displays two different versions of the data.[^_04-ex-explore-categorical-3]\n [@Turnbull+Brown+Hu:1974]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](04-explore-categorical_files/figure-html/unnamed-chunk-33-1.png){width=100%}\n :::\n :::\n\n a. Provide one aspect of the two-group comparison that is easier to see from the stacked bar plot (left)?\n\n b. Provide one aspect of the two-group comparison that is easier to see from the standardized bar plot (right)?\n\n c. For the Heart Transplant Study which of those aspects would be more important to display?\n That is, which bar plot would be better as a data visualization?\n\n \\vspace{5mm}\n\n6. **Shipping holiday gifts data display.** A local news survey asked 500 randomly sampled Los Angeles residents which shipping carrier they prefer to use for shipping holiday gifts.\n The table below shows the distribution of responses by age group as well as the expected counts for each cell (shown in italics).\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](04-explore-categorical_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n\n a. Which graph (top or bottom) would you use to understand the shipping choices of people of different ages?\n\n b. Which graph (top or bottom) would you use to understand the age distribution across different types of shipping choices?\n\n c. A new shipping company would like to market to people over the age of 55.\n Who will be their biggest competitor?\n\n d. FedEx would like to reach out to grow their market share to balance the age demographics of FedEx users.\n To what age group should FedEx market?\n\n[^_04-ex-explore-categorical-1]: The [`antibiotics`](http://openintrostat.github.io/openintro/reference/antibiotics.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_04-ex-explore-categorical-2]: The [`immigration`](http://openintrostat.github.io/openintro/reference/immigration.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_04-ex-explore-categorical-3]: The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n", + "supporting": [ + "04-explore-categorical_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/04-explore-categorical/figure-html/fig-countyIncomeRidgeMulti-1.png b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeRidgeMulti-1.png new file mode 100644 index 00000000..613951e1 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeRidgeMulti-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGain-1.png b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGain-1.png new file mode 100644 index 00000000..c8ee81a5 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGain-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGainFacetHist-1.png b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGainFacetHist-1.png new file mode 100644 index 00000000..26936422 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGainFacetHist-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGainRidge-1.png b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGainRidge-1.png new file mode 100644 index 00000000..01aaecca Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-countyIncomeSplitByPopGainRidge-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-app-type-mosaic-plot-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-app-type-mosaic-plot-1.png new file mode 100644 index 00000000..13a0e3a0 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-app-type-mosaic-plot-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-grade-pie-chart-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-grade-pie-chart-1.png new file mode 100644 index 00000000..0346e61e Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-grade-pie-chart-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-app-type-bar-plot-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-app-type-bar-plot-1.png new file mode 100644 index 00000000..37bcf031 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-app-type-bar-plot-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-bar-plot-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-bar-plot-1.png new file mode 100644 index 00000000..b3a18a08 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-bar-plot-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-pie-chart-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-pie-chart-1.png new file mode 100644 index 00000000..70ba6f50 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-pie-chart-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-type-mosaic-plot-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-type-mosaic-plot-1.png new file mode 100644 index 00000000..985705ad Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-homeownership-type-mosaic-plot-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/fig-loan-waffle-1.png b/_freeze/04-explore-categorical/figure-html/fig-loan-waffle-1.png new file mode 100644 index 00000000..2c8d2c85 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/fig-loan-waffle-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/unnamed-chunk-29-1.png b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-29-1.png new file mode 100644 index 00000000..6fdbee94 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-29-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/unnamed-chunk-31-1.png b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 00000000..65108e72 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-31-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/unnamed-chunk-32-1.png b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 00000000..33e38464 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-32-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/unnamed-chunk-33-1.png b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 00000000..fdb50e89 Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-33-1.png differ diff --git a/_freeze/04-explore-categorical/figure-html/unnamed-chunk-34-1.png b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 00000000..b156912e Binary files /dev/null and b/_freeze/04-explore-categorical/figure-html/unnamed-chunk-34-1.png differ diff --git a/_freeze/05-explore-numerical/execute-results/html.json b/_freeze/05-explore-numerical/execute-results/html.json new file mode 100644 index 00000000..ed808684 --- /dev/null +++ b/_freeze/05-explore-numerical/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "a5c19f0438cc50436b382fe21c676cc2", + "result": { + "markdown": "# Exploring numerical data {#explore-numerical}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nThis chapter focuses on exploring **numerical** data using summary statistics and visualizations.\nThe summaries and graphs presented in this chapter are created using statistical software; however, since this might be your first exposure to the concepts, we take our time in this chapter to detail how to create them.\nMastery of the content presented in this chapter will be crucial for understanding the methods and techniques introduced in the rest of the book.\n:::\n\nConsider the `loan_amount` variable from the `loan50` dataset, which represents the loan size for each of 50 loans in the dataset.\n\nThis variable is numerical since we can sensibly discuss the numerical difference of the size of two loans.\nOn the other hand, area codes and zip codes are not numerical, but rather they are categorical variables.\n\nThroughout this chapter, we will apply numerical methods using the `loan50` and `county` datasets, which were introduced in @sec-data-basics.\nIf you'd like to review the variables from either dataset, see Tables @tbl-loan-50-variables and @tbl-county-variables.\n\n::: {.data data-latex=\"\"}\nThe [`county`](http://openintrostat.github.io/usdata/reference/county.html) data can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package and the [`loan50`](http://openintrostat.github.io/openintro/reference/loan50.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n## Scatterplots for paired data {#scatterplots}\n\nA **scatterplot** provides a case-by-case view of data for two numerical variables.\nIn @fig-county-multi-unit-homeownership, a scatterplot was used to examine the homeownership rate against the percentage of housing units that are in multi-unit structures (e.g., apartments) in the `county` dataset.\nAnother scatterplot is shown in @fig-loan50-amount-income, comparing the total income of a borrower `total_income` and the amount they borrowed `loan_amount` for the `loan50` dataset.\nIn any scatterplot, each point represents a single case.\nSince there are 50 cases in `loan50`, there are 50 points in @fig-loan50-amount-income.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot of loan amount versus total income for the `loan50`\ndataset.](05-explore-numerical_files/figure-html/fig-loan50-amount-income-1.png){#fig-loan50-amount-income width=90%}\n:::\n:::\n\n\nLooking at @fig-loan50-amount-income, we see that there are many borrowers with income below \\$100,000 on the left side of the graph, while there are a handful of borrowers with income above \\$250,000.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot of the median household income against the poverty\nrate for the `county` dataset. Data are from 2017. A statistical model has\nalso been fit to the data and is shown as a dashed line.](05-explore-numerical_files/figure-html/fig-median-hh-income-poverty-1.png){#fig-median-hh-income-poverty width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\n@fig-median-hh-income-poverty shows a plot of median household income against the poverty rate for 3142 counties in the US.\nWhat can be said about the relationship between these variables?\n\n------------------------------------------------------------------------\n\nThe relationship is evidently **nonlinear**, as highlighted by the dashed line.\nThis is different from previous scatterplots we have seen, which indicate very little, if any, curvature in the trend.\n:::\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat do scatterplots reveal about the data, and how are they useful?[^05-explore-numerical-1]\n:::\n\n[^05-explore-numerical-1]: Answers may vary.\n Scatterplots are helpful in quickly spotting associations relating variables, whether those associations come in the form of simple trends or whether those relationships are more complex.\n\n::: {.guidedpractice data-latex=\"\"}\nDescribe two variables that would have a horseshoe-shaped association in a scatterplot $(\\cap$ or $\\frown).$[^05-explore-numerical-2]\n:::\n\n[^05-explore-numerical-2]: Consider the case where your vertical axis represents something \"good\" and your horizontal axis represents something that is only good in moderation.\n Health and water consumption fit this description: we require some water to survive, but consume too much and it becomes toxic and can kill a person.\n\n## Dot plots and the mean {#dotplots}\n\nSometimes we are interested in the distribution of a single variable.\nIn these cases, a dot plot provides the most basic of displays.\nA **dot plot** is a one-variable scatterplot; an example using the interest rate of 50 loans is shown in @fig-loan-int-rate-dotplot.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A dot plot of interest rate for the `loan50` dataset. The rates\nhave been rounded to the nearest percent in this plot, and the\ndistribution's mean is shown as a red triangle.](05-explore-numerical_files/figure-html/fig-loan-int-rate-dotplot-1.png){#fig-loan-int-rate-dotplot width=90%}\n:::\n:::\n\n\nThe **mean**, often called the **average** is a common way to measure the center of a **distribution** of data.\nTo compute the mean interest rate, we add up all the interest rates and divide by the number of observations.\n\n\n\n\n\nThe sample mean is often labeled $\\bar{x}.$ The letter $x$ is being used as a generic placeholder for the variable of interest and the bar over the $x$ communicates we are looking at the average interest rate, which for these 50 loans is 11.57%.\nIt's useful to think of the mean as the balancing point of the distribution, and it's shown as a triangle in @fig-loan-int-rate-dotplot.\n\n::: {.important data-latex=\"\"}\n**Mean.**\n\nThe sample mean can be calculated as the sum of the observed values divided by the number of observations:\n\n$$ \\bar{x} = \\frac{x_1 + x_2 + \\cdots + x_n}{n} $$\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nExamine the equation for the mean.\nWhat does $x_1$ correspond to?\nAnd $x_2$?\nCan you infer a general meaning to what $x_i$ might represent?[^05-explore-numerical-3]\n:::\n\n[^05-explore-numerical-3]: $x_1$ corresponds to the interest rate for the first loan in the sample, $x_2$ to the second loan's interest rate, and $x_i$ corresponds to the interest rate for the $i^{th}$ loan in the dataset.\n For example, if $i = 4,$ then we are examining $x_4,$ which refers to the fourth observation in the dataset.\n\n::: {.guidedpractice data-latex=\"\"}\nWhat was $n$ in this sample of loans?[^05-explore-numerical-4]\n:::\n\n[^05-explore-numerical-4]: The sample size was $n = 50.$\n\nThe `loan50` dataset represents a sample from a larger population of loans made through Lending Club.\nWe could compute a mean for the entire population in the same way as the sample mean.\nHowever, the population mean has a special label: $\\mu.$ The symbol $\\mu$ is the Greek letter *mu* and represents the average of all observations in the population.\nSometimes a subscript, such as $_x,$ is used to represent which variable the population mean refers to, e.g., $\\mu_x.$ Oftentimes it is too expensive to measure the population mean precisely, so we often estimate $\\mu$ using the sample mean, $\\bar{x}.$\n\n::: {.pronunciation data-latex=\"\"}\nThe Greek letter $\\mu$ is pronounced *mu*, listen to the pronunciation [here](https://youtu.be/PStgY5AcEIw?t=47).\n:::\n\n::: {.workedexample data-latex=\"\"}\nAlthough we do not have an ability to *calculate* the average interest rate across all loans in the populations, we can *estimate* the population value using the sample data.\nBased on the sample of 50 loans, what would be a reasonable estimate of $\\mu_x,$ the mean interest rate for all loans in the full dataset?\n\n------------------------------------------------------------------------\n\nThe sample mean, 11.57, provides a rough estimate of $\\mu_x.$ While it is not perfect, this is our single best guess **point estimate**\\index{point estimate} of the average interest rate of all the loans in the population under study.\nIn @sec-foundations-randomization and beyond, we will develop tools to characterize the accuracy of point estimates, like the sample mean.\nAs you might have guessed, point estimates based on larger samples tend to be more accurate than those based on smaller samples.\n:::\n\n\n\n\n\nThe mean is useful because it allows us to rescale or standardize a metric into something more easily interpretable and comparable.\nSuppose we would like to understand if a new drug is more effective at treating asthma attacks than the standard drug.\nA trial of 1,500 adults is set up, where 500 receive the new drug, and 1000 receive a standard drug in the control group.\nResults of this trial are summarized in @tbl-drug-asthma-results.\n\n\n::: {#tbl-drug-asthma-results .cell tbl-cap='Results of a trial of 1500 adults that suffer from asthma.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n\n
New drug Standard drug
Number of patients 500 1000
Total asthma attacks 200 300
\n\n`````\n:::\n:::\n\n\nComparing the raw counts of 200 to 300 asthma attacks would make it appear that the new drug is better, but this is an artifact of the imbalanced group sizes.\nInstead, we should look at the average number of asthma attacks per patient in each group:\n\n- New drug: $200 / 500 = 0.4$ asthma attacks per patient\n- Standard drug: $300 / 1000 = 0.3$ asthma attacks per patient\n\nThe standard drug has a lower average number of asthma attacks per patient than the average in the treatment group.\n\n::: {.workedexample data-latex=\"\"}\nCome up with another example where the mean is useful for making comparisons.\n\n------------------------------------------------------------------------\n\nEmilio opened a food truck last year where he sells burritos, and his business has stabilized over the last 3 months.\nOver that 3-month period, he has made \\$11,000 while working 625 hours.\nEmilio's average hourly earnings provides a useful statistic for evaluating whether his venture is, at least from a financial perspective, worth it:\n\n$$ \\frac{\\$11000}{625\\text{ hours}} = \\$17.60\\text{ per hour} $$\n\nBy knowing his average hourly wage, Emilio now has put his earnings into a standard unit that is easier to compare with many other jobs that he might consider.\n:::\n\n::: {.workedexample data-latex=\"\"}\nSuppose we want to compute the average income per person in the US.\nTo do so, we might first think to take the mean of the per capita incomes across the 3,142 counties in the `county` dataset.\nWhat would be a better approach?\n\n------------------------------------------------------------------------\n\nThe `county` dataset is special in that each county actually represents many individual people.\nIf we were to simply average across the `income` variable, we would be treating counties with 5,000 and 5,000,000 residents equally in the calculations.\nInstead, we should compute the total income for each county, add up all the counties' totals, and then divide by the number of people in all the counties.\nIf we completed these steps with the `county` data, we would find that the per capita income for the US is \\$30,861.\nHad we computed the *simple* mean of per capita income across counties, the result would have been just \\$26,093!\n\nThis example used what is called a **weighted mean**.\nFor more information on this topic, check out the following online supplement regarding [weighted means](https://www.openintro.org/go/?id=stat_extra_weighted_mean).\n:::\n\n\n\n\n\n## Histograms and shape {#sec-histograms}\n\nDot plots show the exact value for each observation.\nThey are useful for small datasets but can become hard to read with larger samples.\nRather than showing the value of each observation, we prefer to think of the value as belonging to a *bin*.\nFor example, in the `loan50` dataset, we created a table of counts for the number of loans with interest rates between 5.0% and 7.5%, then the number of loans with rates between 7.5% and 10.0%, and so on.\nObservations that fall on the boundary of a bin (e.g., 10.00%) are allocated to the lower bin.\nThe tabulation is shown in @tbl-binnedIntRateAmountTable, and the binned counts are plotted as bars in @fig-loan50IntRateHist into what is called a **histogram**.\nNote that the histogram resembles a more heavily binned version of the stacked dot plot shown in @fig-loan-int-rate-dotplot.\n\n\n\n\n::: {#tbl-binnedIntRateAmountTable .cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Counts for the binned interest rate data.
Interest rate Count
(5% - 7.5%] 11
(7.5% - 10%] 15
(10% - 12.5%] 8
(12.5% - 15%] 4
(15% - 17.5%] 5
(17.5% - 20%] 4
(20% - 22.5%] 1
(22.5% - 25%] 1
(25% - 27.5%] 1
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![A histogram of interest rate. This distribution is strongly skewed\nto the right.](05-explore-numerical_files/figure-html/fig-loan50IntRateHist-1.png){#fig-loan50IntRateHist width=90%}\n:::\n:::\n\n\nHistograms provide a view of the **data density**.\nHigher bars represent where the data are relatively more common.\nFor instance, there are many more loans with rates between 5% and 10% than loans with rates between 20% and 25% in the dataset.\nThe bars make it easy to see how the density of the data changes relative to the interest rate.\n\n\n\n\n\nHistograms are especially convenient for understanding the shape of the data distribution.\n@fig-loan50IntRateHist suggests that most loans have rates under 15%, while only a handful of loans have rates above 20%.\nWhen the distribution of a variable trails off to the right in this way and has a longer right **tail**, the shape is said to be **right skewed**.[^05-explore-numerical-5]\n\n[^05-explore-numerical-5]: Other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed to the positive end.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A density plot of interest rate. Again, the distribution is strongly skewed\nto the right.](05-explore-numerical_files/figure-html/fig-loan50IntRateDensity-1.png){#fig-loan50IntRateDensity width=90%}\n:::\n:::\n\n\n@fig-loan50IntRateDensity shows a **density plot** which is a smoothed out histogram.\nThe technical details for how to draw density plots (precisely how to smooth out the histogram) are beyond the scope of this text, but you will note that the shape, scale, and spread of the observations are displayed similarly in a histogram as in a density plot.\n\n\n\n\n\nVariables with the reverse characteristic -- a long, thinner tail to the left -- are said to be **left skewed**.\nWe also say that such a distribution has a long left tail.\nVariables that show roughly equal trailing off in both directions are called **symmetric**.\n\n\n\n\n\n::: {.important data-latex=\"\"}\nWhen data trail off in one direction, the distribution has a **long tail**.\nIf a distribution has a long left tail, it is left skewed.\nIf a distribution has a long right tail, it is right skewed.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nBesides the mean (since it was labeled), what can you see in the dot plot in @fig-loan-int-rate-dotplot that you cannot see in the histogram in @fig-loan50IntRateHist?[^05-explore-numerical-6]\n:::\n\n[^05-explore-numerical-6]: The interest rates for individual loans.\n\nIn addition to looking at whether a distribution is skewed or symmetric, histograms can be used to identify modes.\nA **mode** is represented by a prominent peak in the distribution.\nThere is only one prominent peak in the histogram of `interest_rate`.\n\nA definition of *mode* sometimes taught in math classes is the value with the most occurrences in the dataset.\nHowever, for many real-world datasets, it is common to have *no* observations with the same value in a dataset, making this definition impractical in data analysis.\n\n@fig-singleBiMultiModalPlots shows histograms that have one, two, or three prominent peaks.\nSuch distributions are called **unimodal**, **bimodal**, and **multimodal**, respectively.\nAny distribution with more than two prominent peaks is called multimodal.\nNotice that there was one prominent peak in the unimodal distribution with a second less prominent peak that was not counted since it only differs from its neighboring bins by a few observations.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Counting only prominent peaks, the distributions are (left to right)\nunimodal, bimodal, and multimodal. Note that the left plot is unimodal\nbecause we are counting prominent peaks, not just any peak.\n](05-explore-numerical_files/figure-html/fig-singleBiMultiModalPlots-1.png){#fig-singleBiMultiModalPlots width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\n@fig-loan50IntRateHist reveals only one prominent mode in the interest rate.\nIs the distribution unimodal, bimodal, or multimodal?[^05-explore-numerical-7]\n:::\n\n[^05-explore-numerical-7]: Remember that *uni* stands for 1 (think *uni*cycles), and *bi* stands for 2 (think *bi*cycles).\n\n::: {.guidedpractice data-latex=\"\"}\nHeight measurements of young students and adult teachers at a K-3 elementary school were taken.\nHow many modes would you expect in this height dataset?[^05-explore-numerical-8]\n:::\n\n[^05-explore-numerical-8]: There might be two height groups visible in the dataset: one of the students and one of the adults.\n That is, the data are probably bimodal.\n\nLooking for modes isn't about finding a clear and correct answer about the number of modes in a distribution, which is why *prominent*\\index{prominent} is not rigorously defined in this book.\nThe most important part of this examination is to better understand your data.\n\n## Variance and standard deviation {#variance-sd}\n\nThe mean was introduced as a method to describe the center of a variable, and **variability**\\index{variability} in the data is also important.\nHere, we introduce two measures of variability: the variance and the standard deviation.\nBoth of these are very useful in data analysis, even though their formulas are a bit tedious to calculate by hand.\nThe standard deviation is the easier of the two to comprehend, as it roughly describes how far away the typical observation is from the mean.\n\n\n\n\n\nWe call the distance of an observation from its mean its **deviation**.\nBelow are the deviations for the $1^{st},$ $2^{nd},$ $3^{rd},$ and $50^{th}$ observations in the `interest_rate` variable:\n\n\n\n\n\n\n\n$$\n\\begin{aligned}\nx_1 - \\bar{x} &= 10.9 - 11.57 = -0.67 \\\\\nx_2 - \\bar{x} &= 9.92 - 11.57 = -1.65 \\\\\nx_3 - \\bar{x} &= 26.3 - 11.57 = 14.73 \\\\\n&\\vdots \\\\\nx_{50} - \\bar{x} &= 6.08 - 11.57 = -5.49 \\\\\n\\end{aligned}\n$$\n\nIf we square these deviations and then take an average, the result is equal to the sample **variance**, denoted by $s^2$:\n\n\n\n\n\n$$\n\\begin{aligned}\ns^2 &= \\frac{(-0.67)^2 + (-1.65)^2 + (14.73)^2 + \\cdots + (-5.49)^2}{50 - 1} \\\\\n&= \\frac{0.45 + 2.72 + \\cdots + 30.14}{49} \\\\\n&= 25.52\n\\end{aligned}\n$$\n\nWe divide by $n - 1,$ rather than dividing by $n,$ when computing a sample's variance.\nThere's some mathematical nuance here, but the end result is that doing this makes this statistic slightly more reliable and useful.\n\nNotice that squaring the deviations does two things.\nFirst, it makes large values relatively much larger.\nSecond, it gets rid of any negative signs.\n\n::: {.important data-latex=\"\"}\n**Standard deviation.**\n\nThe sample standard deviation can be calculated as the square root of the sum of the squared distance of each value from the mean divided by the number of observations minus one:\n\n$$s = \\sqrt{\\frac{\\sum_{i=1}^n (x_i - \\bar{x})^2}{n-1}}$$\n:::\n\nThe **standard deviation** is defined as the square root of the variance:\n\n$$s = \\sqrt{25.52} = 5.05$$\n\nWhile often omitted, a subscript of $_x$ may be added to the variance and standard deviation, i.e., $s_x^2$ and $s_x^{},$ if it is useful as a reminder that these are the variance and standard deviation of the observations represented by $x_1,$ $x_2,$ ..., $x_n.$\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Variance and standard deviation.**\n\nThe variance is the average squared distance from the mean.\nThe standard deviation is the square root of the variance.\nThe standard deviation is useful when considering how far the data are distributed from the mean.\n\nThe standard deviation represents the typical deviation of observations from the mean.\nOften about 68% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations.\nHowever, these percentages are not strict rules.\n:::\n\nLike the mean, the population values for variance and standard deviation have special symbols: $\\sigma^2$ for the variance and $\\sigma$ for the standard deviation.\n\n::: {.pronunciation data-latex=\"\"}\nThe Greek letter $\\sigma$ is pronounced *sigma*, listen to the pronunciation [here](https://youtu.be/PStgY5AcEIw?t=72).\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For the interest rate variable, 34 of the 50 loans (68%) had\ninterest rates within 1 standard deviation of the mean, and 48 of the 50\nloans (96%) had rates within 2 standard deviations. Usually about 68% of\nthe data are within 1 standard deviation of the mean and 95% within 2\nstandard deviations, though this is far from a hard rule.](05-explore-numerical_files/figure-html/sdRuleForIntRate-1.png){width=90%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Three very different population distributions with the same mean (0)\nand standard deviation (1).](05-explore-numerical_files/figure-html/fig-severalDiffDistWithSdOf1-1.png){#fig-severalDiffDistWithSdOf1 width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nA good description of the shape of a distribution should include modality and whether the distribution is symmetric or skewed to one side.\nUsing @fig-severalDiffDistWithSdOf1 as an example, explain why such a description is important.[^05-explore-numerical-9]\n:::\n\n[^05-explore-numerical-9]: @fig-severalDiffDistWithSdOf1 shows three distributions that look quite different, but all have the same mean, variance, and standard deviation.\n Using modality, we can distinguish between the first plot (bimodal) and the last two (unimodal).\n Using skewness, we can distinguish between the last plot (right skewed) and the first two.\n While a picture, like a histogram, tells a more complete story, we can use modality and shape (symmetry/skew) to characterize basic information about a distribution.\n\n::: {.workedexample data-latex=\"\"}\nDescribe the distribution of the `interest_rate` variable using the histogram in @fig-loan50IntRateHist.\nThe description should incorporate the center, variability, and shape of the distribution, and it should also be placed in context.\nAlso note any especially unusual cases.\n\n------------------------------------------------------------------------\n\nThe distribution of interest rates is unimodal and skewed to the high end.\nMany of the rates fall near the mean at 11.57%, and most fall within one standard deviation (5.05%) of the mean.\nThere are a few exceptionally large interest rates in the sample that are above 20%.\n:::\n\nIn practice, the variance and standard deviation are sometimes used as a means to an end, where the \"end\" is being able to accurately estimate the uncertainty associated with a sample statistic.\nFor example, in @sec-foundations-mathematical the standard deviation is used in calculations that help us understand how much a sample mean varies from one sample to the next.\n\n## Box plots, quartiles, and the median {#sec-boxplots}\n\nA **box plot** summarizes a dataset using five statistics while also identifying unusual observations.\n@fig-loan-int-rate-boxplot-dotplot provides a dot plot alongside a box plot of the `interest_rate` variable from the `loan50` dataset.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot A shows a dot plot and Plot B shows a box plot of the\ndistribution of interest rates from the `loan50` dataset.\n](05-explore-numerical_files/figure-html/fig-loan-int-rate-boxplot-dotplot-1.png){#fig-loan-int-rate-boxplot-dotplot width=90%}\n:::\n:::\n\n\nThe dark line inside the box represents the **median**, which splits the data in half.\n50% of the data fall below this value and 50% fall above it.\nSince in the `loan50` dataset there are 50 observations (an even number), the median is defined as the average of the two observations closest to the $50^{th}$ percentile.\n@tbl-loan50-int-rate-sorted shows all interest rates, arranged in ascending order.\nWe can see that the $25^{th}$ and the $26^{th}$ values are both 9.93, which corresponds to the dark line in the box plot in @fig-loan-int-rate-boxplot-dotplot.\n\n\n::: {#tbl-loan50-int-rate-sorted .cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Interest rates from the `loan50` dataset, arranged in ascending order.
1 2 3 4 5 6 7 8 9 10
1 5.31 5.31 5.32 6.08 6.08 6.08 6.71 6.71 7.34 7.35
10 7.35 7.96 7.96 7.96 7.97 9.43 9.43 9.44 9.44 9.44
20 9.92 9.92 9.92 9.92 9.93 9.93 10.42 10.42 10.90 10.90
30 10.91 10.91 10.91 11.98 12.62 12.62 12.62 14.08 15.04 16.02
40 17.09 17.09 17.09 18.06 18.45 19.42 20.00 21.45 24.85 26.30
\n\n`````\n:::\n:::\n\n\nWhen there are an odd number of observations, there will be exactly one observation that splits the data into two halves, and in such a case that observation is the median (no average needed).\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Median: the number in the middle.**\n\nIf the data are ordered from smallest to largest, the **median** is the observation right in the middle.\nIf there are an even number of observations, there will be two values in the middle, and the median is taken as their average.\n:::\n\nThe second step in building a box plot is drawing a rectangle to represent the middle 50% of the data.\nThe length of the box is called the **interquartile range**, or **IQR** for short.\nIt, like the standard deviation, is a measure of \\index{variability}variability in data.\nThe more variable the data, the larger the standard deviation and IQR tend to be.\nThe two boundaries of the box are called the **first quartile** (the $25^{th}$ percentile, i.e., 25% of the data fall below this value) and the **third quartile** (the $75^{th}$ percentile, i.e., 75% of the data fall below this value), and these are often labeled $Q_1$ and $Q_3,$ respectively.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Interquartile range (IQR).**\n\nThe IQR interquartile range is the length of the box in a box plot.\nIt is computed as $IQR = Q_3 - Q_1,$ where $Q_1$ and $Q_3$ are the $25^{th}$ and $75^{th}$ percentiles, respectively.\n\nA $\\alpha$ **percentile** is a number with $\\alpha$% of the observations below and $100-\\alpha$% of the observations above.\nFor example, the $90^{th}$ percentile of SAT scores is the value of the SAT score with 90% of students below that value and 10% of students above that value.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nWhat percent of the data fall between $Q_1$ and the median?\nWhat percent is between the median and $Q_3$?[^05-explore-numerical-10]\n:::\n\n[^05-explore-numerical-10]: Since $Q_1$ and $Q_3$ capture the middle 50% of the data and the median splits the data in the middle, 25% of the data fall between $Q_1$ and the median, and another 25% falls between the median and $Q_3.$\n\nExtending out from the box, the **whiskers** attempt to capture the data outside of the box.\nThe whiskers of a box plot reach to the minimum and the maximum values in the data, unless there are points that are considered unusually high or unusually low, which are identified as potential **outliers** by the box plot.\nThese are labeled with a dot on the box plot.\nThe purpose of labeling the outlying points -- instead of extending the whiskers to the minimum and maximum observed values -- is to help identify any observations that appear to be unusually distant from the rest of the data.\nThere are a variety of formulas for determining whether a particular data point is considered an outlier, and different statistical software use different formulas.\nA commonly used formula is that any observation beyond $1.5\\times IQR$ away from the first or the third quartile is considered an outlier.\nIn a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data, up to the outliers.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Outliers are extreme.**\n\nAn **outlier** is an observation that appears extreme relative to the rest of the data.\nExamining data for outliers serves many useful purposes, including\n\n- identifying strong skew \\index{strong skew} in the distribution,\n- identifying possible data collection or data entry errors, and\n- providing insight into interesting properties of the data.\n\nKeep in mind, however, that some datasets have a naturally long skew and outlying points do **not** represent any sort of problem in the dataset.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nUsing the box plot in @fig-loan-int-rate-boxplot-dotplot, estimate the values of the $Q_1,$ $Q_3,$ and IQR for `interest_rate` in the `loan50` dataset.[^05-explore-numerical-11]\n:::\n\n[^05-explore-numerical-11]: These visual estimates will vary a little from one person to the next: $Q_1 \\approx$ 8%, $Q_3 \\approx$ 14%, IQR $\\approx$ 14 - 8 = 6%.\n\n## Robust statistics\n\nHow are the **sample statistics** \\index{sample statistic} of the `interest_rate` dataset affected by the observation, 26.3%?\nWhat would have happened if this loan had instead been only 15%?\nWhat would happen to these summary statistics \\index{summary statistic} if the observation at 26.3% had been even larger, say 35%?\nThe three conjectured scenarios are plotted alongside the original data in @fig-loan-int-rate-robust-ex, and sample statistics are computed under each scenario in @tbl-robustOrNotTable.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Dot plots of the original interest rate data and two modified datasets.](05-explore-numerical_files/figure-html/fig-loan-int-rate-robust-ex-1.png){#fig-loan-int-rate-robust-ex width=90%}\n:::\n:::\n\n::: {#tbl-robustOrNotTable .cell tbl-cap='A comparison of how the median, IQR, mean, and standard deviation change as the value of an extereme observation from the original interest data changes.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Robust
Not robust
Scenario Median IQR Mean SD
Original data 9.93 5.75 11.6 5.05
Move 26.3% to 15% 9.93 5.75 11.3 4.61
Move 26.3% to 35% 9.93 5.75 11.7 5.68
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhich is more affected by extreme observations, the mean or median?\nIs the standard deviation or IQR more affected by extreme observations?[^05-explore-numerical-12]\n:::\n\n[^05-explore-numerical-12]: Mean is affected more than the median.\n Standard deviation is affected more than the IQR.\n\nThe median and IQR are called **robust statistics** because extreme observations have little effect on their values: moving the most extreme value generally has little influence on these statistics.\nOn the other hand, the mean and standard deviation are more heavily influenced by changes in extreme observations, which can be important in some situations.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nThe median and IQR did not change under the three scenarios in @tbl-robustOrNotTable.\nWhy might this be the case?\n\n------------------------------------------------------------------------\n\nThe median and IQR are only sensitive to numbers near $Q_1,$ the median, and $Q_3.$ Since values in these regions are stable in the three datasets, the median and IQR estimates are also stable.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nThe distribution of loan amounts in the `loan50` dataset is right skewed, with a few large loans lingering out into the right tail.\nIf you were wanting to understand the typical loan size, should you be more interested in the mean or median?[^05-explore-numerical-13]\n:::\n\n[^05-explore-numerical-13]: If we are looking to simply understand what a typical individual loan looks like, the median is probably more useful.\n However, if the goal is to understand something that scales well, such as the total amount of money we might need to have on hand if we were to offer 1,000 loans, then the mean would be more useful.\n\n## Transforming data {#sec-transforming-data}\n\nWhen data are very strongly skewed, we sometimes transform them, so they are easier to model.\n@fig-county-unemployed-pop-transform shows two right skewed distributions: distribution of the percentage of unemployed people and the distribution of the population in all counties in the United States.\nThe distribution of population is more strongly skewed than the distribution of unemployed, hence the log transformation results in a much bigger change in the shape of the distribution.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot A: A histogram of the percentage of unemployed in all US counties.\nPlot B: A histogram of log$_{10}$-transformed unemployed percentages.\nPlot C: A histogram of population in all US counties.\nPlot D: A histogram of log$_{10}$-transformed populations.\nFor Plots B and D, the x-value corresponds to the power of 10, e.g.,\n1 on the x-axis corresponds to $10^1 =$ 10 and 5 on the x-axis corresponds\nto $10^5=$ 100,000. Data are from 2017.'\n](05-explore-numerical_files/figure-html/fig-county-unemployed-pop-transform-1.png){#fig-county-unemployed-pop-transform width=100%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nConsider the histogram of county populations shown in Plot C of @fig-county-unemployed-pop-transform, which shows extreme skew.\nWhat characteristics of the plot keep it from being useful?\n\n------------------------------------------------------------------------\n\nNearly all of the data fall into the left-most bin, and the extreme skew obscures many of the potentially interesting details at the low values.\n:::\n\nThere are some standard transformations that may be useful for strongly right skewed data where much of the data is positive but clustered near zero.\nA **transformation** is a rescaling of the data using a function.\nFor instance, a plot of the logarithm (base 10) of unemployment rates and county populations results in the new histograms on the right in @fig-county-unemployed-pop-transform.\nThe transformed data are symmetric, and any potential outliers appear much less extreme than in the original data set.\nBy reigning in the outliers and extreme skew, transformations often make it easier to build statistical models for the data.\n\n\n\n\n\nTransformations can also be applied to one or both variables in a scatterplot.\nA scatterplot of the population change from 2010 to 2017 against the population in 2010 is shown in @fig-county-pop-change-transform.\nIn this first scatterplot, it's hard to decipher any interesting patterns because the population variable is so strongly skewed (left plot).\nHowever, if we apply a log$_{10}$ transformation to the population variable, as shown in @fig-county-pop-change-transform, a positive association between the variables is revealed (right plot).\nIn fact, we may be interested in fitting a trend line to the data when we explore methods around fitting regression lines in @sec-model-slr.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot A: Scatterplot of population change against the population before the change. Plot B: A scatterplot of the same data but where the population size has been log-transformed.](05-explore-numerical_files/figure-html/fig-county-pop-change-transform-1.png){#fig-county-pop-change-transform width=90%}\n:::\n:::\n\n\nTransformations other than the logarithm can be useful, too.\nFor instance, the square root $(\\sqrt{\\text{original observation}})$ and inverse $\\bigg ( \\frac{1}{\\text{original observation}} \\bigg )$ are commonly used by data scientists.\nCommon goals in transforming data are to see the data structure differently, reduce skew, assist in modeling, or straighten a nonlinear relationship in a scatterplot.\n\n## Mapping data\n\n\\index{intensity map}\n\nThe `county` dataset offers many numerical variables that we could plot using dot plots, scatterplots, or box plots, but they can miss the true nature of the data as geographic.\nWhen we encounter geographic data, we should create an **intensity map**, where colors are used to show higher and lower values of a variable.\nFigures @fig-county-intensity-map-poverty-unemp and @fig-county-intensity-map-howownership-median-income show intensity maps for poverty rate in percent (`poverty`), unemployment rate in percent (`unemployment_rate`), homeownership rate in percent (`homeownership`), and median household income in \\$1000s (`median_hh_income`).\nThe color key indicates which colors correspond to which values.\nThe intensity maps are not generally very helpful for getting precise values in any given county, but they are very helpful for seeing geographic trends and generating interesting research questions or hypotheses.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nWhat interesting features are evident in the poverty and unemployment rate intensity maps?\n\n------------------------------------------------------------------------\n\nPoverty rates are evidently higher in a few locations.\nNotably, the deep south shows higher poverty rates, as does much of Arizona and New Mexico.\nHigh poverty rates are evident in the Mississippi flood plains a little north of New Orleans and in a large section of Kentucky.\n\nThe unemployment rate follows similar trends, and we can see correspondence between the two variables.\nIn fact, it makes sense for higher rates of unemployment to be closely related to poverty rates.\nOne observation that stands out when comparing the two maps: the poverty rate is much higher than the unemployment rate, meaning while many people may be working, they are not making enough to break out of poverty.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nWhat interesting features are evident in the median household income intensity map in @fig-county-intensity-map-howownership-median-income?[^05-explore-numerical-14]\n:::\n\n[^05-explore-numerical-14]: Answers will vary.\n There is some correspondence between high earning and metropolitan areas, where we can see darker spots (higher median household income), though there are several exceptions.\n You might look for large cities you are familiar with and try to spot them on the map as dark spots.\n\n\n::: {.cell}\n\n:::\n\n\n\\clearpage\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot A: Intensity map of poverty rate (percent). Plot B: Intensity map of the unemployment rate (percent).](05-explore-numerical_files/figure-html/fig-county-intensity-map-poverty-unemp-1.png){#fig-county-intensity-map-poverty-unemp width=90%}\n:::\n:::\n\n\n\\clearpage\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot A: Intensity map of homeownership rate (percent). Plot B: Intensity map of median household income (in thousands of USD).](05-explore-numerical_files/figure-html/fig-county-intensity-map-howownership-median-income-1.png){#fig-county-intensity-map-howownership-median-income width=90%}\n:::\n:::\n\n\n\\clearpage\n\n## Chapter review {#chp5-review}\n\n### Summary\n\nFluently working with numerical variables is an important skill for data analysts.\nIn this chapter we have introduced different visualizations and numerical summaries applied to numeric variables.\nThe graphical visualizations are even more descriptive when two variables are presented simultaneously.\nWe presented scatterplots, dot plots, histograms, and box plots.\nNumerical variables can be summarized using the mean, median, quartiles, standard deviation, and variance.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
average IQR standard deviation
bimodal left skewed symmetric
box plot mean tail
data density median third quartile
density plot multimodal transformation
deviation nonlinear unimodal
distribution outlier variability
dot plot percentile variance
first quartile point estimate weighted mean
histogram right skewed whiskers
intensity map robust statistics
interquartile range scatterplot
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp5-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-05].\n\n::: {.exercises data-latex=\"\"}\n1. **Mammal life spans.** Data were collected on life spans (in years) and gestation lengths (in days) for 62 mammals.\n A scatterplot of life span versus length of gestation is shown below.[^_05-ex-explore-numerical-1]\n [@Allison+Cicchetti:1975]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-45-1.png){width=90%}\n :::\n :::\n\n a. What type of an association is apparent between life span and length of gestation?\n\n b. What type of an association would you expect to see if the axes of the plot were reversed, i.e., if we plotted length of gestation versus life span?\n\n c. Are life span and length of gestation independent?\n Explain your reasoning.\n\n2. **Associations.** Indicate which of the plots show (a) a positive association, (b) a negative association, or (c) no association.\n Also determine if the positive and negative associations are linear or nonlinear.\n Each part may refer to more than one plot.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-46-1.png){width=90%}\n :::\n :::\n\n3. **Reproducing bacteria.** Suppose that there is only sufficient space and nutrients to support one million bacterial cells in a petri dish.\n You place a few bacterial cells in this petri dish, allow them to reproduce freely, and record the number of bacterial cells in the dish over time.\n Sketch a plot representing the relationship between number of bacterial cells and time.\n\n4. **Office productivity.** Office productivity is relatively low when the employees feel no stress about their work or job security.\n However, high levels of stress can also lead to reduced employee productivity.\n Sketch a plot to represent the relationship between stress and productivity.\n\n5. **Make-up exam.** In a class of 25 students, 24 of them took an exam in class and 1 student took a make-up exam the following day.\n The professor graded the first batch of 24 exams and found an average score of 74 points with a standard deviation of 8.9 points.\n The student who took the make-up the following day scored 64 points on the exam.\n\n a. Does the new student's score increase or decrease the average score?\n\n b. What is the new average?\n\n c. Does the new student's score increase or decrease the standard deviation of the scores?\n\n6. **Infant mortality.** The infant mortality rate is defined as the number of infant deaths per 1,000 live births.\n This rate is often used as an indicator of the level of health in a country.\n The relative frequency histogram below shows the distribution of estimated infant death rates for 224 countries for which such data were available in 2014.[^_05-ex-explore-numerical-2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-47-1.png){width=90%}\n :::\n :::\n\n a. Estimate Q1, the median, and Q3 from the histogram.\n\n b. Would you expect the mean of this dataset to be smaller or larger than the median?\n Explain your reasoning.\n\n7. **Days off at a mining plant.** Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average.\n The manager of this plant is under pressure from a local union to increase the amount of paid time off.\n However, he does not want to give more days off to the workers because that would be costly.\n Instead, he decides he should fire 10 employees in such a way as to raise the average number of days off that are reported by his employees.\n In order to achieve this goal, should he fire employees who have the most number of days off, least number of days off, or those who have about the average number of days off?\n\n8. **Medians and IQRs.** For each part, compare distributions A and B based on their medians and IQRs.\n You do not need to calculate these statistics; simply state how the medians and IQRs compare.\n Make sure to explain your reasoning.\n *Hint:* It may be useful to sketch dot plots of the distributions.\n\n a. **A:** 3, 5, 6, 7, 9; **B:** 3, 5, 6, 7, 20\n\n b. **A:** 3, 5, 6, 7, 9; **B:** 3, 5, 7, 8, 9\n\n c. **A:** 1, 2, 3, 4, 5; **B:** 6, 7, 8, 9, 10\n\n d. **A:** 0, 10, 50, 60, 100; **B:** 0, 100, 500, 600, 1000\n\n \\clearpage\n\n9. **Means and SDs.** For each part, compare distributions A and B based on their means and standard deviations.\n You do not need to calculate these statistics; simply state how the means and the standard deviations compare.\n Make sure to explain your reasoning.\n *Hint:* It may be useful to sketch dot plots of the distributions.\n\n a. **A:** 3, 5, 5, 5, 8, 11, 11, 11, 13; **B:** 3, 5, 5, 5, 8, 11, 11, 11, 20\n\n b. **A:** -20, 0, 0, 0, 15, 25, 30, 30; **B:** -40, 0, 0, 0, 15, 25, 30, 30\n\n c. **A:** 0, 2, 4, 6, 8, 10; **B:** 20, 22, 24, 26, 28, 30\n\n d. **A:** 100, 200, 300, 400, 500; **B:** 0, 50, 300, 550, 600\n\n10. **Histograms and box plots.** Describe (in words) the distribution in the histograms below and match them to the box plots.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-48-1.png){width=100%}\n :::\n :::\n\n11. **Air quality.** Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency.\n This index reports the pollution level and what associated health effects might be a concern.\n The index is calculated for five major air pollutants regulated by the Clean Air Act and takes values from 0 to 300, where a higher value indicates lower air quality.\n AQI was reported for a sample of 91 days in 2011 in Durham, NC. The histogram below shows the distribution of the AQI values on these days.[^_05-ex-explore-numerical-3]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-49-1.png){width=90%}\n :::\n :::\n\n a. Estimate the median AQI value of this sample.\n\n b. Would you expect the mean AQI value of this sample to be higher or lower than the median?\n Explain your reasoning.\n\n c. Estimate Q1, Q3, and IQR for the distribution.\n\n d. Would any of the days in this sample be considered to have an unusually low or high AQI?\n Explain your reasoning.\n\n \\clearpage\n\n12. **Median vs. mean.** Estimate the median for the 400 observations shown in the histogram and note whether you expect the mean to be higher or lower than the median.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-50-1.png){width=90%}\n :::\n :::\n\n13. **Histograms vs. box plots.** Compare the two plots below.\n What characteristics of the distribution are apparent in the histogram and not in the box plot?\n What characteristics are apparent in the box plot but not in the histogram?\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-51-1.png){width=90%}\n :::\n :::\n\n14. **Facebook friends.** Facebook data indicate that 50% of Facebook users have 100 or more friends, and that the average friend count of users is 190.\n What do these findings suggest about the shape of the distribution of number of friends of Facebook users?\n [@Backstrom:2011]\n\n \\clearpage\n\n15. **Distributions and appropriate statistics.** For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed.\n Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR.\n Explain your reasoning.\n\n a. Number of pets per household.\n\n b. Distance to work, i.e., number of miles between work and home.\n\n c. Heights of adult males.\n\n d. Age at death.\n\n e. Exam grade on an easy test.\n\n16. **Distributions and appropriate statistics.** For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed.\n Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR.\n Explain your reasoning.\n\n a. Housing prices in a country where 25% of the houses cost below \\$350,000, 50% of the houses cost below \\$450,000, 75% of the houses cost below \\$1,000,000, and there are a meaningful number of houses that cost more than \\$6,000,000.\n\n b. Housing prices in a country where 25% of the houses cost below \\$300,000, 50% of the houses cost below \\$600,000, 75% of the houses cost below \\$900,000, and very few houses that cost more than \\$1,200,000.\n\n c. Number of alcoholic drinks consumed by college students in a given week.\n Assume that most of these students do not drink since they are under 21 years old, and only a few drink excessively.\n\n d. Annual salaries of the employees at a Fortune 500 company where only a few high-level executives earn much higher salaries than all the other employees.\n\n e. Gestation time in humans where 25% of the babies are born by 38 weeks of gestation, 50% of the babies are born by 39 weeks, 75% of the babies are born by 40 weeks, and the maximum gestation length is 46 weeks.\n\n17. **TV watchers.** College students in a statistics class were asked how many hours of television they watch per week, including online streaming services.\n This sample yielded an average of 8.28 hours, with a standard deviation of 7.18 hours.\n Is the distribution of number of hours students watch television weekly symmetric?\n If not, what shape would you expect this distribution to have?\n Explain your reasoning.\n\n18. **Exam scores.** The average on a history exam (scored out of 100 points) was 85, with a standard deviation of 15.\n Is the distribution of the scores on this exam symmetric?\n If not, what shape would you expect this distribution to have?\n Explain your reasoning.\n\n19. **Midrange.** The *midrange* of a distribution is defined as the average of the maximum and the minimum of that distribution.\n Is this statistic robust to outliers and extreme skew?\n Explain your reasoning.\n\n \\clearpage\n\n20. **Oscar winners.** The first Oscar awards for best actor and best actress were given out in 1929.\n The histograms below show the age distribution for all best actor and best actress winners from 1929 to 2019.\n Summary statistics for these distributions are also provided.\n Compare the distributions of ages of best actor and actress winners.[^_05-ex-explore-numerical-4]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-52-1.png){width=90%}\n :::\n :::\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Mean SD n
Best actor 43.8 8.8 92
Best actress 36.2 11.9 92
\n \n `````\n :::\n :::\n\n21. **Stats scores.** The final exam scores of twenty introductory statistics students, arranged in ascending order, as follows: 57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94.\n Suppose students who score above the 75th percentile on the final exam get an A in the class.\n How many students will get an A in this class?\n\n22. **Income at the coffee shop.** The first histogram below shows the distribution of the yearly incomes of 40 patrons at a college coffee shop.\n Suppose two new people walk into the coffee shop: one making \\$225,000 and the other \\$250,000.\n The second histogram shows the new income distribution.\n Summary statistics are also provided, rounded to the nearest whole number.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-54-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
n Min Q1 Median Mean Max SD
Before 40 $60,679 $60,818 $65,238 $65,089 $69,885 $2,122
After 42 $60,679 $60,838 $65,352 $73,299 $250,000 $37,321
\n \n `````\n :::\n :::\n\n a. Would the mean or the median best represent what we might think of as a typical income for the 42 patrons at this coffee shop?\n What does this say about the robustness of the two measures?\n\n b. Would the standard deviation or the IQR best represent the amount of variability in the incomes of the 42 patrons at this coffee shop?\n What does this say about the robustness of the two measures?\n\n23. **A new statistic.** The statistic $\\frac{\\bar{x}}{median}$ can be used as a measure of skewness.\n Suppose we have a distribution where all observations are greater than 0, $x_i > 0$.\n What is the expected shape of the distribution under the following conditions?\n Explain your reasoning.\n\n a. $\\frac{\\bar{x}}{median} = 1$\n\n b. $\\frac{\\bar{x}}{median} < 1$\n\n c. $\\frac{\\bar{x}}{median} > 1$\n\n24. **Commute times.** The US census collects data on the time it takes Americans to commute to work, among many other variables.\n The histogram below shows the distribution of average commute times in 3,142 US counties in 2017.\n Also shown below is a spatial intensity map of the same data.[^_05-ex-explore-numerical-5]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-55-1.png){width=90%}\n :::\n :::\n\n a. Describe the numerical distribution and comment on whether a log transformation may be advisable for these data.\n\n b. Describe the spatial distribution of commuting times using the map.\n\n \\clearpage\n\n25. **Hispanic population.** The US census collects data on race and ethnicity of Americans, among many other variables.\n The histogram below shows the distribution of the percentage of the population that is Hispanic in 3,142 counties in the US in 2010.\n Also shown is a histogram of logs of these values.[^_05-ex-explore-numerical-6]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-56-1.png){width=80%}\n :::\n \n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-56-2.png){width=80%}\n :::\n :::\n\n a. Describe the numerical distribution and comment on why we might want to use log-transformed values in analyzing or modeling these data.\n\n b. What features of the distribution of the Hispanic population in US counties are apparent in the map but not in the histogram?\n What features are apparent in the histogram but not the map?\n\n c. Is one visualization more appropriate or helpful than the other?\n Explain your reasoning.\n\n \\clearpage\n\n26. **NYC marathon winners.** The histogram and box plots below show the distribution of finishing times for male and female winners of the New York City Marathon between 1970 and 2020.[^_05-ex-explore-numerical-7]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-57-1.png){width=90%}\n :::\n :::\n\n a. What features of the distribution are apparent in the histogram and not the box plot?\n What features are apparent in the box plot but not in the histogram?\n\n b. What may be the reason for the bimodal distribution?\n Explain.\n\n c. Compare the distribution of marathon times for men and women based on the box plot shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-58-1.png){width=90%}\n :::\n :::\n\n d. The time series plot shown below is another way to look at these data. Describe what is visible in this plot but not in the others.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](05-explore-numerical_files/figure-html/unnamed-chunk-59-1.png){width=90%}\n :::\n :::\n\n[^_05-ex-explore-numerical-1]: The [`mammals`](http://openintrostat.github.io/openintro/reference/mammals.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_05-ex-explore-numerical-2]: The [`cia_factbook`](http://openintrostat.github.io/openintro/reference/cia_factbook.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_05-ex-explore-numerical-3]: The [`pm25_2011_durham`](http://openintrostat.github.io/openintro/reference/pm25_2011_durham.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_05-ex-explore-numerical-4]: The [`oscars`](http://openintrostat.github.io/openintro/reference/oscars.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_05-ex-explore-numerical-5]: The [`county_complete`](http://openintrostat.github.io/openintro/reference/county_complete.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.\n\n[^_05-ex-explore-numerical-6]: The [`county_complete`](http://openintrostat.github.io/openintro/reference/county_complete.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.\n\n[^_05-ex-explore-numerical-7]: The [`nyc_marathon`](http://openintrostat.github.io/openintro/reference/nyc_marathon.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n", + "supporting": [ + "05-explore-numerical_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/05-explore-numerical/figure-html/fig-county-intensity-map-howownership-median-income-1.png b/_freeze/05-explore-numerical/figure-html/fig-county-intensity-map-howownership-median-income-1.png new file mode 100644 index 00000000..f7cebcc6 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-county-intensity-map-howownership-median-income-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-county-intensity-map-poverty-unemp-1.png b/_freeze/05-explore-numerical/figure-html/fig-county-intensity-map-poverty-unemp-1.png new file mode 100644 index 00000000..705b9518 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-county-intensity-map-poverty-unemp-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-county-pop-change-transform-1.png b/_freeze/05-explore-numerical/figure-html/fig-county-pop-change-transform-1.png new file mode 100644 index 00000000..7bc14ff8 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-county-pop-change-transform-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-county-unemployed-pop-transform-1.png b/_freeze/05-explore-numerical/figure-html/fig-county-unemployed-pop-transform-1.png new file mode 100644 index 00000000..931e0a6f Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-county-unemployed-pop-transform-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-boxplot-dotplot-1.png b/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-boxplot-dotplot-1.png new file mode 100644 index 00000000..cad2c9c7 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-boxplot-dotplot-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-dotplot-1.png b/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-dotplot-1.png new file mode 100644 index 00000000..588e368e Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-dotplot-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-robust-ex-1.png b/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-robust-ex-1.png new file mode 100644 index 00000000..228daedd Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-loan-int-rate-robust-ex-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-loan50-amount-income-1.png b/_freeze/05-explore-numerical/figure-html/fig-loan50-amount-income-1.png new file mode 100644 index 00000000..4500ef1b Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-loan50-amount-income-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-loan50IntRateDensity-1.png b/_freeze/05-explore-numerical/figure-html/fig-loan50IntRateDensity-1.png new file mode 100644 index 00000000..d6ec9dfb Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-loan50IntRateDensity-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-loan50IntRateHist-1.png b/_freeze/05-explore-numerical/figure-html/fig-loan50IntRateHist-1.png new file mode 100644 index 00000000..e9a02530 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-loan50IntRateHist-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-median-hh-income-poverty-1.png b/_freeze/05-explore-numerical/figure-html/fig-median-hh-income-poverty-1.png new file mode 100644 index 00000000..3e589dea Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-median-hh-income-poverty-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-severalDiffDistWithSdOf1-1.png b/_freeze/05-explore-numerical/figure-html/fig-severalDiffDistWithSdOf1-1.png new file mode 100644 index 00000000..d43db5c8 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-severalDiffDistWithSdOf1-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/fig-singleBiMultiModalPlots-1.png b/_freeze/05-explore-numerical/figure-html/fig-singleBiMultiModalPlots-1.png new file mode 100644 index 00000000..97dcb737 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/fig-singleBiMultiModalPlots-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/sdRuleForIntRate-1.png b/_freeze/05-explore-numerical/figure-html/sdRuleForIntRate-1.png new file mode 100644 index 00000000..490f6c70 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/sdRuleForIntRate-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-45-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-45-1.png new file mode 100644 index 00000000..b57ef735 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-45-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-46-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-46-1.png new file mode 100644 index 00000000..5fdefb61 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-46-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-47-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-47-1.png new file mode 100644 index 00000000..baa10eea Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-47-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-48-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-48-1.png new file mode 100644 index 00000000..bc14ea86 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-48-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-49-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-49-1.png new file mode 100644 index 00000000..b6e37ba6 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-49-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-50-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-50-1.png new file mode 100644 index 00000000..8bbe2f4e Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-50-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-51-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-51-1.png new file mode 100644 index 00000000..c4af7456 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-51-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-52-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-52-1.png new file mode 100644 index 00000000..4a3a90bb Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-52-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-54-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-54-1.png new file mode 100644 index 00000000..fe9f88c5 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-54-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-55-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-55-1.png new file mode 100644 index 00000000..ee11c892 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-55-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-56-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-56-1.png new file mode 100644 index 00000000..964e77e3 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-56-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-56-2.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-56-2.png new file mode 100644 index 00000000..4bbd676c Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-56-2.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-57-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-57-1.png new file mode 100644 index 00000000..b32d7a05 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-57-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-58-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-58-1.png new file mode 100644 index 00000000..1e96d302 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-58-1.png differ diff --git a/_freeze/05-explore-numerical/figure-html/unnamed-chunk-59-1.png b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-59-1.png new file mode 100644 index 00000000..0e0ca862 Binary files /dev/null and b/_freeze/05-explore-numerical/figure-html/unnamed-chunk-59-1.png differ diff --git a/_freeze/06-explore-applications/execute-results/html.json b/_freeze/06-explore-applications/execute-results/html.json new file mode 100644 index 00000000..aaf2fb3b --- /dev/null +++ b/_freeze/06-explore-applications/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "e7ffcce1c929e7411a0f0723e5df9813", + "result": { + "markdown": "# Applications: Explore {#sec-explore-applications}\n\n\n\n\n\n## Case study: Effective communication of exploratory results {#case-study-effective-comms}\n\nGraphs can powerfully communicate ideas directly and quickly.\nWe all know, after all, that \"a picture is worth 1000 words.\" Unfortunately, however, there are times when an image conveys a message which is inaccurate or misleading.\n\nThis chapter focuses on how graphs can best be utilized to present data accurately and effectively.\nAlong with data modeling, creative visualization is somewhat of an art.\nHowever, even with an art, there are recommended guiding principles.\nWe provide a few best practices for creating data visualizations.\n\n### Keep it simple\n\nWhen creating a graphic, keep in mind what it is that you'd like your reader to see.\nColors should be used to group items or differentiate levels in meaningful ways.\nColors can be distracting when they are only used to brighten up the plot.\n\nConsider a manufacturing company that has summarized their costs into five different categories.\nIn the two graphics provided in Figure @fig-pie-to-bar, notice that the magnitudes in the pie chart are difficult for the eye to compare.\nThat is, can your eye tell how different \"Buildings and administration\" is from \"Workplace materials\" when looking at the slices of pie?\nAdditionally, the colors in the pie chart do not mean anything and are therefore distracting.\nLastly, the three-dimensional aspect of the image does not improve the reader's ability to understand the data presented.\n\nAs an alternative, a bar plot has been provided.\nNotice how much easier it is to identify the magnitude of the differences across categories while not being distracted by other aspects of the image.\nTypically, a bar plot will be easier for the reader to digest than a pie chart, especially if the categorical data being plotted has more than just a few levels.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![A pie chart (with added irrelevant features) as compared to a simple bar plot.](images/pie-3d.jpg){#fig-pie-to-bar-1 width=50%}\n:::\n\n::: {.cell-output-display}\n![A pie chart (with added irrelevant features) as compared to a simple bar plot.](06-explore-applications_files/figure-html/fig-pie-to-bar-2.png){#fig-pie-to-bar-2 width=50%}\n:::\n:::\n\n\n### Use color to draw attention\n\nThere are many reasons why you might choose to add **color** to your plots.\nAn important principle to keep in mind is to use color to draw attention.\nOf course, you should still think about how visually pleasing your visualization is, and if you're adding color for making it visually pleasing without drawing attention to a particular feature, that might be fine.\nHowever, you should be critical of default coloring and explicitly decide whether to include color and how.\nNotice that in Plot B in Figure @fig-red-bar the coloring is done in such a way to draw the reader's attention to one particular piece of information.\nThe default coloring in Plot A can be distracting and makes the reader question, for example, is there something similar about the red and purple bars?\nAlso note that not everyone sees color the same way, often it's useful to add color and one more feature (e.g., pattern) so that you can refer to the features you're drawing attention to in multiple ways.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The default coloring in the first bar plot does nothing for the understanding of the data. In the second plot, the color draws attention directly to the bar on Buildings and Administration.](06-explore-applications_files/figure-html/fig-red-bar-1.png){#fig-red-bar width=90%}\n:::\n:::\n\n\n\n\n### Tell a story\n\nFor many graphs, an important aspect is the inclusion of information which is not provided in the dataset that is being plotted.\nThe external information serves to contextualize the data and helps communicate the narrative of the research.\nIn Figure @fig-duke-hires, the graph on the right is **annotated** with information about the start of the university's fiscal year which contextualizes the information provided by the data.\nSometimes the additional information may be a diagonal line given by $y = x$, points above the line quickly show the reader which values have a $y$ coordinate larger than the $x$ coordinate; points below the line show the opposite.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Credit: Angela Zoss and Eric Monson, Duke Data Visualization Services](images/time-series-story.png){#fig-duke-hires width=100%}\n:::\n:::\n\n\n### Order matters\n\nMost software programs have built in methods for some of the plot details.\nFor example, the default option for the software program used in this text, R, is to order the bars in a bar plot alphabetically.\nAs seen in Figure @fig-brexit-bars, the alphabetical ordering isn't particularly meaningful for describing the data.\nSometimes it makes sense to **order** the bars from tallest to shortest (or vice versa).\nBut in this case, the best ordering is probably the one in which the questions were asked.\nAn ordering which does not make sense in the context of the problem (e.g., alphabetically here), can mislead the reader who might take a quick glance at the axes and not read the bar labels carefully.\n\n\n\n\n\nIn September 2019, YouGov survey asked 1,639 Great Britain adults the following question[^06-explore-applications-1]:\n\n[^06-explore-applications-1]: Source: [YouGov Survey Results](https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x0msmggx08/YouGov%20-%20Brexit%20and%202019%20election.pdf), retrieved Oct 7, 2019.\n\n> How well or badly do you think the government are doing at handling Britain's exit from the European Union?\n>\n> - Very well\n> - Fairly well\n> - Fairly badly\n> - Very badly\n> - Don't know\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Bar plot three different ways. Plot A: Alphabetic ordering of levels, Plot B: Bars ordered in descending order of frequency, Plot C: Bars ordered in the same order as they were presented in the survey question.](06-explore-applications_files/figure-html/fig-brexit-bars-1.png){#fig-brexit-bars width=100%}\n:::\n:::\n\n\n### Make the labels as easy to read as possible\n\nThe Brexit survey results were additionally broken down by region in Great Britain.\nThe stacked bar plot allows for comparison of Brexit opinion across the five regions.\nIn Figure @fig-brexit-region the bars are vertical in Plot A and horizontal in Plot B. While the quantitative information in the two graphics is identical, flipping the graph and creating horizontal bars provides more space for the **axis labels**.\nThe easier the categories are to read, the more the reader will learn from the visualization.\nRemember, the goal is to convey as much information as possible in a succinct and clear manner.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Stacked bar plots vertically and horizontally. The horizontal orientation makes the region labels easier to read.](06-explore-applications_files/figure-html/fig-brexit-region-1.png){#fig-brexit-region width=100%}\n:::\n:::\n\n\n\n\n### Pick a purpose\n\nEvery graphical decision should be made with a **purpose**.\nAs previously mentioned, sticking with default options is not always best for conveying the narrative of your data story.\nStacked bar plots tell one part of a story.\nDepending on your research question, they may not tell the part of the story most important to the research.\nFigure @fig-seg-three-ways provides three different ways of representing the same information.\nIf the most important comparison across regions is proportion, you might prefer Plot A. If the most important comparison across regions also considers the total number of individuals in the region, you might prefer Plot B. If a separate bar plot for each region makes the point you'd like, use Plot C, which has been **faceted** by region.\n\n\n\n\n\nPlot C in Figure @fig-seg-three-ways also provides full titles and a succinct URL with the data source.\nOther deliberate decisions to consider include using informative labels and avoiding redundancy.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three different representations of the two variables including survey opinion and region. Use the graphic that best conveys the data narrative at hand.](06-explore-applications_files/figure-html/fig-seg-three-ways-1.png){#fig-seg-three-ways width=90%}\n:::\n:::\n\n\n\n\n### Select meaningful colors\n\n\n\nOne last consideration for building graphs is to consider color choices.\nDefault or rainbow colors are not always the choice which will best distinguish the level of your variables.\nMuch research has been done to find color combinations which are distinct and which are clear for differently sighted individuals.\nThe cividis scale works well with ordinal data.\n[@Nunez:2018] Figure @fig-brexit-viridis shows the same plot with two different colorings.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Identical bar plots with two different coloring options. Plot A uses a default color scale, Plot B uses colors from the cividis scale.](06-explore-applications_files/figure-html/fig-brexit-viridis-1.png){#fig-brexit-viridis width=90%}\n:::\n:::\n\n\nIn this chapter different representations are contrasted to demonstrate best practices in creating graphs.\nThe fundamental principle is that your graph should provide maximal information succinctly and clearly.\nLabels should be clear and oriented horizontally for the reader.\nDon't forget titles and, if possible, include the source of the data.\n\n\\clearpage\n\n## Interactive R tutorials {#explore-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 2: Exploratory data analysis](https://openintrostat.github.io/ims-tutorials/02-explore/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintrostat.github.io/ims-tutorials/02-explore\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 2 - Lesson 1: Visualizing categorical data](https://openintro.shinyapps.io/ims-02-explore-01/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-02-explore-01\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 2 - Lesson 2: Visualizing numerical data](https://openintro.shinyapps.io/ims-02-explore-02/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-02-explore-02\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 2 - Lesson 3: Summarizing with statistics](https://openintro.shinyapps.io/ims-02-explore-03/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-02-explore-03\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 2 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-02-explore-04/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-02-explore-04\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs {#explore-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Intro to data - Flight delays](https://www.openintro.org/go?id=ims-r-lab-intro-to-data)\\\n::: {.content-hidden unless-format=\"pdf\"} https://www.openintro.org/go?i\nd=ims-r-lab-intro-to-data\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "supporting": [ + "06-explore-applications_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/06-explore-applications/figure-html/fig-brexit-bars-1.png b/_freeze/06-explore-applications/figure-html/fig-brexit-bars-1.png new file mode 100644 index 00000000..f82b017a Binary files /dev/null and b/_freeze/06-explore-applications/figure-html/fig-brexit-bars-1.png differ diff --git a/_freeze/06-explore-applications/figure-html/fig-brexit-region-1.png b/_freeze/06-explore-applications/figure-html/fig-brexit-region-1.png new file mode 100644 index 00000000..d4754515 Binary files /dev/null and b/_freeze/06-explore-applications/figure-html/fig-brexit-region-1.png differ diff --git a/_freeze/06-explore-applications/figure-html/fig-brexit-viridis-1.png b/_freeze/06-explore-applications/figure-html/fig-brexit-viridis-1.png new file mode 100644 index 00000000..4005cfa7 Binary files /dev/null and b/_freeze/06-explore-applications/figure-html/fig-brexit-viridis-1.png differ diff --git a/_freeze/06-explore-applications/figure-html/fig-pie-to-bar-2.png b/_freeze/06-explore-applications/figure-html/fig-pie-to-bar-2.png new file mode 100644 index 00000000..db3864f7 Binary files /dev/null and b/_freeze/06-explore-applications/figure-html/fig-pie-to-bar-2.png differ diff --git a/_freeze/06-explore-applications/figure-html/fig-red-bar-1.png b/_freeze/06-explore-applications/figure-html/fig-red-bar-1.png new file mode 100644 index 00000000..822517bd Binary files /dev/null and b/_freeze/06-explore-applications/figure-html/fig-red-bar-1.png differ diff --git a/_freeze/06-explore-applications/figure-html/fig-seg-three-ways-1.png b/_freeze/06-explore-applications/figure-html/fig-seg-three-ways-1.png new file mode 100644 index 00000000..e8c3eb69 Binary files /dev/null and b/_freeze/06-explore-applications/figure-html/fig-seg-three-ways-1.png differ diff --git a/_freeze/07-model-slr/execute-results/html.json b/_freeze/07-model-slr/execute-results/html.json new file mode 100644 index 00000000..6af96386 --- /dev/null +++ b/_freeze/07-model-slr/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "dd5f0e3eeba32ed3e9c083fe2130ee5b", + "result": { + "markdown": "---\noutput: html_document\neditor_options: \n chunk_output_type: console\n---\n\n\n\n\n# Linear regression with a single predictor {#sec-model-slr}\n\n::: {.chapterintro data-latex=\"\"}\nLinear regression is a very powerful statistical technique.\nMany people have some familiarity with regression models just from reading the news, where straight lines are overlaid on scatterplots.\nLinear models can be used for prediction or to evaluate whether there is a linear relationship between a numerical variable on the horizontal axis and the average of the numerical variable on the vertical axis.\n:::\n\n## Fitting a line, residuals, and correlation {#fit-line-res-cor}\n\nWhen considering linear regression, it's helpful to think deeply about the line fitting process.\nIn this section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce a new statistic called *correlation*.\n\n### Fitting a line to data\n\n@fig-perfLinearModel shows two variables whose relationship can be modeled perfectly with a straight line.\nThe equation for the line is $y = 5 + 64.96 x.$ Consider what a perfect linear relationship means: we know the exact value of $y$ just by knowing the value of $x.$ A perfect linear relationship is unrealistic in almost any natural process.\nFor example, if we took family income ($x$), this value would provide some useful information about how much financial support a college may offer a prospective student ($y$).\nHowever, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's finances.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Requests from twelve separate buyers were simultaneously placed with a trading\ncompany to purchase Target Corporation stock (ticker TGT, December 28th, 2018),\nand the total cost of the shares were reported. Because the cost is computed using\na linear formula, the linear fit is perfect.](07-model-slr_files/figure-html/fig-perfLinearModel-1.png){#fig-perfLinearModel width=90%}\n:::\n:::\n\n\nLinear regression is the statistical method for fitting a line to data where the relationship between two variables, $x$ and $y,$ can be modeled by a straight line with some error:\n\n$$\ny = b_0 + b_1 \\ x + e\n$$\n\nThe values $b_0$ and $b_1$ represent the model's intercept and slope, respectively, and the error is represented by $e$.\nThese values are calculated based on the data, i.e., they are sample statistics.\nIf the observed data is a random sample from a target population that we are interested in making inferences about, these values are considered to be point estimates for the population parameters $\\beta_0$ and $\\beta_1$.\nWe will discuss how to make inferences about parameters of a linear model based on sample statistics in @sec-inf-model-slr.\n\n::: {.pronunciation data-latex=\"\"}\nThe Greek letter $\\beta$ is pronounced *beta*, listen to the pronunciation [here](https://youtu.be/PStgY5AcEIw?t=7).\n:::\n\nWhen we use $x$ to predict $y,$ we usually call $x$ the **predictor** variable and we call $y$ the **outcome**.\nWe also often drop the $e$ term when writing down the model since our main focus is often on the prediction of the average outcome.\n\n\n\n\n\nIt is rare for all of the data to fall perfectly on a straight line.\nInstead, it's more common for data to appear as a *cloud of points*, such as those examples shown in @fig-imperfLinearModel.\nIn each case, the data fall around a straight line, even if none of the observations fall exactly on the line.\nThe first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between $x$ and $y.$ The second plot shows an upward trend that, while evident, is not as strong as the first.\nThe last plot shows a very weak downward trend in the data, so slight we can hardly notice it.\nIn each of these examples, we will have some uncertainty regarding our estimates of the model parameters, $\\beta_0$ and $\\beta_1.$ For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less?\nAs we move forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three datasets where a linear model may be useful even though the data do\nnot all fall exactly on the line.\n](07-model-slr_files/figure-html/fig-imperfLinearModel-1.png){#fig-imperfLinearModel width=100%}\n:::\n:::\n\n\nThere are also cases where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful.\nOne such case is shown in @fig-notGoodAtAllForALinearModel where there is a very clear relationship between the variables even though the trend is not linear.\nWe discuss nonlinear trends in this chapter and the next, but details of fitting nonlinear models are saved for a later course.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The best fitting line for these data is flat, which is not a useful way to\ndescribe the non-linear relationship. These data are from a physics experiment.](07-model-slr_files/figure-html/fig-notGoodAtAllForALinearModel-1.png){#fig-notGoodAtAllForALinearModel width=90%}\n:::\n:::\n\n\n### Using linear regression to predict possum head lengths\n\nBrushtail possums are marsupials that live in Australia, and a photo of one is shown in @fig-brushtail-possum.\nResearchers captured 104 of these animals and took body measurements before releasing the animals back into the wild.\nWe consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum's head.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The common brushtail possum of Australia. Photo by Greg Schecter,\n[flic.kr/p/9BAFbR](https://flic.kr/p/9BAFbR), CC BY 2.0 license.\n](images/brushtail-possum/brushtail-possum.jpg){#fig-brushtail-possum fig-alt='Photograph of a common brushtail possum of Australia.' width=50%}\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n@fig-scattHeadLTotalL shows a scatterplot for the head length (mm) and total length (cm) of the possums.\nEach point represents a single possum from the data.\nThe head and total length variables are associated: possums with an above average total length also tend to have above average head lengths.\nWhile the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot showing head length against total length for 104 brushtail\npossums. A point representing a possum with head length 86.7 mm and total\nlength 84 cm is highlighted.](07-model-slr_files/figure-html/fig-scattHeadLTotalL-1.png){#fig-scattHeadLTotalL width=90%}\n:::\n:::\n\n\nWe want to describe the relationship between the head length and total length variables in the possum dataset using a line.\nIn this example, we will use the total length as the predictor variable, $x,$ to predict a possum's head length, $y.$ We could fit the linear relationship by eye, as in @fig-scattHeadLTotalLLine.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A reasonable linear model was fit to represent the relationship between\nhead length and total length.](07-model-slr_files/figure-html/fig-scattHeadLTotalLLine-1.png){#fig-scattHeadLTotalLLine width=90%}\n:::\n:::\n\n\nThe equation for this line is\n\n$$\n\\hat{y} = 41 + 0.59x\n$$\n\nA \"hat\" on $y$ is used to signify that this is an estimate.\nWe can use this line to discuss properties of possums.\nFor instance, the equation predicts a possum with a total length of 80 cm will have a head length of\n\n$$\n\\hat{y} = 41 + 0.59 \\times 80 = 88.2\n$$\n\nThe estimate may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.2 mm.\nAbsent further information about an 80 cm possum, the prediction for head length that uses the average is a reasonable estimate.\n\nThere may be other variables that could help us predict the head length of a possum besides its length.\nPerhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region.\nPlot A in @fig-scattHeadLTotalL-sex-age shows the relationship between total length and head length of brushtail possums, taking into consideration their sex.\nMale possums (represented by blue triangles) seem to be larger in terms of total length and head length than female possums (represented by red circles).\nPlot B in @fig-scattHeadLTotalL-sex-age shows the same relationship, taking into consideration their age.\nIt's harder to tell if age changes the relationship between total length and head length for these possums.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between total length and head length of brushtail possums,\ntaking into consideration their sex (Plot A) or age (Plot B).\n](07-model-slr_files/figure-html/fig-scattHeadLTotalL-sex-age-1.png){#fig-scattHeadLTotalL-sex-age width=90%}\n:::\n:::\n\n\nIn @sec-model-mlr, we'll learn about how we can include more than one predictor in our model.\nBefore we get there, we first need to better understand how to best build a linear model with one predictor.\n\n### Residuals {#resids}\n\n**Residuals** are the leftover variation in the data after accounting for the model fit:\n\n$$\n\\text{Data} = \\text{Fit} + \\text{Residual}\n$$\n\nEach observation will have a residual, and three of the residuals for the linear model we fit for the possum data are shown in @fig-scattHeadLTotalLLine-highlighted.\nIf an observation is above the regression line, then its residual, the vertical distance from the observation to the line, is positive.\nObservations below the line have negative residuals.\nOne goal in picking the right linear model is for these residuals to be as small as possible.\n\n\n\n\n\n@fig-scattHeadLTotalLLine-highlighted is almost a replica of @fig-scattHeadLTotalLLine, with three points from the data highlighted.\nThe observation marked by a red circle has a small, negative residual of about -1; the observation marked by a gray diamond has a large positive residual of about +7; and the observation marked by a pink triangle has a moderate negative residual of about -4.\nThe size of a residual is usually discussed in terms of its absolute value.\nFor example, the residual for the observation marked by a pink triangle is larger than that of the observation marked by a red circle because $|-4|$ is larger than $|-1|.$\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A reasonable linear model was fit to represent the relationship between\nhead length and total length, with three points highlighted.](07-model-slr_files/figure-html/fig-scattHeadLTotalLLine-highlighted-1.png){#fig-scattHeadLTotalLLine-highlighted width=90%}\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Residual: Difference between observed and expected.**\n\nThe residual of the $i^{th}$ observation $(x_i, y_i)$ is the difference of the observed outcome ($y_i$) and the outcome we would predict based on the model fit ($\\hat{y}_i$):\n\n$$\ne_i = y_i - \\hat{y}_i\n$$\n\nWe typically identify $\\hat{y}_i$ by plugging $x_i$ into the model.\n:::\n\n::: {.workedexample data-latex=\"\"}\nThe linear fit shown in @fig-scattHeadLTotalLLine-highlighted is given as $\\hat{y} = 41 + 0.59x.$ Based on this line, formally compute the residual of the observation $(76.0, 85.1).$ This observation is marked by a red circle in @fig-scattHeadLTotalLLine-highlighted.\nCheck it against the earlier visual estimate, -1.\n\n------------------------------------------------------------------------\n\nWe first compute the predicted value of the observation marked by a red circle based on the model:\n\n$$\n\\hat{y} = 41+0.59x = 41+0.59\\times 76.0 = 85.84\n$$\n\nNext we compute the difference of the actual head length and the predicted head length:\n\n$$\ne = y - \\hat{y} = 85.1 - 85.84 = -0.74\n$$\n\nThe model's error is $e = -0.74$ mm, which is very close to the visual estimate of -1 mm.\nThe negative residual indicates that the linear model overpredicted head length for this particular possum.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf a model underestimates an observation, will the residual be positive or negative?\nWhat about if it overestimates the observation?[^07-model-slr-1]\n:::\n\n[^07-model-slr-1]: If a model underestimates an observation, then the model estimate is below the actual.\n The residual, which is the actual observation value minus the model estimate, must then be positive.\n The opposite is true when the model overestimates the observation: the residual is negative.\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the residuals for the observation marked by a blue diamond, $(85.0, 98.6),$ and the observation marked by a pink triangle, $(95.5, 94.0),$ in the figure using the linear relationship $\\hat{y} = 41 + 0.59x.$[^07-model-slr-2]\n:::\n\n[^07-model-slr-2]: Gray diamond: $\\hat{y} = 41+0.59x = 41+0.59\\times 85.0 = 91.15 \\rightarrow e = y - \\hat{y} = 98.6-91.15=7.45.$ This is close to the earlier estimate of 7.\n pink triangle: $\\hat{y} = 41+0.59x = 97.3 \\rightarrow e = -3.3.$ This is also close to the estimate of -4.\n\nResiduals are helpful in evaluating how well a linear model fits a dataset.\nWe often display them in a scatterplot such as the one shown in @fig-scattHeadLTotalLResidualPlot for the regression line in @fig-scattHeadLTotalLLine-highlighted.\nThe residuals are plotted with their predicted outcome variable value as the horizontal coordinate, and the vertical coordinate as the residual.\nFor instance, the point $(85.0, 98.6)$ (marked by the blue diamond) had a predicted value of 91.4 mm and had a residual of 7.45 mm, so in the residual plot it is placed at $(91.4, 7.45).$ Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal, as indicated by the dashed line.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Residual plot for the model predicting head length from total length for\nbrushtail possums.](07-model-slr_files/figure-html/fig-scattHeadLTotalLResidualPlot-1.png){#fig-scattHeadLTotalLResidualPlot width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nOne purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model.\n@fig-sampleLinesAndResPlots shows three scatterplots with linear models in the first row and residual plots in the second row.\nCan you identify any patterns remaining in the residuals?\n\n------------------------------------------------------------------------\n\nIn the first dataset (first column), the residuals show no obvious patterns.\nThe residuals appear to be scattered randomly around the dashed line that represents 0.\n\nThe second dataset shows a pattern in the residuals.\nThere is some curvature in the scatterplot, which is more obvious in the residual plot.\nWe should not use a straight line to model these data.\nInstead, a more advanced technique should be used to model the curved relationship, such as the variable transformations discussed in @sec-transforming-data.\n\nThe last plot shows very little upwards trend, and the residuals also show no obvious patterns.\nIt is reasonable to try to fit a linear model to the data.\nHowever, it is unclear whether there is evidence that the slope parameter is different from zero.\nThe point estimate of the slope parameter, labeled $b_1,$ is not zero, but we might wonder if this could just be due to chance.\nWe will address this sort of scenario in @sec-inf-model-slr.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample data with their best fitting lines (top row) and their corresponding\nresidual plots (bottom row).](07-model-slr_files/figure-html/fig-sampleLinesAndResPlots-1.png){#fig-sampleLinesAndResPlots width=90%}\n:::\n:::\n\n\n\\clearpage\n\n### Describing linear relationships with correlation\n\nWe've seen plots with strong linear relationships and others with very weak linear relationships.\nIt would be useful if we could quantify the strength of these linear relationships with a statistic.\n\n::: {.important data-latex=\"\"}\n**Correlation: strength of a linear relationship.**\n\n**Correlation** which always takes values between -1 and 1, describes the strength and direction of the linear relationship between two variables.\nWe denote the correlation by $r.$\n\nThe correlation value has no units and will not be affected by a linear change in the units (e.g., going from inches to centimeters).\n:::\n\n\n\n\n\nWe can compute the correlation using a formula, just as we did with the sample mean and standard deviation.\nThe formula for correlation, however, is rather complex[^07-model-slr-3], and like with other statistics, we generally perform the calculations on a computer or calculator.\n\n[^07-model-slr-3]: Formally, we can compute the correlation for observations $(x_1, y_1),$ $(x_2, y_2),$ ..., $(x_n, y_n)$ using the formula\n\n$$\nr = \\frac{1}{n-1} \\sum_{i=1}^{n} \\frac{x_i-\\bar{x}}{s_x}\\frac{y_i-\\bar{y}}{s_y}\n$$\n\nwhere $\\bar{x},$ $\\bar{y},$ $s_x,$ and $s_y$ are the sample means and standard deviations for each variable.\n\n@fig-posNegCorPlots shows eight plots and their corresponding correlations.\nOnly when the relationship is perfectly linear is the correlation either -1 or 1.\nIf the relationship is strong and positive, the correlation will be near +1.\nIf it is strong and negative, it will be near -1.\nIf there is no apparent linear relationship between the variables, then the correlation will be near zero.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample scatterplots and their correlations. The first row shows variables\nwith a positive relationship, represented by the trend up and to the right.\nThe second row shows variables with a negative trend, where a large value\nin one variable is associated with a lower value in the other.\n](07-model-slr_files/figure-html/fig-posNegCorPlots-1.png){#fig-posNegCorPlots width=100%}\n:::\n:::\n\n\nThe correlation is intended to quantify the strength of a linear trend.\nNonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in @fig-corForNonLinearPlots.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample scatterplots and their correlations. In each case, there is a strong\nrelationship between the variables. However, because the relationship is\nnot linear, the correlation is relatively weak.\n](07-model-slr_files/figure-html/fig-corForNonLinearPlots-1.png){#fig-corForNonLinearPlots width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nNo straight line is a good fit for any of the datasets represented in Figure @fig-corForNonLinearPlots.\nTry drawing nonlinear curves on each plot.\nOnce you create a curve for each, describe what is important in your fit.[^07-model-slr-4]\n:::\n\n[^07-model-slr-4]: We'll leave it to you to draw the lines.\n In general, the lines you draw should be close to most points and reflect overall trends in the data.\n\n::: {.workedexample data-latex=\"\"}\n@fig-crop-yields-af displays the relationships between various crop yields in countries.\nIn the plots, each point represents a different country.\nThe x and y variables represent the proportion of total yield in the last 50 years which is due to that crop type.\n\nOrder the six scatterplots from strongest negative to strongest positive linear relationship.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationships between various crop yields in over 200 countries.\n](07-model-slr_files/figure-html/fig-crop-yields-af-1.png){#fig-crop-yields-af width=100%}\n:::\n:::\n\n\n------------------------------------------------------------------------\n\nThe order of most negative correlation to most positive correlation is:\n\n$$\nA \\rightarrow D \\rightarrow B \\rightarrow C \\rightarrow E \\rightarrow F\n$$\n\n- Plot A - bananas vs. potatoes: -0.62\n- Plot B - cassava vs. soybeans: -0.21\n- Plot C - cassava vs. maize: -0.26\n- Plot D - cocoa vs. bananas: 0.22\n- Plot E - peas vs. barley: 0.31\n- Plot F - wheat vs. barley: 0.21\n:::\n\nOne important aspect of the correlation is that it's *unitless*.\nThat is, unlike a measurement of the slope of a line (see the next section) which provides an increase in the y-coordinate for a one unit increase in the x-coordinate (in units of the x and y variable), there are no units associated with the correlation of x and y.\n@fig-bdims-units shows the relationship between weights and heights of 507 physically active individuals.\nIn Plot A, weight is measured in kilograms (kg) and height in centimeters (cm).\nIn Plot B, weight has been converted to pounds (lbs) and height to inches (in).\nThe correlation coefficient ($r = 0.72$) is also noted on both plots.\nWe can see that the shape of the relationship has not changed, and neither has the correlation coefficient.\nThe only visual change to the plot is the axis *labeling* of the points.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots, both displaying the relationship between weights and\nheights of 507 physically healthy adults. In Plot A, the units are\nkilograms and centimeters. In Plot B, the units are pounds and inches.\nAlso noted on both plots is the correlation coefficient, $r = 0.72$.\n](07-model-slr_files/figure-html/fig-bdims-units-1.png){#fig-bdims-units width=90%}\n:::\n:::\n\n\n## Least squares regression {#sec-least-squares-regression}\n\nFitting linear models by eye is open to criticism since it is based on an individual's preference.\nIn this section, we use *least squares regression* as a more rigorous approach to fitting a line to a scatterplot.\n\n### Gift aid for freshman at Elmhurst College\n\nThis section considers a dataset on family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois.\nGift aid is financial aid that does not need to be paid back, as opposed to a loan.\nA scatterplot of these data is shown in @fig-elmhurstScatterWLine along with a linear fit.\nThe line follows a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Gift aid and family income for a random sample of 50 freshman students from\nElmhurst College.](07-model-slr_files/figure-html/fig-elmhurstScatterWLine-1.png){#fig-elmhurstScatterWLine width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIs the correlation positive or negative in @fig-elmhurstScatterWLine?[^07-model-slr-5]\n:::\n\n[^07-model-slr-5]: Larger family incomes are associated with lower amounts of aid, so the correlation will be negative.\n Using a computer, the correlation can be computed: -0.499.\n\n### An objective measure for finding the best line\n\nWe begin by thinking about what we mean by the \"best\" line.\nMathematically, we want a line that has small residuals.\nBut beyond the mathematical reasons, hopefully it also makes sense intuitively that whatever line we fit, the residuals should be small (i.e., the points should be close to the line).\nThe first option that may come to mind is to minimize the sum of the residual magnitudes:\n\n$$\n|e_1| + |e_2| + \\dots + |e_n|\n$$\n\nwhich we could accomplish with a computer program.\nThe resulting dashed line shown in @fig-elmhurstScatterW2Lines demonstrates this fit can be quite reasonable.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Gift aid and family income for a random sample of 50 freshman students from\nElmhurst College. The dashed line represents the line that minimizes the sum of\nthe absolute value of residuals, the solid line represents the line that minimizes\nthe sum of squared residuals, i.e., the least squares line.](07-model-slr_files/figure-html/fig-elmhurstScatterW2Lines-1.png){#fig-elmhurstScatterW2Lines width=90%}\n:::\n:::\n\n\nHowever, a more common practice is to choose the line that minimizes the sum of the squared residuals:\n\n$$\ne_{1}^2 + e_{2}^2 + \\dots + e_{n}^2\n$$\n\nThe line that minimizes this least squares criterion is represented as the solid line in @fig-elmhurstScatterW2Lines and is commonly called the **least squares line**.\nThe following are three possible reasons to choose the least squares option instead of trying to minimize the sum of residual magnitudes without any squaring:\n\n\n\n\n\n1. It is the most commonly used method.\n2. Computing the least squares line is widely supported in statistical software.\n3. In many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.\n4. The analyses which link the model to inference about a population are most straightforward when the line is fit through least squares.\n\nThe first two reasons are largely for tradition and convenience; the third and fourth reasons explain why the least squares criterion is typically most helpful when working with real data.[^07-model-slr-6]\n\n[^07-model-slr-6]: There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other criteria we might consider.\n However, this book only applies the least squares criterion.\n\n### Finding and interpreting the least squares line\n\nFor the Elmhurst data, we could write the equation of the least squares regression line as\n\n$$\n\\widehat{\\texttt{aid}} = \\beta_0 + \\beta_{1}\\times \\texttt{family_income}\n$$\n\nHere the equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst.\nThese two values, $\\beta_0$ and $\\beta_1,$ are the parameters of the regression line.\n\nThe parameters are estimated using the observed data.\nIn practice, this estimation is done using a computer in the same way that other estimates, like a sample mean, can be estimated using a computer or calculator.\n\nThe dataset where these data are stored is called `elmhurst`.\nThe first 5 rows of this dataset are given in @tbl-elmhurst-data.\n\n\n::: {#tbl-elmhurst-data .cell tbl-cap='First five rows of the `elmhurst` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
family_income gift_aid price_paid
92.92 21.7 14.28
0.25 27.5 8.53
53.09 27.8 14.25
50.20 27.2 8.78
137.61 18.0 24.00
\n\n`````\n:::\n:::\n\n\nWe can see that family income is recorded in a variable called `family_income` and gift aid from university is recorded in a variable called `gift_aid`.\nFor now, we won't worry about the `price_paid` variable.\nWe should also note that these data are from the 2011-2012 academic year, and all monetary amounts are given in \\$1,000s, i.e., the family income of the first student in the data shown in @tbl-elmhurst-data is \\$92,920 and they received a gift aid of \\$21,700.\n(The data source states that all numbers have been rounded to the nearest whole dollar.)\n\nStatistical software is usually used to compute the least squares line and the typical output generated as a result of fitting regression models looks like the one shown in @tbl-rOutputForIncomeAidLSRLine.\nFor now we will focus on the first column of the output, which lists ${b}_0$ and ${b}_1.$ In @sec-inf-model-slr we will dive deeper into the remaining columns which give us information on how accurate and precise these values of intercept and slope that are calculated from a sample of 50 students are in estimating the population parameters of intercept and slope for *all* students.\n\n\n::: {.cell}\n\n:::\n\n::: {#tbl-rOutputForIncomeAidLSRLine .cell tbl-cap='Summary of least squares fit for the Elmhurst data.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
term estimate std.error statistic p.value
(Intercept) 24.32 1.29 18.83 <0.0001
family_income -0.04 0.01 -3.98 2e-04
\n\n`````\n:::\n:::\n\n\nThe model output tells us that the intercept is approximately 24.319 and the slope on `family_income` is approximately -0.043.\n\nBut what do these values mean?\nInterpreting parameters in a regression model is often one of the most important steps in the analysis.\n\n::: {.workedexample data-latex=\"\"}\nThe intercept and slope estimates for the Elmhurst data are $b_0$ = 24.319 and $b_1$ = -0.043.\nWhat do these numbers really mean?\n\n------------------------------------------------------------------------\n\nInterpreting the slope parameter is helpful in almost any application.\nFor each additional \\$1,000 of family income, we would expect a student to receive a net difference of 1,000 $\\times$ (-0.0431) = -\\$43.10 in aid on average, i.e., \\$43.10 *less*.\nNote that a higher family income corresponds to less aid because the coefficient of family income is negative in the model.\nWe must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational.\nThat is, increasing a particular student's family income may not cause the student's aid to drop.\n(Although it would be reasonable to contact the college and ask if the relationship is causal, i.e., if Elmhurst College's aid decisions are partially based on students' family income.)\n\nThe estimated intercept $b_0$ = 24.319 describes the average aid if a student's family had no income, \\$24,319.\nThe meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \\$0.\nIn other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero.\n:::\n\n::: {.important data-latex=\"\"}\n**Interpreting parameters estimated by least squares.**\n\nThe slope describes the estimated difference in the predicted average outcome of $y$ if the predictor variable $x$ happened to be one unit larger.\nThe intercept describes the average outcome of $y$ if $x = 0$ *and* the linear model is valid all the way to $x = 0$ (values of $x = 0$ are not observed or relevant in many applications).\n:::\n\nIf you would like to learn more about using R to fit linear models, see @sec-model-tutorials for the interactive R tutorials.\nAn alternative way of calculating the values of intercept and slope of a least squares line is manual calculations using formulas.\nWhile manual calculations are not commonly used by practicing statisticians and data scientists, it is useful to work through the first time you're learning about the least squares line and modeling in general.\nCalculating the values by hand leverages two properties of the least squares line:\n\n1. The slope of the least squares line can be estimated by\n\n$$\nb_1 = \\frac{s_y}{s_x} r\n$$\n\nwhere $r$ is the correlation between the two variables, and $s_x$ and $s_y$ are the sample standard deviations of the predictor and outcome, respectively.\n\n2. If $\\bar{x}$ is the sample mean of the predictor variable and $\\bar{y}$ is the sample mean of the outcome variable, then the point $(\\bar{x}, \\bar{y})$ falls on the least squares line.\n\n@tbl-summaryStatsElmhurstRegr shows the sample means for the family income and gift aid as \\$101,780 and \\$19,940, respectively.\nWe could plot the point $(102, 19.9)$ on @fig-elmhurstScatterWLine to verify it falls on the least squares line (the solid line).\n\n\n::: {#tbl-summaryStatsElmhurstRegr .cell tbl-cap='Summary statistics for family income and gift aid.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n\n
Family income, x
Gift aid, y
mean sd mean sd r
102 63.2 19.9 5.46 -0.499
\n\n`````\n:::\n:::\n\n\nNext, we formally find the point estimates $b_0$ and $b_1$ of the parameters $\\beta_0$ and $\\beta_1.$\n\n::: {.workedexample data-latex=\"\"}\nUsing the summary statistics in @tbl-summaryStatsElmhurstRegr, compute the slope for the regression line of gift aid against family income.\n\n------------------------------------------------------------------------\n\nCompute the slope using the summary statistics from @tbl-summaryStatsElmhurstRegr:\n\n$$\nb_1 = \\frac{s_y}{s_x} r = \\frac{5.46}{63.2}(-0.499) = -0.0431\n$$\n:::\n\nYou might recall the form of a line from math class, which we can use to find the model fit, including the estimate of $b_0.$ Given the slope of a line and a point on the line, $(x_0, y_0),$ the equation for the line can be written as\n\n$$\ny - y_0 = slope\\times (x - x_0)\n$$\n\n::: {.important data-latex=\"\"}\n**Identifying the least squares line from summary statistics.**\n\nTo identify the least squares line from summary statistics:\n\n- Estimate the slope parameter, $b_1 = (s_y / s_x) r.$\n- Note that the point $(\\bar{x}, \\bar{y})$ is on the least squares line, use $x_0 = \\bar{x}$ and $y_0 = \\bar{y}$ with the point-slope equation: $y - \\bar{y} = b_1 (x - \\bar{x}).$\n- Simplify the equation, we get $y = \\bar{y} - b_1 \\bar{x} + b_1 x,$ which reveals that $b_0 = \\bar{y} - b_1 \\bar{x}.$\n:::\n\n::: {.workedexample data-latex=\"\"}\nUsing the point (102, 19.9) from the sample means and the slope estimate $b_1 = -0.0431,$ find the least-squares line for predicting aid based on family income.\n\n------------------------------------------------------------------------\n\nApply the point-slope equation using $(102, 19.9)$ and the slope $b_1 = -0.0431$:\n\n$$\n\\begin{aligned}\ny - y_0 &= b_1 (x - x_0) \\\\\ny - 19.9 &= -0.0431 (x - 102)\n\\end{aligned}\n$$\n\nExpanding the right side and then adding 19.9 to each side, the equation simplifies:\n\n$$\n\\widehat{\\texttt{aid}} = 24.3 - 0.0431 \\times \\texttt{family_income}\n$$\n\nHere we have replaced $y$ with $\\widehat{\\texttt{aid}}$ and $x$ with $\\texttt{family_income}$ to put the equation in context.\nThe final least squares equation should always include a \"hat\" on the variable being predicted, whether it is a generic $``y\"$ or a named variable like $``aid\"$.\n:::\n\n::: {.workedexample data-latex=\"\"}\nSuppose a high school senior is considering Elmhurst College.\nCan they simply use the linear equation that we have estimated to calculate her financial aid from the university?\n\n------------------------------------------------------------------------\n\nShe may use it as an estimate, though some qualifiers on this approach are important.\nFirst, the data all come from one freshman class, and the way aid is determined by the university may change from year to year.\nSecond, the equation will provide an imperfect estimate.\nWhile the linear equation is good at capturing the trend in the data, no individual student's aid will be perfectly predicted (as can be seen from the individual data points in the cloud around the line).\n:::\n\n### Extrapolation is treacherous\n\n> *When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6 it was 10 degrees.* *Today it hit almost 80.* *At this rate, by August it will be 220 degrees.* *So clearly folks the climate debate rages on.*[^07-model-slr-7]\n>\n> Stephen Colbert April 6th, 2010\n\n[^07-model-slr-7]: \n\nLinear models can be used to approximate the relationship between two variables.\nHowever, like any model, they have real limitations.\nLinear regression is simply a modeling framework.\nThe truth is almost always much more complex than a simple line.\nFor example, we do not know how the data outside of our limited window will behave.\n\n::: {.workedexample data-latex=\"\"}\nUse the model $\\widehat{\\texttt{aid}} = 24.3 - 0.0431 \\times \\texttt{family_income}$ to estimate the aid of another freshman student whose family had income of \\$1 million.\n\n------------------------------------------------------------------------\n\nWe want to calculate the aid for a family with \\$1 million income.\nNote that in our model this will be represented as 1,000 since the data are in \\$1,000s.\n\n$$\n24.3 - 0.0431 \\times 1000 = -18.8\n$$\n\nThe model predicts this student will have -\\$18,800 in aid (!).\nHowever, Elmhurst College does not offer *negative aid* where they select some students to pay extra on top of tuition to attend.\n:::\n\nApplying a model estimate to values outside of the realm of the original data is called **extrapolation**.\nGenerally, a linear model is only an approximation of the real relationship between two variables.\nIf we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.\n\n\n\n\n\n### Describing the strength of a fit {#r-squared}\n\nWe evaluated the strength of the linear relationship between two variables earlier using the correlation, $r.$ However, it is more common to explain the strength of a linear fit using $R^2,$ called **R-squared**.\nIf provided with a linear model, we might like to describe how closely the data cluster around the linear fit.\n\n\n\n\n\nThe $R^2$ of a linear model describes the amount of variation in the outcome variable that is explained by the least squares line.\nFor example, consider the Elmhurst data, shown in @fig-elmhurstScatterWLine).\nThe variance of the outcome variable, aid received, is about $s_{aid}^2 \\approx 29.8$ million (calculated from the data, some of which is shown in @tbl-elmhurst-data).\nHowever, if we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student's family income.\nThe variability in the residuals describes how much variation remains after using the model: $s_{_{RES}}^2 \\approx 22.4$ million.\nIn short, there was a reduction of\n\n$$\n\\frac{s_{aid}^2 - s_{_{RES}}^2}{s_{aid}^2}\n = \\frac{29800 - 22400}{29800}\n = \\frac{7500}{29800}\n \\approx 0.25,\n$$\n\nor about 25%, of the outcome variable's variation by using information about family income for predicting aid using a linear model.\nIt turns out that $R^2$ corresponds exactly to the squared value of the correlation:\n\n$$\nr = -0.499 \\rightarrow R^2 = 0.25\n$$\n\n::: {.guidedpractice data-latex=\"\"}\nIf a linear model has a very strong negative relationship with a correlation of -0.97, how much of the variation in the outcome is explained by the predictor?[^07-model-slr-8]\n:::\n\n[^07-model-slr-8]: About $R^2 = (-0.97)^2 = 0.94$ or 94% of the variation in the outcome variable is explained by the linear model.\n\n$R^2$ is also called the **coefficient of determination**.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Coefficient of determination: proportion of variability in the outcome variable explained by the model.**\n\nSince $r$ is always between -1 and 1, $R^2$ will always be between 0 and 1.\nThis statistic is called the **coefficient of determination**, and it measures the proportion of variation in the outcome variable, $y,$ that can be explained by the linear model with predictor $x.$\n:::\n\nMore generally, $R^2$ can be calculated as a ratio of a measure of variability around the line divided by a measure of total variability.\n\n::: {.important data-latex=\"\"}\n**Sums of squares to measure variability in** $y.$\n\nWe can measure the variability in the $y$ values by how far they tend to fall from their mean, $\\bar{y}.$ We define this value as the **total sum of squares**, calculated using the formula below, where $y_i$ represents each $y$ value in the sample, and $\\bar{y}$ represents the mean of the $y$ values in the sample.\n\n$$\nSST = (y_1 - \\bar{y})^2 + (y_2 - \\bar{y})^2 + \\cdots + (y_n - \\bar{y})^2.\n$$\n\nLeft-over variability in the $y$ values if we know $x$ can be measured by the **sum of squared errors**, or sum of squared residuals, calculated using the formula below, where $\\hat{y}_i$ represents the predicted value of $y_i$ based on the least squares regression.[^07-model-slr-9],\n\n$$\n\\begin{aligned}\nSSE &= (y_1 - \\hat{y}_1)^2 + (y_2 - \\hat{y}_2)^2 + \\cdots + (y_n - \\hat{y}_n)^2\\\\\n&= e_{1}^2 + e_{2}^2 + \\dots + e_{n}^2\n\\end{aligned}\n$$\n\nThe coefficient of determination can then be calculated as\n\n$$\nR^2 = \\frac{SST - SSE}{SST} = 1 - \\frac{SSE}{SST}\n$$\n:::\n\n[^07-model-slr-9]: The difference $SST - SSE$ is called the **regression sum of squares**, $SSR,$ and can also be calculated as $SSR = (\\hat{y}_1 - \\bar{y})^2 + (\\hat{y}_2 - \\bar{y})^2 + \\cdots + (\\hat{y}_n - \\bar{y})^2.$ $SSR$ represents the variation in $y$ that was accounted for in our model.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nAmong 50 students in the `elmhurst` dataset, the total variability in gift aid is $SST = 1461$.[^07-model-slr-10]\nThe sum of squared residuals is $SSE = 1098.$ Find $R^2.$\n\n------------------------------------------------------------------------\n\nSince we know $SSE$ and $SST,$ we can calculate $R^2$ as\n\n$$\nR^2 = 1 - \\frac{SSE}{SST} = 1 - \\frac{1098}{1461} = 0.25,\n$$\n\nthe same value we found when we squared the correlation: $R^2 = (-0.499)^2 = 0.25.$\n:::\n\n[^07-model-slr-10]: $SST$ can be calculated by finding the sample variance of the outcome variable, $s^2$ and multiplying by $n-1.$\n\n### Categorical predictors with two levels {#categorical-predictor-two-levels}\n\nCategorical variables are also useful in predicting outcomes.\nHere we consider a categorical predictor with two levels (recall that a *level* is the same as a *category*).\nWe'll consider Ebay auctions for a video game, *Mario Kart* for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded.\nHere we want to predict total price based on game condition, which takes values `used` and `new`.\n\n::: {.data data-latex=\"\"}\nThe [`mariokart`](http://openintrostat.github.io/openintro/reference/mariokart.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n\n:::\n\n\nA plot of the auction data is shown in @fig-marioKartNewUsed.\nNote that the original dataset contains some Mario Kart games being sold at prices above \\$100 but for this analysis we have limited our focus to the 141 Mario Kart games that were sold below \\$100.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total auction prices for the video game Mario Kart, divided into used\n($x = 0$) and new ($x = 1$) condition games. The least squares regression\nline is also shown.](07-model-slr_files/figure-html/fig-marioKartNewUsed-1.png){#fig-marioKartNewUsed width=90%}\n:::\n:::\n\n\nTo incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form.\nWe will do so using an **indicator variable** called `condnew`, which takes value 1 when the game is new and 0 when the game is used.\nUsing this indicator variable, the linear model may be written as\n\n$$\n\\widehat{\\texttt{price}} = b_0 + b_1 \\times \\texttt{condnew}\n$$\n\nThe parameter estimates are given in @tbl-marioKartNewUsedRegrSummary.\n\n\n\n\n::: {#tbl-marioKartNewUsedRegrSummary .cell tbl-cap='Least squares regression summary for the final auction price against the\ncondition of the game.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
term estimate std.error statistic p.value
(Intercept) 42.9 0.81 52.67 <0.0001
condnew 10.9 1.26 8.66 <0.0001
\n\n`````\n:::\n:::\n\n\nUsing values from Table @tbl-marioKartNewUsedRegrSummary, the model equation can be summarized as\n\n$$\n\\widehat{\\texttt{price}} = 42.87 + 10.9 \\times \\texttt{condnew}\n$$\n\n::: {.workedexample data-latex=\"\"}\nInterpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.\n\n------------------------------------------------------------------------\n\nThe intercept is the estimated price when `condnew` has a value 0, i.e., when the game is in used condition.\nThat is, the average selling price of a used version of the game is \\$42.9.\nThe slope indicates that, on average, new games sell for about \\$10.9 more than used games.\n:::\n\n::: {.important data-latex=\"\"}\n**Interpreting model estimates for categorical predictors.**\n\nThe estimated intercept is the value of the outcome variable for the first category (i.e., the category corresponding to an indicator value of 0).\nThe estimated slope is the average change in the outcome variable between the two categories.\n:::\n\nNote that, fundamentally, the intercept and slope interpretations do not change when modeling categorical variables with two levels.\nHowever, when the predictor variable is binary, the coefficient estimates ($b_0$ and $b_1$) are directly interpretable with respect to the dataset at hand.\n\nWe'll elaborate further on modeling categorical predictors in @sec-model-mlr, where we examine the influence of many predictor variables simultaneously using multiple regression.\n\n## Outliers in linear regression {#outliers-in-regression}\n\nIn this section, we identify criteria for determining which outliers are important and influential.\nOutliers in regression are observations that fall far from the cloud of points.\nThese points are especially important because they can have a strong influence on the least squares line.\n\n::: {.workedexample data-latex=\"\"}\nThere are three plots shown in @fig-outlier-plots-1 along with the corresponding least squares line and residual plots.\nFor each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.\nRecall that an outlier is any point that does not appear to belong with the vast majority of the other points.\n\n------------------------------------------------------------------------\n\n- A: There is one outlier far from the other points, though it only appears to slightly influence the line.\n\n- B: There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn't very influential.\n\n- C: There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud does not appear to fit very well.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three plots, each with a least squares line and corresponding residual plot.\nEach dataset has at least one outlier.\n](07-model-slr_files/figure-html/fig-outlier-plots-1-1.png){#fig-outlier-plots-1 width=100%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nThere are three plots shown in @fig-outlier-plots-2 along with the least squares line and residual plots.\nAs you did in previous exercise, for each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.\nRecall that an outlier is any point that does not appear to belong with the vast majority of the other points.\n\n------------------------------------------------------------------------\n\n- D: There is a primary cloud and then a small secondary cloud of four outliers.\n The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere.\n There might be an interesting explanation for the dual clouds, which is something that could be investigated.\n\n- E: There is no obvious trend in the main cloud of points and the outlier on the right appears to largely (and problematically) control the slope of the least squares line.\n\n- F: There is one outlier far from the cloud.\n However, it falls quite close to the least squares line and does not appear to be very influential.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three plots, each with a least squares line and residual plot.\nAll datasets have at least one outlier.\n](07-model-slr_files/figure-html/fig-outlier-plots-2-1.png){#fig-outlier-plots-2 width=100%}\n:::\n:::\n\n\nExamine the residual plots in Figures @fig-outlier-plots-1 and @fig-outlier-plots-2.\nIn Plots C, D, and E, you will probably find that there are a few observations which are both away from the remaining points along the x-axis and not in the trajectory of the trend in the rest of the data.\nIn these cases, the outliers influenced the slope of the least squares lines.\nIn Plot E, the bulk of the data show no clear trend, but if we fit a line to these data, we impose a trend where there isn't really one.\n\n::: {.important data-latex=\"\"}\n**Leverage.**\n\nPoints that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with **high leverage** or **leverage points**.\n:::\n\nPoints that fall horizontally far from the line are points of high leverage; these points can strongly influence the slope of the least squares line.\nIf one of these high leverage points does appear to actually invoke its influence on the slope of the line -- as in Plots C, D, and E of @fig-outlier-plots-1 and @fig-outlier-plots-2 -- then we call it an **influential point**.\nUsually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Types of outliers.**\n\nA point (or a group of points) that stands out from the rest of the data is called an outlier.\nOutliers that fall horizontally away from the center of the cloud of points are called leverage points.\nOutliers that influence on the slope of the line are called influential points.\n:::\n\nIt is tempting to remove outliers.\nDon't do this without a very good reason.\nModels that ignore exceptional (and interesting) cases often perform poorly.\nFor instance, if a financial firm ignored the largest market swings -- the \"outliers\" -- they would soon go bankrupt by making poorly thought-out investments.\n\n\\clearpage\n\n## Chapter review {#chp7-review}\n\n### Summary\n\nThroughout this chapter, the nuances of the linear model have been described.\nYou have learned how to create a linear model with explanatory variables that are numerical (e.g., total possum length) and those that are categorical (e.g., whether a video game was new).\nThe residuals in a linear model are an important metric used to understand how well a model fits; high leverage points, influential points, and other types of outliers can impact the fit of a model.\nCorrelation is a measure of the strength and direction of the linear relationship of two variables, without specifying which variable is the explanatory and which is the outcome.\nFuture chapters will focus on generalizing the linear model from the sample of data to claims about the population of interest.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
coefficient of determination influential point predictor
correlation least squares line R-squared
extrapolation leverage point residuals
high leverage outcome sum of squared error
indicator variable outlier total sum of squares
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp7-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-07].\n\n::: {.exercises data-latex=\"\"}\n1. **Visualize the residuals.** \nThe scatterplots shown below each have a superimposed regression line. If we were to construct a residual plot (residuals versus $x$) for each, describe in words what those plots would look like.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-38-1.png){width=90%}\n :::\n :::\n \n1. **Trends in the residuals.** \nShown below are two plots of residuals remaining after fitting a linear model to two different sets of data. \nFor each plot, describe important features and determine if a linear model would be appropriate for these data. \nExplain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n \n1. **Identify relationships, I.** \nFor each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-40-1.png){width=90%}\n :::\n :::\n\n1. **Identify relationships, II.** \nFor each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-41-1.png){width=90%}\n :::\n :::\n\n1. **Midterms and final.** \nThe two scatterplots below show the relationship between the overall course average and two midterm exams (Exam 1 and Exam 2) recorded for 233 students during several years for a statistics course at a university.^[The [`exam_grades`](http://openintrostat.github.io/openintro/reference/exam_grades.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-42-1.png){width=90%}\n :::\n :::\n\n a. Based on these graphs, which of the two exams has the strongest correlation with the course grade? Explain.\n\n b. Can you think of a reason why the correlation between the exam you chose in part (a) and the course grade is higher?\n \n \\clearpage\n\n1. **Partners' ages and heights.** \nThe Great Britain Office of Population Census and Surveys collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the partners. The scatterplot on the left shows the heights of the partners plotted against each other and the plot on the right shows the ages of the partners plotted against each other.^[The [`husbands_wives`](http://openintrostat.github.io/openintro/reference/husbands_wives.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-43-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between partners' ages.\n\n b. Describe the relationship between partners' heights.\n\n c. Which plot shows a stronger correlation? Explain your reasoning.\n\n d. Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between partners' heights?\n\n1. **Match the correlation, I.** \nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-44-1.png){width=90%}\n :::\n :::\n\n a. $r = -0.7$\n\n b. $r = 0.45$\n\n c. $r = 0.06$\n\n d. $r = 0.92$\n\n1. **Match the correlation, II.** \nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-45-1.png){width=90%}\n :::\n :::\n\n a. $r = 0.49$\n\n b. $r = -0.48$\n\n c. $r = -0.03$\n\n d. $r = -0.85$\n\n1. **Body measurements, correlation.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nThe scatterplot below shows the relationship between height and shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-46-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between shoulder girth and height.\n\n b. How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?\n\n1. **Compare correlations.** \nEduardo and Rosie are both collecting data on number of rainy days in a year and the total rainfall for the year. Eduardo records rainfall in inches and Rosie in centimeters. How will their correlation coefficients compare?\n\n1. **The Coast Starlight, correlation.** \nThe Coast Starlight Amtrak train runs from Seattle to Los Angeles. \nThe scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).^[The [`coast_starlight`](http://openintrostat.github.io/openintro/reference/coast_starlight.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-47-1.png){width=70%}\n :::\n :::\n\n a. Describe the relationship between distance and travel time.\n\n b. How would the relationship change if travel time was instead measured in hours, and distance was instead measured in kilometers?\n\n c. Correlation between travel time (in miles) and distance (in minutes) is $r = 0.636$. What is the correlation between travel time (in kilometers) and distance (in hours)?\n\n1. **Crawling babies, correlation.** \nA study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. \nInfants born during the study year were split into twelve groups, one for each birth month. \nWe consider the average crawling age of babies in each group against the average temperature when the babies are six months old (that's when babies often begin trying to crawl). \nTemperature is measured in degrees Fahrenheit (F) and age is measured in weeks.^[The [`babies_crawl`](http://openintrostat.github.io/openintro/reference/babies_crawl.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Benson:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-48-1.png){width=70%}\n :::\n :::\n\n a. Describe the relationship between temperature and crawling age.\n\n b. How would the relationship change if temperature was measured in degrees Celsius (C) and age was measured in months?\n\n c. The correlation between temperature in F and age in weeks was $r=-0.70$. If we converted the temperature to C and age to months, what would the correlation be?\n\n1. **Partners' ages.** \nWhat would be the correlation between the ages of partners if people always dated others who are \n\n a. 3 years younger than themselves?\n\n b. 2 years older than themselves?\n\n c. half as old as themselves?\n\n1. **Graduate degrees and salaries.** \nWhat would be the correlation between the annual salaries of people with and without a graduate degree at a company if for a certain type of position someone with a graduate degree always made \n\n a. \\$5,000 more than those without a graduate degree?\n\n b. 25% more than those without a graduate degree?\n\n c. 15% less than those without a graduate degree?\n\n1. **Units of regression.** \nConsider a regression predicting the number of calories (cal) from width (cm) for a sample of square shaped chocolate brownies. What are the units of the correlation coefficient, the intercept, and the slope?\n\n1. **Which is higher?**\nDetermine if (I) or (II) is higher or if they are equal: *\"For a regression line, the uncertainty associated with the slope estimate, $b_1$, is higher when (I) there is a lot of scatter around the regression line or (II) there is very little scatter around the regression line.\"* Explain your reasoning.\n\n1. **Over-under, I.** \nSuppose we fit a regression line to predict the shelf life of an apple based on its weight. For a particular apple, we predict the shelf life to be 4.6 days. The apple's residual is -0.6 days. Did we over or under estimate the shelf-life of the apple? Explain your reasoning.\n\n1. **Over-under, II.** \nSuppose we fit a regression line to predict the number of incidents of skin cancer per 1,000 people from the number of sunny days in a year. \nFor a particular year, we predict the incidence of skin cancer to be 1.5 per 1,000 people, and the residual for this year is 0.5. \nDid we over or under estimate the incidence of skin cancer? Explain your reasoning.\n\n1. **Starbucks, calories, and protein.** \nThe scatterplot below shows the relationship between the number of calories and amount of protein (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we might be interested in predicting the amount of protein a menu item has based on its calorie content.^[The [`starbucks`](http://openintrostat.github.io/openintro/reference/starbucks.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-49-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between number of calories and amount of protein (in grams) that Starbucks food menu items contain.\n\n b. In this scenario, what are the predictor and outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. What does the residuals vs. predicted plot tell us about the variability in our prediction errors based on this model for items with lower vs. higher predicted protein?\n\n1. **Starbucks, calories, and carbs.** \nThe scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we might be interested in predicting the amount of carbs a menu item has based on its calorie content.^[The [`starbucks`](http://openintrostat.github.io/openintro/reference/starbucks.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-50-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.\n\n b. In this scenario, what are the predictor and outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. What does the residuals vs. predicted plot tell us about the variability in our prediction errors based on this model for items with lower vs. higher predicted carbs?\n\n1. **The Coast Starlight, regression.** \nThe Coast Starlight Amtrak train runs from Seattle to Los Angeles. \nThe scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).\nThe mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. \nThe correlation between travel time and distance is 0.636.^[The [`coast_starlight`](http://openintrostat.github.io/openintro/reference/coast_starlight.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-51-1.png){width=90%}\n :::\n :::\n\n a. Write the equation of the regression line for predicting travel time.\n\n b. Interpret the slope and the intercept in this context.\n\n c. Calculate $R^2$ of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret $R^2$ in the context of the application.\n\n d. The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.\n\n e. It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.\n\n f. Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?\n \n \\clearpage\n\n1. **Body measurements, regression.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nThe scatterplot below shows the relationship between height and shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.\nThe mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. \nThe mean height is 171.14 cm with a standard deviation of 9.41 cm. \nThe correlation between height and shoulder girth is 0.67.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-52-1.png){width=90%}\n :::\n :::\n\n a. Write the equation of the regression line for predicting height.\n\n b. Interpret the slope and the intercept in this context.\n\n c. Calculate $R^2$ of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.\n\n d. A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.\n\n e. The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.\n\n f. A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?\n \n \\clearpage\n\n1. **Poverty and unemployment.** \nThe following scatterplot shows the relationship between percent of population below the poverty level (`poverty`) from unemployment rate among those ages 20-64 (`unemployment_rate`) in counties in the US, as provided by data from the 2019 American Community Survey. \nThe regression output for the model for predicting `poverty` from `unemployment_rate` is also provided.^[The [`\ncounty_2019`](http://openintrostat.github.io/usdata/reference/\ncounty_2019.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-53-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 4.60 0.349 13.2 <0.0001
unemployment_rate 2.05 0.062 33.1 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the linear model.\n\n b. Interpret the intercept.\n\n c. Interpret the slope.\n\n d. For this model $R^2$ is 46%. Interpret this value.\n\n e. Calculate the correlation coefficient.\n \n \\clearpage\n\n1. **Cats weights.** \nThe following regression output is for predicting the heart weight (`Hwt`, in g) of cats from their body weight (`Bwt`, in kg). The coefficients are estimated using a dataset of 144 domestic cats.^[The [`cats`](https://cran.r-project.org/web/packages/MASS/MASS.pdf) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-54-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.357 0.692 -0.515 0.6072
Bwt 4.034 0.250 16.119 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the linear model.\n\n b. Interpret the intercept.\n\n c. Interpret the slope.\n\n d. The $R^2$ of this model is 1%. Interpret $R^2$.\n\n e. Calculate the correlation coefficient.\n\n1. **Outliers, I.** \nIdentify the outliers in the scatterplots shown below, and determine what type of outliers they are. Explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-55-1.png){width=100%}\n :::\n :::\n \n \\clearpage\n\n1. **Outliers, II.** \nIdentify the outliers in the scatterplots shown below and determine what type of outliers they are. Explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-56-1.png){width=100%}\n :::\n :::\n\n1. **Urban homeowners, outliers.** \nThe scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. \nThere are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.^[The [`urban_owner`](http://openintrostat.github.io/openintro/reference/urban_owner.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-57-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between the percent of families who own their home and the percent of the population living in urban areas.\n\n b. The outlier at the bottom right corner is District of Columbia, where 100% of the population is considered urban. What type of an outlier is this observation?\n \n \\pagebreak\n\n1. **Crawling babies, outliers.**\nA study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. \nThe plot below shows the relationship between average crawling age of babies born in each month and the average temperature in the month when the babies are six months old.\nThe plot reveals a potential outlying month when the average temperature is about 53F and average crawling age is about 28.5 weeks. \nDoes this point have high leverage? Is it an influential point?^[The [`babies_crawl`](http://openintrostat.github.io/openintro/reference/babies_crawl.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Benson:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-58-1.png){width=90%}\n :::\n :::\n\n1. **True / False.** \nDetermine if the following statements are true or false. \nIf false, explain why.\n\n a. A correlation coefficient of -0.90 indicates a stronger linear relationship than a correlation of 0.5.\n\n b. Correlation is a measure of the association between any two variables.\n\n1. **Cherry trees.** \nThe scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. \nThe diameter of the tree is measured 4.5 feet above the ground.^[The [`trees`](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/trees.html) data used in this exercise can be found in the [**datasets**](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-59-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between volume and height of these trees.\n\n b. Describe the relationship between volume and diameter of these trees.\n\n c. Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.\n\n1. **Match the correlation, III.**\nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-60-1.png){width=100%}\n :::\n :::\n \n a. r = 0.69\n\n b. r = 0.09\n\n c. r = -0.91\n\n d. r = 0.97\n\n1. **Helmets and lunches.** \nThe scatterplot shows the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (`lunch`) and the percentage of bike riders in the neighborhood wearing helmets (`helmet`). \nThe average percentage of children receiving reduced-fee lunches is 30.833% with a standard deviation of 26.724% and the average percentage of bike riders wearing helmets is 30.883% with a standard deviation of 16.948%.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-61-1.png){width=90%}\n :::\n :::\n\n a. If the $R^2$ for the least-squares regression line for these data is 72%, what is the correlation between `lunch` and `helmet`?\n\n b. Calculate the slope and intercept for the least-squares regression line for these data.\n\n c. Interpret the intercept of the least-squares regression line in the context of the application.\n\n d. Interpret the slope of the least-squares regression line in the context of the application.\n\n e. What would the value of the residual be for a neighborhood where 40% of the children receive reduced-fee lunches and 40% of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.\n\n\n:::\n", + "supporting": [ + "07-model-slr_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/07-model-slr/figure-html/fig-bdims-units-1.png b/_freeze/07-model-slr/figure-html/fig-bdims-units-1.png new file mode 100644 index 00000000..4ed2e781 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-bdims-units-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-corForNonLinearPlots-1.png b/_freeze/07-model-slr/figure-html/fig-corForNonLinearPlots-1.png new file mode 100644 index 00000000..92036c99 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-corForNonLinearPlots-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-crop-yields-af-1.png b/_freeze/07-model-slr/figure-html/fig-crop-yields-af-1.png new file mode 100644 index 00000000..52b2d463 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-crop-yields-af-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-elmhurstScatterW2Lines-1.png b/_freeze/07-model-slr/figure-html/fig-elmhurstScatterW2Lines-1.png new file mode 100644 index 00000000..ae803c34 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-elmhurstScatterW2Lines-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-elmhurstScatterWLine-1.png b/_freeze/07-model-slr/figure-html/fig-elmhurstScatterWLine-1.png new file mode 100644 index 00000000..8e82eaad Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-elmhurstScatterWLine-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-imperfLinearModel-1.png b/_freeze/07-model-slr/figure-html/fig-imperfLinearModel-1.png new file mode 100644 index 00000000..a83933e2 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-imperfLinearModel-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-marioKartNewUsed-1.png b/_freeze/07-model-slr/figure-html/fig-marioKartNewUsed-1.png new file mode 100644 index 00000000..923e8d3a Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-marioKartNewUsed-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-notGoodAtAllForALinearModel-1.png b/_freeze/07-model-slr/figure-html/fig-notGoodAtAllForALinearModel-1.png new file mode 100644 index 00000000..727ebe35 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-notGoodAtAllForALinearModel-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-outlier-plots-1-1.png b/_freeze/07-model-slr/figure-html/fig-outlier-plots-1-1.png new file mode 100644 index 00000000..f507a8b4 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-outlier-plots-1-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-outlier-plots-2-1.png b/_freeze/07-model-slr/figure-html/fig-outlier-plots-2-1.png new file mode 100644 index 00000000..a99c8148 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-outlier-plots-2-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-perfLinearModel-1.png b/_freeze/07-model-slr/figure-html/fig-perfLinearModel-1.png new file mode 100644 index 00000000..cab05603 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-perfLinearModel-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-posNegCorPlots-1.png b/_freeze/07-model-slr/figure-html/fig-posNegCorPlots-1.png new file mode 100644 index 00000000..3ab921e2 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-posNegCorPlots-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-sampleLinesAndResPlots-1.png b/_freeze/07-model-slr/figure-html/fig-sampleLinesAndResPlots-1.png new file mode 100644 index 00000000..e556ca36 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-sampleLinesAndResPlots-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalL-1.png b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalL-1.png new file mode 100644 index 00000000..ec120d27 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalL-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalL-sex-age-1.png b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalL-sex-age-1.png new file mode 100644 index 00000000..1dbd2181 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalL-sex-age-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLLine-1.png b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLLine-1.png new file mode 100644 index 00000000..1ce41e58 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLLine-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLLine-highlighted-1.png b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLLine-highlighted-1.png new file mode 100644 index 00000000..b6c5c83a Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLLine-highlighted-1.png differ diff --git a/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLResidualPlot-1.png b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLResidualPlot-1.png new file mode 100644 index 00000000..a2ec44e1 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/fig-scattHeadLTotalLResidualPlot-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-38-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-38-1.png new file mode 100644 index 00000000..55d89841 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-38-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-39-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 00000000..2b0b2126 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-39-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-40-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-40-1.png new file mode 100644 index 00000000..970489a4 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-40-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-41-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-41-1.png new file mode 100644 index 00000000..dbed84c3 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-41-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-42-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-42-1.png new file mode 100644 index 00000000..80f0e724 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-42-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-43-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-43-1.png new file mode 100644 index 00000000..377653de Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-43-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-44-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-44-1.png new file mode 100644 index 00000000..94cb5f53 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-44-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-45-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-45-1.png new file mode 100644 index 00000000..52bcf6d7 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-45-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-46-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-46-1.png new file mode 100644 index 00000000..b0e92249 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-46-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-47-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-47-1.png new file mode 100644 index 00000000..23381b6e Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-47-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-48-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-48-1.png new file mode 100644 index 00000000..cc651ac2 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-48-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-49-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-49-1.png new file mode 100644 index 00000000..f1c1fc6c Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-49-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-50-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-50-1.png new file mode 100644 index 00000000..ff3517c9 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-50-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-51-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-51-1.png new file mode 100644 index 00000000..23381b6e Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-51-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-52-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-52-1.png new file mode 100644 index 00000000..b0e92249 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-52-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-53-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-53-1.png new file mode 100644 index 00000000..0f8eeaaf Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-53-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-54-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-54-1.png new file mode 100644 index 00000000..cf50d4c2 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-54-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-55-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-55-1.png new file mode 100644 index 00000000..96cc8043 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-55-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-56-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-56-1.png new file mode 100644 index 00000000..24986661 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-56-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-57-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-57-1.png new file mode 100644 index 00000000..6104c48d Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-57-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-58-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-58-1.png new file mode 100644 index 00000000..cc651ac2 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-58-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-59-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-59-1.png new file mode 100644 index 00000000..bf60eb60 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-59-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-60-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-60-1.png new file mode 100644 index 00000000..a5254beb Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-60-1.png differ diff --git a/_freeze/07-model-slr/figure-html/unnamed-chunk-61-1.png b/_freeze/07-model-slr/figure-html/unnamed-chunk-61-1.png new file mode 100644 index 00000000..381db9d6 Binary files /dev/null and b/_freeze/07-model-slr/figure-html/unnamed-chunk-61-1.png differ diff --git a/_freeze/08-model-mlr/execute-results/html.json b/_freeze/08-model-mlr/execute-results/html.json new file mode 100644 index 00000000..ca29f9cc --- /dev/null +++ b/_freeze/08-model-mlr/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "35e6c89f75c5bcf77ce650d0caecab22", + "result": { + "markdown": "# Linear regression with multiple predictors {#sec-model-mlr}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nBuilding on the ideas of one predictor variable in a linear regression model (from Chapter \\@ref(model-slr)), a multiple linear regression model is now fit to two or more predictor variables.\nBy considering how different explanatory variables interact, we can uncover complicated relationships between the predictor variables and the response variable.\nOne challenge to working with multiple variables is that it is sometimes difficult to know which variables are most important to include in the model.\nModel building is an extensive topic, and we scratch the surface here by defining and utilizing the adjusted $R^2$ value.\n:::\n\nMultiple regression extends single predictor variable regression to the case that still has one response but many predictors (denoted $x_1$, $x_2$, $x_3$, ...).\nThe method is motivated by scenarios where many variables may be simultaneously connected to an output.\n\nWe will consider data about loans from the peer-to-peer lender, Lending Club, which is a dataset we first encountered in Chapter \\@ref(data-hello).\nThe loan data includes terms of the loan as well as information about the borrower.\nThe outcome variable we would like to better understand is the interest rate assigned to the loan.\nFor instance, all other characteristics held constant, does it matter how much debt someone already has?\nDoes it matter if their income has been verified?\nMultiple regression will help us answer these and other questions.\n\nThe dataset includes results from 10,000 loans, and we'll be looking at a subset of the available variables, some of which will be new from those we saw in earlier chapters.\nThe first six observations in the dataset are shown in Table \\@ref(tab:loans-data-matrix), and descriptions for each variable are shown in Table \\@ref(tab:loans-variables).\nNotice that the past bankruptcy variable (`bankruptcy`) is an indicator variable, where it takes the value 1 if the borrower had a past bankruptcy in their record and 0 if not.\nUsing an indicator variable in place of a category name allows for these variables to be directly used in regression.\nTwo of the other variables are categorical (`verified_income` and `issue_month`), each of which can take one of a few different non-numerical values; we'll discuss how these are handled in the model in Section \\@ref(ind-and-cat-predictors).\n\n::: {.data data-latex=\"\"}\nThe [`loans_full_schema`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\nBased on the data in this dataset we have created two new variables: `credit_util` which is calculated as the total credit utilized divided by the total credit limit and `bankruptcy` which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcy).\nWe will refer to this modified dataset as `loans`.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
First six rows of the `loans` dataset.
interest_rate verified_income debt_to_income credit_util bankruptcy term credit_checks issue_month
14.07 Verified 18.01 0.548 0 60 6 Mar-2018
12.61 Not Verified 5.04 0.150 1 36 1 Feb-2018
17.09 Source Verified 21.15 0.661 0 36 4 Feb-2018
6.72 Not Verified 10.16 0.197 0 36 0 Jan-2018
14.07 Verified 57.96 0.755 0 36 7 Mar-2018
6.72 Not Verified 6.46 0.093 0 36 6 Jan-2018
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variables and their descriptions for the `loans` dataset.
Variable Description
interest_rate Interest rate on the loan, in an annual percentage.
verified_income Categorical variable describing whether the borrower's income source and amount have been verified, with levels `Verified`, `Source Verified`, and `Not Verified`.
debt_to_income Debt-to-income ratio, which is the percentage of total debt of the borrower divided by their total income.
credit_util Of all the credit available to the borrower, what fraction are they utilizing. For example, the credit utilization on a credit card would be the card's balance divided by the card's credit limit.
bankruptcy An indicator variable for whether the borrower has a past bankruptcy in their record. This variable takes a value of `1` if the answer is *yes* and `0` if the answer is *no*.
term The length of the loan, in months.
issue_month The month and year the loan was issued, which for these loans is always during the first quarter of 2018.
credit_checks Number of credit checks in the last 12 months. For example, when filing an application for a credit card, it is common for the company receiving the application to run a credit check.
\n\n`````\n:::\n:::\n\n\n## Indicator and categorical predictors {#ind-and-cat-predictors}\n\nLet's start by fitting a linear regression model for interest rate with a single predictor indicating whether a person has a bankruptcy in their record:\n\n$$\\widehat{\\texttt{interest_rate}} = 12.34 + 0.74 \\times \\texttt{bankruptcy}$$\n\nResults of this model are shown in Table \\@ref(tab:int-rate-bankruptcy).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a linear model for predicting `interest_rate` based on whether the borrower has a bankruptcy in their record. Degrees of freedom for this model is 9998.
term estimate std.error statistic p.value
(Intercept) 12.34 0.05 231.49 <0.0001
bankruptcy1 0.74 0.15 4.82 <0.0001
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nInterpret the coefficient for the past bankruptcy variable in the model.\n\n------------------------------------------------------------------------\n\nThe variable takes one of two values: 1 when the borrower has a bankruptcy in their history and 0 otherwise.\nA slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record.\n(See Section \\@ref(categorical-predictor-two-levels) for a review of the interpretation for two-level categorical predictor variables.)\n:::\n\nSuppose we had fit a model using a 3-level categorical variable, such as `verified_income`.\nThe output from software is shown in Table \\@ref(tab:int-rate-ver-income).\nThis regression output provides multiple rows for the variable.\nEach row represents the relative difference for each level of `verified_income`.\nHowever, we are missing one of the levels: `Not Verified`.\nThe missing level is called the **reference level** and it represents the default level that other levels are measured against.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a linear model for predicting `interest_rate` based on whether the borrower’s income source and amount has been verified. This predictor has three levels, which results in 2 rows in the regression output.
term estimate std.error statistic p.value
(Intercept) 11.10 0.08 137.2 <0.0001
verified_incomeSource Verified 1.42 0.11 12.8 <0.0001
verified_incomeVerified 3.25 0.13 25.1 <0.0001
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nHow would we write an equation for this regression model?\n\n------------------------------------------------------------------------\n\nThe equation for the regression model may be written as a model with two predictors:\n\n$$\n\\begin{aligned}\n\\widehat{\\texttt{interest_rate}} &= 11.10 \\\\\n&+ 1.42 \\times \\texttt{verified_income}_{\\texttt{Source Verified}} \\\\\n&+ 3.25 \\times \\texttt{verified_income}_{\\texttt{Verified}}\n\\end{aligned}\n$$\n\nWe use the notation $\\texttt{variable}_{\\texttt{level}}$ to represent indicator variables for when the categorical variable takes a particular value.\nFor example, $\\texttt{verified_income}_{\\texttt{Source Verified}}$ would take a value of 1 if it was for a borrower that was source verified, and it would take a value of 0 otherwise.\nLikewise, $\\texttt{verified_income}_{\\texttt{Verified}}$ would take a value of 1 if it was for a borrower that was verified, and 0 if it took any other value.\n:::\n\nThe notation $\\texttt{variable}_{\\texttt{level}}$ may feel a bit confusing.\nLet's figure out how to use the equation for each level of the `verified_income` variable.\n\n::: {.workedexample data-latex=\"\"}\nUsing the model for predicting interest rate from income verification type, compute the average interest rate for borrowers whose income source and amount are both *unverified*.\n\n------------------------------------------------------------------------\n\nWhen `verified_income` takes a value of `Not Verified`, then both indicator functions in the equation for the linear model are set to 0:\n\n$$\\widehat{\\texttt{interest_rate}} = 11.10 + 1.42 \\times 0 + 3.25 \\times 0 = 11.10$$\n\nThe average interest rate for these borrowers is 11.1%.\nBecause the level does not have its own coefficient and it is the reference value, the indicators for the other levels for this variable all drop out.\n:::\n\n::: {.workedexample data-latex=\"\"}\nUsing the model for predicting interest rate from income verification type, compute the average interest rate for borrowers whose income source and amount are both *source verified*.\n\n------------------------------------------------------------------------\n\nWhen `verified_income` takes a value of `Source Verified`, then the corresponding variable takes a value of 1 while the other is 0:\n\n$$\\widehat{\\texttt{interest_rate}} = 11.10 + 1.42 \\times 1 + 3.25 \\times 0 = 12.52$$\n\nThe average interest rate for these borrowers is 12.52%.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the average interest rate for borrowers whose income source and amount are both verified.[^08-model-mlr-1]\n:::\n\n[^08-model-mlr-1]: When `verified_income` takes a value of `Verified`, then the corresponding variable takes a value of 1 while the other is 0: $11.10 + 1.42 \\times 0 + 3.25 \\times 1 = 14.35.$ The average interest rate for these borrowers is 14.35%.\n\n::: {.important data-latex=\"\"}\n**Predictors with several categories.**\n\nWhen fitting a regression model with a categorical variable that has $k$ levels where $k > 2$, software will provide a coefficient for $k - 1$ of those levels.\nFor the last level that does not receive a coefficient, this is the reference level, and the coefficients listed for the other levels are all considered relative to this reference level.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nInterpret the coefficients from the model above.[^08-model-mlr-2]\n:::\n\n[^08-model-mlr-2]: Each of the coefficients gives the incremental interest rate for the corresponding level relative to the `Not Verified` level, which is the reference level.\n For example, for a borrower whose income source and amount have been verified, the model predicts that they will have a 3.25% higher interest rate than a borrower who has not had their income source or amount verified.\n\nThe higher interest rate for borrowers who have verified their income source or amount is surprising.\nIntuitively, we would think that a loan would look *less* risky if the borrower's income has been verified.\nHowever, note that the situation may be more complex, and there may be confounding variables that we didn't account for.\nFor example, perhaps lenders require borrowers with poor credit to verify their income.\nThat is, verifying income in our dataset might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan.\nFor this reason, the borrower could be deemed higher risk, resulting in a higher interest rate.\n(What other confounding variables might explain this counter-intuitive relationship suggested by the model?)\n\n::: {.guidedpractice data-latex=\"\"}\nHow much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?[^08-model-mlr-3]\n:::\n\n[^08-model-mlr-3]: Relative to the `Not Verified` category, the `Verified` category has an interest rate of 3.25% higher, while the `Source Verified` category is only 1.42% higher.\n Thus, `Verified` borrowers will tend to get an interest rate about $3.25% - 1.42% = 1.83%$ higher than `Source Verified` borrowers.\n\n## Many predictors in a model\n\nThe world is complex, and it can be helpful to consider many factors at once in statistical modeling.\nFor example, we might like to use the full context of borrowers to predict the interest rate they receive rather than using a single variable.\nThis is the strategy used in **multiple regression**.\nWhile we remain cautious about making any causal interpretations using multiple regression on observational data, such models are a common first step in gaining insights or providing some evidence of a causal connection.\n\n\n\n\n\nWe want to construct a model that accounts not only for any past bankruptcy or whether the borrower had their income source or amount verified, but simultaneously accounts for all the variables in the `loans` dataset: `verified_income`, `debt_to_income`, `credit_util`, `bankruptcy`, `term`, `issue_month`, and `credit_checks`.\n\n$$\\begin{aligned}\n\\widehat{\\texttt{interest_rate}} &= b_0 \\\\\n&+ b_1 \\times \\texttt{verified_income}_{\\texttt{Source Verified}} \\\\\n&+ b_2 \\times \\texttt{verified_income}_{\\texttt{Verified}} \\\\\n&+ b_3 \\times \\texttt{debt_to_income} \\\\\n&+ b_4 \\times \\texttt{credit_util} \\\\\n&+ b_5 \\times \\texttt{bankruptcy} \\\\\n&+ b_6 \\times \\texttt{term} \\\\\n&+ b_9 \\times \\texttt{credit_checks} \\\\\n&+ b_7 \\times \\texttt{issue_month}_{\\texttt{Jan-2018}} \\\\\n&+ b_8 \\times \\texttt{issue_month}_{\\texttt{Mar-2018}}\n\\end{aligned}$$\n\nThis equation represents a holistic approach for modeling all of the variables simultaneously.\nNotice that there are two coefficients for `verified_income` and two coefficients for `issue_month`, since both are 3-level categorical variables.\n\nWe calculate $b_0$, $b_1$, $b_2$, $\\cdots$, $b_9$ the same way as we did in the case of a model with a single predictor -- we select values that minimize the sum of the squared residuals:\n\n$$SSE = e_1^2 + e_2^2 + \\dots + e_{10000}^2 = \\sum_{i=1}^{10000} e_i^2 = \\sum_{i=1}^{10000} \\left(y_i - \\hat{y}_i\\right)^2$$\n\nwhere $y_i$ and $\\hat{y}_i$ represent the observed interest rates and their estimated values according to the model, respectively.\n10,000 residuals are calculated, one for each observation.\nNote that these values are sample statistics and in the case where the observed data is a random sample from a target population that we are interested in making inferences about, they are estimates of the population parameters $\\beta_0$, $\\beta_1$, $\\beta_2$, $\\cdots$, $\\beta_9$.\nWe will discuss inference based on linear models in Chapter \\@ref(inf-model-mlr), for now we will focus on calculating sample statistics $b_i$.\n\nWe typically use a computer to minimize the sum of squares and compute point estimates, as shown in the sample output in Table \\@ref(tab:loans-full).\nUsing this output, we identify $b_i,$ just as we did in the one-predictor case.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Output for the regression model, where interest rate is the outcome and the variables listed are the predictors. Degrees of freedom for this model is 9990.
term estimate std.error statistic p.value
(Intercept) 1.89 0.21 9.01 <0.0001
verified_incomeSource Verified 1.00 0.10 10.06 <0.0001
verified_incomeVerified 2.56 0.12 21.87 <0.0001
debt_to_income 0.02 0.00 7.43 <0.0001
credit_util 4.90 0.16 30.25 <0.0001
bankruptcy1 0.39 0.13 2.96 0.0031
term 0.15 0.00 38.89 <0.0001
credit_checks 0.23 0.02 12.52 <0.0001
issue_monthJan-2018 0.05 0.11 0.42 0.6736
issue_monthMar-2018 -0.04 0.11 -0.39 0.696
\n\n`````\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Multiple regression model.**\n\nA multiple regression model is a linear model with many predictors.\nIn general, we write the model as\n\n$$\\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \\cdots + b_k x_k$$\n\nwhen there are $k$ predictors.\nWe always calculate $b_i$ using statistical software.\n:::\n\n::: {.workedexample data-latex=\"\"}\nWrite out the regression model using the regression output from Table \\@ref(tab:loans-full).\nHow many predictors are there in this model?\n\n------------------------------------------------------------------------\n\nThe fitted model for the interest rate is given by:\n\n$$\n\\begin{aligned}\n\\widehat{\\texttt{interest_rate}} &= 1.89 \\\\\n&+ 1.00 \\times \\texttt{verified_income}_{\\texttt{Source Verified}} \\\\\n&+ 2.56 \\times \\texttt{verified_income}_{\\texttt{Verified}} \\\\\n&+ 0.02 \\times \\texttt{debt_to_income} \\\\\n&+ 4.90 \\times \\texttt{credit_util} \\\\\n&+ 0.39 \\times \\texttt{bankruptcy} \\\\\n&+ 0.15 \\times \\texttt{term} \\\\\n&+ 0.23 \\times \\texttt{credit_checks} \\\\\n&+ 0.05 \\times \\texttt{issue_month}_{\\texttt{Jan-2018}} \\\\\n&- 0.04 \\times \\texttt{issue_month}_{\\texttt{Mar-2018}}\n\\end{aligned}\n$$\n\nIf we count up the number of predictor coefficients, we get the *effective* number of predictors in the model; there are nine of those.\nNotice that the categorical predictor counts as two, once for each of the two levels shown in the model.\nIn general, a categorical predictor with $p$ different levels will be represented by $p - 1$ terms in a multiple regression model.\nA total of seven variables were used as predictors to fit this model: `verified_income`, `debt_to_income`, `credit_util`, `bankruptcy`, `term`, `credit_checks`, `issue_month`.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nInterpret the coefficient of the variable `credit_checks`.[^08-model-mlr-4]\n:::\n\n[^08-model-mlr-4]: All else held constant, for each additional inquiry into the applicant's credit during the last 12 months, we would expect the interest rate for the loan to be higher, on average, by 0.23 points.\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the residual of the first observation in Table \\@ref(tab:loans-data-matrix) using the full model.[^08-model-mlr-5]\n:::\n\n[^08-model-mlr-5]: To compute the residual, we first need the predicted value, which we compute by plugging values into the equation from earlier.\n For example, $\\texttt{verified_income}_{\\texttt{Source Verified}}$ takes a value of 0, $\\texttt{verified_income}_{\\texttt{Verified}}$ takes a value of 1 (since the borrower's income source and amount were verified), $\\texttt{debt_to_income}$ was 18.01, and so on.\n This leads to a prediction of $\\widehat{\\texttt{interest_rate}}_1 = 17.84$.\n The observed interest rate was 14.07%, which leads to a residual of $e_1 = 14.07 - 17.84 = -3.77$.\n\n::: {.workedexample data-latex=\"\"}\nWe calculated a slope coefficient of 0.74 for `bankruptcy` in Section \\@ref(ind-and-cat-predictors) while the coefficient is 0.39 here.\nWhy is there a difference between the coefficient values between the models with single and multiple predictors?\n\n------------------------------------------------------------------------\n\nIf we examined the data carefully, we would see that some predictors are correlated.\nFor instance, when we modeled the relationship of the outcome `interest_rate` and predictor `bankruptcy` using linear regression, we were unable to control for other variables like whether the borrower had their income verified, the borrower's debt-to-income ratio, and other variables.\nThat original model was constructed in a vacuum and did not consider the full context of everything that is considered when an interest rate is decided.\nWhen we include all of the variables, underlying and unintentional bias that was missed by not including these other variables is reduced or eliminated.\nOf course, bias can still exist from other confounding variables.\n:::\n\nThe previous example describes a common issue in multiple regression: correlation among predictor variables.\nWe say the two predictor variables are collinear (pronounced as *co-linear*) when they are correlated, and this **multicollinearity** complicates model estimation.\nWhile it is impossible to prevent multicollinearity from arising in observational data, experiments are usually designed to prevent predictors from being multicollinear.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nThe estimated value of the intercept is 1.89, and one might be tempted to make some interpretation of this coefficient, such as, it is the model's predicted interest rate when each of the variables take value zero: income source is not verified, the borrower has no debt (debt-to-income and credit utilization are zero), and so on.\nIs this reasonable?\nIs there any value gained by making this interpretation?[^08-model-mlr-6]\n:::\n\n[^08-model-mlr-6]: Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable.\n However, one variable never takes a value of zero: `term`, which describes the length of the loan, in months.\n If `term` is set to zero, then the loan must be paid back immediately; the borrower must give the money back as soon as they receive it, which means it is not a real loan.\n Ultimately, the interpretation of the intercept in this setting is not insightful.\n\n## Adjusted R-squared\n\nWe first used $R^2$ in Section \\@ref(r-squared) to determine the amount of variability in the response that was explained by the model: $$\nR^2 = 1 - \\frac{\\text{variability in residuals}}{\\text{variability in the outcome}}\n = 1 - \\frac{Var(e_i)}{Var(y_i)}\n$$where $e_i$ represents the residuals of the model and $y_i$ the outcomes.\nThis equation remains valid in the multiple regression framework, but a small enhancement can make it even more informative when comparing models.\n\n::: {.guidedpractice data-latex=\"\"}\nThe variance of the residuals for the model given in the earlier Guided Practice is 18.53, and the variance of the total price in all the auctions is 25.01.\nCalculate $R^2$ for this model.[^08-model-mlr-7]\n:::\n\n[^08-model-mlr-7]: $R^2 = 1 - \\frac{18.53}{25.01} = 0.2591$.\n\nThis strategy for estimating $R^2$ is acceptable when there is just a single variable.\nHowever, it becomes less helpful when there are many variables.\nThe regular $R^2$ is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.\nTo get a better estimate, we use the adjusted $R^2$.\n\n::: {.important data-latex=\"\"}\n**Adjusted R-squared as a tool for model assessment.**\n\nThe **adjusted R-squared** is computed as\n\n$$\n\\begin{aligned}\n R_{adj}^{2}\n &= 1 - \\frac{s_{\\text{residuals}}^2 / (n-k-1)}\n {s_{\\text{outcome}}^2 / (n-1)} \\\\\n &= 1 - \\frac{s_{\\text{residuals}}^2}{s_{\\text{outcome}}^2}\n \\times \\frac{n-1}{n-k-1}\n\\end{aligned}\n$$\n\nwhere $n$ is the number of observations used to fit the model and $k$ is the number of predictor variables in the model.\nRemember that a categorical predictor with $p$ levels will contribute $p - 1$ to the number of variables in the model.\n:::\n\n\n\n\n\nBecause $k$ is never negative, the adjusted $R^2$ will be smaller -- often times just a little smaller -- than the unadjusted $R^2$.\nThe reasoning behind the adjusted $R^2$ lies in the **degrees of freedom** associated with each variance, which is equal to $n - k - 1$ in the multiple regression context.\nIf we were to make predictions for *new data* using our current model, we would find that the unadjusted $R^2$ would tend to be slightly overly optimistic, while the adjusted $R^2$ formula helps correct this bias.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nThere were n = 10,000 auctions in the dataset and $k=9$ predictor variables in the model.\nUse $n$, $k$, and the variances from the earlier Guided Practice to calculate $R_{adj}^2$ for the interest rate model.[^08-model-mlr-8]\n:::\n\n[^08-model-mlr-8]: $R_{adj}^2 = 1 - \\frac{18.53}{25.01}\\times \\frac{10000-1}{10000-9-1} = 0.2584$.\n While the difference is very small, it will be important when we fine tune the model in the next section.\n\n::: {.guidedpractice data-latex=\"\"}\nSuppose you added another predictor to the model, but the variance of the errors $Var(e_i)$ didn't go down.\nWhat would happen to the $R^2$?\nWhat would happen to the adjusted $R^2$?[^08-model-mlr-9]\n:::\n\n[^08-model-mlr-9]: The unadjusted $R^2$ would stay the same and the adjusted $R^2$ would go down.\n\nAdjusted $R^2$ could also have been used in Chapter \\@ref(model-slr) where we introduced regression models with a single predictor.\nHowever, when there is only $k = 1$ predictors, adjusted $R^2$ is very close to regular $R^2$, so this nuance isn't typically important when the model has only one predictor.\n\n## Model selection {#model-selection}\n\nThe best model is not always the most complicated.\nSometimes including variables that are not evidently important can actually reduce the accuracy of predictions.\nIn this section, we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important.\nIt's common (and hip, at least in the statistical world) to refer to models that have undergone such variable pruning as **parsimonious**.\n\n\n\n\n\nIn practice, the model that includes all available predictors is often referred to as the **full model**.\nThe full model may not be the best model, and if it isn't, we want to identify a smaller model that is preferable.\n\n\n\n\n\n### Stepwise selection\n\nTwo common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection.\nThese techniques are often referred to as **stepwise selection** strategies, because they add or delete one variable at a time as they \"step\" through the candidate predictors.\n\n\n\n\n\n**Backward elimination** starts with the full model (the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the model any further.\n\n**Forward selection** is the reverse of the backward elimination technique.\nInstead, of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model any further.\n\n\n\n\n\nAn important consideration in implementing either of these stepwise selection strategies is the criterion used to decide whether to eliminate or add a variable.\nOne commonly used decision criterion is adjusted $R^2$.\nWhen using adjusted $R^2$ as the decision criterion, we seek to eliminate or add variables depending on whether they lead to the largest improvement in adjusted $R^2$ and we stop when adding or elimination of another variable does not lead to further improvement in adjusted $R^2$.\n\nAdjusted $R^2$ describes the strength of a model fit, and it is a useful tool for evaluating which predictors are adding value to the model, where *adding value* means they are (likely) improving the accuracy in predicting future outcomes.\n\nLet's consider two models, which are shown in Table \\@ref(tab:loans-full-for-model-selection) and Table \\@ref(tab:loans-full-except-issue-month).\nThe first table summarizes the full model since it includes all predictors, while the second does not include the `issue_month` variable.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The fit for the full regression model, including the adjusted $R^2$.
term estimate std.error statistic p.value
(Intercept) 1.89 0.21 9.01 <0.0001
verified_incomeSource Verified 1.00 0.10 10.06 <0.0001
verified_incomeVerified 2.56 0.12 21.87 <0.0001
debt_to_income 0.02 0.00 7.43 <0.0001
credit_util 4.90 0.16 30.25 <0.0001
bankruptcy1 0.39 0.13 2.96 0.0031
term 0.15 0.00 38.89 <0.0001
credit_checks 0.23 0.02 12.52 <0.0001
issue_monthJan-2018 0.05 0.11 0.42 0.6736
issue_monthMar-2018 -0.04 0.11 -0.39 0.696
Adjusted R-sq = 0.2597
df = 9964
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The fit for the regression model after dropping issue month, including the adjusted $R^2$.
term estimate std.error statistic p.value
(Intercept) 1.90 0.20 9.56 <0.0001
verified_incomeSource Verified 1.00 0.10 10.05 <0.0001
verified_incomeVerified 2.56 0.12 21.86 <0.0001
debt_to_income 0.02 0.00 7.44 <0.0001
credit_util 4.90 0.16 30.25 <0.0001
bankruptcy1 0.39 0.13 2.96 0.0031
term 0.15 0.00 38.89 <0.0001
credit_checks 0.23 0.02 12.52 <0.0001
Adjusted R-sq = 0.2598
df = 9966
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nWhich of the two models is better?\n\n------------------------------------------------------------------------\n\nWe compare the adjusted $R^2$ of each model to determine which to choose.\nSince the second model has a higher $R^2_{adj}$ compared to the first model, we prefer the second model to the first.\n:::\n\nWill the model without `issue_month` be better than the model with `issue_month`?\nWe cannot know for sure, but based on the adjusted $R^2$, this is our best assessment.\n\n::: {.workedexample data-latex=\"\"}\nResults corresponding to the full model for the `loans` data are shown in Table \\@ref(tab:loans-full-for-model-selection).\nHow should we proceed under the backward elimination strategy?\n\n------------------------------------------------------------------------\n\nOur baseline adjusted $R^2$ from the full model is 0.2597, and we need to determine whether dropping a predictor will improve the adjusted $R^2$.\nTo check, we fit models that each drop a different predictor, and we record the adjusted $R^2$:\n\n- Excluding `verified_income`: 0.2238\n- Excluding `debt_to_income`: 0.2557\n- Excluding `credit_util`: 0.1916\n- Excluding `bankruptcy`: 0.2589\n- Excluding `term`: 0.1468\n- Excluding `credit_checks`: 0.2484\n- Excluding `issue_month`: 0.2598\n\nThe model without `issue_month` has the highest adjusted $R^2$ of 0.2598, higher than the adjusted $R^2$ for the full model.\nBecause eliminating `issue_month` leads to a model with a higher adjusted $R^2$, we drop `issue_month` from the model.\n\nSince we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors.\nOur baseline adjusted $R^2$ is now $R^2_{adj} = 0.2598$.\nWe now fit new models, which consider eliminating each of the remaining predictors in addition to `issue_month`:\n\n- Excluding `issue_month` and `verified_income`: 0.22395\n- Excluding `issue_month` and `debt_to_income`: 0.25579\n- Excluding `issue_month` and `credit_util`: 0.19174\n- Excluding `issue_month` and `bankruptcy`: 0.25898\n- Excluding `issue_month` and `term`: 0.14692\n- Excluding `issue_month` and `credit_checks`: 0.24801\n\nNone of these models lead to an improvement in adjusted $R^2$, so we do not eliminate any of the remaining predictors.\nThat is, after backward elimination, we are left with the model that keeps all predictors except `issue_month`, which we can summarize using the coefficients from Table \\@ref(tab:loans-full-except-issue-month).\n\n$$\n\\begin{aligned}\n\\widehat{\\texttt{interest_rate}} &= 1.90 \\\\\n&+ 1.00 \\times \\texttt{verified_income}_\\texttt{Source only} \\\\\n&+ 2.56 \\times \\texttt{verified_income}_\\texttt{Verified} \\\\\n&+ 0.02 \\times \\texttt{debt_to_income} \\\\\n&+ 4.90 \\times \\texttt{credit_util} \\\\\n&+ 0.39 \\times \\texttt{bankruptcy} \\\\\n&+ 0.15 \\times \\texttt{term} \\\\\n&+ 0.23 \\times \\texttt{credit_check}\n\\end{aligned}\n$$\n:::\n\n::: {.workedexample data-latex=\"\"}\nConstruct a model for predicting `interest_rate` from the `loans` data using forward selection.\n\n------------------------------------------------------------------------\n\nWe start with the model that includes no predictors.\nThen we fit each of the possible models with just one predictor.\nThen we examine the adjusted $R^2$ for each of these models:\n\n- Including `verified_income`: 0.05926\n- Including `debt_to_income`: 0.01946\n- Including `credit_util`: 0.06452\n- Including `bankruptcy`: 0.00222\n- Including `term`: 0.12855\n- Including `credit_checks`: -0.0001\n- Including `issue_month`: 0.01711\n\nIn this first step, we compare the adjusted $R^2$ against a baseline model that has no predictors.\nThe no-predictors model always has $R_{adj}^2 = 0$.\nThe model with one predictor that has the largest adjusted $R^2$ is the model with the `term` predictor, and because this adjusted $R^2$ is larger than the adjusted $R^2$ from the model with no predictors ($R_{adj}^2 = 0$), we will add this variable to our model.\n\nWe repeat the process again, this time considering 2-predictor models where one of the predictors is `term` and with a new baseline of $R^2_{adj} = 0.12855:$\n\n- Including `term` and `verified_income`: 0.16851\n- Including `term` and `debt_to_income`: 0.14368\n- Including `term` and `credit_util`: 0.20046\n- Including `term` and `bankruptcy`: 0.13070\n- Including `term` and `credit_checks`: 0.12840\n- Including `term` and `issue_month`: 0.14294\n\nAdding `credit_util` yields the highest increase in adjusted $R^2$ and has a higher adjusted $R^2$ (0.20046) than the baseline (0.12855).\nThus, we will also add `credit_util` to the model as a predictor.\n\nSince we have again added a predictor to the model, we have a new baseline adjusted $R^2$ of 0.20046.\nWe can continue on and see whether it would be beneficial to add a third predictor:\n\n- Including `term`, `credit_util, and verified_income`: 0.24183\n- Including `term`, `credit_util, and debt_to_income`: 0.20810\n- Including `term`, `credit_util, and bankruptcy`: 0.20169\n- Including `term`, `credit_util, and credit_checks`: 0.20031\n- Including `term`, `credit_util, and issue_month`: 0.21629\n\nThe model including `verified_income` has the largest increase in adjusted $R^2$ (0.24183) from the baseline (0.20046), so we add `verified_income` to the model as a predictor as well.\n\nWe continue on in this way, next adding `debt_to_income`, then `credit_checks`, and `bankruptcy`.\nAt this point, we come again to the `issue_month` variable: adding this as a predictor leads to $R_{adj}^2 = 0.25843$, while keeping all the other predictors but excluding `issue_month` has a higher $R_{adj}^2 = 0.25854$.\nThis means we do not add `issue_month` to the model as a predictor.\nIn this example, we have arrived at the same model that we identified from backward elimination.\n:::\n\n::: {.important data-latex=\"\"}\n**Stepwise selection strategies.**\n\nBackward elimination begins with the model having the largest number of predictors and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model.\nForward selection starts with no variables included in the model, then it adds in variables according to their importance until no other important variables are found.\nNotice that, for both methods, we have always chosen to retain the model with the largest adjusted $R^2$ value, even if the difference is less than half a percent (e.g., 0.2597 versus 0.2598).\nOne could argue that the difference between these two models is negligible, as they both explain nearly the same amount of variability in the `interest_rate`.\nThese negligible differences are an important aspect to model selection.\nIt is highly advised that *before* you begin the model selection process, you decide what a \"meaningful\" difference in adjusted $R^2$ is for the context of your data.\nMaybe this difference is 1% or maybe it is 5%.\nThis \"threshold\" is what you will then use to decide if one model is \"better\" than another model.\nUsing meaningful thresholds in model selection requires more critical thinking about what the adjusted $R^2$ values mean.\n\nAdditionally, backward elimination and forward selection sometimes arrive at different final models.\nThis is because the decision for whether to include a given variable or not depends on the other variables that are already in the model.\nWith forward selection, you start with a model that includes no variables and add variables one at a time.\nIn backward elimination, you start with a model that includes all of the variables and remove variables one at a time.\nHow much a given variable changes the percentage of the variability in the outcome that is explained by the model depends on what other variables are in the model.\nThis is especially important if the predictor variables are correlated with each other.\n\nThere is no \"one size fits all\" model selection strategy, which is why there are so many different model selection methods.\nWe hope you walk away from this exploration understanding how stepwise selection is carried out and the considerations that should be made when using stepwise selection with regression models.\n:::\n\n### Other model selection strategies\n\nStepwise selection using adjusted $R^2$ as the decision criteria is one of many commonly used model selection strategies.\nStepwise selection can also be carried out with decision criteria other than adjusted $R^2$, such as p-values, which you'll learn about in @sec-inf-model-slr onward, or AIC (Akaike information criterion) or BIC (Bayesian information criterion), which you might learn about in more advanced courses.\n\nAlternatively, one could choose to include or exclude variables from a model based on expert opinion or due to research focus.\nIn fact, many statisticians discourage the use of stepwise regression alone for model selection and advocate, instead, for a more thoughtful approach that carefully considers the research focus and features of the data.\n\n\\clearpage\n\n## Chapter review {#chp8-review}\n\n### Summary\n\nWith real data, there is often a need to describe how multiple variables can be modeled together.\nIn this chapter, we have presented one approach using multiple linear regression.\nEach coefficient represents a one unit increase of that predictor variable on the response variable *given* the rest of the predictor variables in the model.\nWorking with and interpreting multivariable models can be tricky, especially when the predictor variables show multicollinearity.\nThere is often no perfect or \"right\" final model, but using the adjusted $R^2$ value is one way to identify important predictor variables for a final regression model.\nIn later chapters we will generalize multiple linear regression models to a larger population of interest from which the dataset was generated.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
adjusted R-squared full model reference level
backward elimination multicollinearity stepwise selection
degrees of freedom multiple regression
forward selection parsimonious
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp8-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-08].\n\n::: {.exercises data-latex=\"\"}\n1. **High correlation, good or bad?**\nTwo friends, Frances and Annika, are in disagreement about whether high correlation values are *always* good in the context of regression.\nFrances claims that it's desirable for all variables in the dataset to be highly correlated to each other when building linear models.\nAnnika claims that while it's desirable for each of the predictors to be highly correlated with the outcome, it is not desirable for the predictors to be highly correlated with each other.\nWho is right: Frances, Annika, both, or neither? \nExplain your reasoning using appropriate terminology.\n\n1. **Dealing with categorical predictors.**\nTwo friends, Elliott and Adrian, want to build a model predicting typing speed (average number of words typed per minute) from whether the person wears glasses or not.\nBefore building the model they want to conduct some exploratory analysis to evaluate the strength of the association between these two variables, but they're in disagreement about how to evaluate how strongly a categorical predictor is associated with a numerical outcome.\nElliott claims that it is not possible to calculate a correlation coefficient to summarize the relationship between a categorical predictor and a numerical outcome, however they're not sure what a better alternative is.\nAdrian claims that you can recode a binary predictor as a 0/1 variable (assign one level to be 0 and the other to be 1), thus converting it to a numerical variable.\nAccording to Adrian, you can then calculate the correlation coefficient between the predictor and the outcome.\nWho is right: Elliott or Adrian?\nIf you pick Elliott, can you suggest a better alternative for evaluating the association between the categorical predictor and the numerical outcome?\n\n1. **Training for the 5K.**\nNico signs up for a 5K (a 5,000 metre running race) 30 days prior to the race. \nThey decide to run a 5K every day to train for it, and each day they record the following information: `days_since_start` (number of days since starting training), `days_till_race` (number of days left until the race), `mood` (poor, good, awesome), `tiredness` (1-not tired to 10-very tired), and `time` (time it takes to run 5K, recorded as mm:ss).\nTop few rows of the data they collect is shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
days_since_start days_till_race mood tiredness time
1 29 good 3 25:45
2 28 poor 5 27:13
3 27 awesome 4 24:13
... ... ... ... ...
\n \n `````\n :::\n :::\n \n Using these data Nico wants to build a model predicting `time` from the other variables. Should they include all variables shown above in their model? Why or why not?\n\n1. **Multiple regression fact checking.**\nDetermine which of the following statements are true and false.\nFor each statement that is false, explain why it is false.\n\n a. If predictors are collinear, then removing one variable will have no influence on the point estimate of another variable's coefficient.\n\n b. Suppose a numerical variable $x$ has a coefficient of $b_1 = 2.5$ in the multiple regression model. Suppose also that the first observation has $x_1 = 7.2$, the second observation has a value of $x_1 = 8.2$, and these two observations have the same values for all other predictors. Then the predicted value of the second observation will be 2.5 higher than the prediction of the first observation based on the multiple regression model.\n\n c. If a regression model's first variable has a coefficient of $b_1 = 5.7$, then if we are able to influence the data so that an observation will have its $x_1$ be 1 larger than it would otherwise, the value $y_1$ for this observation would increase by 5.7.\n \n \\clearpage\n\n1. **Baby weights and smoking.** \nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1,000 births from 2014.\nHere, we study the relationship between smoking and weight of the baby. \nThe variable `smoke` is coded 1 if the mother is a smoker, and 0 if not. \nThe summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in pounds, based on the smoking status of the mother.^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 7.270 0.0435 167.22 <0.0001
habitsmoker -0.593 0.1275 -4.65 <0.0001
\n \n `````\n :::\n :::\n\n a. Write the equation of the regression model.\n\n b. Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.\n\n1. **Baby weights and mature moms.**\nThe following is a model for predicting baby weight from whether the mom is classified as a `mature` mom (35 years or older at the time of pregnancy). [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 7.354 0.103 71.02 <0.0001
matureyounger mom -0.185 0.113 -1.64 0.102
\n \n `````\n :::\n :::\n\n a. Write the equation of the regression model.\n\n b. Interpret the slope in this context, and calculate the predicted birth weight of babies born to mature and younger mothers.\n \n1. **Movie returns, prediction.**\nA model was fit to predict return-on-investment (ROI) on movies based on release year and genre (Adventure, Action, Drama, Horror, and Comedy). The model output is shown below. [@webpage:horrormovies]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -156.04 169.15 -0.92 0.3565
release_year 0.08 0.08 0.94 0.348
genreAdventure 0.30 0.74 0.40 0.6914
genreComedy 0.57 0.69 0.83 0.4091
genreDrama 0.37 0.62 0.61 0.5438
genreHorror 8.61 0.86 9.97 <0.0001
\n \n `````\n :::\n :::\n \n a. For a given release year, which genre of movies are predicted, on average, to have the highest predicted return on investment?\n \n b. The adjusted $R^2$ of this model is 10.71%. Adding the production budget of the movie to the model increases the adjusted $R^2$ to 10.84%. Should production budget be added to the model?\n \n \\clearpage\n\n1. **Movie returns by genre.**\nA model was fit to predict return-on-investment (ROI) on movies based on release year and genre (Adventure, Action, Drama, Horror, and Comedy). \nThe plots below show the predicted ROI vs. actual ROI for each of the genres separately. Do these figures support the comment in the FiveThirtyEight.com article that states, \"The return-on-investment potential for horror movies is absurd.\" Note that the x-axis range varies for each plot. [@webpage:horrormovies]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](08-model-mlr_files/figure-html/unnamed-chunk-23-1.png){width=90%}\n :::\n :::\n\n1. **Predicting baby weights.**\nA more realistic approach to modeling baby weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in weeks (`weeks`), mother's age in years (`mage`), the sex of the baby (`sex`), smoking status of the mother (`habit`), and the number of hospital (`visits`) visits during pregnancy. Below are three observations from this data set.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
weight weeks mage sex visits habit
6.96 37 34 male 14 nonsmoker
8.86 41 31 female 12 nonsmoker
7.51 37 36 female 10 nonsmoker
\n \n `````\n :::\n :::\n\n The summary table below shows the results of a regression model for predicting the average birth weight of babies based on all of the variables presented above.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -3.82 0.57 -6.73 <0.0001
weeks 0.26 0.01 18.93 <0.0001
mage 0.02 0.01 2.53 0.0115
sexmale 0.37 0.07 5.30 <0.0001
visits 0.02 0.01 2.09 0.0373
habitsmoker -0.43 0.13 -3.41 7e-04
\n \n `````\n :::\n :::\n\n a. Write the equation of the regression model that includes all of the variables.\n\n b. Interpret the slopes of `weeks` and `habit` in this context.\n\n c. If we fit a model predicting baby weight from only `habit` (whether the mom smokes), we observe a difference in the slope coefficient for `habit` in this small model and the slope coefficient for `habit` in the larger model. Why might there be a difference?\n\n d. Calculate the residual for the first observation in the data set.\n \n \\clearpage\n\n1. **Palmer penguins, predicting body mass.**\nResearchers studying a community of Antarctic penguins collected body measurement (bill length, bill depth, and flipper length measured in millimeters and body mass, measured in grams), species (*Adelie*, *Chinstrap*, or *Gentoo*), and sex (female or male) data on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica.^[The [`penguins`](https://allisonhorst.github.io/palmerpenguins/reference/penguins.html) data used in this exercise can be found in the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) R package.] The summary table below shows the results of a linear regression model for predicting body mass (which is more difficult to measure) from the other variables in the dataset. [@palmerpenguins] \n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -1461.0 571.3 -2.6 0.011
bill_length_mm 18.2 7.1 2.6 0.0109
bill_depth_mm 67.2 19.7 3.4 7e-04
flipper_length_mm 16.0 2.9 5.5 <0.0001
sexmale 389.9 47.8 8.1 <0.0001
speciesChinstrap -251.5 81.1 -3.1 0.0021
speciesGentoo 1014.6 129.6 7.8 <0.0001
\n \n `````\n :::\n :::\n\n a. Write the equation of the regression model.\n\n a. Interpret each one of the slopes in this context.\n\n b. Calculate the residual for a male Adelie penguin that weighs 3750 grams with the following body measurements: `bill_length_mm` = 39.1, `bill_depth_mm` = 18.7, `flipper_length_mm` = 181. Does the model overpredict or underpredict this penguin's weight?\n \n c. The $R^2$ of this model is 87.5%. Interpret this value in context of the data and the model.\n \n \\vspace{5mm}\n\n1. **Baby weights, backwards elimination.**\nLet's consider a model that predicts `weight` of newborns using several predictors: whether the mother is considered `mature`, number of `weeks` of gestation, number of hospital `visits` during pregnancy, weight `gained` by the mother during pregnancy, `sex` of the baby, and whether the mother smoke cigarettes during pregnancy (`habit`). [@data:births14]\n\n ::: {.cell}\n \n :::\n \n The adjusted $R^2$ of the full model is 0.326.\n We remove each variable one by one, refit the model, and record the adjusted $R^2$.\n Which, if any, variable should be removed from the model?\n \n ::: {.cell}\n \n :::\n \n - Drop `mature`: 0.321\n - Drop `weeks`: 0.061\n - Drop `visits`: 0.326\n - Drop `gained`: 0.327\n - Drop `sex`: 0.301\n \n \\clearpage\n\n1. **Palmer penguins, backwards elimination.**\nThe following full model is built to predict the weights of three species (*Adelie*, *Chinstrap*, or *Gentoo*) of penguins living in the Palmer Archipelago, Antarctica. [@palmerpenguins]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -1461.0 571.3 -2.6 0.011
bill_length_mm 18.2 7.1 2.6 0.0109
bill_depth_mm 67.2 19.7 3.4 7e-04
flipper_length_mm 16.0 2.9 5.5 <0.0001
sexmale 389.9 47.8 8.1 <0.0001
speciesChinstrap -251.5 81.1 -3.1 0.0021
speciesGentoo 1014.6 129.6 7.8 <0.0001
\n \n `````\n :::\n :::\n \n The adjusted $R^2$ of the full model is 0.9. In order to evaluate whether any of the predictors can be dropped from the model without losing predictive performance of the model, the researchers dropped one variable at a time, refit the model, and recorded the adjusted $R^2$ of the smaller model. These values are given below.\n \n ::: {.cell}\n \n :::\n \n - Drop `bill_length_mm`: 0.87\n - Drop `bill_depth_mm`: 0.869\n - Drop `flipper_length_mm`: 0.861\n - Drop `sex`: 0.845\n - Drop `species`: 0.821\n\n Which, if any, variable should be removed from the model first?\n\n1. **Baby weights, forward selection.**\nUsing information on the mother and the sex of the baby (which can be determined prior to birth), we want to build a model that predicts the birth weight of babies.\nIn order to do so, we will evaluate six candidate predictors: whether the mother is considered `mature`, number of `weeks` of gestation, number of hospital `visits` during pregnancy, weight `gained` by the mother during pregnancy, `sex` of the baby, and whether the mother smoke cigarettes during pregnancy (`habit`).\nAnd we will make a decision about including them in the model using forward selection and adjusted $R^2$. \nBelow are the six models we evaluate and their adjusted $R^2$ values. [@data:births14]\n\n ::: {.cell}\n \n :::\n \n - Predict `weight` from `mature`: 0.002\n - Predict `weight` from `weeks`: 0.3\n - Predict `weight` from `visits`: 0.034\n - Predict `weight` from `gained`: 0.021\n - Predict `weight` from `sex`: 0.018\n - Predict `weight` from `habit`: 0.021\n\n Which variable should be added to the model first?\n\n1. **Palmer penguins, forward selection.**\nUsing body measurement and other relevant data on three species (*Adelie*, *Chinstrap*, or *Gentoo*) of penguins living in the Palmer Archipelago, Antarctica, we want to predict their body mass. In order to do so, we will evaluate five candidate predictors and make a decision about including them in the model using forward selection and adjusted $R^2$. Below are the five models we evaluate and their adjusted $R^2$ values:\n\n ::: {.cell}\n \n :::\n \n - Predict body mass from `bill_length_mm`: 0.352\n - Predict body mass from `bill_depth_mm`: 0.22\n - Predict body mass from `flipper_length_mm`: 0.758\n - Predict body mass from `sex`: 0.178\n - Predict body mass from `species`: 0.668\n\n Which variable should be added to the model first?\n\n\n:::\n", + "supporting": [ + "08-model-mlr_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/08-model-mlr/figure-html/unnamed-chunk-23-1.png b/_freeze/08-model-mlr/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 00000000..3803b79f Binary files /dev/null and b/_freeze/08-model-mlr/figure-html/unnamed-chunk-23-1.png differ diff --git a/_freeze/09-model-logistic/execute-results/html.json b/_freeze/09-model-logistic/execute-results/html.json new file mode 100644 index 00000000..4dcc47b9 --- /dev/null +++ b/_freeze/09-model-logistic/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "70ff8ff8003c9302372b460989be7ad9", + "result": { + "markdown": "# Logistic regression {#model-logistic}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn this chapter we introduce **logistic regression** as a tool for building models when there is a categorical response variable with two levels, e.g., yes and no.\nLogistic regression is a type of **generalized linear model (GLM)** for response variables where regular multiple regression does not work very well.\nGLMs can be thought of as a two-stage modeling approach.\nWe first model the response variable using a probability distribution, such as the binomial or Poisson distribution.\nSecond, we model the parameter of the distribution using a collection of predictors and a special form of multiple regression.\nUltimately, the application of a GLM will feel very similar to multiple regression, even if some of the details are different.\n:::\n\n\n\n\n\n## Discrimination in hiring\n\nWe will consider experiment data from a study that sought to understand the effect of race and sex on job application callback rates [@bertrand2003].\nTo evaluate which factors were important, job postings were identified in Boston and Chicago for the study, and researchers created many fake resumes to send off to these jobs to see which would elicit a callback.[^09-model-logistic-1]\nThe researchers enumerated important characteristics, such as years of experience and education details, and they used these characteristics to randomly generate fake resumes.\nFinally, they randomly assigned a name to each resume, where the name would imply the applicant's sex and race.\n\n[^09-model-logistic-1]: We did omit discussion of some structure in the data for the analysis presented: the experiment design included blocking, where typically four resumes were sent to each job: one for each inferred race/sex combination (as inferred based on the first name).\n We did not worry about this blocking aspect, since accounting for the blocking would *reduce* the standard error without notably changing the point estimates for the `race` and `sex` variables versus the analysis performed in the section.\n That is, the most interesting conclusions in the study are unaffected even when completing a more sophisticated analysis.\n\n::: {.data data-latex=\"\"}\nThe [`resume`](http://openintrostat.github.io/openintro/reference/resume.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe first names that were used and randomly assigned in this experiment were selected so that they would predominantly be recognized as belonging to Black or White individuals; other races were not considered in this study.\nWhile no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass this survey check were excluded from usage in the experiment.\nYou can find the full set of names that did pass the survey test and were ultimately used in the study in Table \\@ref(tab:resume-names).\nFor example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted to be associated with a White male.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
List of all 36 unique names along with the commonly inferred race and sex associated with these names.
first_name race sex first_name race sex first_name race sex
Aisha Black female Hakim Black male Laurie White female
Allison White female Jamal Black male Leroy Black male
Anne White female Jay White male Matthew White male
Brad White male Jermaine Black male Meredith White female
Brendan White male Jill White female Neil White male
Brett White male Kareem Black male Rasheed Black male
Carrie White female Keisha Black female Sarah White female
Darnell Black male Kenya Black female Tamika Black female
Ebony Black female Kristen White female Tanisha Black female
Emily White female Lakisha Black female Todd White male
Geoffrey White male Latonya Black female Tremayne Black male
Greg White male Latoya Black female Tyrone Black male
\n\n`````\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nThe response variable of interest is whether there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we'll consider, with special interest in the race and sex variables.\nRace and sex are protected classes in the United States, meaning they are not legally permitted factors for hiring or employment decisions.\nThe full set of attributes considered is provided in Table \\@ref(tab:resume-variables).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Descriptions of nine variables from the `resume` dataset. Many of the variables are indicator variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.
variable description
received_callback Specifies whether the employer called the applicant following submission of the application for the job.
job_city City where the job was located: Boston or Chicago.
college_degree An indicator for whether the resume listed a college degree.
years_experience Number of years of experience listed on the resume.
honors Indicator for the resume listing some sort of honors, e.g. employee of the month.
military Indicator for if the resume listed any military experience.
has_email_address Indicator for if the resume listed an email address for the applicant.
race Race of the applicant, implied by their first name listed on the resume.
sex Sex of the applicant (limited to only and in this study), implied by the first name listed on the resume.
\n\n`````\n:::\n:::\n\n\nAll of the attributes listed on each resume were randomly assigned.\nThis means that no attributes that might be favorable or detrimental to employment would favor one demographic over another on these resumes.\nImportantly, due to the experimental nature of this study, we can infer causation between these variables and the callback rate, if substantial differences are found.\nOur analysis will allow us to compare the practical importance of each of the variables relative to each other.\n\n## Modelling the probability of an event {#modelingTheProbabilityOfAnEvent}\n\nLogistic regression is a generalized linear model where the outcome is a two-level categorical variable.\nThe outcome, $Y_i$, takes the value 1 (in our application, this represents a callback for the resume) with probability $p_i$ and the value 0 with probability $1 - p_i$.\nBecause each observation has a slightly different context, e.g., different education level or a different number of years of experience, the probability $p_i$ will differ for each observation.\nUltimately, it is this **probability** that we model in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.\n\n::: {.important data-latex=\"\"}\n**Notation for a logistic regression model.**\n\nThe outcome variable for a GLM is denoted by $Y_i$, where the index $i$ is used to represent observation $i$.\nIn the resume application, $Y_i$ will be used to represent whether resume $i$ received a callback ($Y_i=1$) or not ($Y_i=0$).\n:::\n\n\n\n\n\nThe predictor variables are represented as follows: $x_{1,i}$ is the value of variable 1 for observation $i$, $x_{2,i}$ is the value of variable 2 for observation $i$, and so on.\n\n$$\ntransformation(p_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \\cdots + b_k x_{k,i}\n$$\n\nWe want to choose a **transformation** in the equation that makes practical and mathematical sense.\nFor example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range.\n\n\n\n\n\nA common transformation for $p_i$ is the **logit transformation**\\index{logit transformation}, which may be written as\n\n$$\nlogit(p_i) = \\log_{e}\\left( \\frac{p_i}{1-p_i} \\right)\n$$\n\nThe logit transformation is shown in Figure \\@ref(fig:logit-transformation).\nBelow, we rewrite the equation relating $Y_i$ to its predictors using the logit transformation of $p_i$:\n\n\n\n\n\n$$\n\\log_{e}\\left( \\frac{p_i}{1-p_i} \\right) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \\cdots + b_k x_{k,i}\n$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Values of $p_i$ against values of $logit(p_i)$.](09-model-logistic_files/figure-html/logit-transformation-1.png){width=100%}\n:::\n:::\n\n\nIn our resume example, there are 8 predictor variables, so $k = 8$.\nWhile the precise choice of a logit function isn't intuitive, it is based on theory that underpins generalized linear models, which is beyond the scope of this book.\nFortunately, once we fit a model using software, it will start to feel like we are back in the multiple regression context, even if the interpretation of the coefficients is more complex.\n\nTo convert from values on the logistic regression scale to the probability scale, we need to back transform and then solve for $p_i$:\n\n$$\n\\begin{aligned}\n\\log_{e}\\left( \\frac{p_i}{1-p_i} \\right) &= b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i} \\\\\n\\frac{p_i}{1-p_i} &= e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} \\\\\np_i &= \\left( 1 - p_i \\right) e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} \\\\\np_i &= e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} - p_i \\times e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} \\\\\np_i + p_i \\text{ } e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} &= e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} \\\\\np_i(1 + e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}}) &= e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}} \\\\\np_i &= \\frac{e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}}}{1 + e^{b_0 + b_1 x_{1,i} + \\cdots + b_k x_{k,i}}}\n\\end{aligned}\n$$\n\nAs with most applied data problems, we substitute the point estimates for the parameters (the $b_i$) so that we can make use of this formula.\n\n::: {.workedexample data-latex=\"\"}\nWe start by fitting a model with a single predictor: `honors`.\nThis variable indicates whether the applicant had any type of honors listed on their resume, such as employee of the month.\nA logistic regression model was fit using statistical software and the following model was found:\n\n$$\\log_e \\left( \\frac{\\widehat{p}_i}{1-\\widehat{p}_i} \\right) = -2.4998 + 0.8668 \\times {\\texttt{honors}}$$\n\na. If a resume is randomly selected from the study and it does not have any honors listed, what is the probability it resulted in a callback?\nb. What would the probability be if the resume did list some honors?\n\n------------------------------------------------------------------------\n\na. If a randomly chosen resume from those sent out is considered, and it does not list honors, then `honors` takes the value of 0 and the right side of the model equation equals -2.4998. Solving for $p_i$: $\\frac{e^{-2.4998}}{1 + e^{-2.4998}} = 0.076$. Just as we labeled a fitted value of $y_i$ with a \"hat\" in single-variable and multiple regression, we do the same for this probability: $\\hat{p}_i = 0.076{}$.\nb. If the resume had listed some honors, then the right side of the model equation is $-2.4998 + 0.8668 \\times 1 = -1.6330$, which corresponds to a probability $\\hat{p}_i = 0.163$. Notice that we could examine -2.4998 and -1.6330 in Figure \\@ref(fig:logit-transformation) to estimate the probability before formally calculating the value.\n:::\n\nWhile knowing whether a resume listed honors provides some signal when predicting whether the employer would call, we would like to account for many different variables at once to understand how each of the different resume characteristics affected the chance of a callback.\n\n## Logistic model with many variables\n\nWe used statistical software to fit the logistic regression model with all 8 predictors described in Table \\@ref(tab:resume-variables).\nLike multiple regression, the result may be presented in a summary table, which is shown in Table \\@ref(tab:resume-full-fit).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary table for the full logistic regression model for the resume callback example.
term estimate std.error statistic p.value
(Intercept) -2.66 0.18 -14.64 <0.0001
job_cityChicago -0.44 0.11 -3.85 1e-04
college_degree1 -0.07 0.12 -0.55 0.5821
years_experience 0.02 0.01 1.96 0.0503
honors1 0.77 0.19 4.14 <0.0001
military1 -0.34 0.22 -1.59 0.1127
has_email_address1 0.22 0.11 1.93 0.0541
raceWhite 0.44 0.11 4.10 <0.0001
sexman -0.18 0.14 -1.32 0.1863
\n\n`````\n:::\n:::\n\n\nJust like multiple regression, we could trim some variables from the model.\nHere we'll use a statistic called **Akaike information criterion (AIC)**\\index{AIC}, which is analogous to how we used adjusted $R^2$ in multiple regression.\nAIC is a popular model selection method used in many disciplines, and is praised for its emphasis on model uncertainty and parsimony.\nAIC selects a \"best\" model by ranking models from best to worst according to their AIC values.\nIn the calculation of a model's AIC, a penalty is given for including additional variables.\nThis penalty for added model complexity attempts to strike a balance between underfitting (too few variables in the model) and overfitting (too many variables in the model).\nWhen using AIC for model selection, models with a lower AIC value are considered to be \"better.\" Remember that when using adjusted $R^2$ we select models with higher values instead.\nIt is important to note that AIC provides information about the quality of a model relative to other models, but does not provide information about the overall quality of a model.\n\nWe will look for models with a lower AIC using a backward elimination strategy.\nAfter using this criteria, the variable `college_degree` is eliminated, giving the smaller model summarized in Table \\@ref(tab:resume-fit), which is what we'll rely on for the remainder of this section.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary table for the logistic regression model for the resume callback example, where variable selection has been performed using AIC.
term estimate std.error statistic p.value
(Intercept) -2.72 0.16 -17.51 <0.0001
job_cityChicago -0.44 0.11 -3.83 1e-04
years_experience 0.02 0.01 2.02 0.043
honors1 0.76 0.19 4.12 <0.0001
military1 -0.34 0.22 -1.60 0.1105
has_email_address1 0.22 0.11 1.97 0.0494
raceWhite 0.44 0.11 4.10 <0.0001
sexman -0.20 0.14 -1.45 0.1473
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nThe `race` variable had taken only two levels: `Black` and `White`.\nBased on the model results, what does the coefficient of this variable say about callback decisions?\n\n------------------------------------------------------------------------\n\nThe coefficient shown corresponds to the level of `White`, and it is positive.\nThis positive coefficient reflects a positive gain in callback rate for resumes where the candidate's first name implied they were White.\nThe model results suggest that prospective employers favor resumes where the first name is typically interpreted to be White.\n:::\n\nThe coefficient of $\\texttt{race}_{\\texttt{White}}$ in the full model in Table \\@ref(tab:resume-full-fit), is nearly identical to the model shown in Table \\@ref(tab:resume-fit).\nThe predictors in this experiment were thoughtfully laid out so that the coefficient estimates would typically not be much influenced by which other predictors were in the model, which aligned with the motivation of the study to tease out which effects were important to getting a callback.\nIn most observational data, it's common for point estimates to change a little, and sometimes a lot, depending on which other variables are included in the model.\n\n::: {.workedexample data-latex=\"\"}\nUse the model summarized in Table \\@ref(tab:resume-fit) to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male.\n\n------------------------------------------------------------------------\n\nWe can start by writing out the equation using the coefficients from the model:\n\n$$\n\\begin{aligned}\n&log_e \\left(\\frac{p}{1 - p}\\right) \\\\\n&= - 2.7162 - 0.4364 \\times \\texttt{job_city}_{\\texttt{Chicago}} \\\\\n& \\quad \\quad + 0.0206 \\times \\texttt{years_experience} \\\\\n& \\quad \\quad + 0.7634 \\times \\texttt{honors} - 0.3443 \\times \\texttt{military} + 0.2221 \\times \\texttt{email} \\\\\n& \\quad \\quad + 0.4429 \\times \\texttt{race}_{\\texttt{White}} - 0.1959 \\times \\texttt{sex}_{\\texttt{man}} \n\\end{aligned}\n$$\n\nNow we can add in the corresponding values of each variable for this individual:\n\n$$\n\\begin{aligned}\n&log_e \\left(\\frac{\\widehat{p}}{1 - \\widehat{p}}\\right) \\\\\n&\\quad= - 2.7162 - 0.4364 \\times 1 + 0.0206 \\times 14 \\\\\n&\\quad \\quad + 0.7634 \\times 0 - 0.3443 \\times 0 + 0.2221 \\times 1 \\\\\n&\\quad \\quad + 0.4429 \\times 1 - 0.1959 \\times 1 = - 2.3955 \n\\end{aligned}\n$$\n\nWe can now back-solve for $\\widehat{p}$: the chance such an individual will receive a callback is about $\\frac{e^{-2.3955}}{1 + e^{-2.3955}} = 8.35\\%$.\n:::\n\n\\vspace{-4mm}\n\n::: {.workedexample data-latex=\"\"}\nCompute the probability of a callback for an individual with a name commonly inferred to be from a Black male but who otherwise has the same characteristics as the one described in the previous example.\n\n------------------------------------------------------------------------\n\nWe can complete the same steps for an individual with the same characteristics who is Black, where the only difference in the calculation is that the indicator variable $\\texttt{race}_{\\texttt{White}}$ will take a value of 0.\nDoing so yields a probability of 0.0553.\nLet's compare the results with those of the previous example..\n\nIn practical terms, an individual perceived as White based on their first name would need to apply to $\\frac{1}{0.0835} \\approx 12$ jobs on average to receive a callback, while an individual perceived as Black based on their first name would need to apply to $\\frac{1}{0.0553} \\approx 18$ jobs on average to receive a callback.\nThat is, applicants who are perceived as Black need to apply to 50% more employers to receive a callback than someone who is perceived as White based on their first name for jobs like those in the study.\n:::\n\n\\vspace{-4mm}\n\nWhat we have quantified in this section is alarming and disturbing.\nHowever, one aspect that makes this racism so difficult to address is that the experiment, as well-designed as it is, cannot send us much signal about which employers are discriminating.\nIt is only possible to say that discrimination is happening, even if we cannot say which particular callbacks --- or non-callbacks --- represent discrimination.\nFinding strong evidence of racism for individual cases is a persistent challenge in enforcing anti-discrimination laws.\n\n## Groups of different sizes\n\nAny form of discrimination is concerning, and this is why we decided it was so important to discuss this topic using data.\nThe resume study also only examined discrimination in a single aspect: whether a prospective employer would call a candidate who submitted their resume.\nThere was a 50% higher barrier for resumes simply when the candidate had a first name that was perceived to be from a Black individual.\nIt's unlikely that discrimination would stop there.\n\n::: {.workedexample data-latex=\"\"}\nLet's consider a sex-imbalanced company that consists of 20% women and 80% men, and we'll suppose that the company is very large, consisting of perhaps 20,000 employees.\n(A more thoughtful example would include more inclusive gender identities.) Suppose when someone goes up for promotion at this company, 5 of their colleagues are randomly chosen to provide feedback on their work.\n\nNow let's imagine that 10% of the people in the company are prejudiced against the other sex.\nThat is, 10% of men are prejudiced against women, and similarly, 10% of women are prejudiced against men.\n\nWho is discriminated against more at the company, men or women?\n\n------------------------------------------------------------------------\n\nLet's suppose we took 100 men who have gone up for promotion in the past few years.\nFor these men, $5 \\times 100 = 500$ random colleagues will be tapped for their feedback, of which about 20% will be women (100 women).\nOf these 100 women, 10 are expected to be biased against the man they are reviewing.\nThen, of the 500 colleagues reviewing them, men will experience discrimination by about 2% of their colleagues when they go up for promotion.\n\nLet's do a similar calculation for 100 women who have gone up for promotion in the last few years.\nThey will also have 500 random colleagues providing feedback, of which about 400 (80%) will be men.\nOf these 400 men, about 40 (10%) hold a bias against women.\nOf the 500 colleagues providing feedback on the promotion packet for these women, 8% of the colleagues hold a bias against the women.\n:::\n\nThis example highlights something profound: even in a hypothetical setting where each demographic has the same degree of prejudice against the other demographic, the smaller group experiences the negative effects more frequently.\nAdditionally, if we would complete a handful of examples like the one above with different numbers, we would learn that the greater the imbalance in the population groups, the more the smaller group is disproportionately impacted.[^09-model-logistic-2]\n\n[^09-model-logistic-2]: If a proportion $p$ of a company are women and the rest of the company consists of men, then under the hypothetical situation the ratio of rates of discrimination against women versus men would be given by $(1 - p) / p$; this ratio is always greater than 1 when $p < 0.5$.\n\nOf course, there are other considerable real-world omissions from the hypothetical example.\nFor example, studies have found instances where people from an oppressed group also discriminate against others within their own oppressed group.\nAs another example, there are also instances where a majority group can be oppressed, with apartheid in South Africa being one such historic example.\nUltimately, discrimination is complex, and there are many factors at play beyond the mathematics property we observed in the previous example.\n\nWe close this chapter on this serious topic, and we hope it inspires you to think about the power of reasoning with data.\nWhether it is with a formal statistical model or by using critical thinking skills to structure a problem, we hope the ideas you have learned will help you do more and do better in life.\n\n\\clearpage\n\n## Chapter review {#chp9-review}\n\n### Summary\n\nLogistic and linear regression models have many similarities.\nThe strongest of which is the linear combination of the explanatory variables which is used to form predictions related to the response variable.\nHowever, with logistic regression, the response variable is binary and therefore a prediction is given on the probability of a successful event.\nLogistic model fit and variable selection can be carried out in similar ways as multiple linear regression.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
Akaike information criterion logistic regression probability of an event
generalized linear model logit transformation transformation
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp09-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-09].\n\n::: {.exercises data-latex=\"\"}\n1. **True / False.**\nDetermine which of the following statements are true and false. For each statement that is false, explain why it is false.\n\n a. In logistic regression we fit a line to model the relationship between the predictor(s) and the binary outcome.\n \n b. In logistic regression, we expect the residuals to be even scattered on either side of zero, just like with linear regression.\n \n c. In logistic regression, the outcome variable is binary but the predictor variable(s) can be either binary or continuous.\n \n \\vspace{5mm}\n\n1. **Logistic regression fact checking.**\nDetermine which of the following statements are true and false. For each statement that is false, explain why it is false.\n\n a. Suppose we consider the first two observations based on a logistic regression model, where the first variable in observation 1 takes a value of $x_1 = 6$ and observation 2 has $x_1 = 4$. Suppose we realized we made an error for these two observations, and the first observation was actually $x_1 = 7$ (instead of 6) and the second observation actually had $x_1 = 5$ (instead of 4). Then the predicted probability from the logistic regression model would increase the same amount for each observation after we correct these variables.\n\n b. When using a logistic regression model, it is impossible for the model to predict a probability that is negative or a probability that is greater than 1.\n\n c. Because logistic regression predicts probabilities of outcomes, observations used to build a logistic regression model need not be independent.\n\n d. When fitting logistic regression, we typically complete model selection using adjusted $R^2$.\n \n \\clearpage\n\n1. **Possum classification, model selection.**\nThe common brushtail possum of the Australia region is a bit cuter than its distant cousin, the American opossum (see Figure \\@ref(fig:brushtail-possum). We consider 104 brushtail possums from two regions in Australia, where the possums may be considered a random sample from the population. The first region is Victoria, which is in the eastern half of Australia and traverses the southern coast. The second region consists of New South Wales and Queensland, which make up eastern and northeastern Australia.^[The [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] \n \n We use logistic regression to differentiate between possums in these two regions. The outcome variable, called `pop`, takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland. We consider five predictors: `sex` (an indicator for a possum being male), `head_l` (head length), `skull_w` (skull width), `total_l` (total length), and `tail_l` (tail length). Each variable is summarized in a histogram. The full logistic regression model and a reduced model after variable selection are summarized in the tables below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](09-model-logistic_files/figure-html/unnamed-chunk-14-1.png){width=100%}\n :::\n :::\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 39.23 11.54 3.40 7e-04
sexmale -1.24 0.67 -1.86 0.0632
head_l -0.16 0.14 -1.16 0.248
skull_w -0.20 0.13 -1.52 0.1294
total_l 0.65 0.15 4.24 <0.0001
tail_l -1.87 0.37 -5.00 <0.0001
\n \n `````\n :::\n :::\n \n \\vspace{-6mm}\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 33.51 9.91 3.38 7e-04
sexmale -1.42 0.65 -2.20 0.0278
skull_w -0.28 0.12 -2.27 0.0231
total_l 0.57 0.13 4.30 <0.0001
tail_l -1.81 0.36 -5.02 <0.0001
\n \n `````\n :::\n :::\n\n \\vspace{-2mm}\n\n a. Examine each of the predictors. Are there any outliers that are likely to have a very large influence on the logistic regression model?\n\n b. The summary table for the full model indicates that at least one variable should be eliminated when using the p-value approach for variable selection: `head_l`. The second component of the table summarizes the reduced model following variable selection. Explain why the remaining estimates change between the two models.\n \n \\clearpage\n\n1. **Challenger disaster and model building.** \nOn January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch. `temperature` gives the temperature in Fahrenheit, `damaged` represents the number of damaged O-rings, and `undamaged` represents the number of O-rings that were not damaged.^[The [`orings`](http://openintrostat.github.io/openintro/reference/orings.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] \n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
mission 1 2 3 4 5 6 7 8 9 10 11 12
temperature 53 57 58 63 66 67 67 67 68 69 70 70
damaged 5 1 1 1 0 0 0 0 0 0 1 0
undamaged 1 5 5 5 6 6 6 6 6 6 5 6
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
mission 13 14 15 16 17 18 19 20 21 22 23
temperature 70 70 72 73 75 75 76 76 78 79 81
damaged 1 0 0 0 0 1 0 0 0 0 0
undamaged 5 6 6 6 6 5 6 6 6 6 6
\n \n `````\n :::\n :::\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 11.66 3.30 3.54 4e-04
temperature -0.22 0.05 -4.07 <0.0001
\n \n `````\n :::\n :::\n\n a. Each column of the table above represents a different shuttle mission. Examine these data and describe what you observe with respect to the relationship between temperatures and damaged O-rings.\n\n b. Failures have been coded as 1 for a damaged O-ring and 0 for an undamaged O-ring, and a logistic regression model was fit to these data. The regression output for this model is given above. Describe the key components of the output in words.\n\n c. Write out the logistic model using the point estimates of the model parameters.\n\n d. Based on the model, do you think concerns regarding O-rings are justified? Explain.\n\n1. **Possum classification, prediction.**\nA logistic regression model was proposed for classifying common brushtail possums into their two regions. The outcome variable took value 1 if the possum was from Victoria and 0 otherwise.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 33.51 9.91 3.38 7e-04
sexmale -1.42 0.65 -2.20 0.0278
skull_w -0.28 0.12 -2.27 0.0231
total_l 0.57 0.13 4.30 <0.0001
tail_l -1.81 0.36 -5.02 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the form of the model. Also identify which of the variables are positively associated with the outcome of living in Victoria, when controlling for other variables.\n\n b. Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had been captured in the wild in Australia, but it does not say which part of Australia. However, the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37 cm long, and its total length is 83 cm. What is the reduced model's computed probability that this possum is from Victoria? How confident are you in the model's accuracy of this probability calculation?\n\n1. **Challenger disaster and prediction.**\nOn January 28, 1986, a routine launch was anticipated for the Challenger space shuttle.\nSeventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. \nAn investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. \nThe investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. \n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](09-model-logistic_files/figure-html/unnamed-chunk-20-1.png){width=90%}\n :::\n :::\n\n a. The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as\n \n $\\log\\left( \\frac{\\hat{p}}{1 - \\hat{p}} \\right) = 11.6630 - 0.2162\\times \\texttt{temperature}$\n \n where $\\hat{p}$ is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature: \n \n $$\n \\begin{aligned}\n &\\hat{p}_{57} = 0.341\n && \\hat{p}_{59} = 0.251\n && \\hat{p}_{61} = 0.179\n && \\hat{p}_{63} = 0.124 \\\\\n &\\hat{p}_{65} = 0.084\n && \\hat{p}_{67} = 0.056\n && \\hat{p}_{69} = 0.037\n && \\hat{p}_{71} = 0.024\n \\end{aligned}\n $$\n\n b. Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.\n\n c. Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model's validity.\n \n \\clearpage\n\n1. **Spam filtering, model selection.** \nSpam filters are built on principles similar to those used in logistic regression. Using characteristics of individual emails, we fit a probability that each message is spam or not spam. We have several email variables for this problem, and we won't describe what each variable means here for the sake of brevity, but each is either a numerical or indicator variable.^[The [`email`](http://openintrostat.github.io/openintro/reference/email.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] \n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.69 0.09 -7.42 <0.0001
to_multiple1 -2.82 0.31 -9.05 <0.0001
cc 0.03 0.02 1.41 0.1585
attach 0.28 0.08 3.44 6e-04
dollar -0.08 0.02 -3.45 6e-04
winneryes 1.72 0.34 5.09 <0.0001
inherit 0.32 0.15 2.10 0.0355
password -0.79 0.30 -2.64 0.0083
format1 -1.50 0.13 -12.01 <0.0001
re_subj1 -1.92 0.38 -5.10 <0.0001
exclaim_subj 0.26 0.23 1.14 0.2531
sent_email1 -16.67 293.19 -0.06 0.9547
\n \n `````\n :::\n :::\n \n \\vspace{-4mm}\n \n The AIC of the full model is 1863.5. We remove each variable one by one, refit the model, and record the updated AIC.\n \n ::: {.cell}\n \n :::\n \n a. For variable selection, we fit the full model, which includes all variables, and then we also fit each model where we have dropped exactly one of the variables. In each of these reduced models, the AIC value for the model is reported below. Based on these results, which variable, if any, should we drop as part of model selection? Explain.\n\n - None Dropped: 1863.5\n - Drop `to_multiple`: 2023.5\n - Drop `cc`: 1863.2\n - Drop `attach`: 1871.9\n - Drop `dollar`: 1879.7\n - Drop `winner`: 1885\n - Drop `inherit`: 1865.5\n - Drop `password`: 1879.3\n - Drop `format`: 2008.9\n - Drop `re_subj`: 1904.6\n - Drop `exclaim_subj`: 1862.8\n - Drop `sent_email`: 1958.2\n\n b. Consider the subsequent model selection stage (where the variable from part (a) has been removed, and we are considering removal of a second variable). Here again we have computed the AIC for each leave-one-variable-out model. Based on the results, which variable, if any, should we drop as part of model selection? Explain.\n\n ::: {.cell}\n \n :::\n\n - None Dropped: 1862.8\n - Drop `to_multiple`: 2021.5\n - Drop `cc`: 1862.4\n - Drop `attach`: 1871.2\n - Drop `dollar`: 1877.8\n - Drop `winner`: 1885.2\n - Drop `inherit`: 1864.8\n - Drop `password`: 1878.4\n - Drop `format`: 2007\n - Drop `re_subj`: 1904.3\n - Drop `sent_email`: 1957.3\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part c.*\n :::\n \n \\clearpage\n\n c. Consider one more step in the process. Here again we have computed the AIC for each leave-one-variable-out model. Based on the results, which variable, if any, should we drop as part of model selection? Explain.\n\n ::: {.cell}\n \n :::\n\n - None Dropped: 1862.4\n - Drop `to_multiple`: 2019.6\n - Drop `attach`: 1871.2\n - Drop `dollar`: 1877.7\n - Drop `winner`: 1885\n - Drop `inherit`: 1864.5\n - Drop `password`: 1878.2\n - Drop `format`: 2007.4\n - Drop `re_subj`: 1902.9\n - Drop `sent_email`: 1957.6\n\n1. **Spam filtering, prediction.**\nRecall running a logistic regression to aid in spam classification for individual emails. In this exercise, we have taken a small set of the variables and fit a logistic model with the following output:\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.81 0.09 -9.34 <0.0001
to_multiple1 -2.64 0.30 -8.68 <0.0001
winneryes 1.63 0.32 5.11 <0.0001
format1 -1.59 0.12 -13.28 <0.0001
re_subj1 -3.05 0.36 -8.40 <0.0001
\n \n `````\n :::\n :::\n \n a. Write down the model using the coefficients from the model fit.\n\n b. Suppose we have an observation where $\\texttt{to_multiple} = 0$, $\\texttt{winner}= 1$, $\\texttt{format} = 0$, and $\\texttt{re_subj} = 0$. What is the predicted probability that this message is spam?\n\n c. Put yourself in the shoes of a data scientist working on a spam filter. For a given message, how high must the probability a message is spam be before you think it would be reasonable to put it in a *spambox* (which the user is unlikely to check)? What tradeoffs might you consider? Any ideas about how you might make your spam-filtering system even better from the perspective of someone using your email service?\n\n\n:::\n", + "supporting": [ + "09-model-logistic_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/09-model-logistic/figure-html/logit-transformation-1.png b/_freeze/09-model-logistic/figure-html/logit-transformation-1.png new file mode 100644 index 00000000..e3477bbb Binary files /dev/null and b/_freeze/09-model-logistic/figure-html/logit-transformation-1.png differ diff --git a/_freeze/09-model-logistic/figure-html/unnamed-chunk-14-1.png b/_freeze/09-model-logistic/figure-html/unnamed-chunk-14-1.png new file mode 100644 index 00000000..af7605de Binary files /dev/null and b/_freeze/09-model-logistic/figure-html/unnamed-chunk-14-1.png differ diff --git a/_freeze/09-model-logistic/figure-html/unnamed-chunk-20-1.png b/_freeze/09-model-logistic/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 00000000..09309b6e Binary files /dev/null and b/_freeze/09-model-logistic/figure-html/unnamed-chunk-20-1.png differ diff --git a/_freeze/10-model-applications/execute-results/html.json b/_freeze/10-model-applications/execute-results/html.json new file mode 100644 index 00000000..80726e68 --- /dev/null +++ b/_freeze/10-model-applications/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "2433e66b0bfe2c87bcb7c0fb4b0c3d56", + "result": { + "markdown": "# Applications: Model {#model-application}\n\n\n\n\n\n## Case study: Houses for sale {#model-case-study}\n\nTake a walk around your neighborhood and you'll probably see a few houses for sale.\nIf you find a house for sale, you can probably go online and look up its price.\nYou'll quickly note that the prices seem a bit arbitrary -- the homeowners get to decide what the amount they want to list their house for, and many criteria factor into this decision, e.g., what do comparable houses (\"comps\" in real estate speak) sell for, how quickly they need to sell the house, etc.\n\nIn this case study we'll formalize the process of figuring out how much to list a house for by using data on current home sales In November of 2020, information on 98 houses in the Duke Forest neighborhood of Durham, NC were scraped from [Zillow](https://www.zillow.com).\nThe homes were all recently sold at the time of data collection, and the goal of the project was to build a model for predicting the sale price based on a particular home's characteristics.\nThe first four homes are shown in Table \\@ref(tab:duke-data-frame), and descriptions for each variable are shown in Table \\@ref(tab:duke-variables).\n\n::: {.data data-latex=\"\"}\nThe [`duke_forest`](http://openintrostat.github.io/openintro/reference/duke_forest.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\\vspace{-4mm}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Top four rows of the data describing homes for sale in the Duke Forest neighborhood of Durham, NC.
price bed bath area year_built cooling lot
1,520,000 3 4 6,040 1,972 central 0.97
1,030,000 5 4 4,475 1,969 central 1.38
420,000 2 3 1,745 1,959 central 0.51
680,000 4 3 2,091 1,961 central 0.84
\n\n`````\n:::\n:::\n\n\n\\vspace{-4mm}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variables and their descriptions for the `duke_forest` dataset.
Variable Description
price Sale price, in USD
bed Number of bedrooms
bath Number of bathrooms
area Area of home, in square feet
year_built Year the home was built
cooling Cooling system: central or other (other is baseline)
lot Area of the entire property, in acres
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n### Correlating with `price`\n\nAs mentioned, the goal of the data collection was to build a model for the sale price of homes.\nWhile using multiple predictor variables is likely preferable to using only one variable, we start by learning about the variables themselves and their relationship to price.\nFigure \\@ref(fig:single-scatter) shows scatterplots describing price as a function of each of the predictor variables.\nAll of the variables seem to be positively associated with price (higher values of the variable are matched with higher price values).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Scatter plots describing six different predictor variables' relationship with the price of a home.](10-model-applications_files/figure-html/single-scatter-1.png){width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIn Figure \\@ref(fig:single-scatter) there does not appear to be a correlation value calculated for the predictor variable, `cooling`.\nWhy not?\nCan the variable still be used in the linear model?\nExplain.[^10-model-applications-1]\n:::\n\n[^10-model-applications-1]: The correlation coefficient can only be calculated to describe the relationship between two numerical variables.\n The predictor variable `cooling` is categorical, not numerical.\n It *can*, however, be used in the linear model as a binary indicator variable coded, for example, with a `1` for central and `0` for other.\n\n::: {.workedexample data-latex=\"\"}\nIn Figure \\@ref(fig:single-scatter) which variable seems to be most informative for predicting house price?\nProvide two reasons for your answer.\n\n------------------------------------------------------------------------\n\nThe `area` of the home is the variable which is most highly correlated with `price`.\nAdditionally, the scatterplot for `price` vs. `area` seems to show a strong linear relationship between the two variables.\nNote that the correlation coefficient and the scatterplot linearity will often give the same conclusion.\nHowever, recall that the correlation coefficient is very sensitive to outliers, so it is always wise to look at the scatterplot even when the variables are highly correlated.\n:::\n\n### Modeling `price` with `area`\n\nA linear model was fit to predict `price` from `area`.\nThe resulting model information is given in Table \\@ref(tab:price-slr).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of least squares fit for price on area.
term estimate std.error statistic p.value
(Intercept) 116,652 53,302 2.19 0.0311
area 159 18 8.78 <0.0001
Adjusted R-sq = 0.4394
df = 96
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nInterpret the value of $b_1$ = 159 in the context of the problem.[^10-model-applications-2]\n:::\n\n[^10-model-applications-2]: For each additional square foot of house, we would expect such houses to cost, on average, \\$159 more.\n\n::: {.guidedpractice data-latex=\"\"}\nUsing the output in Table \\@ref(tab:price-slr), write out the model for predicting `price` from `area`.[^10-model-applications-3]\n:::\n\n[^10-model-applications-3]: $\\widehat{\\texttt{price}} = 116,652 + 159 \\times \\texttt{area}$\n\nThe residuals from the linear model can be used to assess whether a linear model is appropriate.\nFigure \\@ref(fig:price-resid-slr) plots the residuals $e_i = y_i - \\hat{y}_i$ on the $y$-axis and the fitted (or predicted) values $\\hat{y}_i$ on the $x$-axis.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Residuals versus predicted values for the model predicting sale price from area of home.](10-model-applications_files/figure-html/price-resid-slr-1.png){width=70%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat aspect(s) of the residual plot indicate that a linear model is appropriate?\nWhat aspect(s) of the residual plot seem concerning when fitting a linear model?[^10-model-applications-4]\n:::\n\n[^10-model-applications-4]: The residual plot shows that the relationship between `area` and the average `price` of a home is indeed linear.\n However, the residuals are quite large for expensive homes.\n The large residuals indicate potential outliers or increasing variability, either of which could warrant more involved modeling techniques than are presented in this text.\n\n### Modeling `price` with multiple variables\n\nIt seems as though the predictions of home price might be more accurate if more than one predictor variable was used in the linear model.\nTable \\@ref(tab:price-mlr) displays the output from a linear model of `price` regressed on `area`, `bed`, `bath`, `year_built`, `cooling`, and `lot`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of least squares fit for price on multiple predictor variables.
term estimate std.error statistic p.value
(Intercept) -2,910,715 1,787,934 -1.63 0.107
area 102 23 4.42 <0.0001
bed -13,692 25,928 -0.53 0.5987
bath 41,076 24,662 1.67 0.0993
year_built 1,459 914 1.60 0.1139
coolingcentral 84,065 30,338 2.77 0.0068
lot 356,141 75,940 4.69 <0.0001
Adjusted R-sq = 0.5896
df = 90
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nUsing Table \\@ref(tab:price-mlr), write out the linear model of price on the six predictor variables.\n\n------------------------------------------------------------------------\n\n$$\n\\begin{aligned}\n\\widehat{\\texttt{price}} &= -2,910,715 \\\\\n&+ 102 \\times \\texttt{area} - 13,692 \\times \\texttt{bed} \\\\\n&+ 41,076 \\times \\texttt{bath} + 1,459 \\times \\texttt{year_built}\\\\\n&+ 84,065 \\times \\texttt{cooling}_{\\texttt{central}} + 356,141 \\times \\texttt{lot}\n\\end{aligned}\n$$\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nThe value of the estimated coefficient on $\\texttt{cooling}_{\\texttt{central}}$ is $b_5 = 84,065.$ Interpret the value of $b_5$ in the context of the problem.[^10-model-applications-5]\n:::\n\n[^10-model-applications-5]: The coefficient indicates that if all the other variables are kept constant, homes with central air conditioning cost \\$84,065 more, on average.\n\nA friend suggests that maybe you do not need all six variables to have a good model for `price`.\nYou consider taking a variable out, but you aren't sure which one to remove.\n\n\n::: {.cell}\n\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nResults corresponding to the full model for the housing data are shown in Table \\@ref(tab:price-mlr).\nHow should we proceed under the backward elimination strategy?\n\n------------------------------------------------------------------------\n\nOur baseline adjusted $R^2$ from the full model is 0.59, and we need to determine whether dropping a predictor will improve the adjusted $R^2$.\nTo check, we fit models that each drop a different predictor, and we record the adjusted $R^2$:\n\n- Excluding `area`: 0.506\n- Excluding `bed`: 0.593\n- Excluding `bath`: 0.582\n- Excluding `year_built`: 0.583\n- Excluding `cooling`: 0.559\n- Excluding `lot`: 0.489\n\nThe model without `bed` has the highest adjusted $R^2$ of 0.593, higher than the adjusted $R^2$ for the full model.\nBecause eliminating `bed` leads to a model with a higher adjusted $R^2$ than the full model, we drop `bed` from the model.\n\nIt might seem counter-intuitive to exclude information on number of bedrooms from the model.\nAfter all, we would expect homes with more bedrooms to cost more, and we can see a clear relationship between number of bedrooms and sale price in Figure \\@ref(fig:single-scatter).\nHowever, note that `area` is still in the model, and it's quite likely that the area of the home and the number of bedrooms are highly associated.\nTherefore, the model already has information on \"how much space is available in the house\" with the inclusion of `area`.\n\nSince we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors.\nOur baseline adjusted $R^2$ is now 0.593.\nWe fit another set of new models, which consider eliminating each of the remaining predictors in addition to `bed`:\n\n\n::: {.cell}\n\n:::\n\n\n- Excluding `bed` and `area`: 0.51\n- Excluding `bed` and `bath`: 0.586\n- Excluding `bed` and `year_built`: 0.586\n- Excluding `bed` and `cooling`: 0.563\n- Excluding `bed` and `lot`: 0.493\n\nNone of these models lead to an improvement in adjusted $R^2$, so we do not eliminate any of the remaining predictors.\n:::\n\n\\clearpage\n\nThat is, after backward elimination, we are left with the model that keeps all predictors except `bed`, which we can summarize using the coefficients from Table \\@ref(tab:price-full-except-bed).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of least squares fit for price on multiple predictor variables, excluding number of bedrooms.
term estimate std.error statistic p.value
(Intercept) -2,952,641 1,779,079 -1.66 0.1004
area 99 22 4.44 <0.0001
bath 36,228 22,799 1.59 0.1155
year_built 1,466 910 1.61 0.1107
coolingcentral 83,856 30,215 2.78 0.0067
lot 357,119 75,617 4.72 <0.0001
Adjusted R-sq = 0.5929
df = 91
\n\n`````\n:::\n:::\n\n\n\\vspace{-4mm}\n\nThen, the linear model for predicting sale price based on this model is as follows:\n\n$$ \n\\begin{aligned}\n\\widehat{\\texttt{price}} &= -2,952,641 + 99 \\times \\texttt{area}\\\\ \n&+ 36,228 \\times \\texttt{bath} + 1,466 \\times \\texttt{year_built}\\\\\n&+ 83,856 \\times \\texttt{cooling}_{\\texttt{central}} + 357,119 \\times \\texttt{lot}\n\\end{aligned}\n$$\n\n::: {.workedexample data-latex=\"\"}\nThe residual plot for the model with all of the predictor variables except `bed` is given in Figure \\@ref(fig:price-resid-mlr-nobed).\nHow do the residuals in Figure \\@ref(fig:price-resid-mlr-nobed) compare to the residuals in Figure \\@ref(fig:price-resid-slr)?\n\n------------------------------------------------------------------------\n\nThe residuals, for the most part, are randomly scattered around 0.\nHowever, there is one extreme outlier with a residual of -\\$750,000, a house whose actual sale price is a lot lower than its predicted price.\nAlso, we observe again that the residuals are quite large for expensive homes.\n:::\n\n\\vspace{-4mm}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Residuals versus predicted values for the model predicting sale price from all predictors except for number of bedrooms.](10-model-applications_files/figure-html/price-resid-mlr-nobed-1.png){width=70%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nConsider a house with 1,803 square feet, 2.5 bathrooms, 0.145 acres, built in 1941, that has central air conditioning.\nWhat is the predicted price of the home?[^10-model-applications-6]\n:::\n\n[^10-model-applications-6]: $\\widehat{\\texttt{price}} = -2,952,641 + 99 \\times 1803\\\\ + 36,228 \\times 2.5 + 1,466 \\times 1941\\\\ + 83,856 \\times 1 + 357,119 \\times 0.145\\\\ = \\$297,570.$\n\n::: {.guidedpractice data-latex=\"\"}\nIf you later learned that the house (with a predicted price of \\$297,570) had recently sold for \\$804,133, would you think the model was terrible?\nWhat if you learned that the house was in California?[^10-model-applications-7]\n:::\n\n[^10-model-applications-7]: A residual of \\$506,563 is reasonably big.\n Note that the large residuals (except a few homes) in Figure \\@ref(fig:price-resid-mlr-nobed) are closer to \\$250,000 (about half as big).\n After we learn that the house is in California, we realize that the model shouldn't be applied to the new home at all!\n The original data are from Durham, NC, and models based on the Durham, NC data should be used only to explore patterns in prices for homes in Durham, NC.\n\n\\clearpage\n\n## Interactive R tutorials {#sec-model-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 3: Regression modeling](https://openintrostat.github.io/ims-tutorials/03-model/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintrostat.github.io/ims-tutorials/03-model\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 1: Visualizing two variables](https://openintro.shinyapps.io/ims-03-model-01/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-01\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 2: Correlation](https://openintro.shinyapps.io/ims-03-model-02/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-02\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 3: Simple linear regression](https://openintro.shinyapps.io/ims-03-model-03/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-03\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 4: Interpreting regression models](https://openintro.shinyapps.io/ims-03-model-04/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-04\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 5: Model fit](https://openintro.shinyapps.io/ims-03-model-05/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-05\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 6: Parallel slopes](https://openintro.shinyapps.io/ims-03-model-06/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-06\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 7: Evaluating and extending parallel slopes model](https://openintro.shinyapps.io/ims-03-model-07/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-07\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 8: Multiple regression](https://openintro.shinyapps.io/ims-03-model-08/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-08\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 9: Logistic regression](https://openintro.shinyapps.io/ims-03-model-09/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-09\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 3 - Lesson 10: Case study: Italian restaurants in NYC](https://openintro.shinyapps.io/ims-03-model-10/)\\\n::: {.content-hidden unless-format=\"pdf\"} https://openintro.shinyapps.io/ims-03-model-10\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs {#model-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Introduction to linear regression - Human Freedom Index](https://www.openintro.org/go?id=ims-r-lab-model)\\\n::: {.content-hidden unless-format=\"pdf\"} https://www.openintro.org/go?i\nd=ims-r-lab-model\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "supporting": [ + "10-model-applications_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/10-model-applications/figure-html/price-resid-mlr-nobed-1.png b/_freeze/10-model-applications/figure-html/price-resid-mlr-nobed-1.png new file mode 100644 index 00000000..562563dd Binary files /dev/null and b/_freeze/10-model-applications/figure-html/price-resid-mlr-nobed-1.png differ diff --git a/_freeze/10-model-applications/figure-html/price-resid-slr-1.png b/_freeze/10-model-applications/figure-html/price-resid-slr-1.png new file mode 100644 index 00000000..a377d955 Binary files /dev/null and b/_freeze/10-model-applications/figure-html/price-resid-slr-1.png differ diff --git a/_freeze/10-model-applications/figure-html/single-scatter-1.png b/_freeze/10-model-applications/figure-html/single-scatter-1.png new file mode 100644 index 00000000..4d1fa443 Binary files /dev/null and b/_freeze/10-model-applications/figure-html/single-scatter-1.png differ diff --git a/_freeze/11-foundations-randomization/execute-results/html.json b/_freeze/11-foundations-randomization/execute-results/html.json new file mode 100644 index 00000000..9e825fdf --- /dev/null +++ b/_freeze/11-foundations-randomization/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "8a948542f664841e39f9a3f40bbae52c", + "result": { + "markdown": "\n\n\n# Hypothesis testing with randomization {#sec-foundations-randomization}\n\n::: {.chapterintro data-latex=\"\"}\nStatistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates.\nWhile the equations and details change depending on the setting, the foundations for inference are the same throughout all of statistics.\n\nWe start with two case studies designed to motivate the process of making decisions about research claims.\nWe formalize the process through the introduction of the **hypothesis testing framework**\\index{hypothesis test}, which allows us to formally evaluate claims about the population.\n:::\n\n\n\n\n\nThroughout the book so far, you have worked with data in a variety of contexts.\nYou have learned how to summarize and visualize the data as well as how to model multiple variables at the same time.\nSometimes the dataset at hand represents the entire research question.\nBut more often than not, the data have been collected to answer a research question about a larger group of which the data are a (hopefully) representative subset.\n\nYou may agree that there is almost always variability in data -- one dataset will not be identical to a second dataset even if they are both collected from the same population using the same methods.\nHowever, quantifying the variability in the data is neither obvious nor easy to do, i.e., answering the question \"*how* different is one dataset from another?\" is not trivial.\n\nFirst, a note on notation.\nWe generally use $p$ to denote a population proportion and $\\hat{p}$ to a sample proportion.\nSimilarly, we generally use $\\mu$ to denote a population mean and $\\bar{x}$ to denote a sample mean.\n\n::: {.workedexample data-latex=\"\"}\nSuppose your professor splits the students in your class into two groups: students who sit on the left side of the classroom and students who sit on the right side of the classroom.\nIf $\\hat{p}_{L}$ represents the proportion of students who prefer to read books on screen who sit on the left side of the classroom and $\\hat{p}_{R}$ represents the proportion of students who prefer to read books on screen who sit on the right side of the classroom, would you be surprised if $\\hat{p}_{L}$ did not *exactly* equal $\\hat{p}_{R}$?\n\n------------------------------------------------------------------------\n\nWhile the proportions $\\hat{p}_{L}$ and $\\hat{p}_{R}$ would probably be close to each other, it would be unusual for them to be exactly the same.\nWe would probably observe a small difference due to *chance*.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf we do not think the side of the room a person sits on in class is related to whether they prefer to read books on screen, what assumption are we making about the relationship between these two variables?[^11-foundations-randomization-1]\n:::\n\n[^11-foundations-randomization-1]: We would be assuming that these two variables are **independent**\\index{independent}.\n\n\n\n\n\nStudying randomness of this form is a key focus of statistics.\nThroughout this chapter, and those that follow, we provide three different approaches for quantifying the variability inherent in data: randomization, bootstrapping, and mathematical models.\nUsing the methods provided in this chapter, we will be able to draw conclusions beyond the dataset at hand to research questions about larger populations that the samples come from.\n\nThe first type of variability we will explore comes from experiments where the explanatory variable (or treatment) is randomly assigned to the observational units.\nAs you learned in Chapter \\@ref(data-hello), a randomized experiment can be used to assess whether one variable (the explanatory variable) causes changes in a second variable (the response variable).\nEvery dataset has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison.\nThat is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level.\nBy reassigning the treatments many many times, we can compare the actual experiment to the sham experiment.\nIf the actual experiment has more extreme results than any of the sham experiments, we are led to believe that it is the explanatory variable which is causing the result and not just variability inherent to the data.\nUsing a few different case studies, let's look more carefully at this idea of a **randomization test**\\index{randomization test}.\n\n\n\n\n\n## Sex discrimination case study {#caseStudySexDiscrimination}\n\nWe consider a study investigating sex discrimination in the 1970s, which is set in the context of personnel decisions within a bank.\nThe research question we hope to answer is, \"Are individuals who identify as female discriminated against in promotion decisions made by their managers who identify as male?\" [@Rosen:1974]\n\n::: {.data data-latex=\"\"}\nThe [`sex_discrimination`](http://openintrostat.github.io/openintro/reference/sex_discrimination.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThis study considered sex roles, and only allowed for options of \"male\" and \"female\".\nWe should note that the identities being considered are not gender identities and that the study allowed only for a binary classification of sex.\n\n### Observed data\n\nThe participants in this study were 48 bank supervisors who identified as male, attending a management institute at the University of North Carolina in 1972.\nThey were asked to assume the role of the personnel director of a bank and were given a personnel file to judge whether the person should be promoted to a branch manager position.\nThe files given to the participants were identical, except that half of them indicated the candidate identified as male and the other half indicated the candidate identified as female.\nThese files were randomly assigned to the bank managers.\n\n::: {.guidedpractice data-latex=\"\"}\nIs this an observational study or an experiment?\nHow does the type of study impact what can be inferred from the results?[^11-foundations-randomization-2]\n:::\n\n[^11-foundations-randomization-2]: The study is an experiment, as subjects were randomly assigned a \"male\" file or a \"female\" file (remember, all the files were actually identical in content).\n Since this is an experiment, the results can be used to evaluate a causal relationship between the sex of a candidate and the promotion decision.\n\n\n::: {.cell}\n\n:::\n\n\nFor each supervisor both the sex associated with the assigned file and the promotion decision were recorded.\nUsing the results of the study summarized in Table \\@ref(tab:sex-discrimination-obs), we would like to evaluate if individuals who identify as female are unfairly discriminated against in promotion decisions.\nIn this study, a smaller proportion of female identifying applications were promoted than males (0.583 versus 0.875), but it is unclear whether the difference provides *convincing evidence* that individuals who identify as female are unfairly discriminated against.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary results for the sex discrimination study.
decision
sex promoted not promoted Total
male 21 3 24
female 14 10 24
Total 35 13 48
\n\n`````\n:::\n:::\n\n\nThe data are visualized in Figure \\@ref(fig:sex-rand-obs) as a set of cards.\nNote that each card denotes a personnel file (an observation from our dataset) and the colors indicate the decision: red for promoted and white for not promoted.\nAdditionally, the observations are broken up into groups of male and female identifying groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The sex discrimination study can be thought of as 48 red and white cards.](images/sex-rand-01-obs.png){fig-alt='48 cards are laid out; 24 indicating male files, 24 indicated female files. Of the 24 male files 3 of the cards are colored white, and 21 of the cards are colored red. Of the female files, 10 of the cards are colored white, and 14 of the cards are colored red.' width=40%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nStatisticians are sometimes called upon to evaluate the strength of evidence.\nWhen looking at the rates of promotion in this study, why might we be tempted to immediately conclude that individuals identifying as female are being discriminated against?\n\n------------------------------------------------------------------------\n\nThe large difference in promotion rates (58.3% for female personnel versus 87.5% for male personnel) suggest there might be discrimination against women in promotion decisions.\nHowever, we cannot yet be sure if the observed difference represents discrimination or is just due to random chance when there is no discrimination occurring.\nSince we wouldn't expect the sample proportions to be *exactly* equal, even if the truth was that the promotion decisions were independent of sex, we can't rule out random chance as a possible explanation when simply comparing the sample proportions.\n:::\n\nThe previous example is a reminder that the observed outcomes in the sample may not perfectly reflect the true relationships between variables in the underlying population.\nTable \\@ref(tab:sex-discrimination-obs) shows there were 7 fewer promotions for female identifying personnel than for the male personnel, a difference in promotion rates of 29.2% $\\left( \\frac{21}{24} - \\frac{14}{24} = 0.292 \\right).$ This observed difference is what we call a **point estimate**\\index{point estimate} of the true difference.\nThe point estimate of the difference in promotion rate is large, but the sample size for the study is small, making it unclear if this observed difference represents discrimination or whether it is simply due to chance when there is no discrimination occurring.\nChance can be thought of as the claim due to natural variability; discrimination can be thought of as the claim the researchers set out to demonstrate.\nWe label these two competing claims, $H_0$ and $H_A:$\n\n\n\n\n\n\\vspace{-2mm}\n\n- $H_0:$ **Null hypothesis**\\index{null hypothesis}. The variables `sex` and `decision` are independent. They have no relationship, and the observed difference between the proportion of males and females who were promoted, 29.2%, was due to the natural variability inherent in the population.\n- $H_A:$ **Alternative hypothesis**\\index{alternative hypothesis}. The variables `sex` and `decision` are *not* independent. The difference in promotion rates of 29.2% was not due to natural variability, and equally qualified female personnel are less likely to be promoted than male personnel.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Hypothesis testing.**\n\nThese hypotheses are part of what is called a **hypothesis test**\\index{hypothesis test}.\nA hypothesis test is a statistical technique used to evaluate competing claims using data.\nOften times, the null hypothesis takes a stance of *no difference* or *no effect*.\nThis hypothesis assumes that any differences seen are due to the variability inherent in the population and could have occurred by random chance.\n\nIf the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.\n\nThere are many nuances to hypothesis testing, so do not worry if you aren't a master of hypothesis testing at the end of this section.\nWe'll discuss these ideas and details many times in this chapter as well as in the chapters that follow.\n:::\n\n\n\n\n\nWhat would it mean if the null hypothesis, which says the variables `sex` and `decision` are unrelated, was true?\nIt would mean each banker would decide whether to promote the candidate without regard to the sex indicated on the personnel file.\nThat is, the difference in the promotion percentages would be due to the natural variability in how the files were randomly allocated to different bankers, and this randomization just happened to give rise to a relatively large difference of 29.2%.\n\nConsider the alternative hypothesis: bankers were influenced by which sex was listed on the personnel file.\nIf this was true, and especially if this influence was substantial, we would expect to see some difference in the promotion rates of male and female candidates.\nIf this sex bias was against female candidates, we would expect a smaller fraction of promotion recommendations for female personnel relative to the male personnel.\n\nWe will choose between the two competing claims by assessing if the data conflict so much with $H_0$ that the null hypothesis cannot be deemed reasonable.\nIf data and the null claim seem to be at odds with one another, and the data seem to support $H_A,$ then we will reject the notion of independence and conclude that the data provide evidence of discrimination.\n\n\\vspace{-2mm}\n\n### Variability of the statistic\n\nTable \\@ref(tab:sex-discrimination-obs) shows that 35 bank supervisors recommended promotion and 13 did not.\nNow, suppose the bankers' decisions were independent of the sex of the candidate.\nThen, if we conducted the experiment again with a different random assignment of sex to the files, differences in promotion rates would be based only on random fluctuation in promotion decisions.\nWe can actually perform this **randomization**, which simulates what would have happened if the bankers' decisions had been independent of `sex` but we had distributed the file sexes differently.[^11-foundations-randomization-3]\n\n[^11-foundations-randomization-3]: The test procedure we employ in this section is sometimes referred to as a **randomization test**.\n If the explanatory variable had not been randomly assigned, as in an observational study, the procedure would be referred to as a **permutation test**.\n Permutation tests are used for observational studies, where the explanatory variable was not randomly assigned.\\index{permutation test}.\n\n\n\n\n\nIn the **simulation**\\index{simulation}, we thoroughly shuffle the 48 personnel files, 35 labelled `promoted` and 13 labelled `not promoted`, together and we deal files into two new stacks.\nNote that by keeping 35 promoted and 13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual whose content is contained in the file **independent** of the sex indicated on their file.\nWe will deal 24 files into the first stack, which will represent the 24 \"female\" files.\nThe second stack will also have 24 files, and it will represent the 24 \"male\" files.\nFigure \\@ref(fig:sex-rand-shuffle-1) highlights both the shuffle and the reallocation to the sham sex groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The sex discrimination data is shuffled and reallocated to new groups of male and female files.](images/sex-rand-02-shuffle-1.png){fig-alt='The 48 red and white cards which denote the original data are shuffled and reassigned, 24 to each group indicating 24 male files and 24 female files.' width=80%}\n:::\n:::\n\n\nThen, as we did with the original data, we tabulate the results and determine the fraction of personnel files designated as \"male\" and \"female\" who were promoted.\n\n\n\n\n\nSince the randomization of files in this simulation is independent of the promotion decisions, any difference in promotion rates is due to chance.\nTable \\@ref(tab:sex-discrimination-rand-1) show the results of one such simulation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Simulation results, where the difference in promotion rates between male and female is purely due to random chance.
decision
sex promoted not promoted Total
male 18 6 24
female 17 7 24
Total 35 13 48
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the difference in promotion rates between the two simulated groups in Table \\@ref(tab:sex-discrimination-rand-1) ?\nHow does this compare to the observed difference 29.2% from the actual study?[^11-foundations-randomization-4]\n:::\n\n[^11-foundations-randomization-4]: $18/24 - 17/24=0.042$ or about 4.2% in favor of the male personnel.\n This difference due to chance is much smaller than the difference observed in the actual groups.\n\nFigure \\@ref(fig:sex-rand-shuffle-1-sort) shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 \\> 0.042).\nThe quantity of interest throughout this case study has been the difference in promotion rates.\nWe call the summary value the **statistic** of interest (or often the **test statistic**).\nWhen we encounter different data structures, the statistic is likely to change (e.g., we might calculate an average instead of a proportion), but we will always want to understand how the statistic varies from sample to sample.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![We summarize the randomized data to produce one estimate of the difference in proportions given no sex discrimination. Note that the sort step is only used to make it easier to visually calculate the simulated sample proportions.](images/sex-rand-03-shuffle-1-sort.png){fig-alt='The 48 red and white cards are show in three panels. The first panel represents the original data and original allocation of the male and female files (in the original data there are 3 white cards in the male group and 10 white cards in the female group). The second panel represents the shuffled red and white cards that are randomly assigned as male and female files. The third panel has the cards sorted according to the random assignment of female or male. In the third panel there are 6 white cards in the male group and 7 white cards in the female group.' width=100%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nWe computed one possible difference under the null hypothesis in Guided Practice, which represents one difference due to chance when the null hypothesis is assumed to be true.\nWhile in this first simulation, we physically dealt out files, it is much more efficient to perform this simulation using a computer.\nRepeating the simulation on a computer, we get another difference due to chance under the same assumption: -0.042.\nAnd another: 0.208.\nAnd so on until we repeat the simulation enough times that we have a good idea of the shape of the *distribution of differences* under the null hypothesis.\nFigure \\@ref(fig:sex-rand-dot-plot) shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:sex-rand-dot-plot-cap)](11-foundations-randomization_files/figure-html/sex-rand-dot-plot-1.png){width=100%}\n:::\n:::\n\n\n(ref:sex-rand-dot-plot-cap) A stacked dot plot of differences from 100 simulations produced under the null hypothesis, $H_0,$ where the simulated sex and decision are independent. Two of the 100 simulations had a difference of at least 29.2%, the difference observed in the study, and are shown as solid blue dots.\n\nNote that the distribution of these simulated differences in proportions is centered around 0.\nUnder the null hypothesis our simulations made no distinction between male and female personnel files.\nThus, a center of 0 makes sense: we should expect differences from chance alone to fall around zero with some random fluctuation for each simulation.\n\n::: {.workedexample data-latex=\"\"}\nHow often would you observe a difference of at least 29.2% (0.292) according to Figure \\@ref(fig:sex-rand-dot-plot)?\nOften, sometimes, rarely, or never?\n\n------------------------------------------------------------------------\n\nIt appears that a difference of at least 29.2% under the null hypothesis would only happen about 2% of the time according to Figure \\@ref(fig:sex-rand-dot-plot).\nSuch a low probability indicates that observing such a large difference from chance alone is rare.\n:::\n\nThe difference of 29.2% is a rare event if there really is no impact from listing sex in the candidates' files, which provides us with two possible interpretations of the study results:\n\n- If $H_0,$ the **Null hypothesis** is true: Sex has no effect on promotion decision, and we observed a difference that is so large that it would only happen rarely.\n\n- If $H_A,$ the **Alternative hypothesis** is true: Sex has an effect on promotion decision, and what we observed was actually due to equally qualified female candidates being discriminated against in promotion decisions, which explains the large difference of 29.2%.\n\nWhen we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.[^11-foundations-randomization-5]\nIn our analysis, we determined that there was only a $\\approx$ 2% probability of obtaining a sample where $\\geq$ 29.2% more male candidates than female candidates get promoted under the null hypothesis, so we conclude that the data provide strong evidence of sex discrimination against female candidates by the male supervisors.\nIn this case, we reject the null hypothesis in favor of the alternative.\n\n[^11-foundations-randomization-5]: This reasoning does not generally extend to anecdotal observations.\n Each of us observes incredibly rare events every day, events we could not possibly hope to predict.\n However, in the non-rigorous setting of anecdotal evidence, almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous.\n For example, we might look at the lottery: there was only a 1 in 176 million chance that the Mega Millions numbers for the largest jackpot in history (October 23, 2018) would be (5, 28, 62, 65, 70) with a Mega ball of (5), but nonetheless those numbers came up!\n However, no matter what numbers had turned up, they would have had the same incredibly rare odds.\n That is, *any set of numbers we could have observed would ultimately be incredibly rare*.\n This type of situation is typical of our daily lives: each possible event in itself seems incredibly rare, but if we consider every alternative, those outcomes are also incredibly rare.\n We should be cautious not to misinterpret such anecdotal evidence.\n\n**Statistical inference** is the practice of making decisions and conclusions from data in the context of uncertainty.\nErrors do occur, just like rare events, and the dataset at hand might lead us to the wrong conclusion.\nWhile a given dataset may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur.\nBefore getting into the nuances of hypothesis testing, let's work through another case study.\n\n\n\n\n\n## Opportunity cost case study {#caseStudyOpportunityCost}\n\nHow rational and consistent is the behavior of the typical American college student?\nIn this section, we'll explore whether college student consumers always consider the following: money not spent now can be spent later.\n\nIn particular, we are interested in whether reminding students about this well-known fact about money causes them to be a little thriftier.\nA skeptic might think that such a reminder would have no impact.\nWe can summarize the two different perspectives using the null and alternative hypothesis framework.\n\n- $H_0:$ **Null hypothesis**. Reminding students that they can save money for later purchases will not have any impact on students' spending decisions.\n- $H_A:$ **Alternative hypothesis**. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.\n\nIn this section, we'll explore an experiment conducted by researchers that investigates this very question for students at a university in the southwestern United States.\n[@Frederick:2009]\n\n### Observed data\n\nOne-hundred and fifty students were recruited for the study, and each was given the following statement:\n\n> *Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of \\$14.99. What would you do in this situation? Please circle one of the options below.*[^11-foundations-randomization-6]\n\n[^11-foundations-randomization-6]: This context might feel strange if physical video stores predate you.\n If you're curious about what those were like, look up \"Blockbuster\".\n\nHalf of the 150 students were randomized into a control group and were given the following two options:\n\n> (A) Buy this entertaining video.\n\n> (B) Not buy this entertaining video.\n\nThe remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B):\n\n> (A) Buy this entertaining video.\n\n> (B) Not buy this entertaining video. Keep the \\$14.99 for other purchases.\n\nWould the extra statement reminding students of an obvious fact impact the purchasing decision?\nTable \\@ref(tab:opportunity-cost-obs) summarizes the study results.\n\n::: {.data data-latex=\"\"}\nThe [`opportunity_cost`](http://openintrostat.github.io/openintro/reference/opportunity_cost.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary results of the opportunity cost study.
decision
group buy video not buy video Total
control 56 19 75
treatment 41 34 75
Total 97 53 150
\n\n`````\n:::\n:::\n\n\nIt might be a little easier to review the results using a visualization.\nFigure \\@ref(fig:opportunity-cost-obs-bar) shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Stacked bar plot of results of the opportunity cost study.](11-foundations-randomization_files/figure-html/opportunity-cost-obs-bar-1.png){width=100%}\n:::\n:::\n\n\nAnother useful way to review the results from Table \\@ref(tab:opportunity-cost-obs) is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the video.\nThese summaries are given in Table \\@ref(tab:opportunity-cost-obs-row-prop).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n
The opportunity cost data are summarized using row proportions. Row proportions are particularly useful here since we can view the proportion of *buy* and *not buy* decisions in each group.
decision
group buy video not buy video Total
control 0.747 0.253 1
treatment 0.547 0.453 1
\n\n`````\n:::\n:::\n\n\nWe will define a **success**\\index{success} in this study as a student who chooses not to buy the video.[^11-foundations-randomization-7]\nThen, the value of interest is the change in video purchase rates that results by reminding students that not spending money now means they can spend the money later.\n\n[^11-foundations-randomization-7]: Success is often defined in a study as the outcome of interest, and a \"success\" may or may not actually be a positive outcome.\n For example, researchers working on a study on COVID prevalence might define a \"success\" in the statistical sense as a patient who has COVID-19.\n A more complete discussion of the term **success** will be given in Chapter \\@ref(inference-one-prop).\n\n\n\n\n\nWe can construct a point estimate for this difference as ($T$ for treatment and $C$ for control):\n\n$$\\hat{p}_{T} - \\hat{p}_{C} = \\frac{34}{75} - \\frac{19}{75} = 0.453 - 0.253 = 0.200$$\n\nThe proportion of students who chose not to buy the video was 20 percentage points higher in the treatment group than the control group.\nIs this 20% difference between the two groups so prominent that it is unlikely to have occurred from chance alone, if there is no difference between the spending habits of the two groups?\n\n### Variability of the statistic\n\nThe primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students.\nBecause this is an experiment, we'll use the same procedure we applied in Section \\@ref(caseStudySexDiscrimination): randomization.\n\nLet's think about the data in the context of the hypotheses.\nIf the null hypothesis $(H_0)$ was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to random chance.\nIf, on the other hand, the alternative hypothesis $(H_A)$ is true, then the difference indicates that reminding students about saving for later purchases actually impacts their buying decisions.\n\n### Observed statistic vs. null statistics\n\nJust like with the sex discrimination study, we can perform a statistical analysis.\nUsing the same randomization technique from the last section, let's see what happens when we simulate the experiment under the scenario where there is no effect from the treatment.\n\nWhile we would in reality do this simulation on a computer, it might be useful to think about how we would go about carrying out the simulation without a computer.\nWe start with 150 index cards and label each card to indicate the distribution of our response variable: `decision`.\nThat is, 53 cards will be labeled \"not buy video\" to represent the 53 students who opted not to buy, and 97 will be labeled \"buy video\" for the other 97 students.\nThen we shuffle these cards thoroughly and divide them into two stacks of size 75, representing the simulated treatment and control groups.\nBecause we have shuffled the cards from both groups together, assuming no difference in their purchasing behavior, any observed difference between the proportions of \"not buy video\" cards (what we earlier defined as *success*) can be attributed entirely to chance.\n\n::: {.workedexample data-latex=\"\"}\nIf we are randomly assigning the cards into the simulated treatment and control groups, how many \"not buy video\" cards would we expect to end up in each simulated group?\nWhat would be the expected difference between the proportions of \"not buy video\" cards in each group?\n\n------------------------------------------------------------------------\n\nSince the simulated groups are of equal size, we would expect $53 / 2 = 26.5,$ i.e., 26 or 27, \"not buy video\" cards in each simulated group, yielding a simulated point estimate of the difference in proportions of 0% .\nHowever, due to random chance, we might also expect to sometimes observe a number a little above or below 26 and 27.\n:::\n\nThe results of a single randomization is shown in Table \\@ref(tab:opportunity-cost-obs-simulated).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of student choices against their simulated groups. The group assignment had no connection to the student decisions, so any difference between the two groups is due to chance.
decision
group buy video not buy video Total
control 46 29 75
treatment 51 24 75
Total 97 53 150
\n\n`````\n:::\n:::\n\n\nFrom this table, we can compute a difference that occurred from the first shuffle of the data (i.e., from chance alone):\n\n$$\\hat{p}_{T, shfl1} - \\hat{p}_{C, shfl1} = \\frac{24}{75} - \\frac{29}{75} = 0.32 - 0.387 = - 0.067$$\n\nJust one simulation will not be enough to get a sense of what sorts of differences would happen from chance alone.\n\n\n::: {.cell}\n\n:::\n\n\nWe'll simulate another set of simulated groups and compute the new difference: 0.04.\n\nAnd again: 0.12.\n\nAnd again: -0.013.\n\nWe'll do this 1,000 times.\n\nThe results are summarized in a dot plot in Figure \\@ref(fig:opportunity-cost-rand-dot-plot), where each point represents the difference from one randomization.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:opportunity-cost-rand-dot-plot-cap)](11-foundations-randomization_files/figure-html/opportunity-cost-rand-dot-plot-1.png){width=90%}\n:::\n:::\n\n\n(ref:opportunity-cost-rand-dot-plot-cap) A stacked dot plot of 1,000 simulated (null) differences produced under the null hypothesis, $H_0.$ Six of the 1,000 simulations had a difference of at least 20% , which was the difference observed in the study.\n\nSince there are so many points and it is difficult to discern one point from the other, it is more convenient to summarize the results in a histogram such as the one in Figure \\@ref(fig:opportunity-cost-rand-hist), where the height of each histogram bar represents the number of simulations resulting in an outcome of that magnitude.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A histogram of 1,000 chance differences produced under the null hypothesis. Histograms like this one are a convenient representation of data or results when there are a large number of simulations.](11-foundations-randomization_files/figure-html/opportunity-cost-rand-hist-1.png){width=90%}\n:::\n:::\n\n\nUnder the null hypothesis (no treatment effect), we would observe a difference of at least +20% about 0.6% of the time.\nThat is really rare!\nInstead, we will conclude the data provide strong evidence there is a treatment effect: reminding students before a purchase that they could instead spend the money later on something else lowers the chance that they will continue with the purchase.\nNotice that we are able to make a causal statement for this study since the study is an experiment, although we do not know why the reminder induces a lower purchase rate.\n\n## Hypothesis testing {#HypothesisTesting}\n\nIn the last two sections, we utilized a **hypothesis test**\\index{hypothesis test}, which is a formal technique for evaluating two competing possibilities.\nIn each scenario, we described a **null hypothesis**\\index{null hypothesis}, which represented either a skeptical perspective or a perspective of no difference.\nWe also laid out an **alternative hypothesis**\\index{alternative hypothesis}, which represented a new perspective such as the possibility of a relationship between two variables or a treatment effect in an experiment.\nThe alternative hypothesis is usually the reason the scientists set out to do the research in the first place.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Null and alternative hypotheses.**\n\nThe **null hypothesis** $(H_0)$ often represents either a skeptical perspective or a claim of \"no difference\" to be tested.\n\nThe **alternative hypothesis** $(H_A)$ represents an alternative claim under consideration and is often represented by a range of possible values for the value of interest.\n:::\n\nIf a person makes a somewhat unbelievable claim, we are initially skeptical.\nHowever, if there is sufficient evidence that supports the claim, we set aside our skepticism.\nThe hallmarks of hypothesis testing are also found in the US court system.\n\n### The US court system\n\nIn the US course system, jurors evaluate the evidence to see whether it convincingly shows a defendant is guilty.\nDefendants are considered to be innocent until proven otherwise.\n\n::: {.workedexample data-latex=\"\"}\nThe US court considers two possible claims about a defendant: they are either innocent or guilty.\n\nIf we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?\n\n------------------------------------------------------------------------\n\nThe jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person's guilt.\nThat is, the skeptical perspective (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis).\n:::\n\nJurors examine the evidence to see whether it convincingly shows a defendant is guilty.\nNotice that if a jury finds a defendant *not guilty*, this does not necessarily mean the jury is confident in the person's innocence.\nThey are simply not convinced of the alternative, that the person is guilty.\nThis is also the case with hypothesis testing: *even if we fail to reject the null hypothesis, we do not accept the null hypothesis as truth*.\n\nFailing to find evidence in favor of the alternative hypothesis is not equivalent to finding evidence that the null hypothesis is true.\nWe will see this idea in greater detail in Section \\@ref(decerr).\n\n### p-value and statistical significance\n\nIn Section \\@ref(caseStudySexDiscrimination) we encountered a study from the 1970's that explored whether there was strong evidence that female candidates were less likely to be promoted than male candidates.\nThe research question -- are female candidates discriminated against in promotion decisions?\n-- was framed in the context of hypotheses:\n\n- $H_0:$ Sex has no effect on promotion decisions.\n\n- $H_A:$ Female candidates are discriminated against in promotion decisions.\n\nThe null hypothesis $(H_0)$ was a perspective of no difference in promotion.\nThe data on sex discrimination provided a point estimate of a 29.2% difference in recommended promotion rates between male and female candidates.\nWe determined that such a difference from chance alone, assuming the null hypothesis was true, would be rare: it would only happen about 2 in 100 times.\nWhen results like these are inconsistent with $H_0,$ we reject $H_0$ in favor of $H_A.$ Here, we concluded there was discrimination against female candidates.\n\nThe 2-in-100 chance is what we call a **p-value**, which is a probability quantifying the strength of the evidence against the null hypothesis, given the observed data.\n\n::: {.important data-latex=\"\"}\n**p-value.**\n\nThe **p-value**\\index{hypothesis testing!p-value} is the probability of observing data at least as favorable to the alternative hypothesis as our current dataset, if the null hypothesis were true.\nWe typically use a summary statistic of the data, such as a difference in proportions, to help compute the p-value and evaluate the hypotheses.\nThis summary value that is used to compute the p-value is often called the **test statistic**\\index{test statistic}.\n:::\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nIn the sex discrimination study, the difference in discrimination rates was our test statistic.\nWhat was the test statistic in the opportunity cost study covered in Section \\@ref(caseStudyOpportunityCost)?\n\n------------------------------------------------------------------------\n\nThe test statistic in the opportunity cost study was the difference in the proportion of students who decided against the video purchase in the treatment and control groups.\nIn each of these examples, the **point estimate** of the difference in proportions was used as the test statistic.\n:::\n\nWhen the p-value is small, i.e., less than a previously set threshold, we say the results are **statistically significant**\\index{statistically significant}.\nThis means the data provide such strong evidence against $H_0$ that we reject the null hypothesis in favor of the alternative hypothesis.\nThe threshold is called the **significance level**\\index{hypothesis testing!significance level}\\index{significance level} and often represented by $\\alpha$ (the Greek letter *alpha*).\nThe value of $\\alpha$ represents how rare an event needs to be in order for the null hypothesis to be rejected.\nHistorically, many fields have set $\\alpha = 0.05,$ meaning that the results need to occur less than 5% of the time, if the null hypothesis is to be rejected.\nThe value of $\\alpha$ can vary depending on the the field or the application.\n\n\n\n\n\nAlthough in everyday language \"significant\" would indicate that a difference is large or meaningful, that is not necessarily the case here.\nThe term \"statistically significant\" only indicates that the p-value from a study fell below the chosen significance level.\nFor example, in the sex discrimination study, the p-value was found to be approximately 0.02.\nUsing a significance level of $\\alpha = 0.05,$ we would say that the data provided statistically significant evidence against the null hypothesis.\nHowever, this conclusion gives us no information regarding the size of the difference in promotion rates!\n\n::: {.important data-latex=\"\"}\n**Statistical significance.**\n\nWe say that the data provide **statistically significant**\\index{hypothesis testing!statistically significant} evidence against the null hypothesis if the p-value is less than some predetermined threshold (e.g., 0.01, 0.05, 0.1).\n:::\n\n::: {.workedexample data-latex=\"\"}\nIn the opportunity cost study in Section \\@ref(caseStudyOpportunityCost), we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future.\nWe determined that such a large difference would only occur 6-in-1,000 times if the reminder actually had no influence on student decision-making.\nWhat is the p-value in this study?\nWould you classify the result as \"statistically significant\"?\n\n------------------------------------------------------------------------\n\nThe p-value was 0.006.\nSince the p-value is less than 0.05, the data provide statistically significant evidence that US college students were actually influenced by the reminder.\n:::\n\n::: {.important data-latex=\"\"}\n**What's so special about 0.05?**\n\nWe often use a threshold of 0.05 to determine whether a result is statistically significant.\nBut why 0.05?\nMaybe we should use a bigger number, or maybe a smaller number.\nIf you're a little puzzled, that probably means you're reading with a critical eye -- good job!\nWe've made a video to help clarify *why 0.05*:\n\n\n\nSometimes it's also a good idea to deviate from the standard.\nWe'll discuss when to choose a threshold different than 0.05 in Section \\@ref(decerr).\n:::\n\n\\clearpage\n\n## Chapter review {#chp11-review}\n\n### Summary\n\nFigure \\@ref(fig:fullrand) provides a visual summary of the randomization testing procedure.\n\n\\index{randomization test}\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An example of one simulation of the full randomization procedure from a hypothetical dataset as visualized in the first panel. We repeat the steps hundreds or thousands of times.](images/fullrand.png){fig-alt='48 red and white cards are show in three panels. The first panel represents original data and original allocation of Group 1 and Group 2 (in the original data there are 7 white cards in Group 1 and 10 white cards in Group 2). The second panel represents the shuffled red and white cards that are randomly assigned as Group 1 and Group 2. The third panel has the cards sorted according to the random assignment of Group 1 and Group 2. In the third panel there are 8 white cards in the Group 1 and 9 white cards in Group 2.' width=100%}\n:::\n:::\n\n\nWe can summarize the randomization test procedure as follows:\n\n- **Frame the research question in terms of hypotheses.** Hypothesis tests are appropriate for research questions that can be summarized in two competing hypotheses. The null hypothesis $(H_0)$ usually represents a skeptical perspective or a perspective of no relationship between the variables. The alternative hypothesis $(H_A)$ usually represents a new view or the existance of a relationship between the variables.\n- **Collect data with an observational study or experiment.** If a research question can be formed into two hypotheses, we can collect data to run a hypothesis test. If the research question focuses on associations between variables but does not concern causation, we would use an observational study. If the research question seeks a causal connection between two or more variables, then an experiment should be used.\n- **Model the randomness that would occur if the null hypothesis was true.** In the examples above, the variability has been modeled as if the treatment (e.g., sexual identity, opportunity) allocation was independent of the outcome of the study. The computer generated null distribution is the result of many different randomizations and quantifies the variability that would be expected if the null hypothesis was true.\n- **Analyze the data.** Choose an analysis technique appropriate for the data and identify the p-value. So far, we have only seen one analysis technique: randomization. Throughout the rest of this textbook, we'll encounter several new methods suitable for many other contexts.\n- **Form a conclusion.** Using the p-value from the analysis, determine whether the data provide evidence against the null hypothesis. Also, be sure to write the conclusion in plain language so casual readers can understand the results.\n\nTable \\@ref(tab:chp11-summary) is another look at the randomization test summary.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of randomization as an inferential statistical method.
Question Answer
What does it do? Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment
What is the random process described? Randomized experiment
What other random processes can be approximated? Can also be used to describe random sampling in an observational model
What is it best for? Hypothesis testing (can also be used for confidence intervals, but not covered in this text).
What physical object represents the simulation process? Shuffling cards
\n\n`````\n:::\n:::\n\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
alternative hypothesis permutation test statistical inference
confidence interval point estimate statistically significant
hypothesis test randomization test success
independent significance level test statistic
null hypothesis simulation
p-value statistic
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp11-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-11].\n\n::: {.exercises data-latex=\"\"}\n1. **Identify the parameter, I**\nFor each of the following situations, state whether the parameter of interest is a mean or a proportion. It may be helpful to examine whether individual responses are numerical or categorical.\n\n a. In a survey, one hundred college students are asked how many hours per week they spend on the Internet.\n\n b. In a survey, one hundred college students are asked: \"What percentage of the time you spend on the Internet is part of your course work?\"\n\n c. In a survey, one hundred college students are asked whether they cited information from Wikipedia in their papers.\n\n d. In a survey, one hundred college students are asked what percentage of their total weekly spending is on alcoholic beverages.\n\n e. In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date.\n\n1. **Identify the parameter, II.**\nFor each of the following situations, state whether the parameter of interest is a mean or a proportion.\n\n a. A poll shows that 64% of Americans personally worry a great deal about federal spending and the budget deficit.\n\n b. A survey reports that local TV news has shown a 17% increase in revenue within a two year period while newspaper revenues decreased by 6.4% during this time period.\n\n c. In a survey, high school and college students are asked whether they use geolocation services on their smart phones.\n\n d. In a survey, smart phone users are asked whether they use a web-based taxi service.\n\n e. In a survey, smart phone users are asked how many times they used a web-based taxi service over the last year.\n\n1. **Hypotheses.**\nFor each of the research statements below, note whether it represents a null hypothesis claim or an alternative hypothesis claim.\n\n a. The number of hours that grade-school children spend doing homework predicts their future success on standardized tests.\n \n b. King cheetahs on average run the same speed as standard spotted cheetahs.\n \n c. For a particular student, the probability of correctly answering a 5-option multiple choice test is larger than 0.2 (i.e., better than guessing).\n \n d. The mean length of African elephant tusks has changed over the last 100 years.\n \n e. The risk of facial clefts is equal for babies born to mothers who take folic acid supplements compared with those from mothers who do not.\n \n f. Caffeine intake during pregnancy affects mean birth weight.\n \n g. The probability of getting in a car accident is the same if using a cell phone than if not using a cell phone.\n \n \\clearpage\n\n1. **True null hypothesis.**\nUnbeknownst to you, let's say that the null hypothesis is actually true in the population. You plan to run a study anyway.\n\n a. If the level of significance you choose (i.e., the cutoff for your p-value) is 0.05, how likely is it that you will mistakenly reject the null hypothesis?\n \n b. If the level of significance you choose (i.e., the cutoff for your p-value) is 0.01, how likely is it that you will mistakenly reject the null hypothesis?\n \n c. If the level of significance you choose (i.e., the cutoff for your p-value) is 0.10, how likely is it that you will mistakenly reject the null hypothesis?\n\n1. **Identify hypotheses, I.**\nWrite the null and alternative hypotheses in words and then symbols for each of the following situations.\n\n a. New York is known as \"the city that never sleeps\". A random sample of 25 New Yorkers were asked how much sleep they get per night. Do these data provide convincing evidence that New Yorkers on average sleep less than 8 hours a night?\n\n b. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. They also collect data on how much company time employees spend on such non- business activities during March Madness. They want to determine if these data provide convincing evidence that employee productivity decreases during March Madness.\n\n1. **Identify hypotheses, II.**\nWrite the null and alternative hypotheses in words and using symbols for each of the following situations.\n\n a. Since 2008, chain restaurants in California have been required to display calorie counts of each menu item. Prior to menus displaying calorie counts, the average calorie intake of diners at a restaurant was 1100 calories. After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant?\n\n b. Based on the performance of those who took the GRE exam between July 1, 2004 and June 30, 2007, the average Verbal Reasoning score was calculated to be 462. In 2021 the average verbal score was slightly higher. Do these data provide convincing evidence that the average GRE Verbal Reasoning score has changed since 2021?\n \n \\clearpage\n\n1. **Side effects of Avandia.** \nRosiglitazone is the active ingredient in the controversial type 2 diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular problems such as stroke, heart failure, and death. A common alternative treatment is Pioglitazone, the active ingredient in a diabetes medicine called Actos. In a nationwide retrospective observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that 2,593 of the 67,593 patients using Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had serious cardiovascular problems. These data are summarized in the contingency table below.^[The [`avandia`](http://openintrostat.github.io/openintro/reference/avandia.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Graham:2010]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
treatment No Yes Total
Pioglitazone 154,592 5,386 159,978
Rosiglitazone 65,000 2,593 67,593
Total 219,592 7,979 227,571
\n \n `````\n :::\n :::\n\n a. Determine if each of the following statements is true or false. If false, explain why. *Be careful:* The reasoning may be wrong even if the statement's conclusion is correct. In such cases, the statement should be considered false.\n \n i. Since more patients on Pioglitazone had cardiovascular problems (5,386 vs. 2,593), we can conclude that the rate of cardiovascular problems for those on a Pioglitazone treatment is higher.\n \n ii. The data suggest that diabetic patients who are taking Rosiglitazone are more likely to have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038) 3.8\\% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4\\% for patients on Pioglitazone. \n \n iii. The fact that the rate of incidence is higher for the Rosiglitazone group proves that Rosiglitazone causes serious cardiovascular problems. \n \n iv. Based on the information provided so far, we cannot tell if the difference between the rates of incidences is due to a relationship between the two variables or due to chance.\n \n b. What proportion of all patients had cardiovascular problems?\n\n c. If the type of treatment and having cardiovascular problems were independent, about how many patients in the Rosiglitazone group would we expect to have had cardiovascular problems?\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part d.*\n :::\n \n \\clearpage\n\n d. We can investigate the relationship between outcome and treatment in this study using a randomization technique. While in reality we would carry out the simulations required for randomization using statistical software, suppose we actually simulate using index cards. In order to simulate from the independence model, which states that the outcomes were independent of the treatment, we write whether each patient had a cardiovascular problem on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and 159,978. We repeat this simulation 100 times and each time record the difference between the proportions of cards that say \"Yes\" in the Rosiglitazone and Pioglitazone groups. Use the histogram of these differences in proportions to answer the following questions.\n \n i. What are the claims being tested? \n \n ii. Compared to the number calculated in part (b), which would provide more support for the alternative hypothesis, *higher* or *lower* proportion of patients with cardiovascular problems in the Rosiglitazone group? \n \n iii. What do the simulation results suggest about the relationship between taking Rosiglitazone and having cardiovascular problems in diabetic patients?\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n \n \\clearpage\n\n1. **Heart transplants.** \nThe Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated an official heart transplant candidate, meaning that they were gravely ill and would most likely benefit from a new heart. Some patients got a transplant and some did not. The variable `transplant` indicates which group the patients were in; patients in the treatment group got a transplant and those in the control group did not. Of the 34 patients in the control group, 30 died. Of the 69 people in the treatment group, 45 died. Another variable called `survived` was used to indicate whether the patient was alive at the end of the study.^[The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Turnbull+Brown+Hu:1974]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-36-1.png){width=90%}\n :::\n :::\n\n a. Does the stacked bar plot indicate that survival is independent of whether the patient got a transplant? Explain your reasoning.\n\n b. What do the box plots above suggest about the efficacy (effectiveness) of the heart transplant treatment.\n\n c. What proportion of patients in the treatment group and what proportion of patients in the control group died?\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for part d.*\n :::\n \n \\clearpage\n\n d. One approach for investigating whether the treatment is effective is to use a randomization technique.\n \n i. What are the claims being tested?\n \n ii. The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.\n \n > We write *alive* on $\\rule{2cm}{0.5pt}$ cards representing patients who were alive at the end of the study, and *deceased* on $\\rule{2cm}{0.5pt}$ cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size $\\rule{2cm}{0.5pt}$ representing treatment, and another group of size $\\rule{2cm}{0.5pt}$ representing control. We calculate the difference between the proportion of \\textit{deceased} cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at $\\rule{2cm}{0.5pt}$. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are $\\rule{2cm}{0.5pt}$. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.\n\n iii. What do the simulation results shown below suggest about the effectiveness of the transplant program?\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](11-foundations-randomization_files/figure-html/unnamed-chunk-37-1.png){width=90%}\n :::\n :::\n\n\n:::\n", + "supporting": [ + "11-foundations-randomization_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/11-foundations-randomization/figure-html/opportunity-cost-obs-bar-1.png b/_freeze/11-foundations-randomization/figure-html/opportunity-cost-obs-bar-1.png new file mode 100644 index 00000000..3e5e2bee Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/opportunity-cost-obs-bar-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/opportunity-cost-rand-dot-plot-1.png b/_freeze/11-foundations-randomization/figure-html/opportunity-cost-rand-dot-plot-1.png new file mode 100644 index 00000000..3c7e67b9 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/opportunity-cost-rand-dot-plot-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/opportunity-cost-rand-hist-1.png b/_freeze/11-foundations-randomization/figure-html/opportunity-cost-rand-hist-1.png new file mode 100644 index 00000000..5b3aea80 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/opportunity-cost-rand-hist-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/sex-rand-dot-plot-1.png b/_freeze/11-foundations-randomization/figure-html/sex-rand-dot-plot-1.png new file mode 100644 index 00000000..828978c4 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/sex-rand-dot-plot-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-35-1.png b/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-35-1.png new file mode 100644 index 00000000..532ff8a1 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-35-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-36-1.png b/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-36-1.png new file mode 100644 index 00000000..0fac7cf2 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-36-1.png differ diff --git a/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-37-1.png b/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-37-1.png new file mode 100644 index 00000000..962f1b06 Binary files /dev/null and b/_freeze/11-foundations-randomization/figure-html/unnamed-chunk-37-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/execute-results/html.json b/_freeze/12-foundations-bootstrapping/execute-results/html.json new file mode 100644 index 00000000..1375bd17 --- /dev/null +++ b/_freeze/12-foundations-bootstrapping/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "e83b1f442bb4288f9b5ef7ff61b34778", + "result": { + "markdown": "# Confidence intervals with bootstrapping {#foundations-bootstrapping}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn this chapter, we expand on the familiar idea of using a sample proportion to estimate a population proportion.\nThat is, we create what is called a **confidence interval**\\index{confidence interval}, which is a range of plausible values where we may find the true population value.\nThe process for creating a confidence interval is based on understanding how a statistic (here the sample proportion) *varies* around the parameter (here the population proportion) when many different statistics are calculated from many different samples.\n\nIf we could, we would measure the variability of the statistics by repeatedly taking sample data from the population and compute the sample proportion.\nThen we could do it again.\nAnd again.\nAnd so on until we have a good sense of the variability of our original estimate.\n\nWhen the variability across the samples is large, we would assume that the original statistic is possibly far from the true population parameter of interest (and the interval estimate will be wide).\nWhen the variability across the samples is small, we expect the sample statistic to be close to the true parameter of interest (and the interval estimate will be narrow).\n\nThe ideal world where sampling data is free or extremely cheap is almost never the case, and taking repeated samples from a population is usually impossible.\n\nSo, instead of using a \"resample from the population\" approach, bootstrapping uses a \"resample from the sample\" approach.\nIn this chapter we provide examples and details about the bootstrapping process.\n:::\n\nAs seen in Chapter \\@ref(foundations-randomization), randomization is a statistical technique suitable for evaluating whether a difference in sample proportions is due to chance.\n\nRandomization tests are best suited for modeling experiments where the treatment (explanatory variable) has been randomly assigned to the observational units and there is an attempt to answer a simple yes/no research question.\n\nFor example, consider the following research questions that can be well assessed with a randomization test:\n\n- Does this vaccine make it less likely that a person will get malaria?\n- Does drinking caffeine affect how quickly a person can tap their finger?\n- Can we predict whether candidate A will win the upcoming election?\n\nIn this chapter, however, we are instead interested in a different approach to understanding population parameters.\nInstead, of testing a claim, the goal now is to estimate the unknown value of a population parameter.\n\nFor example,\n\n- How much less likely am I to get malaria if I get the vaccine?\n- How much faster (or slower) can a person tap their finger, on average, if they drink caffeine first?\n- What proportion of the vote will go to candidate A?\n\nHere, we explore the situation where the focus is on a single proportion, and we introduce a new simulation method: **bootstrapping**.\n\n\n\n\n\nBootstrapping is best suited for modeling studies where the data have been generated through random sampling from a population.\n\nAs with randomization tests, our goal with bootstrapping is to understand variability of a statistic.\n\nUnlike randomization tests (which modeled how the statistic would change if the treatment had been allocated differently), the bootstrap will model how a statistic varies from one sample to another taken from the population.\nThis will provide information about how different the statistic is from the parameter of interest.\n\nQuantifying the variability of a statistic from sample to sample is a hard problem.\n\nFortunately, sometimes the mathematical theory for how a statistic varies (across different samples) is well-known; this is the case for the sample proportion as seen in @sec-foundations-mathematical.\n\nHowever, some statistics do not have simple theory for how they vary, and bootstrapping provides a computational approach for providing interval estimates for almost any population parameter.\nIn this chapter we will focus on bootstrapping to estimate a single proportion, and we will revisit bootstrapping in Chapters \\@ref(inference-one-mean) through \\@ref(inference-paired-means), so you'll get plenty of practice as well as exposure to bootstrapping in many different datasettings.\n\nOur goal with bootstrapping will be to produce an interval estimate (a range of plausible values) for the population parameter.\n\n## Medical consultant case study {#case-study-med-consult}\n\nPeople providing an organ for donation sometimes seek the help of a special medical consultant.\nThese consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery.\nPatients might choose a consultant based in part on the historical complication rate of the consultant's clients.\n\n### Observed data\n\nOne consultant tried to attract patients by noting the average complication rate for liver donor surgeries in the US is about 10%, but her clients have had only 3 complications in the 62 liver donor surgeries she has facilitated.\nShe claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).\n\n::: {.workedexample data-latex=\"\"}\nWe will let $p$ represent the true complication rate for liver donors working with this consultant.\n(The \"true\" complication rate will be referred to as the **parameter**.) We estimate $p$ using the data, and label the estimate $\\hat{p}.$\n\n------------------------------------------------------------------------\n\nThe sample proportion for the complication rate is 3 complications divided by the 62 surgeries the consultant has worked on: $\\hat{p} = 3/62 = 0.048.$\n:::\n\n::: {.workedexample data-latex=\"\"}\nIs it possible to assess the consultant's claim (that the reduction in complications is due to her work) using the data?\n\n------------------------------------------------------------------------\n\nNo.\n\nThe claim is that there is a causal connection, but the data are observational, so we must be on the lookout for confounding variables.\n\nFor example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.\n\nWhile it is not possible to assess the causal claim, it is still possible to understand the consultant's true rate of complications.\n:::\n\n::: {.important data-latex=\"\"}\n**Parameter.**\\index{parameter}\n\nA **parameter** is the \"true\" value of interest.\n\nWe typically estimate the parameter using a point estimate\\index{point estimate} from a sample of data.\nThe point estimate is also known as the **statistic**\\index{statistic}.\n\nFor example, we estimate the probability $p$ of a complication for a client of the medical consultant by examining the past complications rates of her clients:\n\n$$\\hat{p} = 3 / 62 = 0.048~\\text{is used to estimate}~p$$\n:::\n\n\n\n\n\n### Variability of the statistic\n\nIn the medical consultant case study, the parameter is $p,$ the true probability of a complication for a client of the medical consultant.\nThere is no reason to believe that $p$ is exactly $\\hat{p} = 3/62,$ but there is also no reason to believe that $p$ is particularly far from $\\hat{p} = 3/62.$ By sampling with replacement from the dataset (a process called bootstrapping\\index{bootstrapping}), the variability of the possible $\\hat{p}$ values can be approximated.\n\nMost of the inferential procedures covered in this text are grounded in quantifying how one dataset would differ from another when they are both taken from the same population.\nIt does not make sense to take repeated samples from the same population because if you have the means to take more samples, a larger sample size will benefit you more than separately evaluating two sample of the exact same size.\nInstead, we measure how the samples behave under an estimate of the population.\n\nFigure \\@ref(fig:boot1) shows how the unknown original population can be estimated by using the sample to approximate the proportion of successes and failures (in our case, the proportion of complications and no complications for the medical consultant).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The unknown population is estimated using the observed sample data. Note that we can use the sample to create an estimated or bootstrapped population from which to sample. The observed data include three red and four white marbles, so the estimated population contains 3/7 red marbles and 4/7 white marbles.](images/boot1prop1.png){fig-alt='A small sample of 3 red marbles and 4 white marbles is taken from a large population with predominately unknown individual values. The sample is then replicated infinitely many times to create a proxy population where the values are known to be 3/7 red and 4/7 white.' width=75%}\n:::\n:::\n\n\nBy taking repeated samples from the estimated population, the variability from sample to sample can be observed.\nIn Figure \\@ref(fig:boot2) the repeated bootstrap samples are obviously different both from each other and from the original population.\nRecall that the bootstrap samples were taken from the same (estimated) population, and so the differences are due entirely to natural variability in the sampling procedure.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Bootstrap sampling provides a measure of the sample to sample variability. Note that we are taking samples from the estimated population that was created from the observed data.](images/boot1prop2.png){fig-alt='The same unknown large population is given with a sample of 3 red and 4 white marbles. After the proxy population is created (infinite replicates of the sample), new resamples of size 7 can be taken from the proxy population. Three resamples of size 7 are shown: Resample 1 has 2/7 red; Resample 2 has 4/7 red; and Resample k has 5/7 red.' width=75%}\n:::\n:::\n\n\nBy summarizing each of the bootstrap samples (here, using the sample proportion), we see, directly, the variability of the sample proportion, $\\hat{p},$ from sample to sample.\nThe distribution of $\\hat{p}_{boot}$ for the example scenario is shown in Figure \\@ref(fig:boot3), and the full bootstrap distribution for the medical consultant data is shown in Figure \\@ref(fig:MedConsBSSim).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The bootstrapped proportion is estimated for each bootstrap sample. The resulting bootstrap distribution (dotplot) provides a measure for how the proportions vary from sample to sample](images/boot1prop3.png){fig-alt='The same unknown large population with a sample of 3 red and 4 white marbles; the proxy population which is infinite with 3/7 red marbles; and the k Resamples of size 7 are shown. From each of the resamples the bootstrapped proportion of red is calculated (shown as 2/7, 4/7, and 5/7). Many many resamples are taken and summarized in a dotplot of the bootstrapped proportions. The proportions range from 0/7 to 7/7 in a bell shape with the majority of bootstrapped proportions falling between 1/7 and 6/7.' width=95%}\n:::\n:::\n\n\nIt turns out that in practice, it is very difficult for computers to work with an infinite population (with the same proportional breakdown as in the sample).\nHowever, there is a physical and computational method which produces an equivalent bootstrap distribution of the sample proportion in a computationally efficient manner.\n\nConsider the observed data to be a bag of marbles 3 of which are success (red) and 4 of which are failures (white).\nBy drawing the marbles out of the bag with replacement, we depict the exact same sampling **process** as was done with the infinitely large estimated population.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Taking repeated resamples from the sample data is the same process as creating an infinitely large estimate of the population. It is computationally more feasible to take resamples directly from the sample. Note that the resampling is now done with replacement (that is, the original sample does not ever change) so that the original sample and estimated hypothetical population are equivalent.](images/boot1prop4.png){fig-alt='Shown is the unknown large population with a sample of 3 red and 4 white marbles. Without creating the infinitely large proxy population, resamples are taken from the original sample (by sampling with replacement from the sample). Three resamples of size 7 are shown: Resample 1 has 2/7 red; Resample 2 has 4/7 red; and Resample k has 5/7 red.' width=75%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![A comparison of the process of sampling from the estimate infinite population and resampling with replacement from the original sample. Note that the dotplot of bootstrapped proportions is the same because the process by which the statistics were estimated is equivalent.](images/boot1propboth.png){fig-alt='Top image includes the steps of (1) a large unknown population, (2) observed sample of size 7 (with 3 red and 4 white), (3) creation of an infinitely large proxy population, and (4) three resamples. (5) Many resamples are considered with a dotplot of bootstrapped proportions. The bottom image follows the same process without the infinitely large proxy population. That is, in the bottom image a (1) single sample is taken from the original population and (2) the three resamples are taken directly from the observed data (using sampling with replacement). (3) Again, many resamples are considered with a dotplot of bootstrapped proportions.' width=95%}\n:::\n:::\n\n\nIf we apply the bootstrap sampling process to the medical consultant example, we consider each client to be one of the marbles in the bag.\nThere will be 59 white marbles (no complication) and 3 red marbles (complication).\nIf we choose 62 marbles out of the bag (one at a time with replacement) and compute the proportion of simulated patients with complications, $\\hat{p}_{boot},$ then this \"bootstrap\" proportion represents a single simulated proportion from the \"resample from the sample\" approach.\n\n::: {.guidedpractice data-latex=\"\"}\nIn a simulation of 62 patients, about how many would we expect to have had a complication?[^12-foundations-bootstrapping-1]\n:::\n\n[^12-foundations-bootstrapping-1]: About 4.8% of the patients (3 on average) in the simulation will have a complication, as this is what was seen in the sample.\n We will, however, see a little variation from one simulation to the next.\n\nOne simulation isn't enough to get a sense of the variability from one bootstrap proportion to another bootstrap proportion, so we repeat the simulation 10,000 times using a computer.\n\nFigure \\@ref(fig:MedConsBSSim) shows the distribution from the 10,000 bootstrap simulations.\nThe bootstrapped proportions vary from about zero to 11.3%.\nThe variability in the bootstrapped proportions leads us to believe that the true probability of complication (the parameter, $p$) is likely to fall somewhere between 0 and 11.3%, as these numbers capture 95% of the bootstrap resampled values.\n\nThe range of values for the true proportion is called a **bootstrap percentile confidence interval**, and we will see it again throughout the next few sections and chapters.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:MedConsBSSim-cap)](12-foundations-bootstrapping_files/figure-html/MedConsBSSim-1.png){width=90%}\n:::\n:::\n\n\n(ref:MedConsBSSim-cap) The original medical consultant data is bootstrapped 10,000 times. Each simulation creates a sample from the original data where the probability of a complication is $\\hat{p} = 3/62.$ The bootstrap 2.5 percentile proportion is 0 and the 97.5 percentile is 0.113. The result is: we are confident that, in the population, the true probability of a complication is between 0% and 11.3%.\n\n::: {.workedexample data-latex=\"\"}\nThe original claim was that the consultant's true rate of complication was under the national rate of 10%.\nDoes the interval estimate of 0 to 11.3% for the true probability of complication indicate that the surgical consultant has a lower rate of complications than the national average?\nExplain.\n\n------------------------------------------------------------------------\n\nNo.\nBecause the interval overlaps 10%, it might be that the consultant's work is associated with a lower risk of complications, or it might be that the consultant's work is associated with a higher risk (i.e., greater than 10%) of complications!\nAdditionally, as previously mentioned, because this is an observational study, even if an association can be measured, there is no evidence that the consultant's work is the cause of the complication rate (being higher or lower).\n:::\n\n\\clearpage\n\n## Tappers and listeners case study {#tapperscasestudy}\n\nHere's a game you can try with your friends or family: pick a simple, well-known song, tap that tune on your desk, and see if the other person can guess the song.\nIn this simple game, you are the tapper, and the other person is the listener.\n\n### Observed data\n\nA Stanford University graduate student named Elizabeth Newton conducted an experiment using the tapper-listener game.[^12-foundations-bootstrapping-2]\nIn her study, she recruited 120 tappers and 120 listeners into the study.\nAbout 50% of the tappers expected that the listener would be able to guess the song.\nNewton wondered, is 50% a reasonable expectation?\n\n[^12-foundations-bootstrapping-2]: This case study is described in [Made to Stick](https://en.wikipedia.org/wiki/Made_to_Stick) by Chip and Dan Heath.\n Little known fact: the teaching principles behind many OpenIntro resources are based on *Made to Stick*.\n\nIn Newton's study, only 3 out of 120 listeners ($\\hat{p} = 0.025$) were able to guess the tune!\nThat seems like quite a low number which leads the researcher to ask: what is the true proportion of people who can guess the tune?\n\n### Variability of the statistic\n\nTo answer the question, we will again use a simulation.\nTo simulate 120 games, this time we use a bag of 120 marbles 3 are red (for those who guessed correctly) and 117 are white (for those who could not guess the song).\nSampling from the bag 120 times (remembering to replace the marble back into the bag each time to keep constant the population proportion of red) produces one bootstrap sample.\n\nFor example, we can start by simulating 5 tapper-listener pairs by sampling 5 marbles from the bag of 3 red and 117 white marbles.\n\n| W | W | W | R | W |\n|:-----:|:-----:|:-----:|:-------:|:-----:|\n| Wrong | Wrong | Wrong | Correct | Wrong |\n\nAfter selecting 120 marbles, we counted 2 red for $\\hat{p}_{boot1} = 0.0167.$ As we did with the randomization technique, seeing what would happen with one simulation isn't enough.\nIn order to understand how far the observed proportion of 0.025 might be from the true parameter, we should generate more simulations.\nHere we have repeated the entire simulation ten times:\n\n$$0.0417 \\quad 0.025 \\quad 0.025 \\quad 0.0083 \\quad 0.05 \\quad 0.0333 \\quad 0.025 \\quad 0 \\quad 0.0083 \\quad 0$$ \n\nAs before, we'll run a total of 10,000 simulations using a computer.\nAs seen in Figure \\@ref(fig:tappers-bs-sim), the range of 95% of the resampled values of $\\hat{p}_{boot}$ is 0.000 to 0.0583.\nThat is, we expect that between 0% and 5.83% of people are truly able to guess the tapper's tune.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:tappers-bs-sim-cap)](12-foundations-bootstrapping_files/figure-html/tappers-bs-sim-1.png){width=90%}\n:::\n:::\n\n\n(ref:tappers-bs-sim-cap) The original listener-tapper data is bootstrapped 10,000 times. Each simulation creates a sample where the probability of being correct is $\\hat{p} = 3/120.$ The 2.5 percentile proportion is 0 and the 97.5 percentile is 0.0583. The result is that we are confident that, in the population, the true percent of people who can guess correctly is between 0% and 5.83%.\n\n::: {.guidedpractice data-latex=\"\"}\nDo the data provide convincing evidence against the claim that 50% of listeners can guess the tapper's tune?[^12-foundations-bootstrapping-3]\n:::\n\n[^12-foundations-bootstrapping-3]: Because 50% is not in the interval estimate for the true parameter, we can say that there is convincing evidence against the hypothesis that 50% of listeners can guess the tune.\n Moreover, 50% is a substantial distance from the largest resample statistic, suggesting that there is **very** convincing evidence against this hypothesis.\n\n## Confidence intervals {#ConfidenceIntervals}\n\n\\index{confidence interval}\n\nA point estimate provides a single plausible value for a parameter.\nHowever, a point estimate is rarely perfect; usually there is some error in the estimate.\nIn addition to supplying a point estimate of a parameter, a next logical step would be to provide a plausible *range of values* for the parameter.\n\n### Plausible range of values for the population parameter\n\nA plausible range of values for the population parameter is called a **confidence interval**.\nUsing only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net.\nWe can throw a spear where we saw a fish, but we will probably miss.\nOn the other hand, if we toss a net in that area, we have a good chance of catching the fish.\n\nIf we report a point estimate, we probably will not hit the exact population parameter.\nOn the other hand, if we report a range of plausible values -- a confidence interval -- we have a good shot at capturing the parameter.\n\n::: {.guidedpractice data-latex=\"\"}\nIf we want to be very certain we capture the population parameter, should we use a wider interval (e.g., 99%) or a smaller interval (e.g., 80%)?[^12-foundations-bootstrapping-4]\n:::\n\n[^12-foundations-bootstrapping-4]: If we want to be more certain we will capture the fish, we might use a wider net.\n Likewise, we use a wider confidence interval if we want to be more certain that we capture the parameter.\n\n### Bootstrap confidence interval\n\nAs we saw above, a **bootstrap sample**\\index{bootstrap sample} is a sample of the original sample.\nIn the case of the medical complications data, we proceed as follows:\n\n- Randomly sample one observation from the 62 patients (replace the marble back into the bag so as to keep the population constant).\n- Randomly sample a second observation from the 62 patients. Because we sample with replacement (i.e., we do not actually remove the marbles from the bag), there is a 1-in-62 chance that the second observation will be the same one sampled in the first step!\n- Keep going one sampled observation at a time ...\n- Randomly sample the 62nd observation from the 62 patients.\n\n\n\n\n\nBootstrap sampling is often called **sampling with replacement**.\n\nA bootstrap sample behaves similarly to how an actual sample from a population would behave, and we compute the point estimate of interest (here, compute $\\hat{p}_{boot}$).\n\nDue to theory that is beyond this text, we know that the bootstrap proportions $\\hat{p}_{boot}$ vary around $\\hat{p}$ in a similar way to how different sample proportions (i.e., values of $\\hat{p}$) vary around the true parameter $p.$\n\nTherefore, an interval estimate for $p$ can be produced using the $\\hat{p}_{boot}$ values themselves.\n\n::: {.important data-latex=\"\"}\n**95% Bootstrap percentile confidence interval for a parameter** $p.$\n\nThe 95% bootstrap confidence interval for the parameter $p$ can be obtained directly using the ordered $\\hat{p}_{boot}$ values.\n\nConsider the sorted $\\hat{p}_{boot}$ values.\nCall the 2.5% bootstrapped proportion value \"lower\", and call the 97.5% bootstrapped proportion value \"upper\".\n\nThe 95% confidence interval is given by: (lower, upper)\n:::\n\nIn Section \\@ref(one-prop-null-boot) we will discuss different percentages for the confidence interval (e.g., 90% confidence interval or 99% confidence interval).\n\nSection \\@ref(one-prop-null-boot) also provides a longer discussion on what \"95% confidence\" actually means.\n\n\\clearpage\n\n## Chapter review {#chp12-review}\n\n### Summary\n\nFigure \\@ref(fig:bootboth) provides a visual summary of creating bootstrap confidence intervals.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![We will use sampling with replacement to measure the variability of the statistic of interest (here the proportion). Sampling with replacement is a computational tool which is equivalent to using the sample as a way of estimating an infinitely large population from which to sample.](images/boot1propboth.png){width=95%}\n:::\n:::\n\n\nWe can summarize the bootstrap process as follows:\n\n- **Frame the research question in terms of a parameter to estimate.** Confidence Intervals are appropriate for research questions that aim to estimate a number from the population (called a parameter).\n- **Collect data with an observational study or experiment.** If a research question can be formed as a query about the parameter, we can collect data to calculate a statistic which is the best guess we have for the value of the parameter. However, we know that the statistic won't be exactly equal to the parameter due to natural variability.\n- **Model the randomness by using the data values as a proxy for the population.** In order to assess how far the statistic might be from the parameter, we take repeated resamples from the dataset to measure the variability in bootstrapped statistics. The variability of the bootstrapped statistics around the observed statistic (a quantity which can be measured through computational technique) should be approximately the same as the variability of many observed sample statistics around the parameter (a quantity which is very difficult to measure because in real life we only get exactly one sample).\n- **Create the interval.** After choosing a particular confidence level, use the variability of the bootstrapped statistics to create an interval estimate which will hope to capture the true parameter. While the interval estimate associated with the particular sample at hand may or may not capture the parameter, the researcher knows that over their lifetime, the confidence level will determine the percentage of their research confidence intervals that do capture the true parameter.\n- **Form a conclusion.** Using the confidence interval from the analysis, report on the interval estimate for the parameter of interest. Also, be sure to write the conclusion in plain language so casual readers can understand the results.\n\nTable \\@ref(tab:chp12-summary) is another look at the Bootstrap process summary.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of bootstrapping as an inferential statistical method.
Question Answer
What does it do? Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population
What is the random process described? Random sampling from a population
What other random processes can be approximated? Can also be used to describe random allocation in an experiment
What is it best for? Confidence intervals (can also be used for bootstrap hypothesis testing for one proportion as well).
What physical object represents the simulation process? Pulling marbles from a bag with replacement
\n\n`````\n:::\n:::\n\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
bootstrap percentile confidence interval parameter statistic
bootstrap sample point estimate
bootstrapping sampling with replacement
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp12-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-12].\n\n::: {.exercises data-latex=\"\"}\n1. **Outside YouTube videos.**\nLet's say that you want to estimate the proportion of YouTube videos which take place outside (define \"outside\" to be if any part of the video takes place outdoors).\nYou take a random sample of 128 YouTube videos^[There are many choices for implementing a random selection of YouTube videos, but it isn't clear how \"random\" they are.] and determine that 37 of them take place outside.\nYou'd like to estimate the proportion of all YouTube videos which take place outside, so you decide to create a bootstrap interval from the original sample of 128 videos.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](12-foundations-bootstrapping_files/figure-html/unnamed-chunk-15-1.png){width=90%}\n :::\n :::\n\n a. Describe in words the relevant statistic and parameter for this problem. If you know the numerical value for either one, provide it. If you do not know the numerical value, explain why the value is unknown.\n \n b. What notation is used to describe, respectively, the statistic and the parameter?\n \n c. If using software to bootstrap the original dataset, what is the statistic calculated on each bootstrap sample?\n \n d. When creating a bootstrap sampling distribution (histogram) of the bootstrapped sample proportions, where should the center of the histogram lie?\n \n e. The histogram provides a bootstrap sampling distribution for the sample proportion (with 1000 bootstrap repetitions). Using the histogram, estimate a 90% confidence interval for the proportion of YouTube videos which take place outdoors.\n \n f. In words of the problem, interpret the confidence interval which was estimated in the previous part.\n \n \\clearpage\n\n1. **Chronic illness.**\nIn 2012 the Pew Research Foundation reported that \"45% of US adults report that they live with one or more chronic conditions.\" However, this value was based on a sample, so it may not be a perfect estimate for the population parameter of interest on its own. The study was based on a sample of 3014 adults. Below is a distribution of 1000 bootstrapped sample proportions from the Pew dataset. [@data:pewdiagnosis:2013]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](12-foundations-bootstrapping_files/figure-html/unnamed-chunk-16-1.png){width=90%}\n :::\n :::\n\n Using the distribution of 1000 bootstrapped proportions, approximate a 92% confidence interval for the true proportion of US adults who live with one or more chronic conditions. Interpret the interval in the context of the problem.\n\n1. **Twitter users and news.**\nA poll conducted in 2013 found that 52% of all US adult Twitter users get at least some news on Twitter. However, this value was based on a sample, so it may not be a perfect estimate for the population parameter of interest on its own. The study was based on a sample of 736 adults. Below is a distribution of 1000 bootstrapped sample proportions from the Pew dataset. [@data:pewtwitternews:2013]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](12-foundations-bootstrapping_files/figure-html/unnamed-chunk-17-1.png){width=90%}\n :::\n :::\n \n Using the distribution of 1000 bootstrapped proportions, approximate a 98% confidence interval for the true proportion of US adult Twitter users (in 2013) who get at least some of their news from Twitter. Interpret the interval in the context of the problem.\n \n \\clearpage\n\n1. **Bootstrap distributions of $\\hat{p}$, I.**\nEach of the following four distributions was created using a different dataset.\nEach dataset was based on $n=23$ observations.\nThe original datasets had the following proportions of successes: $$\\hat{p} = 0.13 \\ \\ \\hat{p} = 0.22 \\ \\ \\hat{p} = 0.30 \\ \\ \\hat{p} = 0.43.$$ \nMatch each histogram with the original data proportion of success.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](12-foundations-bootstrapping_files/figure-html/unnamed-chunk-18-1.png){width=90%}\n :::\n :::\n \n \\clearpage\n\n1. **Bootstrap distributions of $\\hat{p}$, II.**\nEach of the following four distributions was created using a different dataset.\nEach dataset was based on $n=23$ observations. \n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](12-foundations-bootstrapping_files/figure-html/unnamed-chunk-19-1.png){width=90%}\n :::\n :::\n \n Consider each of the following values for the true popluation $p$ (proportion of success). Datasets A, B, C, D were bootstrapped 1000 times, with bootstrap proportions as given in the histograms provided. For each parameter value, list the datasets which could plausibly have come from that population. (Hint: there may be more than one dataset for each parameter value.)\n \n a. $p = 0.05$\n \n b. $p = 0.25$\n \n c. $p = 0.45$\n \n d. $p = 0.55$\n \n e. $p = 0.75$\n \n \\clearpage\n\n1. **Bootstrap distributions of $\\hat{p}$, III.**\nEach of the following four distributions was created using a different dataset.\nEach dataset had the same proportion of successes $(\\hat{p} = 0.4)$ but a different sample size. The four datasets were given by $n = 10, 100, 500$, and $1000$.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](12-foundations-bootstrapping_files/figure-html/unnamed-chunk-20-1.png){width=90%}\n :::\n :::\n\n Consider each of the following values for the true popluation $p$ (proportion of success). Datasets A, B, C, D were bootstrapped 1000 times, with bootstrap proportions as given in the histograms provided. For each parameter value, list the datasets which could plausibly have come from that population. (Hint: there may be more than one dataset for each parameter value.)\n \n a. $p = 0.05$\n \n b. $p = 0.25$\n \n c. $p = 0.45$\n \n d. $p = 0.55$\n \n e. $p = 0.75$\n\n1. **Cyberbullying rates.** \nTeens were surveyed about cyberbullying, and 54% to 64% reported experiencing cyberbullying (95% confidence interval). Answer the following questions based on this interval. [@pewcyberbully2018]\n\n a. A newspaper claims that a majority of teens have experienced cyberbullying. Is this claim supported by the confidence interval? Explain your reasoning.\n\n b. A researcher conjectured that 70% of teens have experienced cyberbullying. Is this claim supported by the confidence interval? Explain your reasoning.\n\n c. Without actually calculating the interval, determine if the claim of the researcher from part (b) would be supported based on a 90% confidence interval?\n\n1. **Waiting at an ER.**\nA 95% confidence interval for the mean waiting time at an emergency room (ER) of (128 minutes, 147 minutes). Answer the following questions based on this interval.\n\n a. A local newspaper claims that the average waiting time at this ER exceeds 3 hours. Is this claim supported by the confidence interval? Explain your reasoning.\n\n b. The Dean of Medicine at this hospital claims the average wait time is 2.2 hours. Is this claim supported by the confidence interval? Explain your reasoning.\n\n c. Without actually calculating the interval, determine if the claim of the Dean from part (b) would be supported based on a 99% confidence interval?\n\n\n:::\n", + "supporting": [ + "12-foundations-bootstrapping_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/12-foundations-bootstrapping/figure-html/MedConsBSSim-1.png b/_freeze/12-foundations-bootstrapping/figure-html/MedConsBSSim-1.png new file mode 100644 index 00000000..1a68a27c Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/MedConsBSSim-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/tappers-bs-sim-1.png b/_freeze/12-foundations-bootstrapping/figure-html/tappers-bs-sim-1.png new file mode 100644 index 00000000..63f4ca2b Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/tappers-bs-sim-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-15-1.png b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 00000000..f0ae2cda Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-15-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-16-1.png b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 00000000..f3eb7ec5 Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-16-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-17-1.png b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 00000000..b9aff8fd Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-17-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-18-1.png b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 00000000..fb89e182 Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-18-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-19-1.png b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-19-1.png new file mode 100644 index 00000000..fb89e182 Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-19-1.png differ diff --git a/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-20-1.png b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 00000000..32cd40b1 Binary files /dev/null and b/_freeze/12-foundations-bootstrapping/figure-html/unnamed-chunk-20-1.png differ diff --git a/_freeze/13-foundations-mathematical/execute-results/html.json b/_freeze/13-foundations-mathematical/execute-results/html.json new file mode 100644 index 00000000..ce93527a --- /dev/null +++ b/_freeze/13-foundations-mathematical/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "2bd44009b9157676782b5db0b2f51003", + "result": { + "markdown": "# Inference with mathematical models {#sec-foundations-mathematical}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn @sec-foundations-randomization and \\@ref(foundations-bootstrapping) questions about population parameters were addressed using computational techniques.\nWith randomization tests, the data were permuted assuming the null hypothesis.\nWith bootstrapping, the data were resampled in order to measure the variability.\nIn many cases (indeed, with sample proportions), the variability of the statistic can be described by the computational method (as in previous chapters) or by a mathematical formula (as in this chapter).\n\nThe normal distribution is presented here to describe the variability associated with sample proportions which are taken from either repeated samples or repeated experiments.\nThe normal distribution is quite powerful in that it describes the variability of many different statistics, and we will encounter the normal distribution throughout the remainder of the book.\n\nFor now, however, focus is on the parallels between how data can provide insight about a research question either through computational methods or through mathematical models.\n:::\n\n## Central Limit Theorem {#CLTsection}\n\nIn recent chapters, we have encountered four case studies.\nWhile they differ in the settings, in their outcomes, and in the technique we have used to analyze the data, they all have something in common: the general shape of the distribution of the statistics (called the **sampling distribution**).\\index{sampling distribution} You may have noticed that the distributions were symmetric and bell-shaped.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Sampling distribution.**\n\nA sampling distribution is the distribution of all possible values of a *sample statistic* from samples of a given sample size from a given population.\nWe can think about the sample distribution as describing how sample statistics (e.g. the sample proportion $\\hat{p}$ or the sample mean $\\bar{x}$) varies from one study to another.\nA sampling distribution is contrasted with a data distribution which shows the variability of the *observed* data values.\nThe data distribution can be visualized from the observations themselves.\nHowever, because a sampling distribution describes sample statistics computed from many studies, it cannot be visualized directly from a single dataset.\nInstead, we use either computational or mathematical structures to estimate the sampling distribution and hence to describe the expected variability of the sample statistic in repeated studies.\n:::\n\nFigure \\@ref(fig:FourCaseStudies) shows the null distributions in each of the four case studies where we ran 10,000 simulations.\nNote that the **null distribution**\\index{null distribution} is the sampling distribution of the statistic created under the setting where the null hypothesis is true.\nTherefore, the null distribution will always be centered at the value of the parameter given by the null hypothesis.\nIn the case of the opportunity cost study, which originally had just 1,000 simulations, we have included an additional 9,000 simulations.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The null distribution for each of the four case studies presented previously. Note that the center of each distribution is given by the value of the parameter set in the null hypothesis.](13-foundations-mathematical_files/figure-html/FourCaseStudies-1.png){width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nDescribe the shape of the distributions and note anything that you find interesting.[^13-foundations-mathematical-1]\n:::\n\n[^13-foundations-mathematical-1]: In general, the distributions are reasonably symmetric.\n The case study for the medical consultant is the only distribution with any evident skew (the distribution is skewed right).\n\nThe case study for the medical consultant is the only distribution with any evident skew.\nAs we observed in Chapter \\@ref(data-hello), it's common for distributions to be skewed or contain outliers.\nHowever, the null distributions we have so far encountered have all looked somewhat similar and, for the most part, symmetric.\nThey all resemble a bell-shaped curve.\nThe bell-shaped curve similarity is not a coincidence, but rather, is guaranteed by mathematical theory.\n\n::: {.important data-latex=\"\"}\n**Central Limit Theorem for proportions.**\\index{Central Limit Theorem}\n\nIf we look at a proportion (or difference in proportions) and the scenario satisfies certain conditions, then the sample proportion (or difference in proportions) will appear to follow a bell-shaped curve called the *normal distribution*.\n:::\n\n\n\n\n\nAn example of a perfect normal distribution is shown in Figure \\@ref(fig:simpleNormal).\nImagine laying a normal curve over each of the four null distributions in Figure \\@ref(fig:FourCaseStudies).\nWhile the mean (center) and standard deviation (width or spread) may change for each plot, the general shape remains roughly intact.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A normal curve.](13-foundations-mathematical_files/figure-html/simpleNormal-1.png){width=60%}\n:::\n:::\n\n\nMathematical theory guarantees that if repeated samples are taken a sample proportion or a difference in sample proportions will follow something that resembles a normal distribution when certain conditions are met.\n(Note: we typically only take **one** sample, but the mathematical model lets us know what to expect if we *had* taken repeated samples.) These conditions fall into two general categories describing the independence between observations and the need to take a sufficiently large sample size.\n\n1. Observations in the sample are **independent**.\n Independence is guaranteed when we take a random sample from a population.\n Independence can also be guaranteed if we randomly divide individuals into treatment and control groups.\n\n2. The sample is **large enough**.\n The sample size cannot be too small.\n What qualifies as \"small\" differs from one context to the next, and we'll provide suitable guidelines for proportions in Chapter \\@ref(inference-one-prop).\n\nSo far we have had no need for the normal distribution.\nWe've been able to answer our questions somewhat easily using simulation techniques.\nHowever, soon this will change.\nSimulating data can be non-trivial.\nFor example, some of the scenarios encountered in Chapter \\@ref(model-mlr) where we introduced regression models with multiple predictors would require complex simulations in order to make inferential conclusions.\nInstead, the normal distribution and other distributions like it offer a general framework for statistical inference that applies to a very large number of settings.\n\n::: {.important data-latex=\"\"}\n**Technical Conditions.**\n\nIn order for the normal approximation to describe the sampling distribution of the sample proportion as it varies from sample to sample, two conditions must hold.\nIf these conditions do not hold, it is unwise to use the normal distribution (and related concepts like Z scores, probabilities from the normal curve, etc.) for inferential analyses.\n\n1. **Independent observations**\n2. **Large enough sample:** For proportions, at least 10 expected successes and 10 expected failures in the sample.\n:::\n\n## Normal Distribution {#normalDist}\n\nAmong all the distributions we see in statistics, one is overwhelmingly the most common.\nThe symmetric, unimodal, bell curve is ubiquitous throughout statistics.\nIt is so common that people know it as a variety of names including the **normal curve**\\index{normal curve}, **normal model**\\index{normal model}, or **normal distribution**\\index{normal distribution}.[^13-foundations-mathematical-2]\nUnder certain conditions, sample proportions, sample means, and sample differences can be modeled using the normal distribution.\nAdditionally, some variables such as SAT scores and heights of US adult males closely follow the normal distribution.\n\n[^13-foundations-mathematical-2]: It is also introduced as the Gaussian distribution after Frederic Gauss, the first person to formalize its mathematical expression.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Normal distribution facts.**\n\nMany summary statistics and variables are nearly normal, but none are exactly normal.\nThus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems.\nWe will use it in data exploration and to solve important problems in statistics.\n:::\n\nIn this section, we will discuss the normal distribution in the context of data to become familiar with normal distribution techniques.\nIn the following sections and beyond, we'll move our discussion to focus on applying the normal distribution and other related distributions to model point estimates for hypothesis tests and for constructing confidence intervals.\n\n### Normal distribution model\n\nThe normal distribution always describes a symmetric, unimodal, bell-shaped curve.\nHowever, normal curves can look different depending on the details of the model.\nSpecifically, the normal model can be adjusted using two parameters: mean and standard deviation.\nAs you can probably guess, changing the mean shifts the bell curve to the left or right, while changing the standard deviation stretches or constricts the curve.\nFigure \\@ref(fig:twoSampleNormals) shows the normal distribution with mean $0$ and standard deviation $1$ (which is commonly referred to as the **standard normal distribution**\\index{standard normal distribution}) on the left.\nA normal distributions with mean $19$ and standard deviation $4$ is shown on the right.\nFigure \\@ref(fig:twoSampleNormalsStacked) shows the same two normal distributions on the same axis.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Both curves represent the normal distribution, however, they differ in their center and spread. The normal distribution with mean 0 and standard deviation 1 (blue solid line, on the left) is called the **standard normal distribution**. The other distribution (green dashed line, on the right) has mean 19 and standard deviation 4.](13-foundations-mathematical_files/figure-html/twoSampleNormals-1.png){width=100%}\n:::\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:twoSampleNormalsStacked-cap)](13-foundations-mathematical_files/figure-html/twoSampleNormalsStacked-1.png){width=100%}\n:::\n:::\n\n\n(ref:twoSampleNormalsStacked-cap) The two normal models shown in Figure \\@ref(fig:twoSampleNormals) but plotted together on the same scale.\n\nIf a normal distribution has mean $\\mu$ and standard deviation $\\sigma,$ we may write the distribution as $N(\\mu, \\sigma).$ The two distributions in Figure \\@ref(fig:twoSampleNormalsStacked) can be written as\n\n$$ N(\\mu = 0, \\sigma = 1)\\quad\\text{and}\\quad N(\\mu = 19, \\sigma = 4) $$\n\nBecause the mean and standard deviation describe a normal distribution exactly, they are called the distribution's **parameters**\\index{parameter}.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nWrite down the short-hand for a normal distribution with the following parameters.\n\na. mean 5 and standard deviation 3\nb. mean -100 and standard deviation 10\nc. mean 2 and standard deviation 9\n\n------------------------------------------------------------------------\n\na. $N(\\mu = 5,\\sigma = 3)$\nb. $N(\\mu = -100, \\sigma = 10)$\nc. $N(\\mu = 2, \\sigma = 9)$\n:::\n\n### Standardizing with Z scores\n\n::: {.guidedpractice data-latex=\"\"}\nSAT scores follow a nearly normal distribution with a mean of 1500 points and a standard deviation of 300 points.\nACT scores also follow a nearly normal distribution with mean of 21 points and a standard deviation of 5 points.\nSuppose Nel scored 1800 points on their SAT and Sian scored 24 points on their ACT.\nWho performed better?[^13-foundations-mathematical-3]\n:::\n\n[^13-foundations-mathematical-3]: We use the standard deviation as a guide.\n Nel is 1 standard deviation above average on the SAT: $1500 + 300 = 1800.$ Sian is 0.6 standard deviations above the mean on the ACT: $21+0.6 \\times 5 = 24.$ In Figure \\@ref(fig:satActNormals), we can see that Nel did better compared to other test takers than Sian did, so their score was better.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Nel's and Sian's scores shown with the distributions of SAT and ACT scores.](13-foundations-mathematical_files/figure-html/satActNormals-1.png){width=90%}\n:::\n:::\n\n\nThe solution to the previous example relies on a standardization technique called a Z score, a method most commonly employed for nearly normal observations (but that may be used with any distribution).\nThe **Z score**\\index{Z score} of an observation is defined as the number of standard deviations it falls above or below the mean.\nIf the observation is one standard deviation above the mean, its Z score is 1.\nIf it is 1.5 standard deviations *below* the mean, then its Z score is -1.5.\nIf $x$ is an observation from a distribution $N(\\mu, \\sigma),$ we define the Z score mathematically as\n\n\n\n\n\n$$ Z = \\frac{x-\\mu}{\\sigma} $$\n\nUsing $\\mu_{SAT}=1500,$ $\\sigma_{SAT}=300,$ and $x_{Nel}=1800,$ we find Nel's Z score:\n\n$$ Z_{Nel} = \\frac{x_{Nel} - \\mu_{SAT}}{\\sigma_{SAT}} = \\frac{1800-1500}{300} = 1 $$\n\n::: {.important data-latex=\"\"}\n**The Z score.**\n\nThe Z score of an observation is the number of standard deviations it falls above or below the mean.\nWe compute the Z score for an observation $x$ that follows a distribution with mean $\\mu$ and standard deviation $\\sigma$ using\n\n$$Z = \\frac{x-\\mu}{\\sigma}$$\n\nIf the observation $x$ comes from a *normal* distribution centered at $\\mu$ with standard deviation of $\\sigma$, then the Z score will be distributed according to a *normal* distribution with a center of 0 and a standard deviation of 1.\nThat is, the normality remains when transforming from $x$ to $Z$ with a shift in both the center as well as the spread.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nUse Sian's ACT score, 24, along with the ACT mean and standard deviation to compute their Z score.[^13-foundations-mathematical-4]\n:::\n\n[^13-foundations-mathematical-4]: $Z_{Sian} = \\frac{x_{Sian} - \\mu_{ACT}}{\\sigma_{ACT}} = \\frac{24 - 21}{5} = 0.6$\n\nObservations above the mean always have positive Z scores while those below the mean have negative Z scores.\nIf an observation is equal to the mean (e.g., SAT score of 1500), then the Z score is $0.$\n\n::: {.workedexample data-latex=\"\"}\nLet $X$ represent a random variable from $N(\\mu=3, \\sigma=2),$ and suppose we observe $x=5.19.$ Find the Z score of $x.$ Then, use the Z score to determine how many standard deviations above or below the mean $x$ falls.\n\n------------------------------------------------------------------------\n\nIts Z score is given by $Z = \\frac{x-\\mu}{\\sigma} = \\frac{5.19 - 3}{2} = 2.19/2 = 1.095.$ The observation $x$ is 1.095 standard deviations *above* the mean.\nWe know it must be above the mean since $Z$ is positive.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nHead lengths of brushtail possums follow a nearly normal distribution with mean 92.6 mm and standard deviation 3.6 mm.\nCompute the Z scores for possums with head lengths of 95.4 mm and 85.8 mm.[^13-foundations-mathematical-5]\n:::\n\n[^13-foundations-mathematical-5]: For $x_1=95.4$ mm: $Z_1 = \\frac{x_1 - \\mu}{\\sigma} = \\frac{95.4 - 92.6}{3.6} = 0.78.$ For $x_2=85.8$ mm: $Z_2 = \\frac{85.8 - 92.6}{3.6} = -1.89.$\n\nWe can use Z scores to roughly identify which observations are more unusual than others.\nOne observation $x_1$ is said to be more unusual than another observation $x_2$ if the absolute value of its Z score is larger than the absolute value of the other observation's Z score: $|Z_1| > |Z_2|.$ This technique is especially insightful when a distribution is symmetric.\n\n::: {.guidedpractice data-latex=\"\"}\nWhich of the two brushtail possum observations in the previous guided practice is more *unusual*?[^13-foundations-mathematical-6]\n:::\n\n[^13-foundations-mathematical-6]: Because the *absolute value* of Z score for the second observation is larger than that of the first, the second observation has a more unusual head length.\n\n### Normal probability calculations\n\n::: {.workedexample data-latex=\"\"}\nNel from the SAT Guided Practice earned a score of 1800 on their SAT with a corresponding $Z=1.$ They would like to know what percentile they fall in among all SAT test-takers.\n\n------------------------------------------------------------------------\n\nNel's **percentile**\\index{percentile} is the percentage of people who earned a lower SAT score than Nel.\nWe shade the area representing those individuals in Figure \\@ref(fig:satBelow1800).\nThe total area under the normal curve is always equal to 1, and the proportion of people who scored below Nel on the SAT is equal to the *area* shaded in Figure \\@ref(fig:satBelow1800): 0.8413.\nIn other words, Nel is in the $84^{th}$ percentile of SAT takers.\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The normal model for SAT scores, shading the area of those individuals who scored below Nel.](13-foundations-mathematical_files/figure-html/satBelow1800-1.png){width=60%}\n:::\n:::\n\n\nWe can use the normal model to find percentiles or probabilities.\nA **normal probability table**\\index{normal probability table}, which lists Z scores and corresponding percentiles, can be used to identify a percentile based on the Z score (and vice versa).\nStatistical software can also be used.\n\n\n\n\n\nNormal probabilities are most commonly found using statistical software which we will show here using R.\nWe use the software to identify the percentile corresponding to any particular Z score.\nFor instance, the percentile of $Z=0.43$ is 0.6664, or the $66.64^{th}$ percentile.\nThe `pnorm()` function is available in default R and will provide the percentile associated with any cutoff on a normal curve.\nThe `normTail()` function is available in the [**openintro**](http://openintrostat.github.io/openintro/) R package and will draw the associated normal distribution curve.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npnorm(0.43, mean = 0, sd = 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.666\n```\n\n\n:::\n\n```{.r .cell-code}\nnormTail(m = 0, s = 1, L = 0.43)\n```\n\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-17-1.png){width=60%}\n:::\n:::\n\n\nWe can also find the Z score associated with a percentile.\nFor example, to identify Z for the $80^{th}$ percentile, we use `qnorm()` which identifies the **quantile** for a given percentage.\nThe quantile represents the cutoff value.\n(To remember the function `qnorm()` as providing a cutoff, notice that both `qnorm()` and \"cutoff\" start with the sound \"kuh\".\nTo remember the `pnorm()` function as providing a probability from a given cutoff, notice that both `pnorm()` and probability start with the sound \"puh\".) We determine the Z score for the $80^{th}$ percentile using `qnorm()`: 0.84.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.80, mean = 0, sd = 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.842\n```\n\n\n:::\n\n```{.r .cell-code}\nopenintro::normTail(m = 0, s = 1, L = 0.842)\n```\n\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-18-1.png){width=60%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nDetermine the proportion of SAT test takers who scored better than Nel on the SAT.[^13-foundations-mathematical-7]\n:::\n\n[^13-foundations-mathematical-7]: If 84% had lower scores than Nel, the number of people who had better scores must be 16%.\n (Generally ties are ignored when the normal model, or any other continuous distribution, is used.)\n\n\\clearpage\n\n### Normal probability examples\n\nCumulative SAT scores are approximated well by a normal model, $N(\\mu=1500, \\sigma=300).$\n\n::: {.workedexample data-latex=\"\"}\nShannon is a randomly selected SAT taker, and nothing is known about Shannon's SAT aptitude.\nWhat is the probability that Shannon scores at least 1630 on their SATs?\n\n------------------------------------------------------------------------\n\nFirst, always draw and label a picture of the normal distribution.\n(Drawings need not be exact to be useful.) We are interested in the chance they score above 1630, so we shade the upper tail.\nSee the normal curve below.\n\nThe $x$-axis identifies the mean and the values at 2 standard deviations above and below the mean.\nThe simplest way to find the shaded area under the curve makes use of the Z score of the cutoff value.\nWith $\\mu=1500,$ $\\sigma=300,$ and the cutoff value $x=1630,$ the Z score is computed as\n\n$$ Z = \\frac{x - \\mu}{\\sigma} = \\frac{1630 - 1500}{300} = \\frac{130}{300} = 0.43 $$\n\nWe use software to find the percentile of $Z=0.43,$ which yields 0.6664.\nHowever, the percentile describes those who had a Z score *lower* than 0.43.\nTo find the area *above* $Z=0.43,$ we compute one minus the area of the lower tail, as seen below.\n\nThe probability Shannon scores at least 1630 on the SAT is 0.3336.\nThis calculation is visualized in Figure \\@ref(fig:subtractingArea).\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Visual calculation of the probability that Shannon scores at least 1630 on the SAT.](13-foundations-mathematical_files/figure-html/subtractingArea-1.png){width=90%}\n:::\n:::\n\n\n::: {.tip data-latex=\"\"}\n**Always draw a picture first, and find the Z score second.**\n\nFor any normal probability situation, *always always always* draw and label the normal curve and shade the area of interest first.\nThe picture will provide an estimate of the probability.\n\nAfter drawing a figure to represent the situation, identify the Z score for the observation of interest.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf the probability of Shannon scoring at least 1630 is 0.3336, then what is the probability they score less than 1630?\nDraw the normal curve representing this exercise, shading the lower region instead of the upper one.[^13-foundations-mathematical-8]\n:::\n\n[^13-foundations-mathematical-8]: We found the probability to be 0.6664.\n A picture for this exercise is represented by the shaded area below \"0.6664\".\n\n::: {.workedexample data-latex=\"\"}\nEdward earned a 1400 on their SAT.\nWhat is their percentile?\n\n------------------------------------------------------------------------\n\nFirst, a picture is needed.\nEdward's percentile is the proportion of people who do not get as high as a 1400.\nThese are the scores to the left of 1400.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-20-1.png){width=60%}\n:::\n:::\n\n\nThe mean $\\mu=1500,$ the standard deviation $\\sigma=300,$ and the cutoff for the tail area $x=1400$ are used to compute the Z score:\n\n$$ Z = \\frac{x - \\mu}{\\sigma} = \\frac{1400 - 1500}{300} = -0.33$$\n\nStatistical software can be used to find the proportion of the $N(0,1)$ curve to the left of $-0.33$ which is 0.3707.\nEdward is at the $37^{th}$ percentile.\n:::\n\n::: {.workedexample data-latex=\"\"}\nUse the results of the previous example to compute the proportion of SAT takers who did better than Edward.\nAlso draw a new picture.\n\n------------------------------------------------------------------------\n\nIf Edward did better than 37% of SAT takers, then about 63% must have done better than them.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-21-1.png){width=60%}\n:::\n:::\n\n:::\n\n::: {.tip data-latex=\"\"}\n**Areas to the right.**\n\nMost statistical software, as well as normal probability tables in most books, give the area to the left.\nIf you would like the area to the right, first find the area to the left and then subtract the amount from one.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nStuart earned an SAT score of 2100.\nDraw a picture for each part.\n(a) What is their percentile?\n(b) What percent of SAT takers did better than Stuart?[^13-foundations-mathematical-9]\n:::\n\n[^13-foundations-mathematical-9]: Numerical answers: (a) 0.9772.\n (b) 0.0228.\n\nBased on a sample of 100 men,[^13-foundations-mathematical-10] the heights of adults who identify as male, between the ages 20 and 62 in the US is nearly normal with mean 70.0'' and standard deviation 3.3''.\n\n[^13-foundations-mathematical-10]: This sample was taken from the USDA Food Commodity Intake Database.\n\n::: {.workedexample data-latex=\"\"}\nKamron is 5'7'' (67 inches) and Adrian is 6'4'' (76 inches).\n(a) What is Kamron's height percentile?\n(b) What is Adrian's height percentile?\nAlso draw one picture for each part.\n\n------------------------------------------------------------------------\n\nNumerical answers, calculated using statistical software (e.g., `pnorm()` in R): (a) 18.17th percentile.\n(b) 96.55th percentile.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-22-1.png){width=60%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-23-1.png){width=60%}\n:::\n:::\n\n:::\n\nThe last several problems have focused on finding the probability or percentile for a particular observation.\nWhat if you would like to know the observation corresponding to a particular percentile?\n\n::: {.workedexample data-latex=\"\"}\nYousef's height is at the $40^{th}$ percentile.\nHow tall are they?\n\n------------------------------------------------------------------------\n\nAs always, first draw the picture.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-24-1.png){width=60%}\n:::\n:::\n\n\nIn this case, the lower tail probability is known (0.40), which can be shaded on the diagram.\nWe want to find the observation that corresponds to the known probability of 0.4.\nAs a first step in this direction, we determine the Z score associated with the $40^{th}$ percentile.\n\nBecause the percentile is below 50%, we know $Z$ will be negative.\nStatistical software provides the $Z$ value to be $-0.25.$\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.4, mean = 0, sd = 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] -0.253\n```\n\n\n:::\n:::\n\n\nKnowing $Z_{Yousef}=-0.25$ and the population parameters $\\mu=70$ and $\\sigma=3.3$ inches, the Z score formula can be set up to determine Yousef's unknown height, labeled $x_{Yousef}$:\n\n$$ -0.25 = Z_{Yousef} = \\frac{x_{Yousef} - \\mu}{\\sigma} = \\frac{x_{Yousef} - 70}{3.3} $$\n\nSolving for $x_{Yousef}$ yields the height 69.18 inches.\nThat is, Yousef is about 5'9'' (this is notation for 5-feet, 9-inches).\n:::\n\n::: {.workedexample data-latex=\"\"}\nWhat is the adult male height at the $82^{nd}$ percentile?\n\n------------------------------------------------------------------------\n\nAgain, we draw the figure first.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/height82Perc-1.png){width=60%}\n:::\n:::\n\n\nAnd calculate the Z value associated with the $82^{nd}$ percentile:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nqnorm(0.82, m = 0, s = 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.915\n```\n\n\n:::\n:::\n\n\nNext, we want to find the Z score at the $82^{nd}$ percentile, which will be a positive value (because the percentile is bigger than 50%).\nUsing `qnorm()`, the $82^{nd}$ percentile corresponds to $Z=0.92.$ Finally, the height $x$ is found using the Z score formula with the known mean $\\mu,$ standard deviation $\\sigma,$ and Z score $Z=0.92$:\n\n$$ 0.92 = Z = \\frac{x-\\mu}{\\sigma} = \\frac{x - 70}{3.3} $$\n\nThis yields 73.04 inches or about 6'1'' as the height at the $82^{nd}$ percentile.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\n(a) What is the $95^{th}$ percentile for SAT scores?\\\n(b) What is the $97.5^{th}$ percentile of the male heights? As always with normal probability problems, first draw a picture.[^13-foundations-mathematical-11]\n:::\n\n[^13-foundations-mathematical-11]: Remember: draw a picture first, then find the Z score.\n (We leave the pictures to you.) The Z score can be found by using the percentiles and the normal probability table.\n (a) We look for 0.95 in the probability portion (middle part) of the normal probability table, which leads us to row 1.6 and (about) column 0.05, i.e., $Z_{95}=1.65.$ Knowing $Z_{95}=1.65,$ $\\mu = 1500,$ and $\\sigma = 300,$ we setup the Z score formula: $1.65 = \\frac{x_{95} - 1500}{300}.$ We solve for $x_{95}$: $x_{95} = 1995.$ (b) Similarly, we find $Z_{97.5} = 1.96,$ again setup the Z score formula for the heights, and calculate $x_{97.5} = 76.5.$\n\n::: {.guidedpractice data-latex=\"\"}\n(a) What is the probability that a randomly selected male adult is at least 6'2'' (74 inches)?\\\n(b) What is the probability that a male adult is shorter than 5'9'' (69 inches)?[^13-foundations-mathematical-12]\n:::\n\n[^13-foundations-mathematical-12]: Numerical answers: (a) 0.1131.\n (b) 0.3821.\n\n::: {.workedexample data-latex=\"\"}\nWhat is the probability that a randomly selected adult male is between 5'9'' and 6'2''?\n\n------------------------------------------------------------------------\n\nThese heights correspond to 69 inches and 74 inches.\nFirst, draw the figure.\nThe area of interest is no longer an upper or lower tail.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-28-1.png){width=60%}\n:::\n:::\n\n\nThe total area under the curve is 1.\nIf we find the area of the two tails that are not shaded (from the previous Guided Practice, these areas are $0.3821$ and $0.1131$), then we can find the middle area:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/unnamed-chunk-29-1.png){width=90%}\n:::\n:::\n\n\nThat is, the probability of being between 5'9'' and 6'2'' is 0.5048.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nWhat percent of SAT takers get between 1500 and 2000?[^13-foundations-mathematical-13]\n:::\n\n[^13-foundations-mathematical-13]: This is an abbreviated solution.\n (Be sure to draw a figure!) First find the percent who get below 1500 and the percent that get above 2000: $Z_{1500} = 0.00 \\to 0.5000$ (area below), $Z_{2000} = 1.67 \\to 0.0475$ (area above).\n Final answer: $1.0000-0.5000 - 0.0475 = 0.4525.$\n\n::: {.guidedpractice data-latex=\"\"}\nWhat percent of adult males are between 5'5'' and 5'7''?[^13-foundations-mathematical-14]\n:::\n\n[^13-foundations-mathematical-14]: 5'5'' is 65 inches.\n 5'7'' is 67 inches.\n Numerical solution: $1.000 - 0.0649 - 0.8183 = 0.1168,$ i.e., 11.68%.\n\n## Quantifying the variability of a statistic\n\nAs seen in later chapters, it turns out that many of the statistics used to summarize data (e.g., the sample proportion, the sample mean, differences in two sample proportions, differences in two sample means, the sample slope from a linear model, etc.) vary according to the normal distribution seen above.\nThe mathematical models are derived from the normal theory, but even the computational methods (and the intuitive thinking behind both approaches) use the general bell-shaped variability seen in most of the distributions constructed so far.\n\n### 68-95-99.7 rule\n\nHere, we present a useful general rule for the probability of falling within 1, 2, and 3 standard deviations of the mean in the normal distribution.\nThe rule will be useful in a wide range of practical settings, especially when trying to make a quick estimate without a calculator or Z table.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Probabilities for falling within 1, 2, and 3 standard deviations of the mean in a normal distribution.](13-foundations-mathematical_files/figure-html/er6895997-1.png){width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nUse `pnorm()` (or a Z table) to confirm that about 68%, 95%, and 99.7% of observations fall within 1, 2, and 3, standard deviations of the mean in the normal distribution, respectively.\nFor instance, first find the area that falls between $Z=-1$ and $Z=1,$ which should have an area of about 0.68.\nSimilarly there should be an area of about 0.95 between $Z=-2$ and $Z=2.$[^13-foundations-mathematical-15]\n:::\n\n[^13-foundations-mathematical-15]: First draw the pictures.\n To find the area between $Z=-1$ and $Z=1,$ use `pnorm()` or the normal probability table to determine the areas below $Z=-1$ and above $Z=1.$ Next verify the area between $Z=-1$ and $Z=1$ is about 0.68.\n Repeat this for $Z=-2$ to $Z=2$ and for $Z=-3$ to $Z=3.$\n\nIt is possible for a normal random variable to fall 4, 5, or even more standard deviations from the mean.\nHowever, these occurrences are very rare if the data are nearly normal.\nThe probability of being further than 4 standard deviations from the mean is about 1-in-30,000.\nFor 5 and 6 standard deviations, it is about 1-in-3.5 million and 1-in-1 billion, respectively.\n\n::: {.guidedpractice data-latex=\"\"}\nSAT scores closely follow the normal model with mean $\\mu = 1500$ and standard deviation $\\sigma = 300.$ About what percent of test takers score 900 to 2100?\nWhat percent score between 1500 and 2100 ?[^13-foundations-mathematical-16]\n:::\n\n[^13-foundations-mathematical-16]: 900 and 2100 represent two standard deviations above and below the mean, which means about 95% of test takers will score between 900 and 2100.\n Since the normal model is symmetric, then half of the test takers from part (a) ($\\frac{95\\%}{2} = 47.5\\%$ of all test takers) will score 900 to 1500 while 47.5% score between 1500 and 2100.\n\n### Standard error {#se}\n\nPoint estimates vary from sample to sample, and we quantify this variability with what is called the **standard error (SE)**\\index{standard error}.\nThe standard error is equal to the standard deviation associated with the statistic.\nSo, for example, to quantify the variability of a point estimate from one sample to the next, the variability is called the standard error of the point estimate.\nAlmost always, the standard error is itself an estimate, calculated from the sample of data.\n\n\n\n\n\nThe way we determine the standard error varies from one situation to the next.\nHowever, typically it is determined using a formula based on the Central Limit Theorem.\n\n### Margin of error {#moe}\n\nVery related to the standard error is the **margin of error**\\index{margin of error}.\nThe margin of error describes how far away observations are from their mean.\\\nFor example, to describe where most (i.e., 95%) observations lie, we say that the margin of error is approximately $2 \\times SE$.\nThat is, 95% of the observations are within one margin of error of the mean.\n\n::: {.important data-latex=\"\"}\n**Margin of error for sample proportions.**\n\nThe distance given by $z^\\star \\times SE$ is called the **margin of error**.\n\n$z^\\star$ is the cutoff value found on the normal distribution.\nThe most common value of $z^\\star$ is 1.96 (often approximated to be 2) indicating that the margin of error describes the variability associated with 95% of the sampled statistics.\n:::\n\nNotice that if the spread of the observations goes from some lower bound to some upper bound, a rough approximation of the SE is to divide the range by 4.\nThat is, if you notice the sample proportions go from 0.1 to 0.4, the SE can be approximated to be 0.075.\n\n\n\n\n\n## Case Study (test): Opportunity cost {#caseopp}\n\nThe approach for using the normal model in the context of inference is very similar to the practice of applying the model to individual observations that are nearly normal.\nWe will replace null distributions we previously obtained using the randomization or simulation techniques and verify the results once again using the normal model.\nWhen the sample size is sufficiently large, the normal approximation generally provides us with the same conclusions as the simulation model.\n\n### Observed data\n\nIn Section \\@ref(caseStudyOpportunityCost) we were introduced to the opportunity cost study, which found that students became thriftier when they were reminded that not spending money now means the money can be spent on other things in the future.\nLet's re-analyze the data in the context of the normal distribution and compare the results.\n\n::: {.data data-latex=\"\"}\nThe [`opportunity_cost`](http://openintrostat.github.io/openintro/reference/opportunity_cost.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n### Variability of the statistic\n\nFigure \\@ref(fig:OpportunityCostDiffs-w-normal) summarizes the null distribution as determined using the randomization method.\nThe best fitting normal distribution for the null distribution has a mean of 0.\nWe can calculate the standard error of this distribution by borrowing a formula that we will become familiar with in Chapter \\@ref(inference-two-props), but for now let's just take the value $SE = 0.078$ as a given.\nRecall that the point estimate of the difference was 0.20, as shown in Figure \\@ref(fig:OpportunityCostDiffs-w-normal).\nNext, we'll use the normal distribution approach to compute the p-value.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Null distribution of differences with an overlaid normal curve for the opportunity cost study. 10,000 simulations were run for this figure.](13-foundations-mathematical_files/figure-html/OpportunityCostDiffs-w-normal-1.png){width=90%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nAs we learned in Section \\@ref(normalDist), it is helpful to draw and shade a picture of the normal distribution so we know precisely what we want to calculate.\nHere we want to find the area of the tail beyond 0.2, representing the p-value.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/OpportunityCostDiffs_normal_only-1.png){width=60%}\n:::\n:::\n\n\nNext, we can calculate the Z score using the observed difference, 0.20, and the two model parameters.\nThe standard error, $SE = 0.078,$ is the equivalent of the model's standard deviation.\n\n$$Z = \\frac{\\text{observed difference} - 0}{SE} = \\frac{0.20 - 0}{0.078} = 2.56$$\n\nWe can either use statistical software or look up $Z = 2.56$ in the normal probability table to determine the right tail area: 0.0052, which is about the same as what we got for the right tail using the randomization approach (0.006).\nUsing this area as the p-value, we see that the p-value is less than 0.05, we conclude that the treatment did indeed impact students' spending.\n\n::: {.important data-latex=\"\"}\n**Z score in a hypothesis test.**\n\nIn the context of a hypothesis test, the Z score for a point estimate is\n\n$$Z = \\frac{\\text{point estimate} - \\text{null value}}{SE}$$\n\nThe standard error in this case is the equivalent of the standard deviation of the point estimate, and the null value comes from the claim made in the null hypothesis.\n:::\n\nWe have confirmed that the randomization approach we used earlier and the normal distribution approach provide almost identical p-values and conclusions in the opportunity cost case study.\nNext, let's turn our attention to the medical consultant case study.\n\n## Case study (test): Medical consultant {#casemed}\n\n### Observed data\n\nIn Section \\@ref(case-study-med-consult) we learned about a medical consultant who reported that only 3 of their 62 clients who underwent a liver transplant had complications, which is less than the more common complication rate of 0.10.\nIn that work, we did not model a null scenario, but we will discuss a simulation method for a one proportion null distribution in Section \\@ref(one-prop-null-boot), such a distribution is provided in Figure \\@ref(fig:MedConsNullSim-w-normal).\nWe have added the best-fitting normal curve to the figure, which has a mean of 0.10.\nBorrowing a formula that we'll encounter in Chapter \\@ref(inference-one-prop), the standard error of this distribution was also computed: $SE = 0.038.$\n\n### Variability of the statistic\n\nBefore we begin, we want to point out a simple detail that is easy to overlook: the null distribution we generated from the simulation is slightly skewed, and the histogram is not particularly smooth.\nIn fact, the normal distribution only sort-of fits this model.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The null distribution for the sample proportion, created from 10,000 simulated studies, along with the best-fitting normal model.](13-foundations-mathematical_files/figure-html/MedConsNullSim-w-normal-1.png){width=90%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nAs always, we'll draw a picture before finding the normal probabilities.\nBelow is a normal distribution centered at 0.10 with a standard error of 0.038.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-foundations-mathematical_files/figure-html/MedConsNullSim-normal-only-1.png){width=60%}\n:::\n:::\n\n\nNext, we can calculate the Z score using the observed complication rate, $\\hat{p} = 0.048$ along with the mean and standard deviation of the normal model.\nHere again, we use the standard error for the standard deviation.\n\n$$Z = \\frac{\\hat{p} - p_0}{SE_{\\hat{p}}} = \\frac{0.048 - 0.10}{0.038} = -1.37$$\n\nIdentifying $Z = -1.37$ using statistical software or in the normal probability table, we can determine that the left tail area is 0.0853 which is the estimated p-value for the hypothesis test.\nThere is a small problem: the p-value of 0.0853 is almost 30% smaller than the simulation p-value of 0.1222 which will be calculated in Section \\@ref(one-prop-null-boot).\n\nThe discrepancy is explained by the normal model's poor representation of the null distribution in Figure \\@ref(fig:MedConsNullSim-w-normal).\nAs noted earlier, the null distribution from the simulations is not very smooth, and the distribution itself is slightly skewed.\nThat's the bad news.\nThe good news is that we can foresee these problems using some simple checks.\nWe'll learn more about these checks in the following chapters.\n\nIn Section \\@ref(CLTsection) we noted that the two common requirements to apply the Central Limit Theorem are (1) the observations in the sample must be independent, and (2) the sample must be sufficiently large.\nThe guidelines for this particular situation -- which we will learn in Chapter \\@ref(inference-one-prop) -- would have alerted us that the normal model was a poor approximation.\n\n### Conditions for applying the normal model\n\nThe success story in this section was the application of the normal model in the context of the opportunity cost data.\nHowever, the biggest lesson comes from the less successful attempt to use the normal approximation in the medical consultant case study.\n\nStatistical techniques are like a carpenter's tools.\nWhen used responsibly, they can produce amazing and precise results.\nHowever, if the tools are applied irresponsibly or under inappropriate conditions, they will produce unreliable results.\nFor this reason, with every statistical method that we introduce in future chapters, we will carefully outline conditions when the method can reasonably be used.\nThese conditions should be checked in each application of the technique.\n\nAfter covering the introductory topics in this course, advanced study may lead to working with complex models which, for example, bring together many variables with different variability structure.\nWorking with data that come from normal populations makes higher-order models easier to estimate and interpret.\nThere are times when simulation, randomization, or bootstrapping are unwieldy in either structure or computational demand.\nNormality can often lead to excellent approximations of the data using straightforward modeling techniques.\n\n## Case study (interval): Stents {#casestent}\n\nA point estimate is our best guess for the value of the parameter, so it makes sense to build the confidence interval around that value.\nThe standard error, which is a measure of the uncertainty associated with the point estimate, provides a guide for how large we should make the confidence interval.\nThe 68-95-99.7 rule tells us that, in general, 95% of observations are within 2 standard errors of the mean.\nHere, we use the value 1.96 to be slightly more precise.\n\n::: {.important data-latex=\"\"}\n**Constructing a 95% confidence interval.**\n\nWhen the sampling distribution of a point estimate can reasonably be modeled as normal, the point estimate we observe will be within 1.96 standard errors of the true value of interest about 95% of the time.\nThus, a **95% confidence interval** for such a point estimate can be constructed:\n\n$$\\text{point estimate} \\pm 1.96 \\times SE$$\n\nWe can be **95% confident** this interval captures the true value.\n:::\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the area between -1.96 and 1.96 for a normal distribution with mean 0 and standard deviation 1.[^13-foundations-mathematical-17]\n:::\n\n[^13-foundations-mathematical-17]: We will leave it to you to draw a picture.\n The Z scores are $Z_{left} = -1.96$ and $Z_{right} = 1.96.$ The area between these two Z scores is $0.9750 - 0.0250 = 0.9500.$ This is where \"1.96\" comes from in the 95% confidence interval formula.\n\n::: {.workedexample data-latex=\"\"}\nThe point estimate from the opportunity cost study was that 20% fewer students would buy a video if they were reminded that money not spent now could be spent later on something else.\nThe point estimate from this study can reasonably be modeled with a normal distribution, and a proper standard error for this point estimate is $SE = 0.078.$ Construct a 95% confidence interval.\n\n------------------------------------------------------------------------\n\nSince the conditions for the normal approximation have already been verified, we can move forward with the construction of the 95% confidence interval:\n\n$$\\text{point estimate} \\pm 1.96 \\times SE = 0.20 \\pm 1.96 \\times 0.078 = (0.047, 0.353)$$\n\nWe are 95% confident that the video purchase rate resulting from the treatment is between 4.7% and 35.3% lower than in the control group.\nSince this confidence interval does not contain 0, it is consistent with our earlier result where we rejected the notion of \"no difference\" using a hypothesis test.\n\nNote that we have used SE = 0.078 from the last section.\nHowever, it would more generally be appropriate to recompute the SE slightly differently for this confidence interval using sample proportions.\nDon't worry about this detail for now since the two resulting standard errors are, in this case, almost identical.\n:::\n\n### Observed data\n\nConsider an experiment that examined whether implanting a stent in the brain of a patient at risk for a stroke helps reduce the risk of a stroke.\nThe results from the first 30 days of this study, which included 451 patients, are summarized in Table \\@ref(tab:stentStudyResultsCIsection).\nThese results are surprising!\nThe point estimate suggests that patients who received stents may have a *higher* risk of stroke: $p_{trmt} - p_{ctrl} = 0.090.$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Descriptive statistics for 30-day results for the stent study.
Group Stroke No event Total
control 214 13 227
treatment 191 33 224
Total 405 46 451
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`stent30`](http://openintrostat.github.io/openintro/reference/stent30.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n### Variability of the statistic\n\n::: {.workedexample data-latex=\"\"}\nConsider the stent study and results.\nThe conditions necessary to ensure the point estimate $p_{trmt} - p_{ctrl} = 0.090$ is nearly normal have been verified for you, and the estimate's standard error is $SE = 0.028.$ Construct a 95% confidence interval for the change in 30-day stroke rates from usage of the stent.\n\n------------------------------------------------------------------------\n\nThe conditions for applying the normal model have already been verified, so we can proceed to the construction of the confidence interval:\n\n$$\\text{point estimate} \\pm 1.96 \\times SE = 0.090 \\pm 1.96 \\times 0.028 = (0.035, 0.145)$$\n\nWe are 95% confident that implanting a stent in a stroke patient's brain increased the risk of stroke within 30 days by a rate of 0.035 to 0.145.\nThis confidence interval can also be used in a way analogous to a hypothesis test: since the interval does not contain 0 (is completely above 0), it means the data provide convincing evidence that the stent used in the study changed the risk of stroke within 30 days.\n:::\n\nAs with hypothesis tests, confidence intervals are imperfect.\nAbout 1-in-20 properly constructed 95% confidence intervals will fail to capture the parameter of interest, simply due to natural variability in the observed data.\nFigure \\@ref(fig:95PercentConfidenceInterval) shows 25 confidence intervals for a proportion that were constructed from 25 different datasets that all came from the same population where the true proportion was $p = 0.3.$ However, 1 of these 25 confidence intervals happened not to include the true value.\nThe interval which does not capture $p=0.3$ is not due to bad science.\nInstead, it is due to natural variability, and we should *expect* some of our intervals to miss the parameter of interest.\nIndeed, over a lifetime of creating 95% intervals, you should expect 5% of your reported intervals to miss the parameter of interest (unfortunately, you will not ever know which of your reported intervals captured the parameter and which missed the parameter).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Twenty-five samples of size $n=300$ were collected from a population with $p = 0.30.$ For each sample, a confidence interval was created to try to capture the true proportion $p.$ However, 1 of these 25 intervals did not capture $p = 0.30.$](13-foundations-mathematical_files/figure-html/95PercentConfidenceInterval-1.png){fig-alt='A series of 25 horizontal lines are drawn, representing each of 25 different samples. Each vertical line starts at the value of the lower bound of the confidence interval and ends at the value of the upper bound of the confidence interval which was created from that particular sample. In the center of the line is a solid dot at the observed proportion of successes for that particular sample. A dashed vertical line runs through the horizontal lines at p = 0.3 (which is the true value of the population proportion). 24 of the 25 horizontal lines cross the vertical line at 0.3, but one of the horizontal lines is completely lower than 0.3. The line that does not cross 0.3 is colored red because the confidence interval from that particular sample would not have captured the true population proportion.' width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIn Figure \\@ref(fig:95PercentConfidenceInterval), one interval does not contain the true proportion, $p = 0.3.$ Does this imply that there was a problem with the datasets that were selected?[^13-foundations-mathematical-18]\n:::\n\n[^13-foundations-mathematical-18]: No.\n Just as some observations occur more than 1.96 standard deviations from the mean, some point estimates will be more than 1.96 standard errors from the parameter.\n A confidence interval only provides a plausible range of values for a parameter.\n While we might say other values are implausible based on the data, this does not mean they are impossible.\n\n### Interpreting confidence intervals\n\nA careful eye might have observed the somewhat awkward language used to describe confidence intervals.\n\n::: {.important data-latex=\"\"}\n**Correct confidence interval interpretation.**\n\nWe are XX% confident that the population parameter is between *lower* and *upper* (where *lower* and *upper* are both numerical values).\n\n**Incorrect** language might try to describe the confidence interval as capturing the population parameter with a certain probability.\n\nThis is one of the most common errors: while it might be useful to think of it as a probability, the confidence level only quantifies how plausible it is that the parameter is in the interval.\n:::\n\nAnother especially important consideration of confidence intervals is that they *only try to capture the population parameter*.\nOur intervals say nothing about the confidence of capturing individual observations, a proportion of the observations, or about capturing point estimates.\nConfidence intervals provide an interval estimate for and attempt to capture **population parameters**.\n\n\\index{confidence interval}\n\n\\clearpage\n\n## Chapter review {#chp13-review}\n\n### Summary\n\nWe can summarise the process of using the normal model as follows:\n\n- **Frame the research question.** The mathematical model can be applied to both the hypothesis testing and the confidence interval framework. Make sure that your research question is being addressed by the most appropriate inference procedure.\n- **Collect data with an observational study or experiment.** To address the research question, collect data on the variables of interest. Note that your data may be a random sample from a population or may be part of a randomized experiment.\n- **Model the randomness of the statistic.** In many cases, the normal distribution will be an excellent model for the randomness associated with the statistic of interest. The Central Limit Theorem tells us that if the sample size is large enough, sample averages (which can be calculated as either a proportion or a sample mean) will be approximately normally distributed when describing how the statistics change from sample to sample.\n- **Calculate the variability of the statistic.** Using formulas, come up with the standard deviation (or more typically, an estimate of the standard deviation called the standard error) of the statistic. The SE of the statistic will give information on how far the observed statistic is from the null hypothesized value (if performing a hypothesis test) or from the unknown population parameter (if creating a confidence interval).\n- **Use the normal distribution to quantify the variability.** The normal distribution will provide a probability which measures how likely it is for your observed and hypothesized (or observed and unknown) parameter to differ by the amount measured. The unusualness (or not) of the discrepancy will form the conclusion to the research question.\n- **Form a conclusion.** Using the p-value or the confidence interval from the analysis, report on the research question of interest. Also, be sure to write the conclusion in plain language so casual readers can understand the results.\n\nTable \\@ref(tab:chp13-summary) is another look at the mathematical model approach to inference.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of mathematical models as an inferential statistical method.
Question Answer
What does it do? Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples
What is the random process described? Randomized experiment or random sampling
What other random processes can be approximated? Randomized experiment or random sampling
What is it best for? Quick analyses through, for example, calculating a Z score.
What physical object represents the simulation process? Not applicable
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
95% confidence interval normal distribution percentile
95% confident normal model sampling distribution
Central Limit Theorem normal probability table standard error
margin of error null distribution standard normal distribution
normal curve parameter Z score
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp13-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-13].\n\n::: {.exercises data-latex=\"\"}\n1. **Chronic illness.**\nIn 2013, the Pew Research Foundation reported that \"45% of U.S. adults report that they live with one or more chronic conditions\". However, this value was based on a sample, so it may not be a perfect estimate for the population parameter of interest on its own. The study reported a standard error of about 1.2%, and a normal model may reasonably be used in this setting. \n\n a. Create a 95% confidence interval for the proportion of U.S. adults who live with one or more chronic conditions. Also interpret the confidence interval in the context of the study. [@data:pewdiagnosis:2013]\n \n b. Identify each of the following statements as true or false. Provide an explanation to justify each of your answers.\n \n i. We can say with certainty that the confidence interval from part (a) contains the true percentage of U.S. adults who suffer from a chronic illness.\n\n ii. If we repeated this study 1,000 times and constructed a 95% confidence interval for each study, then approximately 950 of those confidence intervals would contain the true fraction of U.S. adults who suffer from chronic illnesses.\n\n iii. The poll provides statistically significant evidence (at the $\\alpha = 0.05$ level) that the percentage of U.S. adults who suffer from chronic illnesses is below 50%.\n\n iv. Since the standard error is 1.2%, only 1.2% of people in the study communicated uncertainty about their answer.\n\n1. **Twitter users and news.**\nA poll conducted in 2013 found that 52% of U.S. adult Twitter users get at least some news on Twitter. The standard error for this estimate was 2.4%, and a normal distribution may be used to model the sample proportion. [@data:pewtwitternews:2013]\n \n a. Construct a 99% confidence interval for the fraction of U.S. adult Twitter users who get some news on Twitter, and interpret the confidence interval in context.\n \n b. Identify each of the following statements as true or false. Provide an explanation to justify each of your answers.\n \n i. The data provide statistically significant evidence that more than half of U.S. adult Twitter users get some news through Twitter. Use a significance level of $\\alpha = 0.01$.\n \n ii. Since the standard error is 2.4%, we can conclude that 97.6% of all U.S. adult Twitter users were included in the study.\n \n iii. If we want to reduce the standard error of the estimate, we should collect less data.\n \n iv. If we construct a 90% confidence interval for the percentage of U.S. adults Twitter users who get some news through Twitter, this confidence interval will be wider than a corresponding 99% confidence interval.\n \n \\clearpage\n\n1. **Interpreting a Z score from a sample proportion.**\nSuppose that you conduct a hypothesis test about a population proportion and calculate the Z score to be 0.47. Which of the following is the best interpretation of this value? For the problems which are not a good interpretation, indicate the statistical idea being described.^[This exercise was inspired by discussion on Dr. Allan Rossman's blog [Ask Good Questions](https://askgoodquestions.blog/).]\n\n a. The probability is 0.47 that the null hypothesis is true. \n \n b. If the null hypothesis were true, the probability would be 0.47 of obtaining a sample proportion as far as observed from the hypothesized value of the population proportion. \n \n c. The sample proportion is 0.47 standard errors greater than the hypothesized value of the population proportion. \n \n d. The sample proportion is equal to 0.47 times the standard error.\n \n e. The sample proportion is 0.47 away from the hypothesized value of the population.\n \n f. The sample proportion is 0.47.\n \n \\vspace{-2mm}\n\n1. **Mental health.**\nThe General Social Survey asked the question: \"For how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?\\\" Based on responses from 1,151 US residents, the survey reported a 95% confidence interval of 3.40 to 4.24 days in 2010.\n\n a. Interpret this interval in context of the data.\n\n b. What does \"95% confident\\\" mean? Explain in the context of the application.\n\n c. Suppose the researchers think a 99% confidence level would be more appropriate for this interval. Will this new interval be smaller or wider than the 95% confidence interval?\n\n d. If a new survey were to be done with 500 Americans, do you think the standard error of the estimate be larger, smaller, or about the same.\n \n \\vspace{-2mm}\n\n1. **Repeated water samples.**\nA nonprofit wants to understand the fraction of households that have elevated levels of lead in their drinking water. They expect at least 5% of homes will have elevated levels of lead, but not more than about 30%. They randomly sample 800 homes and work with the owners to retrieve water samples, and they compute the fraction of these homes with elevated lead levels. They repeat this 1,000 times and build a distribution of sample proportions.\n\n a. What is this distribution called?\n\n b. Would you expect the shape of this distribution to be symmetric, right skewed, or left skewed? Explain your reasoning.\n\n c. What is the name of the variability of this distribution.\n\n d. Suppose the researchers' budget is reduced, and they are only able to collect 250 observations per sample, but they can still collect 1,000 samples. They build a new distribution of sample proportions. How will the variability of this new distribution compare to the variability of the distribution when each sample contained 800 observations?\n \n \\vspace{-2mm}\n\n1. **Repeated student samples.**\nOf all freshman at a large college, 16% made the dean's list in the current year. As part of a class project, students randomly sample 40 students and check if those students made the list. They repeat this 1,000 times and build a distribution of sample proportions.\n\n a. What is this distribution called?\n\n b. Would you expect the shape of this distribution to be symmetric, right skewed, or left skewed? Explain your reasoning.\n\n c. What is the name of the variability of this distribution?\n\n d. Suppose the students decide to sample again, this time collecting 90 students per sample, and they again collect 1,000 samples. They build a new distribution of sample proportions. How will the variability of this new distribution compare to the variability of the distribution when each sample contained 40 observations?\n\n\n:::\n", + "supporting": [ + "13-foundations-mathematical_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/13-foundations-mathematical/figure-html/95PercentConfidenceInterval-1.png b/_freeze/13-foundations-mathematical/figure-html/95PercentConfidenceInterval-1.png new file mode 100644 index 00000000..6240f254 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/95PercentConfidenceInterval-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/FourCaseStudies-1.png b/_freeze/13-foundations-mathematical/figure-html/FourCaseStudies-1.png new file mode 100644 index 00000000..44c05de7 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/FourCaseStudies-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/MedConsNullSim-normal-only-1.png b/_freeze/13-foundations-mathematical/figure-html/MedConsNullSim-normal-only-1.png new file mode 100644 index 00000000..a5fa1ec5 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/MedConsNullSim-normal-only-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/MedConsNullSim-w-normal-1.png b/_freeze/13-foundations-mathematical/figure-html/MedConsNullSim-w-normal-1.png new file mode 100644 index 00000000..d62f837b Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/MedConsNullSim-w-normal-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/OpportunityCostDiffs-w-normal-1.png b/_freeze/13-foundations-mathematical/figure-html/OpportunityCostDiffs-w-normal-1.png new file mode 100644 index 00000000..63cb758c Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/OpportunityCostDiffs-w-normal-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/OpportunityCostDiffs_normal_only-1.png b/_freeze/13-foundations-mathematical/figure-html/OpportunityCostDiffs_normal_only-1.png new file mode 100644 index 00000000..93d8d418 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/OpportunityCostDiffs_normal_only-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/er6895997-1.png b/_freeze/13-foundations-mathematical/figure-html/er6895997-1.png new file mode 100644 index 00000000..6e9d12c6 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/er6895997-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/height82Perc-1.png b/_freeze/13-foundations-mathematical/figure-html/height82Perc-1.png new file mode 100644 index 00000000..f9b5a37a Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/height82Perc-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/satActNormals-1.png b/_freeze/13-foundations-mathematical/figure-html/satActNormals-1.png new file mode 100644 index 00000000..4d645110 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/satActNormals-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/satBelow1800-1.png b/_freeze/13-foundations-mathematical/figure-html/satBelow1800-1.png new file mode 100644 index 00000000..bf15ad84 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/satBelow1800-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/simpleNormal-1.png b/_freeze/13-foundations-mathematical/figure-html/simpleNormal-1.png new file mode 100644 index 00000000..793be518 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/simpleNormal-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/subtractingArea-1.png b/_freeze/13-foundations-mathematical/figure-html/subtractingArea-1.png new file mode 100644 index 00000000..8b806bc0 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/subtractingArea-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/twoSampleNormals-1.png b/_freeze/13-foundations-mathematical/figure-html/twoSampleNormals-1.png new file mode 100644 index 00000000..c3ebcd6d Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/twoSampleNormals-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/twoSampleNormalsStacked-1.png b/_freeze/13-foundations-mathematical/figure-html/twoSampleNormalsStacked-1.png new file mode 100644 index 00000000..32ac6910 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/twoSampleNormalsStacked-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-17-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 00000000..199c4ede Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-17-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-18-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 00000000..a791fbd1 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-18-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-20-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 00000000..8af62dd4 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-20-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-21-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 00000000..bc5875c0 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-21-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-22-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 00000000..07e4133d Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-22-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-23-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 00000000..cd8faf7b Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-23-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-24-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..6df8f2bd Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-28-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 00000000..4b99cc01 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-28-1.png differ diff --git a/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-29-1.png b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-29-1.png new file mode 100644 index 00000000..96ec3b02 Binary files /dev/null and b/_freeze/13-foundations-mathematical/figure-html/unnamed-chunk-29-1.png differ diff --git a/_freeze/14-foundations-errors/execute-results/html.json b/_freeze/14-foundations-errors/execute-results/html.json new file mode 100644 index 00000000..ee51366c --- /dev/null +++ b/_freeze/14-foundations-errors/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "4b0754b7229805b53305759254ff29dd", + "result": { + "markdown": "# Decision Errors {#decerr}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nUsing data to make inferential decisions about larger populations is not a perfect process.\nAs seen in Chapter \\@ref(foundations-randomization), a small p-value typically leads the researcher to a decision to reject the null claim or hypothesis.\nSometimes, however, data can produce a small p-value when the null hypothesis is actually true and the data are just inherently variable.\nHere we describe the errors which can arise in hypothesis testing, how to define and quantify the different errors, and suggestions for mitigating errors if possible.\n:::\n\n\\index{decision errors}\n\nHypothesis tests are not flawless.\nJust think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free.\nSimilarly, data can point to the wrong conclusion.\nHowever, what distinguishes statistical hypothesis tests from a court system is that our framework allows us to quantify and control how often the data lead us to the incorrect conclusion.\n\nIn a hypothesis test, there are two competing hypotheses: the null and the alternative.\nWe make a statement about which one might be true, but we might choose incorrectly.\nThere are four possible scenarios in a hypothesis test, which are summarized in Table \\@ref(tab:fourHTScenarios).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n\n
Four different scenarios for hypothesis tests.
Test conclusion
Truth Reject null hypothesis Fail to reject null hypothesis
Null hypothesis is true Type 1 Error Good decision
Alternative hypothesis is true Good decision Type 2 Error
\n\n`````\n:::\n:::\n\n\nA **Type 1 Error**\\index{Type 1 Error} is rejecting the null hypothesis when $H_0$ is actually true.\nSince we rejected the null hypothesis in the sex discrimination and opportunity cost studies, it is possible that we made a Type 1 Error in one or both of those studies.\nA **Type 2 Error**\\index{Type 2 Error} is failing to reject the null hypothesis when the alternative is actually true.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nIn a US court, the defendant is either innocent $(H_0)$ or guilty $(H_A).$ What does a Type 1 Error represent in this context?\nWhat does a Type 2 Error represent?\nTable \\@ref(tab:fourHTScenarios) may be useful.\n\n------------------------------------------------------------------------\n\nIf the court makes a Type 1 Error, this means the defendant is innocent $(H_0$ true) but wrongly convicted.\nA Type 2 Error means the court failed to reject $H_0$ (i.e., failed to convict the person) when they were in fact guilty $(H_A$ true).\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nConsider the opportunity cost study where we concluded students were less likely to make a DVD purchase if they were reminded that money not spent now could be spent later.\nWhat would a Type 1 Error represent in this context?[^14-foundations-errors-1]\n:::\n\n[^14-foundations-errors-1]: Making a Type 1 Error in this context would mean that reminding students that money not spent now can be spent later does not affect their buying habits, despite the strong evidence (the data suggesting otherwise) found in the experiment.\n Notice that this does *not* necessarily mean something was wrong with the data or that we made a computational mistake.\n Sometimes data simply point us to the wrong conclusion, which is why scientific studies are often repeated to check initial findings.\n\n::: {.workedexample data-latex=\"\"}\nHow could we reduce the Type 1 Error rate in US courts?\nWhat influence would this have on the Type 2 Error rate?\n\n------------------------------------------------------------------------\n\nTo lower the Type 1 Error rate, we might raise our standard for conviction from \"beyond a reasonable doubt\" to \"beyond a conceivable doubt\" so fewer people would be wrongly convicted.\nHowever, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nHow could we reduce the Type 2 Error rate in US courts?\nWhat influence would this have on the Type 1 Error rate?[^14-foundations-errors-2]\n:::\n\n[^14-foundations-errors-2]: To lower the Type 2 Error rate, we want to convict more guilty people.\n We could lower the standards for conviction from \"beyond a reasonable doubt\" to \"beyond a little doubt\".\n Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.\n\n\\index{decision errors}\n\nThe example and guided practice above provide an important lesson: if we reduce how often we make one type of error, we generally make more of the other type.\n\n\\clearpage\n\n## Significance level\n\n\\index{significance level}\n\nThe **significance level** provides the cutoff for the p-value which will lead to a decision of \"reject the null hypothesis.\" Choosing a significance level for a test is important in many contexts, and the traditional level is 0.05.\nHowever, it is sometimes helpful to adjust the significance level based on the application.\nWe may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.\n\nIf making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g., 0.01 or 0.001).\nIf we want to be very cautious about rejecting the null hypothesis, we demand very strong evidence favoring the alternative $H_A$ before we would reject $H_0.$\n\nIf a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g., 0.10).\nHere we want to be cautious about failing to reject $H_0$ when the null is actually false.\n\n\n\n\n\n::: {.tip data-latex=\"\"}\n**Significance levels should reflect consequences of errors.**\n\nThe significance level selected for a test should reflect the real-world consequences associated with making a Type 1 or Type 2 Error.\n:::\n\n## Two-sided hypotheses\n\n\\index{hypothesis testing}\n\nIn Chapter \\@ref(foundations-randomization) we explored whether women were discriminated against and whether a simple trick could make students a little thriftier.\nIn these two case studies, we have actually ignored some possibilities:\n\n- What if *men* are actually discriminated against?\n- What if the money trick actually makes students *spend more*?\n\nThese possibilities weren't considered in our original hypotheses or analyses.\nThe disregard of the extra alternatives may have seemed natural since the data pointed in the directions in which we framed the problems.\nHowever, there are two dangers if we ignore possibilities that disagree with our data or that conflict with our world view:\n\n1. Framing an alternative hypothesis simply to match the direction that the data point will generally inflate the Type 1 Error rate.\n After all the work we have done (and will continue to do) to rigorously control the error rates in hypothesis tests, careless construction of the alternative hypotheses can disrupt that hard work.\n\n2. If we only use alternative hypotheses that agree with our worldview, then we are going to be subjecting ourselves to **confirmation bias**\\index{confirmation bias}, which means we are looking for data that supports our ideas.\n That's not very scientific, and we can do better!\n\nThe original hypotheses we have seen are called **one-sided hypothesis tests**\\index{one-sided hypothesis test} because they only explored one direction of possibilities.\nSuch hypotheses are appropriate when we are exclusively interested in the single direction, but usually we want to consider all possibilities.\nTo do so, let's learn about **two-sided hypothesis tests**\\index{two-sided hypothesis test} in the context of a new study that examines the impact of using blood thinners on patients who have undergone CPR.\n\n\n\n\n\nCardiopulmonary resuscitation (CPR) is a procedure used on individuals suffering a heart attack when other emergency resources are unavailable.\nThis procedure is helpful in providing some blood circulation to keep a person alive, but CPR chest compression can also cause internal injuries.\nInternal bleeding and other injuries that can result from CPR complicate additional treatment efforts.\nFor instance, blood thinners may be used to help release a clot that is causing the heart attack once a patient arrives in the hospital.\nHowever, blood thinners negatively affect internal injuries.\n\nHere we consider an experiment with patients who underwent CPR for a heart attack and were subsequently admitted to a hospital.\nEach patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group).\nThe outcome variable of interest was whether the patient survived for at least 24 hours.\n[@Bottiger:2001]\n\n::: {.data data-latex=\"\"}\nThe [`cpr`](http://openintrostat.github.io/openintro/reference/cpr.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n::: {.workedexample data-latex=\"\"}\nForm hypotheses for this study in plain and statistical language.\nLet $p_C$ represent the true survival rate of people who do not receive a blood thinner (corresponding to the control group) and $p_T$ represent the survival rate for people receiving a blood thinner (corresponding to the treatment group).\n\n------------------------------------------------------------------------\n\nWe want to understand whether blood thinners are helpful or harmful.\nWe'll consider both of these possibilities using a two-sided hypothesis test.\n\n- $H_0:$ Blood thinners do not have an overall survival effect, i.e., the survival proportions are the same in each group.\n $p_T - p_C = 0.$\n\n- $H_A:$ Blood thinners have an impact on survival, either positive or negative, but not zero.\n $p_T - p_C \\neq 0.$\n\nNote that if we had done a one-sided hypothesis test, the resulting hypotheses would have been:\n\n- $H_0:$ Blood thinners do not have a positive overall survival effect, i.e., the survival proportions for the blood thinner group is the same or lower than the control group.\n $p_T - p_C \\leq 0.$\n\n- $H_A:$ Blood thinners have a positive impact on survival.\n $p_T - p_C > 0.$\n:::\n\nThere were 50 patients in the experiment who did not receive a blood thinner and 40 patients who did.\nThe study results are shown in Table \\@ref(tab:cpr-summary).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Results for the CPR study. Patients in the treatment group were given a blood thinner, and patients in the control group were not.
Group Died Survived Total
Control 39 11 50
Treatment 26 14 40
Total 65 25 90
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat is the observed survival rate in the control group?\nAnd in the treatment group?\nAlso, provide a point estimate $(\\hat{p}_T - \\hat{p}_C)$ for the true difference in population survival proportions across the two groups: $p_T - p_C.$[^14-foundations-errors-3]\n:::\n\n[^14-foundations-errors-3]: Observed control survival rate: $\\hat{p}_C = \\frac{11}{50} = 0.22.$ Treatment survival rate: $\\hat{p}_T = \\frac{14}{40} = 0.35.$ Observed difference: $\\hat{p}_T - \\hat{p}_C = 0.35 - 0.22 = 0.13.$\n\nAccording to the point estimate, for patients who have undergone CPR outside of the hospital, an additional 13% of these patients survive when they are treated with blood thinners.\nHowever, we wonder if this difference could be easily explainable by chance, if the treatment has no effect on survival.\n\nAs we did in past studies, we will simulate what type of differences we might see from chance alone under the null hypothesis.\nBy randomly assigning each of the patient's files to a \"simulated treatment\" or \"simulated control\" allocation, we get a new grouping.\nIf we repeat this simulation 1,000 times, we can build a **null distribution**\\index{null distribution} of the differences shown in Figure \\@ref(fig:CPR-study-right-tail).\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Null distribution of the point estimate for the difference in proportions, $\\hat{p}_T - \\hat{p}_C.$ The shaded right tail shows observations that are at least as large as the observed difference, 0.13.](14-foundations-errors_files/figure-html/CPR-study-right-tail-1.png){width=90%}\n:::\n:::\n\n\nThe right tail area is 0.135.\n(Note: it is only a coincidence that we also have $\\hat{p}_T - \\hat{p}_C=0.13.)$ However, contrary to how we calculated the p-value in previous studies, the p-value of this test is not actually the tail area we calculated, i.e., it's not 0.135!\n\nThe p-value is defined as the probability we observe a result at least as favorable to the alternative hypothesis as the result (i.e., the difference) we observe.\nIn this case, any differences less than or equal to -0.13 would also provide equally strong evidence favoring the alternative hypothesis as a difference of +0.13 did.\nA difference of -0.13 would correspond to 13% higher survival rate in the control group than the treatment group.\nIn Figure \\@ref(fig:CPR-study-p-value) we have also shaded these differences in the left tail of the distribution.\nThese two shaded tails provide a visual representation of the p-value for a two-sided test.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Null distribution of the point estimate for the difference in proportions, $\\hat{p}_T - \\hat{p}_C.$ All values that are at least as extreme as +0.13 but in either direction away from 0 are shaded.](14-foundations-errors_files/figure-html/CPR-study-p-value-1.png){width=90%}\n:::\n:::\n\n\nFor a two-sided test, take the single tail (in this case, 0.131) and double it to get the p-value: 0.262.\nSince this p-value is larger than 0.05, we do not reject the null hypothesis.\nThat is, we do not find convincing evidence that the blood thinner has any influence on survival of patients who undergo CPR prior to arriving at the hospital.\n\n::: {.important data-latex=\"\"}\n**Default to a two-sided test.**\n\nWe want to be rigorous and keep an open mind when we analyze data and evidence.\nUse a one-sided hypothesis test only if you truly have interest in only one direction.\n:::\n\n::: {.important data-latex=\"\"}\n**Computing a p-value for a two-sided test.**\n\nFirst compute the p-value for one tail of the distribution, then double that value to get the two-sided p-value.\nThat's it!\n:::\n\n::: {.workedexample data-latex=\"\"}\nConsider the situation of the medical consultant.\nNow that you know about one-sided and two-sided tests, which type of test do you think is more appropriate?\n\n------------------------------------------------------------------------\n\nThe setting has been framed in the context of the consultant being helpful (which is what led us to a one-sided test originally), but what if the consultant actually performed *worse* than the average?\nWould we care?\nMore than ever!\nSince it turns out that we care about a finding in either direction, we should run a two-sided test.\nThe p-value for the two-sided test is double that of the one-sided test, here the simulated p-value would be 0.2444.\n:::\n\nGenerally, to find a two-sided p-value we double the single tail area, which remains a reasonable approach even when the distribution is asymmetric.\nHowever, the approach can result in p-values larger than 1 when the point estimate is very near the mean in the null distribution; in such cases, we write that the p-value is 1.\nAlso, very large p-values computed in this way (e.g., 0.85), may also be slightly inflated.\nTypically, we do not worry too much about the precision of very large p-values because they lead to the same analysis conclusion, even if the value is slightly off.\n\n\\clearpage\n\n## Controlling the Type 1 Error rate\n\nNow that we understand the difference between one-sided and two-sided tests, we must recognize when to use each type of test.\nBecause of the result of increased error rates, it is never okay to change two-sided tests to one-sided tests after observing the data.\nWe explore the consequences of ignoring this advice in the next example.\n\n::: {.workedexample data-latex=\"\"}\nUsing $\\alpha=0.05,$ we show that freely switching from two-sided tests to one-sided tests will lead us to make twice as many Type 1 Errors as intended.\n\n------------------------------------------------------------------------\n\nSuppose we are interested in finding any difference from 0.\nWe've created a smooth-looking **null distribution** representing differences due to chance in Figure \\@ref(fig:type1ErrorDoublingExampleFigure).\n\nSuppose the sample difference was larger than 0.\nThen if we can flip to a one-sided test, we would use $H_A:$ difference $> 0.$ Now if we obtain any observation in the upper 5% of the distribution, we would reject $H_0$ since the p-value would just be a the single tail.\nThus, if the null hypothesis is true, we incorrectly reject the null hypothesis about 5% of the time when the sample mean is above the null value, as shown in Figure \\@ref(fig:type1ErrorDoublingExampleFigure).\n\nSuppose the sample difference was smaller than 0.\nThen if we change to a one-sided test, we would use $H_A:$ difference $< 0.$ If the observed difference falls in the lower 5% of the figure, we would reject $H_0.$ That is, if the null hypothesis is true, then we would observe this situation about 5% of the time.\n\nBy examining these two scenarios, we can determine that we will make a Type 1 Error $5\\%+5\\%=10\\%$ of the time if we are allowed to swap to the \"best\" one-sided test for the data.\nThis is twice the error rate we prescribed with our significance level: $\\alpha=0.05$ (!).\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The shaded regions represent areas where we would reject $H_0$ under the bad practices considered in when $\\alpha = 0.05.$](14-foundations-errors_files/figure-html/type1ErrorDoublingExampleFigure-1.png){width=90%}\n:::\n:::\n\n\n::: caution\n**Hypothesis tests should be set up *before* seeing the data.**\n\nAfter observing data, it is tempting to turn a two-sided test into a one-sided test.\nAvoid this temptation.\nHypotheses should be set up *before* observing the data.\n:::\n\n\\index{hypothesis testing}\n\n\\clearpage\n\n## Power {#pow}\n\nAlthough we won't go into extensive detail here, power is an important topic for follow-up consideration after understanding the basics of hypothesis testing.\nA good power analysis is a vital preliminary step to any study as it will inform whether the data you collect are sufficient for being able to conclude your research broadly.\n\nOften times in experiment planning, there are two competing considerations:\n\n- We want to collect enough data that we can detect important effects.\n- Collecting data can be expensive, and, in experiments involving people, there may be some risk to patients.\n\nWhen planning a study, we want to know how likely we are to detect an effect we care about.\nIn other words, if there is a real effect, and that effect is large enough that it has practical value, then what is the probability that we detect that effect?\nThis probability is called the **power**\\index{power}, and we can compute it for different sample sizes or different effect sizes.\n\n::: {.important data-latex=\"\"}\n**Power.**\n\nThe power of the test is the probability of rejecting the null claim when the alternative claim is true.\n\nHow easy it is to detect the effect depends on both how big the effect is (e.g., how good the medical treatment is) as well as the sample size.\n:::\n\nWe think of power as the probability that you will become rich and famous from your science.\nIn order for your science to make a splash, you need to have good ideas!\nThat is, you won't become famous if you happen to find a single Type 1 error which rejects the null hypothesis.\nInstead, you'll become famous if your science is very good and important (that is, if the alternative hypothesis is true).\nThe better your science is (i.e., the better the medical treatment), the larger the *effect size* and the easier it will be for you to convince people of your work.\n\nNot only does your science need to be solid, but you also need to have evidence (i.e., data) that shows the effect.\nA few observations (e.g., $n = 2)$ is unlikely to be convincing because of well known ideas of natural variability.\nIndeed, the larger the dataset which provides evidence for your scientific claim, the more likely you are to convince the community that your idea is correct.\n\n\n\n\n\n\\clearpage\n\n## Chapter review {#chp15-review}\n\n### Summary\n\nAlthough hypothesis testing provides a strong framework for making decisions based on data, as the analyst, you need to understand how and when the process can go wrong.\nThat is, always keep in mind that the conclusion to a hypothesis test may not be right!\nSometimes when the null hypothesis is true, we will accidentally reject it and commit a type 1 error; sometimes when the alternative hypothesis is true, we will fail to reject the null hypothesis and commit a type 2 error.\nThe power of the test quantifies how likely it is to obtain data which will reject the null hypothesis when indeed the alternative is true; the power of the test is increased when larger sample sizes are taken.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
confirmation bias power type 1 error
null distribution significance level type 2 error
one-sided hypothesis test two-sided hypothesis test
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp14-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-14].\n\n::: {.exercises data-latex=\"\"}\n1. **Testing for Fibromyalgia.**\nA patient named Diana was diagnosed with Fibromyalgia, a long-term syndrome of body pain, and was prescribed anti-depressants. Being the skeptic that she is, Diana didn't initially believe that anti-depressants would help her symptoms. However, after a couple months of being on the medication she decides that the anti-depressants are working, because she feels like her symptoms are in fact getting better.\n\n a. Write the hypotheses in words for Diana's skeptical position when she started taking the anti-depressants.\n\n b. What is a Type 1 Error in this context?\n\n c. What is a Type 2 Error in this context?\n\n1. **Which is higher?**\nIn each part below, there is a value of interest and two scenarios: (i) and (ii). For each part, report if the value of interest is larger under scenario (i), scenario (ii), or whether the value is equal under the scenarios.\n\n a. The standard error of $\\hat{p}$ when (i) $n = 125$ or (ii) $n = 500$.\n\n b. The margin of error of a confidence interval when the confidence level is (i) 90% or (ii) 80%.\n\n c. The p-value for a Z-statistic of 2.5 calculated based on a (i) sample with $n = 500$ or based on a (ii) sample with $n = 1000$.\n\n d. The probability of making a Type 2 Error when the alternative hypothesis is true and the significance level is (i) 0.05 or (ii) 0.10.\n\n1. **Testing for food safety.**\nA food safety inspector is called upon to investigate a restaurant with a few customer reports of poor sanitation practices. The food safety inspector uses a hypothesis testing framework to evaluate whether regulations are not being met. If he decides the restaurant is in gross violation, its license to serve food will be revoked.\n\n a. Write the hypotheses in words.\n\n b. What is a Type 1 Error in this context?\n\n c. What is a Type 2 Error in this context?\n\n d. Which error is more problematic for the restaurant owner? Why?\n\n e. Which error is more problematic for the diners? Why?\n\n f. As a diner, would you prefer that the food safety inspector requires strong evidence or very strong evidence of health concerns before revoking a restaurant's license? Explain your reasoning.\n\n1. **True or false.**\nDetermine if the following statements are true or false, and explain your reasoning. If false, state how it could be corrected.\n\n a. If a given value (for example, the null hypothesized value of a parameter) is within a 95% confidence interval, it will also be within a 99% confidence interval.\n\n b. Decreasing the significance level ($\\alpha$) will increase the probability of making a Type 1 Error.\n\n c. Suppose the null hypothesis is $p = 0.5$ and we fail to reject $H_0$. Under this scenario, the true population proportion is 0.5.\n\n d. With large sample sizes, even small differences between the null value and the observed point estimate, a difference often called the effect size, will be identified as statistically significant.\n \n \\clearpage\n\n1. **Online communication.**\nA study suggests that 60% of college student spend 10 or more hours per week communicating with others online. You believe that this is incorrect and decide to collect your own sample for a hypothesis test. You randomly sample 160 students from your dorm and find that 70% spent 10 or more hours a week communicating with others online. A friend of yours, who offers to help you with the hypothesis test, comes up with the following set of hypotheses. Indicate any errors you see. \n\n $$H_0: \\hat{p} < 0.6 \\quad \\quad H_A: \\hat{p} > 0.7$$\n\n1. **Same observation, different sample size.**\nSuppose you conduct a hypothesis test based on a sample where the sample size is $n = 50$, and arrive at a p-value of 0.08. You then refer back to your notes and discover that you made a careless mistake, the sample size should have been $n = 500$. Will your p-value increase, decrease, or stay the same? Explain.\n\n\n:::\n", + "supporting": [ + "14-foundations-errors_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/14-foundations-errors/figure-html/CPR-study-p-value-1.png b/_freeze/14-foundations-errors/figure-html/CPR-study-p-value-1.png new file mode 100644 index 00000000..c6838ff7 Binary files /dev/null and b/_freeze/14-foundations-errors/figure-html/CPR-study-p-value-1.png differ diff --git a/_freeze/14-foundations-errors/figure-html/CPR-study-right-tail-1.png b/_freeze/14-foundations-errors/figure-html/CPR-study-right-tail-1.png new file mode 100644 index 00000000..a8dfff87 Binary files /dev/null and b/_freeze/14-foundations-errors/figure-html/CPR-study-right-tail-1.png differ diff --git a/_freeze/14-foundations-errors/figure-html/type1ErrorDoublingExampleFigure-1.png b/_freeze/14-foundations-errors/figure-html/type1ErrorDoublingExampleFigure-1.png new file mode 100644 index 00000000..7f85d00c Binary files /dev/null and b/_freeze/14-foundations-errors/figure-html/type1ErrorDoublingExampleFigure-1.png differ diff --git a/_freeze/15-foundations-applications/execute-results/html.json b/_freeze/15-foundations-applications/execute-results/html.json new file mode 100644 index 00000000..f92703f8 --- /dev/null +++ b/_freeze/15-foundations-applications/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "ae32db499301f6e89b2ccb670b939dd5", + "result": { + "markdown": "# Applications: Foundations {#foundations-applications}\n\n\n\n\n\n## Recap: Foundations {#foundations-sec-summary}\n\nIn the Foundations of inference chapters, we have provided three different methods for statistical inference.\nWe will continue to build on all three of the methods throughout the text, and by the end, you should have an understanding of the similarities and differences between them.\nMeanwhile, it is important to note that the methods are designed to mimic variability with data, and we know that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \\@ref(fig:randsampValloc)).\nIn Table \\@ref(tab:foundations-summary), we have summarized some of the ways the inferential procedures feature specific sources of variability.\nWe hope that you refer back to the table often as you dive more deeply into inferential ideas in future chapters.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary and comparison of randomization, bootstrapping, and mathematical models as inferential statistical methods.
Answer
Question Randomization Bootstrapping Mathematical models
What does it do? Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples
What is the random process described? Randomized experiment Random sampling from a population Randomized experiment or random sampling
What other random processes can be approximated? Can also be used to describe random sampling in an observational model Can also be used to describe random allocation in an experiment Randomized experiment or random sampling
What is it best for? Hypothesis testing (can also be used for confidence intervals, but not covered in this text). Confidence intervals (can also be used for bootstrap hypothesis testing for one proportion as well). Quick analyses through, for example, calculating a Z score.
What physical object represents the simulation process? Shuffling cards Pulling marbles from a bag with replacement Not applicable
\n\n`````\n:::\n:::\n\n\nYou might have noticed that the word *distribution* is used throughout this part (and will continue to be used in future chapters).\nA distribution always describes variability, but sometimes it is worth reflecting on *what* is varying.\nTypically the distribution either describes how the observations vary or how a statistic varies.\nBut even when describing how a statistic varies, there is a further consideration with respect to the study design, e.g., does the statistic vary from random sample to random sample or does it vary from random allocation to random allocation?\nThe methods presented in this text (and used in science generally) are typically used interchangeably across ideas of random samples or random allocations of the treatment.\nOften, the two different analysis methods will give equivalent conclusions.\nThe most important thing to consider is how to contextualize the conclusion in terms of the problem.\nSee Figure \\@ref(fig:randsampValloc) to confirm that your conclusions are appropriate.\n\nBelow, we synthesize the different types of distributions discussed throughout the text.\nReading through the different definitions and solidifying your understanding will help as you come across these distributions in future chapters and you can always return back here to refresh your understanding of the differences between the various distributions.\n\n::: {.important data-latex=\"\"}\n**Distributions.**\n\n- A **data distribution** describes the shape, center, and variability of the **observed data**.\n\n This can also be referred to as the **sample distribution** but we'll avoid that phrase as it sounds too much like sampling distribution, which is different.\n\n- A **population distribution** describes the shape, center, and variability of the entire **population of data**.\n\n Except in very rare circumstances of very small, very well-defined populations, this is never observed.\n\n- A **sampling distribution** describes the shape, center, and variability of all possible values of a **sample statistic** from samples of a given sample size from a given population.\n\n Since the population is never observed, it's never possible to observe the true sampling distribution either.\n However, when certain conditions hold, the Central Limit Theorem tells us what the sampling distribution is.\n\n- A **randomization distribution** describes the shape, center, and variability of all possible values of a **sample statistic** from random allocations of the treatment variable.\n\n We computationally generate the randomization distribution, though usually, it's not feasible to generate the full distribution of all possible values of the sample statistic, so we instead generate a large number of them.\n Almost always, by randomly allocating the treatment variable, the randomization distribution describes the null hypothesis, i.e., it is centered at the null hypothesized value of the parameter.\n\n- A **bootstrap distribution** describes the shape, center, and variability of all possible values of a **sample statistic** from resamples of the observed data.\n\n We computationally generate the bootstrap distribution, though usually, it's not feasible to generate all possible resamples of the observed data, so we instead generate a large number of them.\n Since bootstrap distributions are generated by randomly resampling from the observed data, they are centered at the sample statistic.\n Bootstrap distributions are most often used for estimation, i.e., we base confidence intervals off of them.\n:::\n\n\\clearpage\n\n## Case study: Malaria vaccine {#case-study-malaria-vaccine}\n\nIn this case study, we consider a new malaria vaccine called PfSPZ.\nIn the malaria study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine and 6 patients received a placebo vaccine.\nNineteen weeks later, all 20 patients were exposed to a drug-sensitive strain of the malaria parasite; the motivation of using a drug-sensitive strain here is for ethical considerations, allowing any infections to be treated effectively.\n\n::: {.data data-latex=\"\"}\nThe [`malaria`](http://openintrostat.github.io/openintro/reference/malaria.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe results are summarized in Table \\@ref(tab:malaria-vaccine-20-ex-summary), where 9 of the 14 treatment patients remained free of signs of infection while all of the 6 patients in the control group showed some baseline signs of infection.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary results for the malaria vaccine experiment.
treatment infection no infection Total
placebo 6 0 6
vaccine 5 9 14
Total 11 9 20
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIs this an observational study or an experiment?\nWhat implications does the study type have on what can be inferred from the results?[^15-foundations-applications-1]\n:::\n\n[^15-foundations-applications-1]: The study is an experiment, as patients were randomly assigned an experiment group.\n Since this is an experiment, the results can be used to evaluate a causal relationship between the malaria vaccine and whether patients showed signs of an infection.\n\n\\vspace{-5mm}\n\n### Variability within data\n\nIn this study, a smaller proportion of patients who received the vaccine showed signs of an infection (35.7% versus 100%).\nHowever, the sample is very small, and it is unclear whether the difference provides *convincing evidence* that the vaccine is effective.\n\n::: {.workedexample data-latex=\"\"}\nStatisticians and data scientists are sometimes called upon to evaluate the strength of evidence.\nWhen looking at the rates of infection for patients in the two groups in this study, what comes to mind as we try to determine whether the data show convincing evidence of a real difference?\n\n------------------------------------------------------------------------\n\nThe observed infection rates (35.7% for the treatment group versus 100% for the control group) suggest the vaccine may be effective.\nHowever, we cannot be sure if the observed difference represents the vaccine's efficacy or if there is no treatment effect and the observed difference is just from random chance.\nGenerally there is a little bit of fluctuation in sample data, and we wouldn't expect the sample proportions to be *exactly* equal, even if the truth was that the infection rates were independent of getting the vaccine.\nAdditionally, with such small samples, perhaps it's common to observe such large differences when we randomly split a group due to chance alone!\n:::\n\nThis example is a reminder that the observed outcomes in the data sample may not perfectly reflect the true relationships between variables since there is **random noise**.\nWhile the observed difference in rates of infection is large, the sample size for the study is small, making it unclear if this observed difference represents efficacy of the vaccine or whether it is simply due to chance.\nWe label these two competing claims, $H_0$ and $H_A$:\n\n- $H_0$: **Independence model.** The variables are independent.\n They have no relationship, and the observed difference between the proportion of patients who developed an infection in the two groups, 64.3%, was due to chance.\n\n- $H_A$: **Alternative model.** The variables are *not* independent.\n The difference in infection rates of 64.3% was not due to chance.\n Here (because an experiment was done), if the difference in infection rate is not due to chance, it was the vaccine that affected the rate of infection.\n\nWhat would it mean if the independence model, which says the vaccine had no influence on the rate of infection, is true?\nIt would mean 11 patients were going to develop an infection *no matter which group they were randomized into*, and 9 patients would not develop an infection *no matter which group they were randomized into*.\nThat is, if the vaccine did not affect the rate of infection, the difference in the infection rates was due to chance alone in how the patients were randomized.\n\nNow consider the alternative model: infection rates were influenced by whether a patient received the vaccine or not.\nIf this was true, and especially if this influence was substantial, we would expect to see some difference in the infection rates of patients in the groups.\n\nWe choose between these two competing claims by assessing if the data conflict so much with $H_0$ that the independence model cannot be deemed reasonable.\nIf this is the case, and the data support $H_A,$ then we will reject the notion of independence and conclude the vaccine was effective.\n\n### Simulating the study\n\nWe're going to implement **simulation** under the setting where we will pretend we know that the malaria vaccine being tested does *not* work.\nUltimately, we want to understand if the large difference we observed in the data is common in these simulations that represent independence.\nIf it is common, then maybe the difference we observed was purely due to chance.\nIf it is very uncommon, then the possibility that the vaccine was helpful seems more plausible.\n\nTable \\@ref(tab:malaria-vaccine-20-ex-summary) shows that 11 patients developed infections and 9 did not.\nFor our simulation, we will suppose the infections were independent of the vaccine and we were able to *rewind* back to when the researchers randomized the patients in the study.\nIf we happened to randomize the patients differently, we may get a different result in this hypothetical world where the vaccine does not influence the infection.\nLet's complete another **randomization** using a simulation.\n\nIn this **simulation**, we take 20 notecards to represent the 20 patients, where we write down \"infection\" on 11 cards and \"no infection\" on 9 cards.\nIn this hypothetical world, we believe each patient that got an infection was going to get it regardless of which group they were in, so let's see what happens if we randomly assign the patients to the treatment and control groups again.\nWe thoroughly shuffle the notecards and deal 14 into a pile and 6 into a pile.\nFinally, we tabulate the results, which are shown in Table \\@ref(tab:malaria-vaccine-20-ex-summary-rand-1).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Simulation results, where any difference in infection ratio is purely due to chance.
treatment placebo vaccine Total
infection 4 7 11
no infection 2 7 9
Total 6 14 20
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nHow does this compare to the observed 64.3% difference in the actual data?[^15-foundations-applications-2]\n:::\n\n[^15-foundations-applications-2]: $4 / 6 - 7 / 14 = 0.167$ or about 16.7% in favor of the vaccine.\n This difference due to chance is much smaller than the difference observed in the actual groups.\n\n### Independence between treatment and outcome\n\nWe computed one possible difference under the independence model in the previous Guided Practice, which represents one difference due to chance, assuming there is no vaccine effect.\nWhile in this first simulation, we physically dealt out notecards to represent the patients, it is more efficient to perform the simulation using a computer.\n\nRepeating the simulation on a computer, we get another difference due to chance: $$ \\frac{2}{6{}} - \\frac{9}{14} = -0.310 $$\n\nAnd another: $$ \\frac{3}{6{}} - \\frac{8}{14} = -0.071$$\n\nAnd so on until we repeat the simulation enough times to create a *distribution of differences that could have occurred if the null hypothesis was true*.\n\nFigure \\@ref(fig:malaria-rand-dot-plot) shows a stacked plot of the differences found from 100 simulations, where each dot represents a simulated difference between the infection rates (control rate minus treatment rate).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:malaria-rand-dot-plot-cap)](15-foundations-applications_files/figure-html/malaria-rand-dot-plot-1.png){width=90%}\n:::\n:::\n\n\n(ref:malaria-rand-dot-plot-cap) A stacked dot plot of differences from 100 simulations produced under the independence mode, $H_0,$ where in these simulations infections are unaffected by the vaccine. Two of the 100 simulations had a difference of at least 64.3%, the difference observed in the study.\n\nNote that the distribution of these simulated differences is centered around 0.\nWe simulated these differences assuming that the independence model was true, and under this condition, we expect the difference to be near zero with some random fluctuation, where *near* is pretty generous in this case since the sample sizes are so small in this study.\n\n::: {.workedexample data-latex=\"\"}\nHow often would you observe a difference of at least 64.3% (0.643) according to Figure \\@ref(fig:malaria-rand-dot-plot)?\nOften, sometimes, rarely, or never?\n\n------------------------------------------------------------------------\n\nIt appears that a difference of at least 64.3% due to chance alone would only happen about 2% of the time according to Figure \\@ref(fig:malaria-rand-dot-plot).\nSuch a low probability indicates a rare event.\n:::\n\nThe difference of 64.3% being a rare event suggests two possible interpretations of the results of the study:\n\n- $H_0$: **Independence model.** The vaccine has no effect on infection rate, and we just happened to observe a difference that would only occur on a rare occasion.\n\n- $H_A$: **Alternative model.** The vaccine has an effect on infection rate, and the difference we observed was actually due to the vaccine being effective at combating malaria, which explains the large difference of 64.3%.\n\nBased on the simulations, we have two options.\n(1) We conclude that the study results do not provide strong evidence against the independence model.\nThat is, we do not have sufficiently strong evidence to conclude the vaccine had an effect in this clinical setting.\n(2) We conclude the evidence is sufficiently strong to reject $H_0$ and assert that the vaccine was useful.\nWhen we conduct formal studies, usually we reject the notion that we just happened to observe a rare event.\nSo in the vaccine case, we reject the independence model in favor of the alternative.\nThat is, we are concluding the data provide strong evidence that the vaccine provides some protection against malaria in this clinical setting.\n\nOne field of statistics, statistical inference, is built on evaluating whether such differences are due to chance.\nIn statistical inference, data scientists evaluate which model is most reasonable given the data.\nErrors do occur, just like rare events, and we might choose the wrong model.\nWhile we do not always choose correctly, statistical inference gives us tools to control and evaluate how often decision errors occur.\n\n\\clearpage\n\n## Interactive R tutorials {#foundations-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 4: Foundations of inference](https://openintrostat.github.io/ims-tutorials/04-foundations/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintrostat.github.io/ims-tutorials/04-foundations\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 4 - Lesson 1: Sampling variability](https://openintro.shinyapps.io/ims-04-foundations-01/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-04-foundations-01\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 4 - Lesson 2: Randomization test](https://openintro.shinyapps.io/ims-04-foundations-02/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-04-foundations-02\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 4 - Lesson 3: Errors in hypothesis testing](https://openintro.shinyapps.io/ims-04-foundations-03/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-04-foundations-03\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 4 - Lesson 4: Parameters and confidence intervals](https://openintro.shinyapps.io/ims-04-foundations-04/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-04-foundations-04\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs {#foundations-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Sampling distributions - Does science benefit you?](https://www.openintro.org/go?id=ims-r-lab-foundations-1)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?id=ims-r-lab-foundations-1\n:::\n\n:::\n\n::: {.singlelab data-latex=\"\"}\n[Confidence intervals - Climate change](https://www.openintro.org/go?id=ims-r-lab-foundations-2)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?id=ims-r-lab-foundations-2\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "supporting": [ + "15-foundations-applications_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/15-foundations-applications/figure-html/malaria-rand-dot-plot-1.png b/_freeze/15-foundations-applications/figure-html/malaria-rand-dot-plot-1.png new file mode 100644 index 00000000..2c203720 Binary files /dev/null and b/_freeze/15-foundations-applications/figure-html/malaria-rand-dot-plot-1.png differ diff --git a/_freeze/16-inference-one-prop/execute-results/html.json b/_freeze/16-inference-one-prop/execute-results/html.json new file mode 100644 index 00000000..d9eb5a73 --- /dev/null +++ b/_freeze/16-inference-one-prop/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "27448690d58df7f1a82bae55a533b049", + "result": { + "markdown": "\n\n\n# Inference for a single proportion {#inference-one-prop}\n\n::: {.chapterintro data-latex=\"\"}\nFocusing now on statistical inference for categorical data, we will revisit many of the foundational aspects of hypothesis testing from Chapter \\@ref(foundations-randomization).\n\nThe three data structures we detail are one binary variable, summarized using a single proportion; two binary variables, summarized using a difference of two proportions; and two categorical variables, summarized using a two-way table.\nWhen appropriate, each of the data structures will be analyzed using the three methods from Chapters \\@ref(foundations-randomization), \\@ref(foundations-bootstrapping), and \\@ref(foundations-mathematical): randomization test, bootstrapping, and mathematical models, respectively.\n\nAs we build on the inferential ideas, we will visit new foundational concepts in statistical inference.\nFor example, we will cover the conditions for when a normal model is appropriate; the two different error rates in hypothesis testing; and choosing the confidence level for a confidence interval.\n:::\n\nWe encountered inference methods for a single proportion in Chapter \\@ref(foundations-bootstrapping), exploring point estimates and confidence intervals.\nIn this section, we'll do a review of these topics and how to choose an appropriate sample size when collecting data for single proportion contexts.\n\nNote that there is only one variable being measured in a study which focuses on one proportion.\nFor each observational unit, the single variable is measured as either a success or failure (e.g., \"surgical complication\" vs. \"no surgical complication\").\nBecause the nature of the research question at hand focuses on only a single variable, there is not a way to randomize the variable across a different (explanatory) variable.\nFor this reason, we will not use randomization as an analysis tool when focusing on a single proportion.\nInstead, we will apply bootstrapping techniques to test a given hypothesis, and we will also revisit the associated mathematical models.\n\n\\vspace{-4mm}\n\n## Bootstrap test for a proportion {#one-prop-null-boot}\n\nThe bootstrap simulation concept when $H_0$ is true is similar to the ideas used in the case studies presented in Chapter \\@ref(foundations-bootstrapping) where we bootstrapped without an assumption about $H_0.$ Because we will be testing a hypothesized value of $p$ (referred to as $p_0),$ the bootstrap simulation for hypothesis testing has a fantastic advantage that it can be used for any sample size (a huge benefit for small samples, a nice alternative for large samples).\n\nWe expand on the medical consultant example, see Section \\@ref(case-study-med-consult), where instead of finding an interval estimate for the true complication rate, we work to test a specific research claim.\n\n\\clearpage\n\n### Observed data\n\nRecall the set-up for the example:\n\nPeople providing an organ for donation sometimes seek the help of a special \"medical consultant\".\nThese consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery.\nPatients might choose a consultant based in part on the historical complication rate of the consultant's clients.\nOne consultant tried to attract patients by noting the average complication rate for liver donor surgeries in the US is about 10%, but her clients have only had 3 complications in the 62 liver donor surgeries she has facilitated.\nShe claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).\n\n::: {.workedexample data-latex=\"\"}\nUsing the data, is it possible to assess the consultant's claim that her complication rate is less than 10%?\n\n------------------------------------------------------------------------\n\nNo.\nThe claim is that there is a causal connection, but the data are observational.\nPatients who hire this medical consultant may have lower complication rates for other reasons.\n\nWhile it is not possible to assess this causal claim, it is still possible to test for an association using these data.\nFor this question we ask, could the low complication rate of $\\hat{p} = 0.048$ have simply occurred by chance, if her complication rate does not differ from the US standard rate?\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nWrite out hypotheses in both plain and statistical language to test for the association between the consultant's work and the true complication rate, $p,$ for the consultant's clients.[^16-inference-one-prop-1]\n:::\n\n[^16-inference-one-prop-1]: $H_0:$ There is no association between the consultant's contributions and the clients' complication rate.\n In statistical language, $p = 0.10.$ $H_A:$ Patients who work with the consultant tend to have a complication rate lower than 10%, i.e., $p < 0.10.$\n\nBecause, as it turns out, the conditions of working with the normal distribution are not met (see Section \\@ref(one-prop-norm)), the uncertainty associated with the sample proportion should not be modeled using the normal distribution, as doing so would underestimate the uncertainty associated with the sample statistic.\nHowever, we would still like to assess the hypotheses from the previous Guided Practice in absence of the normal framework.\nTo do so, we need to evaluate the possibility of a sample value $(\\hat{p})$ as far below the null value, $p_0 = 0.10$ as what was observed.\nThe deviation of the sample value from the hypothesized parameter is usually quantified with a p-value.\n\nThe p-value is computed based on the null distribution, which is the distribution of the test statistic if the null hypothesis is true.\nSupposing the null hypothesis is true, we can compute the p-value by identifying the probability of observing a test statistic that favors the alternative hypothesis at least as strongly as the observed test statistic.\nHere we will use a bootstrap simulation to calculate the p-value.\n\n\\clearpage\n\n### Variability of the statistic\n\nWe want to identify the sampling distribution of the test statistic $(\\hat{p})$ if the null hypothesis was true.\nIn other words, we want to see the variability we can expect from sample proportions if the null hypothesis was true.\nThen we plan to use this information to decide whether there is enough evidence to reject the null hypothesis.\n\nUnder the null hypothesis, 10% of liver donors have complications during or after surgery.\nSuppose this rate was really no different for the consultant's clients (for *all* the consultant's clients, not just the 62 previously measured).\nIf this was the case, we could *simulate* 62 clients to get a sample proportion for the complication rate from the null distribution.\nSimulating observations using a hypothesized null parameter value is often called a **parametric bootstrap simulation**\\index{parametric bootstrap}.\n\n\n\n\n\nSimilar to the process described in Chapter \\@ref(foundations-bootstrapping), each client can be simulated using a bag of marbles with 10% red marbles and 90% white marbles.\nSampling a marble from the bag (with 10% red marbles) is one way of simulating whether a patient has a complication *if the true complication rate is 10%*.\nIf we select 62 marbles and then compute the proportion of patients with complications in the simulation, $\\hat{p}_{sim1},$ then the resulting sample proportion is a sample from the null distribution.\n\nThere were 5 simulated cases with a complication and 57 simulated cases without a complication, i.e., $\\hat{p}_{sim1} = 5/62 = 0.081.$\n\n::: {.workedexample data-latex=\"\"}\nIs this one simulation enough to determine whether we should reject the null hypothesis?\n\n------------------------------------------------------------------------\n\nNo.\nTo assess the hypotheses, we need to see a distribution of many values of $\\hat{p}_{sim},$ not just a *single* draw from this sampling distribution.\n:::\n\n### Observed statistic vs. null statistics\n\nOne simulation isn't enough to get a sense of the null distribution; many simulation studies are needed.\nRoughly 10,000 seems sufficient.\nHowever, paying someone to simulate 10,000 studies by hand is a waste of time and money.\nInstead, simulations are typically programmed into a computer, which is much more efficient.\n\n\n\n\n\nFigure \\@ref(fig:nullDistForPHatIfLiverTransplantConsultantIsNotHelpful) shows the results of 10,000 simulated studies.\nThe proportions that are equal to or less than $\\hat{p} = 0.048$ are shaded.\nThe shaded areas represent sample proportions under the null distribution that provide at least as much evidence as $\\hat{p}$ favoring the alternative hypothesis.\nThere were 420 simulated sample proportions with $\\hat{p}_{sim} \\leq 0.048.$ We use these to construct the null distribution's left-tail area and find the p-value:\n\n$$\\text{left tail area} = \\frac{\\text{Number of observed simulations with }\\hat{p}_{sim} \\leq \\text{ 0.048}}{10000}$$\n\nOf the 10,000 simulated $\\hat{p}_{sim},$ 420 were equal to or smaller than $\\hat{p}.$ Since the hypothesis test is one-sided, the estimated p-value is equal to this tail area: 0.042.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-cap)](16-inference-one-prop_files/figure-html/nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-1.png){width=90%}\n:::\n:::\n\n\n(ref:nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-cap) The null distribution for $\\hat{p},$ created from 10,000 simulated studies. The left tail, representing the p-value for the hypothesis test, contains 4.2% of the simulations.\n\n::: {.guidedpractice data-latex=\"\"}\nBecause the estimated p-value is 0.042, which is smaller than the significance level 0.05, we reject the null hypothesis.\nExplain what this means in plain language in the context of the problem.[^16-inference-one-prop-2]\n:::\n\n[^16-inference-one-prop-2]: There is sufficiently strong evidence to reject the null hypothesis in favor of the alternative hypothesis.\n We would conclude that there is evidence that the consultant's surgery complication rate is lower than the US standard rate of 10%.\n\n::: {.guidedpractice data-latex=\"\"}\nDoes the conclusion in the previous Guided Practice imply the consultant is good at their job?\nExplain.[^16-inference-one-prop-3]\n:::\n\n[^16-inference-one-prop-3]: No.\n Not necessarily.\n The evidence supports the alternative hypothesis that the consultant's complication rate is lower, but it's not a measurement of their performance.\n\n::: {.important data-latex=\"\"}\n**Null distribution of** $\\hat{p}$ **with bootstrap simulation.**\n\nRegardless of the statistical method chosen, the p-value is always derived by analyzing the null distribution of the test statistic.\nThe normal model poorly approximates the null distribution for $\\hat{p}$ when the success-failure condition is not satisfied.\nAs a substitute, we can generate the null distribution using simulated sample proportions and use this distribution to compute the tail area, i.e., the p-value.\n:::\n\nIn the previous Guided Practice, the p-value is *estimated*.\nIt is not exact because the simulated null distribution itself is only a close approximation of the sampling distribution of the sample statistic.\nAn exact p-value can be generated using the binomial distribution, but that method will not be covered in this text.\n\n\\clearpage\n\n## Mathematical model for a proportion {#one-prop-norm}\n\n### Conditions\n\nIn Section \\@ref(normalDist), we introduced the normal distribution and showed how it can be used as a mathematical model to describe the variability of a statistic.\nThere are conditions under which a sample proportion $\\hat{p}$ is well modeled using a normal distribution.\nWhen the sample observations are independent and the sample size is sufficiently large, the normal model will describe the sampling distribution of the sample proportion quite well; when the observations violate the conditions, the normal model can be inaccurate.\nParticularly, it can underestimate the variability of the sample proportion.\n\n::: {.important data-latex=\"\"}\n**Sampling distribution of** $\\hat{p}.$\n\nThe sampling distribution for $\\hat{p}$ based on a sample of size $n$ from a population with a true proportion $p$ is nearly normal when:\n\n1. The sample's observations are independent, e.g., are from a simple random sample.\n2. We expected to see at least 10 successes and 10 failures in the sample, i.e., $np\\geq10$ and $n(1-p)\\geq10.$ This is called the **success-failure condition**.\n\nWhen these conditions are met, then the sampling distribution of $\\hat{p}$ is nearly normal with mean $p$ and standard error of $\\hat{p}$ as $SE = \\sqrt{\\frac{\\ \\hat{p}(1-\\hat{p})\\ }{n}}.$\n:::\n\nRecall that the margin of error is defined by the standard error.\nThe margin of error for $\\hat{p}$ can be directly obtained from $SE(\\hat{p}).$\n\n::: {.important data-latex=\"\"}\n**Margin of error for** $\\hat{p}.$\n\nThe margin of error is $z^\\star \\times \\sqrt{\\frac{\\ \\hat{p}(1-\\hat{p})\\ }{n}}$ where $z^\\star$ is calculated from a specified percentile on the normal distribution.\n:::\n\n\\index{success-failure condition} \\index{standard error (SE)!single proportion}\n\n\n\n\n\nTypically we do not know the true proportion $p,$ so we substitute some value to check conditions and estimate the standard error.\nFor confidence intervals, the sample proportion $\\hat{p}$ is used to check the success-failure condition and compute the standard error.\nFor hypothesis tests, typically the null value -- that is, the proportion claimed in the null hypothesis -- is used in place of $p.$\n\nThe independence condition is a more nuanced requirement.\nWhen it isn't met, it is important to understand how and why it is violated.\nFor example, there exist no statistical methods available to truly correct the inherent biases of data from a convenience sample.\nOn the other hand, if we took a cluster sample (see Section \\@ref(samp-methods)), the observations wouldn't be independent, but suitable statistical methods are available for analyzing the data (but they are beyond the scope of even most second or third courses in statistics).\n\n::: {.workedexample data-latex=\"\"}\nIn the examples based on large sample theory, we modeled $\\hat{p}$ using the normal distribution.\nWhy is this not appropriate for the case study on the medical consultant?\n\n------------------------------------------------------------------------\n\nThe independence assumption may be reasonable if each of the surgeries is from a different surgical team.\nHowever, the success-failure condition is not satisfied.\nUnder the null hypothesis, we would anticipate seeing $62 \\times 0.10 = 6.2$ complications, not the 10 required for the normal approximation.\n:::\n\nWhile this book is scoped to well-constrained statistical problems, do remember that this is just the first book in what is a large library of statistical methods that are suitable for a very wide range of data and contexts.\n\n### Confidence interval for a proportion\n\n\\index{point estimate!single proportion}\n\nA confidence interval provides a range of plausible values for the parameter $p,$ and when $\\hat{p}$ can be modeled using a normal distribution, the confidence interval for $p$ takes the form $\\hat{p} \\pm z^{\\star} \\times SE.$ We have seen $\\hat{p}$ to be the sample proportion.\nThe value $z^{\\star}$ determines the confidence level (previously set to be 1.96) and will be discussed in detail in the examples following.\nThe value of the standard error, $SE,$ depends heavily on the sample size.\n\n::: {.important data-latex=\"\"}\n**Standard error of one proportion,** $\\hat{p}.$\n\nWhen the conditions are met so that the distribution of $\\hat{p}$ is nearly normal, the **variability** of a single proportion, $\\hat{p}$ is well described by:\n\n$$SE(\\hat{p}) = \\sqrt{\\frac{p(1-p)}{n}}$$\n\nNote that we almost never know the true value of $p.$ A more helpful formula to use is:\n\n$$SE(\\hat{p}) \\approx \\sqrt{\\frac{(\\mbox{best guess of }p)(1 - \\mbox{best guess of }p)}{n}}$$\n\nFor hypothesis testing, we often use $p_0$ as the best guess of $p.$ For confidence intervals, we typically use $\\hat{p}$ as the best guess of $p.$\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nConsider taking many polls of registered voters (i.e., random samples) of size 300 asking them if they support legalized marijuana.\nIt is suspected that about 2/3 of all voters support legalized marijuana.\nTo understand how the sample proportion $(\\hat{p})$ would vary across the samples, calculate the standard error of $\\hat{p}.$[^16-inference-one-prop-4]\n:::\n\n[^16-inference-one-prop-4]: Because the $p$ is unknown but expected to be around 2/3, we will use 2/3 in place of $p$ in the formula for the standard error.\\\n $SE = \\sqrt{\\frac{p(1-p)}{n}} \\approx \\sqrt{\\frac{2/3 (1 - 2/3)} {300}} = 0.027.$\n\n\\clearpage\n\n### Variability of the sample proportion\n\n::: {.workedexample data-latex=\"\"}\nA simple random sample of 826 payday loan borrowers was surveyed to better understand their interests around regulation and costs.\n70% of the responses supported new regulations on payday lenders.\n\n1. Is it reasonable to model the variability of $\\hat{p}$ from sample to sample using a normal distribution?\n\n2. Estimate the standard error of $\\hat{p}.$\n\n3. Construct a 95% confidence interval for $p,$ the proportion of payday borrowers who support increased regulation for payday lenders.\n\n------------------------------------------------------------------------\n\n1. The data are a random sample, so it is reasonable to assume that the observations are independent and representative of the population of interest.\n\nWe also must check the success-failure condition, which we do using $\\hat{p}$ in place of $p$ when computing a confidence interval:\n\n$$\n\\begin{aligned}\n \\text{Support: }\n n p &\n \\approx 826 \\times 0.70\n = 578\\\\\n \\text{Not: }\n n (1 - p) &\n \\approx 826 \\times (1 - 0.70)\n = 248\n\\end{aligned}\n$$\n\nSince both values are at least 10, we can use the normal distribution to model $\\hat{p}.$\n\n2. Because $p$ is unknown and the standard error is for a confidence interval, use $\\hat{p}$ in place of $p$ in the formula.\n\n$$SE = \\sqrt{\\frac{p(1-p)}{n}} \\approx \\sqrt{\\frac{0.70 (1 - 0.70)} {826}} = 0.016.$$\n\n3. Using the point estimate 0.70, $z^{\\star} = 1.96$ for a 95% confidence interval, and the standard error $SE = 0.016$ from the previous Guided Practice, the confidence interval is\n\n$$ \n\\begin{aligned}\n\\text{point estimate} \\ &\\pm \\ z^{\\star} \\times \\ SE \\\\\n0.70 \\ &\\pm \\ 1.96 \\ \\times \\ 0.016 \\\\ \n(0.669 \\ &, \\ 0.731)\n\\end{aligned}\n$$\n\nWe are 95% confident that the true proportion of payday borrowers who supported regulation at the time of the poll was between 0.669 and 0.731.\n:::\n\n::: {.important data-latex=\"\"}\n**Constructing a confidence interval for a single proportion.**\n\nThere are three steps to constructing a confidence interval for $p.$\n\n1. Check if it seems reasonable to assume the observations are independent and check the success-failure condition using $\\hat{p}.$ If the conditions are met, the sampling distribution of $\\hat{p}$ may be well-approximated by the normal model.\n2. Construct the standard error using $\\hat{p}$ in place of $p$ in the standard error formula.\n3. Apply the general confidence interval formula.\n:::\n\nFor additional one-proportion confidence interval examples, see Section \\@ref(ConfidenceIntervals).\n\n### Changing the confidence level\n\n\\index{confidence level}\n\nSuppose we want to consider confidence intervals where the confidence level is somewhat higher than 95%: perhaps we would like a confidence level of 99%.\nThink back to the analogy about trying to catch a fish: if we want to be more sure that we will catch the fish, we should use a wider net.\nTo create a 99% confidence level, we must also widen our 95% interval.\nOn the other hand, if we want an interval with lower confidence, such as 90%, we could make our original 95% interval slightly slimmer.\n\nThe 95% confidence interval structure provides guidance in how to make intervals with new confidence levels.\nBelow is a general 95% confidence interval for a point estimate that comes from a nearly normal distribution:\n\n$$\\text{point estimate} \\ \\pm \\ 1.96 \\ \\times \\ SE$$\n\nThere are three components to this interval: the point estimate, \"1.96\", and the standard error.\nThe choice of $1.96 \\times SE$ was based on capturing 95% of the data since the estimate is within 1.96 standard errors of the true value about 95% of the time.\nThe choice of 1.96 corresponds to a 95% confidence level.\n\n::: {.guidedpractice data-latex=\"\"}\nIf $X$ is a normally distributed random variable, how often will $X$ be within 2.58 standard deviations of the mean?[^16-inference-one-prop-5]\n:::\n\n[^16-inference-one-prop-5]: This is equivalent to asking how often the $Z$ score will be larger than -2.58 but less than 2.58.\n (For a picture, see Figure \\@ref(fig:choosingZForCI).) To determine this probability, look up -2.58 and 2.58 in the normal probability table (0.0049 and 0.9951).\n Thus, there is a $0.9951-0.0049 \\approx 0.99$ probability that the unobserved random variable $X$ will be within 2.58 standard deviations of the mean.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:choosingZForCI-cap)](16-inference-one-prop_files/figure-html/choosingZForCI-1.png){width=90%}\n:::\n:::\n\n\n(ref:choosingZForCI-cap) The area between -$z^{\\star}$ and $z^{\\star}$ increases as $|z^{\\star}|$ becomes larger. If the confidence level is 99%, we choose $z^{\\star}$ such that 99% of the normal curve is between -$z^{\\star}$ and $z^{\\star},$ which corresponds to 0.5% in the lower tail and 0.5% in the upper tail: $z^{\\star}=2.58.$\n\n\\index{confidence interval}\n\nTo create a 99% confidence interval, change 1.96 in the 95% confidence interval formula to be $2.58.$ The previous Guided Practice highlights that 99% of the time a normal random variable will be within 2.58 standard deviations of its mean.\nThis approach -- using the Z scores in the normal model to compute confidence levels -- is appropriate when the point estimate is associated with a normal distribution and we can properly compute the standard error.\nThus, the formula for a 99% confidence interval is:\n\n$$\\text{point estimate} \\ \\pm \\ 2.58 \\ \\times \\ SE$$\n\nThe normal approximation is crucial to the precision of the $z^\\star$ confidence intervals (in contrast to the bootstrap percentile confidence intervals).\nWhen the normal model is not a good fit, we will use alternative distributions that better characterize the sampling distribution or we will use bootstrapping procedures.\n\n::: {.guidedpractice data-latex=\"\"}\nCreate a 99% confidence interval for the impact of the stent on the risk of stroke using the data from Section \\@ref(case-study-stents-strokes).\nThe point estimate is 0.090, and the standard error is $SE = 0.028.$ It has been verified for you that the point estimate can reasonably be modeled by a normal distribution.[^16-inference-one-prop-6]\n:::\n\n[^16-inference-one-prop-6]: Since the necessary conditions for applying the normal model have already been checked for us, we can go straight to the construction of the confidence interval: $\\text{point estimate} \\pm 2.58 \\times SE$ Which gives an interval of (0.018, 0.162).\\$ We are 99% confident that implanting a stent in the brain of a patient who is at risk of stroke increases the risk of stroke within 30 days by a rate of 0.018 to 0.162 (assuming the patients are representative of the population).\n\n::: {.important data-latex=\"\"}\n**Mathematical model confidence interval for any confidence level.**\n\nIf the point estimate follows the normal model with standard error $SE,$ then a confidence interval for the population parameter is\n\n$$\\text{point estimate} \\ \\pm \\ z^{\\star} \\ \\times \\ SE$$\n\nwhere $z^{\\star}$ corresponds to the confidence level selected.\n:::\n\nFigure \\@ref(fig:choosingZForCI) provides a picture of how to identify $z^{\\star}$ based on a confidence level.\nWe select $z^{\\star}$ so that the area between -$z^{\\star}$ and $z^{\\star}$ in the normal model corresponds to the confidence level.\n\n::: {.guidedpractice data-latex=\"\"}\nPreviously, we found that implanting a stent in the brain of a patient at risk for a stroke *increased* the risk of a stroke.\nThe study estimated a 9% increase in the number of patients who had a stroke, and the standard error of this estimate was about $SE = 2.8%.$ Compute a 90% confidence interval for the effect.[^16-inference-one-prop-7]\n:::\n\n[^16-inference-one-prop-7]: We must find $z^{\\star}$ such that 90% of the distribution falls between -$z^{\\star}$ and $z^{\\star}$ in the standard normal model, $N(\\mu=0, \\sigma=1).$ We can look up -$z^{\\star}$ in the normal probability table by looking for a lower tail of 5% (the other 5% is in the upper tail), thus $z^{\\star} = 1.65.$ The 90% confidence interval can then be computed as $\\text{point estimate} \\pm 1.65 \\times SE \\to (4.4\\%, 13.6\\%).$ (Note: the conditions for normality had earlier been confirmed for us.) That is, we are 90% confident that implanting a stent in a stroke patient's brain increased the risk of stroke within 30 days by 4.4% to 13.6%.\\\n Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such as 95% or 99%).\n A lower degree of confidence increases potential for error, but it also produces a more narrow interval.\n\n### Hypothesis test for a proportion\n\n::: {.important data-latex=\"\"}\n**The test statistic for assessing a single proportion is a Z.**\n\nThe **Z score** is a ratio of how the sample proportion differs from the hypothesized proportion as compared to the expected variability of the $\\hat{p}$ values.\n\n$$Z = \\frac{\\hat{p} - p_0}{\\sqrt{p_0(1 - p_0)/n}}$$\n\nWhen the null hypothesis is true and the conditions are met, Z has a standard normal distribution.\n\nConditions:\n\n- independent observations\\\n- large samples $(n p_0 \\geq 10$ and $n (1-p_0) \\geq 10)$\\\n:::\n\n\n\n\n\nOne possible regulation for payday lenders is that they would be required to do a credit check and evaluate debt payments against the borrower's finances.\nWe would like to know: would borrowers support this form of regulation?\n\n::: {.guidedpractice data-latex=\"\"}\nSet up hypotheses to evaluate whether borrowers have a majority support for this type of regulation.[^16-inference-one-prop-8]\n:::\n\n[^16-inference-one-prop-8]: $H_0:$ there is not support for the regulation; $H_0:$ $p \\leq 0.50.$ $H_A:$ the majority of borrowers support the regulation; $H_A:$ $p > 0.50.$\n\nTo apply the normal distribution framework in the context of a hypothesis test for a proportion, the independence and success-failure conditions must be satisfied.\nIn a hypothesis test, the success-failure condition is checked using the null proportion: we verify $np_0$ and $n(1-p_0)$ are at least 10, where $p_0$ is the null value.\n\n::: {.guidedpractice data-latex=\"\"}\nDo payday loan borrowers support a regulation that would require lenders to pull their credit report and evaluate their debt payments?\nFrom a random sample of 826 borrowers, 51% said they would support such a regulation.\nIs it reasonable to use a normal distribution to model $\\hat{p}$ for a hypothesis test here?[^16-inference-one-prop-9]\n:::\n\n[^16-inference-one-prop-9]: Independence holds since the poll is based on a random sample.\n The success-failure condition also holds, which is checked using the null value $(p_0 = 0.5)$ from $H_0:$ $np_0 = 826 \\times 0.5 = 413,$ $n(1 - p_0) = 826 \\times 0.5 = 413.$ Recall that here, the best guess for $p$ is $p_0$ which comes from the null hypothesis (because we assume the null hypothesis is true when performing the testing procedure steps).\n $H_0:$ there is not support for the regulation; $H_0:$ $p \\leq 0.50.$ $H_A:$ the majority of borrowers support the regulation; $H_A:$ $p > 0.50.$\n\n::: {.workedexample data-latex=\"\"}\nUsing the hypotheses and data from the previous Guided Practices, evaluate whether the poll on lending regulations provides convincing evidence that a majority of payday loan borrowers support a new regulation that would require lenders to pull credit reports and evaluate debt payments.\n\n------------------------------------------------------------------------\n\nWith hypotheses already set up and conditions checked, we can move onto calculations.\nThe standard error in the context of a one-proportion hypothesis test is computed using the null value, $p_0:$\n\n$$SE = \\sqrt{\\frac{p_0 (1 - p_0)}{n}} = \\sqrt{\\frac{0.5 (1 - 0.5)}{826}} = 0.017$$\n\nA picture of the normal model is shown with the p-value represented by the shaded region.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](16-inference-one-prop_files/figure-html/unnamed-chunk-8-1.png){width=90%}\n:::\n:::\n\n\nBased on the normal model, the test statistic can be computed as the Z score of the point estimate:\n\n$$\n\\begin{aligned}\nZ &= \\frac{\\text{point estimate} - \\text{null value}}{SE} \\\\\n &= \\frac{0.51 - 0.50}{0.017} \\\\\n &= 0.59\n\\end{aligned} \n$$\n\nThe single tail area which represents the p-value is 0.2776.\nBecause the p-value is larger than 0.05, we do not reject $H_0.$ The poll does not provide convincing evidence that a majority of payday loan borrowers support regulations around credit checks and evaluation of debt payments.\n\nIn Section \\@ref(two-prop-errors) we discuss two-sided hypothesis tests of which the payday example may have been better structured.\nThat is, we might have wanted to ask whether the borrows **support or oppose** the regulations (to study opinion in either direction away from the 50% benchmark).\nIn that case, the p-value would have been doubled to 0.5552 (again, we would not reject $H_0).$ In the two-sided hypothesis setting, the appropriate conclusion would be to claim that the poll does not provide convincing evidence that a majority of payday loan borrowers support or oppose regulations around credit checks and evaluation of debt payments.\n\nIn both the one-sided or two-sided setting, the conclusion is somewhat unsatisfactory because there is no conclusion.\nThat is, there is no resolution one way or the other about public opinion.\nWe cannot claim that exactly 50% of people support the regulation, but we cannot claim a majority in either direction.\n:::\n\n::: {.important data-latex=\"\"}\n**Mathematical model hypothesis test for a proportion.**\n\nSet up hypotheses and verify the conditions using the null value, $p_0,$ to ensure $\\hat{p}$ is nearly normal under $H_0.$ If the conditions hold, construct the standard error, again using $p_0,$ and show the p-value in a drawing.\nLastly, compute the p-value and evaluate the hypotheses.\n:::\n\nFor additional one-proportion hypothesis test examples, see Section \\@ref(HypothesisTesting).\n\n### Violating conditions\n\nWe've spent a lot of time discussing conditions for when $\\hat{p}$ can be reasonably modeled by a normal distribution.\nWhat happens when the success-failure condition fails?\nWhat about when the independence condition fails?\nIn either case, the general ideas of confidence intervals and hypothesis tests remain the same, but the strategy or technique used to generate the interval or p-value change.\n\nWhen the success-failure condition isn't met for a hypothesis test, we can simulate the null distribution of $\\hat{p}$ using the null value, $p_0,$ as seen in Section \\@ref(one-prop-null-boot).\nUnfortunately, methods for dealing with observations which are not independent (e.g., repeated measurements on subject, such as in studies where measurements from the same subjects are taken pre and post study) are outside the scope of this book.\n\n\\vspace{10mm}\n\n## Chapter review {#chp16-review}\n\n### Summary\n\nBuilding on the foundational ideas from the previous few ideas, this chapter focused exclusively on the single population proportion as the parameter of interest.\nNote that it is not possible to do a randomization test with only one variable, so to do computational hypothesis testing, we applied a bootstrapping framework.\nThe bootstrap confidence interval and the mathematical framework for both hypothesis testing and confidence intervals are similar to those applied to other data structures and parameters.\nWhen using the mathematical model, keep in mind the success-failure conditions.\nAdditionally, know that bootstrapping is always more accurate with larger samples.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
parametric bootstrap success-failure condition
SE single proportion Z score
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp16-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-16].\n\n::: {.exercises data-latex=\"\"}\n1. **Do aliens exist?**\nIn May 2021, YouGov asked 4,839 adult Great Britain residents whether they think aliens exist, and if so, if they have or have not visited Earth.\nYou want to evaluate if more than a quarter (25\\%) of Great Britain adults think aliens do not exist.\nIn the survey 22\\% responded \"I think they exist, and have visited Earth\", 28\\% responded \"I think they exist, but have not visited Earth\", 29% responded \"I do not think they exist\", and 22\\% responded \"Don't know\".\nA friend of yours offers to help you with setting up the hypothesis test and comes up with the following hypotheses.\nIndicate any errors you see.\n\n $H_0: \\hat{p} = 0.29 \\quad \\quad H_A: \\hat{p} > 0.29$\n \n \\vspace{5mm}\n\n1. **Married at 25.**\nA study suggests that the 25% of 25 year olds have gotten married.\nYou believe that this is incorrect and decide to collect your own sample for a hypothesis test.\nFrom a random sample of 25 year olds in census data with size 776, you find that 24% of them are married.\nA friend of yours offers to help you with setting up the hypothesis test and comes up with the following hypotheses. Indicate any errors you see.\n\n $H_0: \\hat{p} = 0.24 \\quad \\quad H_A: \\hat{p} \\neq 0.24$\n \n \\vspace{5mm}\n\n1. **Defund the police.**\nA Survey USA poll conducted in Seattle, WA in May 2021 reports that of the 650 respondents (adults living in this area), 159 support proposals to defund police departments. [@data:defundpolice]\n\n a. A journalist writing a news story on the poll results wants to use the headline \"More than 1 in 5 adults living in Seattle support proposals to defund police departments.\" You caution the journalist that they should first conduct a hypothesis test to see if the poll data provide convincing evidence for this claim. Write the hypotheses for this test.\n \n b. Calculate the proportion of Seattle adults in this sample who support proposals to defund police departments.\n \n c. Describe a setup for a simulation that would be appropriate in this situation and how the p-value can be calculated using the simulation results.\n \n d. Below is a histogram showing the distribution of $\\hat{p}_{sim}$ in 1,000 simulations under the null hypothesis. Estimate the p-value using the plot and use it to evaluate the hypotheses.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-10-1.png){width=90%}\n :::\n :::\n \n \\clearpage\n\n1. **Assisted reproduction.**\nAssisted Reproductive Technology (ART) is a collection of techniques that help facilitate pregnancy (e.g., in vitro fertilization). The 2018 ART Fertility Clinic Success Rates Report published by the Centers for Disease Control and Prevention reports that ART has been successful in leading to a live birth in 48.8% of cases where the patient is under 35 years old. [@web:art2018] A new fertility clinic claims that their success rate is higher than average for this age group. A random sample of 30 of their patients yielded a success rate of 60%. A consumer watchdog group would like to determine if this provides strong evidence to support the company's claim.\n\n a. Write the hypotheses to test if the success rate for ART at this clinic is significantly higher than the success rate reported by the CDC.\n\n b. Describe a setup for a simulation that would be appropriate in this situation and how the p-value can be calculated using the simulation results.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-11-1.png){width=90%}\n :::\n :::\n\n c. Below is a histogram showing the distribution of $\\hat{p}_{sim}$ in 1,000 simulations under the null hypothesis. Estimate the p-value using the plot and use it to evaluate the hypotheses.\n\n d. After performing this analysis, the consumer group releases the following news headline: \"Infertility clinic falsely advertises better success rates\". Comment on the appropriateness of this statement.\n \n \\clearpage\n\n1. **If I fits, I sits, bootstrap test.**\nA citizen science project on which type of enclosed spaces cats are most likely to sit in compared (among other options) two different spaces taped to the ground. The first was a square, and the second was a shape known as [Kanizsa square illusion](https://en.wikipedia.org/wiki/Illusory_contours#Kanizsa_figures). When comparing the two options given to 7 cats, 5 chose the square, and 2 chose the Kanizsa square illusion. We are interested to know whether these data provide convincing evidence that cats prefer one of the shapes over the other. [@Smith:2021]\n \n a. What are the null and alternative hypotheses for evaluating whether these data provide convincing evidence that cats have preference for one of the shapes\n \n b. A parametric bootstrap simulation (with 1,000 bootstrap samples) was run and the resulting null distribution is displayed in the histogram below.Find the p-value using this distribution and conclude the hypothesis test in the context of the problem.\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-12-1.png){width=90%}\n :::\n :::\n\n1. **Legalization of marijuana, bootstrap test.**\nThe 2018 General Social Survey asked a random sample of 1,563 US adults: \"Do you think the use of marijuana should be made legal, or not?\" 60% of the respondents said it should be made legal. [@data:gssgrass] Consider a scenario where, in order to become legal, 55% (or more) of voters must approve.\n \n a. What are the null and alternative hypotheses for evaluating whether these data provide convincing evidence that, if voted on, marijuana would be legalized in the US.\n \n b. A parametric bootstrap simulation (with 1,000 bootstrap samples) was run and the resulting null distribution is displayed in the histogram below. Find the p-value using this distribution and conclude the hypothesis test in the context of the problem.\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-13-1.png){width=90%}\n :::\n :::\n \n \\clearpage\n\n1. **If I fits, I sits, standard errors.**\nThe results of a study on the type of enclosed spaces cats are most likely to sit in show that 5 out of 7 cats chose a square taped to the ground over a shape known as [Kanizsa square illusion](https://en.wikipedia.org/wiki/Illusory_contours#Kanizsa_figures), which was preferred by the remaining 2 cats. To evaluate whether these data provide convincing evidence that cats prefer one of the shapes over the other, we set $H_0: p = 0.5$, where $p$ is the population proportion of cats who prefer square over the Kanizsa square illusion and $H_A: p \\neq 0.5$, which suggests some preference, without specifying which shape is more preferred. [@Smith:2021]\n\n a. Using the mathematical model, calculate the standard error of the sample proportion in repeated samples of size 7.\n \n b. A parametric bootstrap simulation (with 1,000 bootstrap samples) was run and the resulting null distribution is displayed in the histogram below. This distribution shows the variability of the sample proportion in samples of size 7 when 50% of cats prefer the square shape over the Kanizsa square illusion. What is the approximate standard error of the sample proportion based on this distribution?\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-14-1.png){width=90%}\n :::\n :::\n \n c. Do the mathematical model and parametric bootstrap give similar standard errors?\n \n d. In order to approach the problem using the mathematical model, is the success-failure condition met for this study?Explain.\n \n e. What about the null distribution shown above (generated using the parametric bootstrap) tells us that the mathematical model should probably not be used?\n \n \\clearpage\n\n1. **Legalization of marijuana, standard errors.**\nAccording to the 2018 General Social Survey, in a random sample of 1,563 US adults, 60% think marijuana should be made legal. [@data:gssgrass] Consider a scenario where, in order to become legal, 55% (or more) of voters must approve.\n\n a. Calculate the standard error of the sample proportion using the mathematical model.\n\n b. A parametric bootstrap simulation (with 1,000 bootstrap samples) was run and the resulting null distribution is displayed in the histogram below. This distribution shows the variability of the sample proportion in samples of size 1,563 when 55% of voters approve legalizing marijuana. What is the approximate standard error of the sample proportion based on this distribution?\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-15-1.png){width=90%}\n :::\n :::\n \n c. Do the mathematical model and parametric bootstrap give similar standard errors?\n \n d. In this setting (to test whether the true underlying population proportion is greater than 0.55), would there be a strong reason to choose the mathematical model over the parametric bootstrap (or vice versa)?\n \n \\clearpage\n\n1. **Statistics and employment, describe the bootstrap.**\nA large university knows that about 70% of the full-time students are employed at least 5 hours per week. The members of the Statistics Department wonder if the same proportion of their students work at least 5 hours per week. They randomly sample 25 majors and find that 15 of the students work 5 or more hours each week.\n\n Two bootstrap sampling distributions are created to describe the variability in the proportion of statistics majors who work at least 5 hours per week. The parametric bootstrap imposes a true population proportion of $p = 0.7$ while the data bootstrap resamples from the actual data (which has 60% of the observations who work at least 5 hours per week).\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-16-1.png){width=100%}\n :::\n :::\n \n a. The bootstrap sampling was done under two different settings to generate each of the distributions shown above. Describe the two different settings.\n\n b. Where are each of the two distributions centered? Are they centered at roughly the same place?\n \n c. Estimate the standard error of the simulated proportions based on each distribution. Are the two standard errors you estimate roughly equal?\n \n d. Describe the shapes of the two distributions. Are they roughly the same?\n \n \\clearpage\n\n1. **National Health Plan, parametric bootstrap.**\nA Kaiser Family Foundation poll for a random sample of US adults in 2019 found that 79% of Democrats, 55% of Independents, and 24% of Republicans supported a generic \"National Health Plan\". \nThere were 347 Democrats, 298 Republicans, and 617 Independents surveyed. [@data:KFF2019nathealthplan]\n\n A political pundit on TV claims that a majority of Independents support a National Health Plan. Do these data provide strong evidence to support this type of statement? One approach to assessing the question of whether a majority of Independents support a National Health Plan is to simulate 1,000 parametric bootstrap samples with $p = 0.5$ as the proportion of Independents in support.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-17-1.png){width=90%}\n :::\n :::\n \n a. The histogram above displays 1000 values of what? \n\n b. Is the observed proportion of Independents consistent with the parametric bootstrap proportions under the setting where $p=0.5?$\n \n c. In order to test the claim that \"a majority of Independents support a National Health Plan\" what are the null and alternative hypotheses?\n \n d. Using the parametric bootstrap distribution, find the p-value and conclude the hypothesis test in the context of the problem.\n \n \\clearpage\n\n1. **Statistics and employment, use the bootstrap.**\nIn a large university where 70% of the full-time students are employed at least 5 hours per week, the members of the Statistics Department wonder if the same proportion of their students work at least 5 hours per week. They randomly sample 25 majors and find that 15 of the students work 5 or more hours each week.\n\n Two bootstrap sampling distributions are created to describe the variability in the proportion of statistics majors who work at least 5 hours per week. The parametric bootstrap imposes a true population proportion of $p=0.7$ while the data bootstrap resamples from the actual data (which has 60% of the observations who work at least 5 hours per week).\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](16-inference-one-prop_files/figure-html/unnamed-chunk-18-1.png){width=100%}\n :::\n :::\n \n a. Which bootstrap distribution should be used to test whether the proportion of all statistics majors who work at least 5 hours per week is 70%? And which bootstrap distribution should be used to find a confidence interval for the true poportion of statistics majors who work at least 5 hours per week?\n \n b. Using the appropriate histogram, test the claim that 70% of statistics majors, like their peers, work at least 5 hours per week. State the null and alternative hypotheses, find the p-value, and conclude the test in the context of the problem.\n \n c. Using the appropriate histogram, find a 98% bootstrap percentile confidence interval for the true proportion of statistics majors who work at least 5 hours per week. Interpret the confidence interval in the context of the problem.\n \n d. Using the appropriate historgram, find a 98% bootstrap SE confidence interval for the true proportion of statistics majors who work at least 5 hours per week. Interpret the confidence interval in the context of the problem.\n \n \\vspace{5mm}\n\n1. **CLT for proportions.**\nDefine the term \"sampling distribution\" of the sample proportion, and describe how the shape, center, and spread of the sampling distribution change as the sample size increases when $p = 0.1$.\n\n \\clearpage\n\n1. **Vegetarian college students.**\nSuppose that 8% of college students are vegetarians. Determine if the following statements are true or false, and explain your reasoning.\n\n a. The distribution of the sample proportions of vegetarians in random samples of size 60 is approximately normal since $n \\ge 30$.\n\n b. The distribution of the sample proportions of vegetarian college students in random samples of size 50 is right skewed.\n\n c. A random sample of 125 college students where 12% are vegetarians would be considered unusual.\n\n d. A random sample of 250 college students where 12% are vegetarians would be considered unusual.\n\n e. The standard error would be reduced by one-half if we increased the sample size from 125 to 250.\n\n1. **Young Americans, American dream.** \nAbout 77% of young adults think they can achieve the American dream. \nDetermine if the following statements are true or false, and explain your reasoning. [@news:youngAmericans1]\n\n a. The distribution of sample proportions of young Americans who think they can achieve the American dream in random samples of size 20 is left skewed.\n\n b. The distribution of sample proportions of young Americans who think they can achieve the American dream in random samples of size 40 is approximately normal since $n \\ge 30$.\n\n c. A random sample of 60 young Americans where 85% think they can achieve the American dream would be considered unusual.\n\n d. A random sample of 120 young Americans where 85% think they can achieve the American dream would be considered unusual.\n\n1. **Orange tabbies.** \nSuppose that 90% of orange tabby cats are male.\nDetermine if the following statements are true or false, and explain your reasoning.\n\n a. The distribution of sample proportions of random samples of size 30 is left skewed.\n\n b. Using a sample size that is 4 times as large will reduce the standard error of the sample proportion by one-half.\n\n c. The distribution of sample proportions of random samples of size 140 is approximately normal.\n\n d. The distribution of sample proportions of random samples of size 280 is approximately normal.\n\n1. **Young Americans, starting a family.**\nAbout 25% of young Americans have delayed starting a family due to the continued economic slump.\nDetermine if the following statements are true or false, and explain your reasoning. [@news:youngAmericans2]\n\n a. The distribution of sample proportions of young Americans who have delayed starting a family due to the continued economic slump in random samples of size 12 is right skewed.\n\n b. In order for the distribution of sample proportions of young Americans who have delayed starting a family due to the continued economic slump to be approximately normal, we need random samples where the sample size is at least 40.\n\n c. A random sample of 50 young Americans where 20% have delayed starting a family due to the continued economic slump would be considered unusual.\n\n d. A random sample of 150 young Americans where 20% have delayed starting a family due to the continued economic slump would be considered unusual.\n\n e. Tripling the sample size will reduce the standard error of the sample proportion by one-third.\n \n \\clearpage\n\n1. **Sex equality.**\nThe General Social Survey asked a random sample of 1,390 Americans the following question: \"On the whole, do you think it should or should not be the government's responsibility to promote equality between men and women?\" 82% of the respondents said it \"should be\". At a 95% confidence level, this sample has 2% margin of error. Based on this information, determine if the following statements are true or false, and explain your reasoning. [@data:gsssexeq]\n\n a. We are 95% confident that between 80% and 84% of Americans in this sample think it's the government's responsibility to promote equality between men and women.\n\n b. We are 95% confident that between 80% and 84% of all Americans think it's the government's responsibility to promote equality between men and women.\n\n c. If we considered many random samples of 1,390 Americans, and we calculated 95% confidence intervals for each, 95% of these intervals would include the true population proportion of Americans who think it's the government's responsibility to promote equality between men and women.\n\n d. In order to decrease the margin of error to 1%, we would need to quadruple (multiply by 4) the sample size.\n\n e. Based on this confidence interval, there is sufficient evidence to conclude that a majority of Americans think it's the government's responsibility to promote equality between men and women.\n \n \\vspace{3mm}\n\n1. **Elderly drivers.** \nThe Marist Poll published a report stating that 66% of adults nationally think licensed drivers should be required to retake their road test once they reach 65 years of age. It was also reported that interviews were conducted on a random sample of 1,018 American adults, and that the margin of error was 3% using a 95% confidence level. [@data:elderlyDriving]\n\n a. Verify the margin of error reported by The Marist Poll using a mathematical model.\n\n b. Based on a 95% confidence interval, does the poll provide convincing evidence that *more than* two thirds of the population think that licensed drivers should be required to retake their road test once they turn 65?\n \n \\vspace{3mm}\n\n1. **Fireworks on July 4$^{\\text{th}}$.** \nA local news outlet reported that 56% of 600 randomly sampled Kansas residents planned to set off fireworks on July $4^{th}$. \nDetermine the margin of error for the 56% point estimate using a 95% confidence level using a mathematical model. [@data:july4]\n\n \\vspace{3mm}\n\n1. **Proof of COVID-19 vaccination.**\nA Gallup poll surveyed 3,731 randomly sampled US in April 2021, asking how they felt about requiring proof of COVID-19 vaccination for travel by airplane. \nThe poll found that 57% said they would favor it. [@data:gallupcovidvaccine]\n\n a. Describe the population parameter of interest. What is the value of the point estimate of this parameter?\n\n b. Check if the conditions required for constructing a confidence interval using a mathematical model based on these data are met.\n\n c. Construct a 95% confidence interval for the proportion of US adults who favor requiring proof of COVID-19 vaccination for travel by airplane.\n\n d. Without doing any calculations, describe what would happen to the confidence interval if we decided to use a higher confidence level.\n\n e. Without doing any calculations, describe what would happen to the confidence interval if we used a larger sample.\n \n \\clearpage\n\n1. **Study abroad.** \nA survey on 1,509 high school seniors who took the SAT and who completed an optional web survey shows that 55% of high school seniors are fairly certain that they will participate in a study abroad program in college. [@data:studyAbroad]\n\n a. Is this sample a representative sample from the population of all high school seniors in the US? Explain your reasoning.\n\n b. Let's suppose the conditions for inference are met. Even if your answer to part (a) indicated that this approach would not be reliable, this analysis may still be interesting to carry out (though not report). Using a mathematical model, construct a 90% confidence interval for the proportion of high school seniors (of those who took the SAT) who are fairly certain they will participate in a study abroad program in college, and interpret this interval in context.\n\n c. What does \"90% confidence\" mean?\n\n d. Based on this interval, would it be appropriate to claim that the majority of high school seniors are fairly certain that they will participate in a study abroad program in college?\n\n1. **Legalization of marijuana, mathematical interval.**\nThe General Social Survey asked a random sample of 1,563 US adults: \"Do you think the use of marijuana should be made legal, or not?\" 60% of the respondents said it should be made legal. [@data:gssgrass]\n\n a. Is 60% a sample statistic or a population parameter? Explain.\n\n b. Using a mathematical model, construct a 95% confidence interval for the proportion of US adults who think marijuana should be made legal, and interpret it.\n\n c. A critic points out that this 95% confidence interval is only accurate if the statistic follows a normal distribution, or if the normal model is a good approximation. Is this true for these data? Explain.\n\n d. A news piece on this survey's findings states, \"Majority of US adults think marijuana should be legalized.\" Based on your confidence interval, is this statement justified?\n\n1. **National Health Plan, mathematical inference.**\nA Kaiser Family Foundation poll for a random sample of US adults in 2019 found that 79% of Democrats, 55% of Independents, and 24% of Republicans supported a generic \"National Health Plan\". \nThere were 347 Democrats, 298 Republicans, and 617 Independents surveyed. [@data:KFF2019nathealthplan]\n\n a. A political pundit on TV claims that a majority of Independents support a National Health Plan. Do these data provide strong evidence to support this type of statement? Your response should use a mathematical model.\n\n b. Would you expect a confidence interval for the proportion of Independents who oppose the public option plan to include 0.5? Explain.\n\n1. **Is college worth it?**\nAmong a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school. [@data:collegeWorthIt]\n\n a. A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test to determine if these data provide strong evidence supporting this statement.\n\n b. Would you expect a confidence interval for the proportion of American adults who decide not to go to college because they cannot afford it to include 0.5? Explain.\n \n \\clearpage\n\n1. **Taste test.**\nSome people claim that they can tell the difference between a diet soda and a regular soda in the first sip.\nA researcher wanting to test this claim randomly sampled 80 such people.\nHe then filled 80 plain white cups with soda, half diet and half regular through random assignment, and asked each person to take one sip from their cup and identify the soda as diet or regular.\n53 participants correctly identified the soda.\n\n a. Do these data provide strong evidence that these people are able to detect the difference between diet and regular soda, in other words, are the results significantly better than just random guessing? Your response should use a mathematical model.\n\n b. Interpret the p-value in this context.\n \n \\vspace{5mm}\n\n1. **Will the coronavirus bring the world closer together?**\nAn April 2021 YouGov poll asked 4,265 UK adults whether they think the coronavirus bring the world closer together or leave us further apart. \n12% of the respondents said it will bring the world closer together. 37% said it would leave us further apart, 39% said it won't make a difference and the remainder didn't have an opinion on the matter. [@data:yougovcovid]\n\n a. Calculate, using a mathematical model, a 90% confidence interval for the proportion of UK adults who think the coronavirus will bring the world closer together, and interpret the interval in context.\n\n b. Suppose we wanted the margin of error for the 90% confidence level to be about 0.5%. How large of a sample size would you recommend for the poll?\n \n \\vspace{5mm}\n\n1. **Quality control.**\nAs part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. \nShe finds that 27 of the chips are defective.\n\n a. What population is under consideration in the data set?\n\n b. What parameter is being estimated?\n\n c. What is the point estimate for the parameter?\n\n d. What is the name of the statistic that can be used to measure the uncertainty of the point estimate?\n\n e. Compute the value of the statistic from part (d) using a mathematical model.\n\n f. The historical rate of defects is 10%. Should the engineer be surprised by the observed rate of defects during the current week?\n\n g. Suppose the true population value was found to be 10%. If we use this proportion to recompute the value in part (d) using $p = 0.1$ instead of $\\hat{p}$, how much does the resulting value of the statistic change?\n \n \\vspace{5mm}\n\n1. **Nearsighted children.**\nNearsightedness (myopia) is a common vision condition in which you can see objects near to you clearly, but objects farther away are blurry. \nIt is believed that nearsightedness affects about 8% of all children. \nIn a random sample of 194 children, 21 are nearsighted. \nUsing a mathematical model, conduct a hypothesis test for the following question: do these data provide evidence that the 8% value is inaccurate?\n\n \\clearpage\n\n1. **Website registration.**\nA website is trying to increase registration for first-time visitors, exposing 1% of these visitors to a new site design. \nOf 752 randomly sampled visitors over a month who saw the new design, 64 registered.\n\n a. Check the conditions for constructing a confidence interval using a mathematical model.\n\n b. Compute the standard error which would describe the variability associated with repeated samples of size 752.\n\n c. Construct and interpret a 90% confidence interval for the fraction of first-time visitors of the site who would register under the new design (assuming stable behaviors by new visitors over time).\n \n \\vspace{5mm}\n\n1. **Coupons driving visits.**\nA store randomly samples 603 shoppers over the course of a year and finds that 142 of them made their visit because of a coupon they'd received in the mail.\nUsing a mathematical model, construct a 95% confidence interval for the fraction of all shoppers during the year whose visit was because of a coupon they'd received in the mail.\n\n\n:::\n", + "supporting": [ + "16-inference-one-prop_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/16-inference-one-prop/figure-html/choosingZForCI-1.png b/_freeze/16-inference-one-prop/figure-html/choosingZForCI-1.png new file mode 100644 index 00000000..cfc32f0e Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/choosingZForCI-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-1.png b/_freeze/16-inference-one-prop/figure-html/nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-1.png new file mode 100644 index 00000000..101be280 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/nullDistForPHatIfLiverTransplantConsultantIsNotHelpful-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-10-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 00000000..7d39862d Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-10-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-11-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 00000000..d297836b Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-11-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-12-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 00000000..6c16cc6e Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-12-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-13-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 00000000..be5f2b72 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-13-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-14-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-14-1.png new file mode 100644 index 00000000..edcb6b0e Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-14-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-15-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 00000000..be5f2b72 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-15-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-16-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 00000000..3302a230 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-16-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-17-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 00000000..8ec2c722 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-17-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-18-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 00000000..3302a230 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-18-1.png differ diff --git a/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-8-1.png b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 00000000..40aad958 Binary files /dev/null and b/_freeze/16-inference-one-prop/figure-html/unnamed-chunk-8-1.png differ diff --git a/_freeze/17-inference-two-props/execute-results/html.json b/_freeze/17-inference-two-props/execute-results/html.json new file mode 100644 index 00000000..e8bfa33d --- /dev/null +++ b/_freeze/17-inference-two-props/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "ed8ed82846a67fba1472c7c554ca7f19", + "result": { + "markdown": "# Inference for comparing two proportions {#inference-two-props}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nWe now extend the methods from Chapter \\@ref(inference-one-prop) to apply confidence intervals and hypothesis tests to differences in population proportions that come from two groups, Group 1 and Group 2: $p_1 - p_2.$\n\nIn our investigations, we'll identify a reasonable point estimate of $p_1 - p_2$ based on the sample, and you may have already guessed its form: $\\hat{p}_1 - \\hat{p}_2.$ \\index{point estimate!difference of proportions} Then we'll look at the inferential analysis in three different ways: using a randomization test, applying bootstrapping for interval estimates, and, if we verify that the point estimate can be modeled using a normal distribution, we compute the estimate's standard error, and we apply the mathematical framework.\n:::\n\n\n\n\n\n## Randomization test for the difference in proportions {#two-prop-errors}\n\n### Observed data\n\nLet's take another look at the cardiopulmonary resuscitation (CPR) study we introduced in Chapter \\@ref(two-sided-hypotheses).\nThe experiment consisted of two treatments on patients who underwent CPR for a heart attack and were subsequently admitted to a hospital.\nEach patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group).\nThe outcome variable of interest was whether the patient survived for at least 24 hours.\n[@Bottiger:2001]\n\n::: {.data data-latex=\"\"}\nThe [`cpr`](http://openintrostat.github.io/openintro/reference/cpr.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe results are summarized in Table \\@ref(tab:cpr-summary-again) (which is a replica of Table \\@ref(tab:cpr-summary)).\n11 out of the 50 patients in the control group and 14 out of the 40 patients in the treatment group survived.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Results for the CPR study. Patients in the treatment group were given a blood thinner, and patients in the control group were not.
Group Died Survived Total
Control 39 11 50
Treatment 26 14 40
Total 65 25 90
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIs this an observational study or an experiment?\nWhat implications does the study type have on what can be inferred from the results?[^17-inference-two-props-1]\n:::\n\n[^17-inference-two-props-1]: The study is an experiment, as patients were randomly assigned an experiment group.\n Since this is an experiment, the results can be used to evaluate a causal relationship between blood thinner use after CPR and whether patients survived.\n\nIn this study, a larger proportion of patients who received blood thinner after CPR,$\\hat{p}_T = \\frac{14}{40} = 0.35,$ survived compared to those who did not receive blood thinner, $\\hat{p}_C = \\frac{11}{50} = 0.22.$ However, based on these observed proportions alone, we cannot determine whether the difference ($\\hat{p}_T - \\hat{p}_C = 0.35 - 0.22 = 0.13$) provides *convincing evidence* that blood thinner usage after CPR is effective.\n\nAs we saw in Chapter \\@ref(foundations-randomization), we can re-randomize the responses (`survived` or `died`) to the treatment conditions assuming the null hypothesis is true and compute possible differences in proportions.\nThe process by which we randomize observations to two groups is summarized and visualized in Figure \\@ref(fig:fullrand).\n\n### Variability of the statistic\n\nFigure \\@ref(fig:cpr-rand-dot-plot) shows a stacked plot of the differences found from 100 randomization simulations (i.e., repeated iterations as described in Figure \\@ref(fig:fullrand)), where each dot represents a simulated difference between the infection rates (control rate minus treatment rate).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:cpr-rand-dot-plot-cap)](17-inference-two-props_files/figure-html/cpr-rand-dot-plot-1.png){width=90%}\n:::\n:::\n\n\n(ref:cpr-rand-dot-plot-cap) A stacked dot plot of differences from 100 simulations produced under the independence model $H_0,$ where in these simulations survival is unaffected by the treatment. Twelve of the 100 simulations had a difference of at least 13%, the difference observed in the study.\n\n### Observed statistic vs null statistics\n\nNote that the distribution of these simulated differences is centered around 0.\nWe simulated the differences assuming that the independence model was true, that blood thinners after CPR have no effect on survival.\nUnder the null hypothesis, we expect the difference to be near zero with some random fluctuation, where *near* is pretty generous in this case since the sample sizes are so small in this study.\n\n::: {.workedexample data-latex=\"\"}\nHow often would you observe a difference of at least 13% (0.13) according to Figure \\@ref(fig:cpr-rand-dot-plot)?\nIs this a rare event?\n\n------------------------------------------------------------------------\n\nIt appears that a difference of at least 13% due to chance alone, if the null hypothesis was true would happen about 12% of the time according to Figure \\@ref(fig:cpr-rand-dot-plot).\nThis is not a very rare event.\n:::\n\nThe difference of 13% not being a rare event suggests two possible interpretations of the results of the study:\n\n- $H_0$ Independence model. Blood thinners after CPR have no effect on survival, and we just happened to observe a difference that would only occur on a rare occasion.\n- $H_A$ Alternative model. Blood thinners after CPR increase chance of survival, and the difference we observed was actually due to the blood thinners after CPR being effective at increasing the chance of survival, which explains the difference of 13%.\n\nSince we determined that the outcome is not that rare (12% chance of observing a difference of 13% or more under the assumption that blood thinners after CPR have no effect on survival), we fail to reject $H_0$, and conclude that the study results do not provide strong evidence against the independence model.\nThis does not mean that we have proved that blood thinners are not effective, it just means that this study does not provide convincing evidence that they are effective in this setting.\n\nStatistical inference, is built on evaluating how likely such differences are to occur due to chance if in fact the null hypothesis is true.\nIn statistical inference, data scientists evaluate which model is most reasonable given the data.\nErrors do occur, just like rare events, and we might choose the wrong model.\nWhile we do not always choose correctly, statistical inference gives us tools to control and evaluate how often these errors occur.\n\n## Bootstrap confidence interval for the difference in proportions {#two-prop-boot-ci}\n\n\\chaptermark{Bootstrap CI for the difference in proportions}\n\nIn Section \\@ref(two-prop-errors), we worked with the randomization distribution to understand the distribution of $\\hat{p}_1 - \\hat{p}_2$ when the null hypothesis $H_0: p_1 - p_2 = 0$ is true.\nNow, through bootstrapping, we study the variability of $\\hat{p}_1 - \\hat{p}_2$ without assuming the null hypothesis is true.\n\n### Observed data\n\nReconsider the CPR data from Section \\@ref(two-prop-errors) which is provided in Table \\@ref(tab:cpr-summary).\nAgain, we use the difference in sample proportions as the observed statistic of interest.\nHere, the value of the statistic is: $\\hat{p}_T - \\hat{p}_C = 0.35 - 0.22 = 0.13.$\n\n### Variability of the difference in sample proportions\n\nThe bootstrap method applied to two samples is an extension of the method described in Chapter \\@ref(foundations-bootstrapping).\nNow, we have two samples, so each sample estimates the population from which they came.\nIn the CPR setting, the `treatment` sample estimates the population of all individuals who have gotten (or will get) the treatment; the `control` sample estimate the population of all individuals who do not get the treatment and are controls.\nFigure \\@ref(fig:boot2proppops) extends Figure \\@ref(fig:boot1) to show the bootstrapping process from two samples simultaneously.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Creating two populations from which to take each of the bootstrap samples.](images/boot2proppops.png){fig-alt='Sample 1 is taken from Population 1 (3 colored marbles out of 7); Sample 2 is taken from Population 2 (5 colored marbles out of 9). Each of the two samples is used to create separate infinitely large proxy populations. Proxy population 1 has 3/7 colored marbles; proxy population 2 has 4/9 colored marbles.' width=100%}\n:::\n:::\n\n\nAs before, once the population is estimated, we can randomly resample observations to create bootstrap samples, as seen in Figure \\@ref(fig:boot2propresamps).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Taking each bootstrap sample from the estimated population.](images/boot2propresamps.png){fig-alt='Sample 1 is taken from Population 1 (3 colored marbles out of 7); Sample 2 is taken from Population 2 (5 colored marbles out of 9). Each of the two samples is used to create separate infinitely large proxy populations. Proxy population 1 has 3/7 colored marbles; proxy population 2 has 4/9 colored marbles. Resamples are taken from each of the proxy populations. The three resamples from proxy population 1 have 2/7, 4/7 and 5/7 colored marbles, respectively. The three resamples from proxy population 2 have 5/9, 4/9, and 7/9 colored marbles, respectively.' width=100%}\n:::\n:::\n\n\nThe variability of the statistic (the difference in sample proportions) can be calculated by taking one bootstrap resample from Sample 1 and one bootstrap resample from Sample 2 and calculating the difference in the bootstrap proportions.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For example, the first bootstrap resamples from Sample 1 and Sample 2 provide resample proportions of 2/7 and 5/9, respectively.](images/boot2prop3.png){fig-alt='The first resamples from each of the two proxy populations are compared. Resample 1 from proxy population 1 has 2/7 colored marbles; resample 1 from proxy population 2 has 5/9 colored marbles. The difference in bootstrap proportions is taken as 2/7 minus 5/9.' width=60%}\n:::\n:::\n\n\n\\clearpage\n\nAs always, the variability of the difference in proportions can only be estimated by repeated simulations, in this case, repeated bootstrap resamples.\nFigure \\@ref(fig:boot2samp2) shows multiple bootstrap differences calculated for each of the repeated bootstrap samples.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For each pair of bootstrap samples, we calculate the difference in sample proportions](images/boot2prop2.png){fig-alt='Shown are the two infinitely large proxy populations (created from sample 1 and sample 2). From each proxy population, three resamples are shown. For each pair of resamples, the difference in bootstrap proportions is taken. The first pair of resamples gives a difference in bootstrapped proportions of 2/7 minus 5/9; the second pair of resamples gives a difference in bootstrapped proportions of 4/7 minus 4/9; the last pair of resamples gives a difference in bootstrapped proportions of 5/7 minus 7/9.' width=100%}\n:::\n:::\n\n\nRepeated bootstrap simulations lead to a bootstrap sampling distribution of the statistic of interest, here the difference in sample proportions.\nFigure \\@ref(fig:boot2samp1) visualizes the process and Figure \\@ref(fig:bootCPR1000) shows 1,000 bootstrap differences in proportions for the CPR data.\nNote that the CPR data includes 40 and 50 people in the respective groups, and the illustrated example includes 7 and 9 people in the two groups.\nAccordingly, the variability in the distribution of sample proportions is higher for the illustrated example.\nAs you will see in the mathematical models discussed in Section \\@ref(math-2prop), large sample sizes lead to smaller standard errors for a difference in proportions.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The differences in each bootstrapped pair of proportions are combined to create the sampling distribution of the differences in proportions.](images/boot2prop1.png){fig-alt='Shown are the two infinitely large proxy populations (created from sample 1 and sample 2). From each proxy population, three resamples are shown. For each pair of resamples, the difference in bootstrap proportions is taken. A dotplot displays many differences in bootstrap proportions. The differences range from roughly -0.6 to +0.3.' width=100%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![A histogram of differences in proportions from 1000 bootstrap simulations of the CPR data. Note that because the CPR data has a larger sample size than the illustrated example, the variability of the difference in proportions is much smaller with the CPR histogram.](17-inference-two-props_files/figure-html/bootCPR1000-1.png){width=90%}\n:::\n:::\n\n\n### Bootstrap percentile vs. SE confidence intervals\n\nFigure \\@ref(fig:bootCPR1000) provides an estimate for the variability of the difference in survival proportions from sample to sample.\nThe values in the histogram can be used in two different ways to create a confidence interval for the parameter of interest: $p_1 - p_2.$\n\n**Bootstrap percentile confidence interval**\n\n\n::: {.cell}\n\n:::\n\n\nAs in Chapter \\@ref(foundations-bootstrapping), the bootstrap confidence interval can be calculated directly from the bootstrapped differences in Figure \\@ref(fig:bootCPR1000).\nThe interval created from the percentiles of the distribution is called the **percentile interval**.\nNote that here we calculate the 90% confidence interval by finding the $5^{th}$ and $95^{th}$ percentile values from the bootstrapped differences.\nThe bootstrap 5 percentile proportion is -0.032 and the 95 percentile is 0.284.\nThe result is: we are 90% confident that, in the population, the true difference in probability of survival for individuals receiving blood thinners after CPR is between -0.032 lower and 0.284 higher than those who did not receive blood thinners.\nThe interval shows that we do not have much definitive evidence of the affect of blood thinners, one way or another.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The CPR data is bootstrapped 1,000 times. Each simulation creates a sample from the original data where the probability of survival in the treatment group is $\\hat{p}_{T} = 14/40$ and the probability of survival in the control group is $\\hat{p}_{C} = 11/50.$ ](17-inference-two-props_files/figure-html/bootCPR1000CI-1.png){width=80%}\n:::\n:::\n\n\n**Bootstrap SE confidence interval**\n\n\n::: {.cell}\n\n:::\n\n\nAlternatively, we can use the variability in the bootstrapped differences to calculate a standard error of the difference.\nThe resulting interval is called the **SE interval**.\nSection \\@ref(math-2prop) details the mathematical model for the standard error of the difference in sample proportions, but the bootstrap distribution typically does an excellent job of estimating the variability of the sampling distribution of the sample statistic.\n\n\n\n\n\n$$SE(\\hat{p}_T - \\hat{p}_C) \\approx SE(\\hat{p}_{T, boot} - \\hat{p}_{C, boot}) = 0.098$$\n\nThe variability of the difference in proportions was calculated in R using the `sd()` function, but any statistical software will calculate the standard deviation of the differences, here, the exact quantity we hope to approximate.\n\nNote that we do not know know the true distribution of $\\hat{p}_T - \\hat{p}_C,$ so we will use a rough approximation to find a confidence interval for $p_T - p_C.$ As seen in the bootstrap histograms, the shape of the distribution is roughly symmetric and bell-shaped.\nSo for a rough approximation, we will apply the 67-95-99.7 rule which tells us that 95% of observed differences should be roughly no farther than 2 SE from the true parameter (difference in proportions).\nA 95% confidence interval for $p_T - p_C$ is given by:\n\n$$\\hat{p}_T - \\hat{p}_C \\pm 2 \\cdot SE \\ \\ \\ \\rightarrow \\ \\ \\ 14/40 - 11/50 \\pm 2 \\cdot 0.098 \\ \\ \\ \\rightarrow \\ \\ \\ (-0.067, 0.327)$$\n\nWe are 95% confident that the true value of $p_T - p_C$ is between -0.067 and 0.327.\nAgain, the wide confidence interval that contains zero indicates that the study provides very little evidence about the effectiveness of blood thinners.\nFor other percentages, e.g., a 90% bootstrap SE confidence interval, we will use quantiles given by the standard normal distribution, as seen in Section \\@ref(normalDist) and Figure \\@ref(fig:er6895997).\n\n### What does 95% mean?\n\nRecall that the goal of a confidence interval is to find a plausible range of values for a *parameter* of interest.\nThe estimated statistic is not the value of interest, but it is typically the best guess for the unknown parameter.\nThe confidence level (often 95%) is a number that takes a while to get used to.\nSurprisingly, the percentage does not describe the dataset at hand, it describes many possible datasets.\nOne way to understand a confidence interval is to think about all the confidence intervals that you have ever made or that you will ever make as a scientist, the confidence level describes **those** intervals.\n\nFigure \\@ref(fig:ci25ints) demonstrates a hypothetical situation in which 25 different studies are performed on the exact same population (with the same goal of estimating the true parameter value of $p_1 - p_2 = 0.47).$ The study at hand represents one point estimate (a dot) and a corresponding interval.\nIt is not possible to know whether the interval at hand is to the right of the unknown true parameter value (the black line) or to the left of that line.\nIt is also impossible to know whether the interval captures the true parameter (is blue) or does not (is red).\nIf we are making 95% intervals, then about 5% of the intervals we create over our lifetime will *not* capture the parameter of interest (e.g., will be red as in Figure \\@ref(fig:ci25ints) ).\nWhat we know is that over our lifetimes as scientists, about 95% of the intervals created and reported on will capture the parameter value of interest: thus the language \"95% confident.\"\n\n\\clearpage\n\n\n::: {.cell}\n::: {.cell-output-display}\n![One hypothetical population, parameter value of: $p_1 - p_2 = 0.47.$ Twenty-five different studies all which led to a different point estimate, SE, and confidence interval. The study at hand is one of the horizontal lines (hopefully a blue line!).](17-inference-two-props_files/figure-html/ci25ints-1.png){fig-alt='A series of 25 horizontal lines are drawn, representing each of 25 different studies (where a study represents two samples, one from each of population 1 and population 2). Each vertical line starts at the value of the lower bound of the confidence interval and ends at the value of the upper bound of the confidence interval which was created from that particular sample. In the center of the line is a solid dot at the observed difference in proportion of successes for sample 1 minus sample 2. A dashed vertical line runs through the horizontal lines at p = 0.47 (which is the true value of the diffrence in population proportions). 24 of the 25 horizontal lines cross the vertical line at 0.47, but one of the horizontal lines is completely above than 0.47. The line that does not cross 0.47 is colored red because the confidence interval from that particular sample would not have captured the true difference in population proportions.' width=85%}\n:::\n:::\n\n\nThe choice of 95% or 90% or even 99% as a confidence level is admittedly somewhat arbitrary; however, it is related to the logic we used when deciding that a p-value should be declared as \"significant\" if it is lower than 0.05 (or 0.10 or 0.01, respectively).\nIndeed, one can show mathematically, that a 95% confidence interval and a two-sided hypothesis test at a cutoff of 0.05 will provide the same conclusion when the same data and mathematical tools are applied for the analysis.\nA full derivation of the explicit connection between confidence intervals and hypothesis tests is beyond the scope of this text.\n\n\\clearpage\n\n## Mathematical model for the difference in proportions {#math-2prop}\n\n### Variability of the difference between two proportions\n\nLike with $\\hat{p},$ the difference of two sample proportions $\\hat{p}_1 - \\hat{p}_2$ can be modeled using a normal distribution when certain conditions are met.\nFirst, we require a broader independence condition, and secondly, the success-failure condition must be met by both groups.\n\n::: {.important data-latex=\"\"}\n**Conditions for the sampling distribution of** $\\hat{p}_1 -\\hat{p}_2$ **to be normal.**\n\nThe difference $\\hat{p}_1 - \\hat{p}_2$ can be modeled using a normal distribution when\n\n1. *Independence* (extended). The data are independent within and between the two groups. Generally this is satisfied if the data come from two independent random samples or if the data come from a randomized experiment.\n2. *Success-failure condition.* The success-failure condition holds for both groups, where we check successes and failures in each group separately. That is, we should have at least 10 successes and 10 failures in each of the two groups.\n\nWhen these conditions are satisfied, the standard error of $\\hat{p}_1 - \\hat{p}_2$ is:\n\n$$SE(\\hat{p}_1 - \\hat{p}_2) = \\sqrt{\\frac{p_1(1-p_1)}{n_1} + \\frac{p_2(1-p_2)}{n_2}}$$\n\nwhere $p_1$ and $p_2$ represent the population proportions, and $n_1$ and $n_2$ represent the sample sizes.\n\nNote that in most cases, the standard error is approximated using the observed data:\n\n$$SE(\\hat{p}_1 - \\hat{p}_2) = \\sqrt{\\frac{\\hat{p}_1(1-\\hat{p}_1)}{n_1} + \\frac{\\hat{p}_2(1-\\hat{p}_2)}{n_2}}$$\n\nwhere $\\hat{p}_1$ and $\\hat{p}_2$ represent the observed sample proportions, and $n_1$ and $n_2$ represent the sample sizes.\n:::\n\nRecall that the margin of error is defined by the standard error.\nThe margin of error for $\\hat{p}_1 - \\hat{p}_2$ can be directly obtained from $SE(\\hat{p}_1 - \\hat{p}_2).$\n\n::: {.important data-latex=\"\"}\n**Margin of error for** $\\hat{p}_1 - \\hat{p}_2.$\n\nThe margin of error is $z^\\star \\times \\sqrt{\\frac{\\hat{p}_1(1-\\hat{p}_1)}{n_1} + \\frac{\\hat{p}_2(1-\\hat{p}_2)}{n_2}}$ where $z^\\star$ is calculated from a specified percentile on the normal distribution.\n:::\n\n\\index{standard error (SE)!difference in proportions}\n\n\n\n\n\n\\clearpage\n\n### Confidence interval for the difference between two proportions\n\nWe can apply the generic confidence interval formula for a difference of two proportions, where we use $\\hat{p}_1 - \\hat{p}_2$ as the point estimate and substitute the $SE$ formula:\n\n$$\n\\begin{aligned}\n\\text{point estimate} \\ &\\pm \\ z^{\\star} \\ \\times \\ SE \\\\\n(\\hat{p}_1 - \\hat{p}_2) \\ &\\pm \\ z^{\\star} \\times \\sqrt{\\frac{\\hat{p}_1(1-\\hat{p}_1)}{n_1} + \\frac{\\hat{p}_2(1-\\hat{p}_2)}{n_2}}\n\\end{aligned}\n$$\n\n::: {.important data-latex=\"\"}\n**Standard error of the difference in two proportions,** $\\hat{p}_1 -\\hat{p}_2.$\n\nWhen the conditions for the normal model are are met, the **variability** of the difference in proportions, $\\hat{p}_1 -\\hat{p}_2,$ is well described by:\n\n$$SE(\\hat{p}_1 -\\hat{p}_2) = \\sqrt{\\frac{\\hat{p}_1(1-\\hat{p}_1)}{n_1} + \\frac{\\hat{p}_2(1-\\hat{p}_2)}{n_2}}$$\n:::\n\n::: {.workedexample data-latex=\"\"}\nWe reconsider the experiment for patients who underwent cardiopulmonary resuscitation (CPR) for a heart attack and were subsequently admitted to a hospital.\nThese patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner.\nThe outcome variable of interest was whether the patients survived for at least 24 hours.\nThe results are shown in Table \\@ref(tab:cpr-summary).\nCheck whether we can model the difference in sample proportions using the normal distribution.\n\n------------------------------------------------------------------------\n\nWe first check for independence: since this is a randomized experiment, it seems reasonable to assume that the observations are idependent.\n\nNext, we check the success-failure condition for each group.\nWe have at least 10 successes and 10 failures in each experiment arm (11, 14, 39, 26), so this condition is also satisfied.\n\nWith both conditions satisfied, the difference in sample proportions can be reasonably modeled using a normal distribution for these data.\n:::\n\n::: {.workedexample data-latex=\"\"}\nCreate and interpret a 90% confidence interval of the difference for the survival rates in the CPR study.\n\n------------------------------------------------------------------------\n\nWe'll use $p_T$ for the survival rate in the treatment group and $p_C$ for the control group:\n\n$$\\hat{p}_{T} - \\hat{p}_{C} = \\frac{14}{40} - \\frac{11}{50} = 0.35 - 0.22 = 0.13$$\n\nWe use the standard error formula previously provided.\nAs with the one-sample proportion case, we use the sample estimates of each proportion in the formula in the confidence interval context:\n\n$$SE \\approx \\sqrt{\\frac{0.35 (1 - 0.35)}{40} + \\frac{0.22 (1 - 0.22)}{50}} = 0.095$$\n\nFor a 90% confidence interval, we use $z^{\\star} = 1.65:$\n\n$$\n\\begin{aligned}\n\\text{point estimate} \\ &\\pm \\ z^{\\star} \\ \\times \\ SE \\\\\n0.13 \\ &\\pm \\ 1.65 \\ \\times \\ 0.095 \\\\\n(-0.027 \\ &, \\ 0.287)\n\\end{aligned}\n$$\n\nWe are 90% confident that individuals receiving blood thinners have between a 2.7% less chance of survival to a 28.7% greater chance of survival than those in the control group.\nBecause 0% is contained in the interval, we do not have enough information to say whether blood thinners help or harm heart attack patients who have been admitted after they have undergone CPR.\n\nNote, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such a 95% or 99%).\nA lower degree of confidence increases potential for error, but it also produces a more narrow interval.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nA 5-year experiment was conducted to evaluate the effectiveness of fish oils on reducing cardiovascular events, where each subject was randomized into one of two treatment groups [@Manson:2019].\nWe'll consider heart attack outcomes in the patients listed in Table \\@ref(tab:fish-oil-data).\n\nCreate a 95% confidence interval for the effect of fish oils on heart attacks for patients who are well-represented by those in the study.\nAlso interpret the interval in the context of the study.[^17-inference-two-props-2]\n:::\n\n[^17-inference-two-props-2]: Because the patients were randomized, the subjects are independent, both within and between the two groups.\n The success-failure condition is also met for both groups as all counts are at least 10.\n This satisfies the conditions necessary to model the difference in proportions using a normal distribution.\n Compute the sample proportions $(\\hat{p}_{\\text{fish oil}} = 0.0112,$ $\\hat{p}_{\\text{placebo}} = 0.0155),$ point estimate of the difference $(0.0112 - 0.0155 = -0.0043),$ and standard error $SE = \\sqrt{\\frac{0.0112 \\times 0.9888}{12933} + \\frac{0.0155 \\times 0.9845}{12938}},$ $SE = 0.00145.$ Next, plug the values into the general formula for a confidence interval, where we'll use a 95% confidence level with $z^{\\star} = 1.96:$ $-0.0043 \\pm 1.96 \\times 0.00145 = (-0.0071, -0.0015).$ We are 95% confident that fish oils decreases heart attacks by 0.15 to 0.71 percentage points (off of a baseline of about 1.55%) over a 5-year period for subjects who are similar to those in the study.\n Because the interval is entirely below 0, and the treatment was randomly assigned the data provide strong evidence that fish oil supplements reduce heart attacks in patients like those in the study.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n
Results for the study on n-3 fatty acid supplement and related health benefits.
heart attack no event Total
fish oil 145 12788 12933
placebo 200 12738 12938
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`fish_oil_18`](http://openintrostat.github.io/openintro/reference/fish_oil.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n### Hypothesis test for the difference between two proportions\n\nThe details for calculating a SE and for checking technical conditions are very similar to that of confidence intervals.\nHowever, when the null hypothesis is that $p_1 - p_2 = 0,$ we use a special proportion called the **pooled proportion**\\index{pooled proportion} to estimate the SE and to check the success-failure condition.\n\n::: {.important data-latex=\"\"}\n**Use the pooled proportion when** $H_0$ is $p_1 - p_2 = 0.$\n\nWhen the null hypothesis is that the proportions are equal, use the pooled proportion $(\\hat{p}_{\\textit{pool}})$ of successes to verify the success-failure condition and estimate the standard error:\n\n$$\\hat{p}_{\\textit{pool}} = \\frac{\\text{number of successes}}{\\text{number of cases}} = \\frac{\\hat{p}_1 n_1 + \\hat{p}_2 n_2}{n_1 + n_2}$$\n\nHere $\\hat{p}_1 n_1$ represents the number of successes in sample 1 because $\\hat{p}_1 = \\frac{\\text{number of successes in sample 1}}{n_1}.$\n\nSimilarly, $\\hat{p}_2 n_2$ represents the number of successes in sample 2.\n:::\n\n::: {.important data-latex=\"\"}\n**The test statistic for assessing two proportions is a Z.**\n\nThe Z score is a ratio of how the two sample proportions differ as compared to the expected variability of difference between the proportions.\n\n$$Z = \\frac{(\\hat{p}_1 - \\hat{p}_2) - 0}{\\sqrt{\\hat{p}_{pool}(1-\\hat{p}_{pool}) \\bigg(\\frac{1}{n_1} + \\frac{1}{n_2} \\bigg)}}$$\n\nWhen the null hypothesis is true and the conditions are met, Z has a standard normal distribution.\nSee the box below for calculation of the pooled proportion of successes.\n\nConditions:\n\n- Independent observations\n- Large samples: $(n_1 p_1 \\geq 10$ and $n_1 (1-p_1) \\geq 10$ and $n_2 p_2 \\geq 10$ and $n_2 (1-p_2) \\geq 10)$\n- Check conditions using: $(n_1 \\hat{p}_{\\textit{pool}} \\geq 10$ and $n_1 (1-\\hat{p}_{\\textit{pool}}) \\geq 10$ and $n_2 \\hat{p}_{\\textit{pool}}\\geq 10$ and $n_2 (1-\\hat{p}_{\\textit{pool}}) \\geq 10)$\n:::\n\n\n\n\n\nA mammogram is an X-ray procedure used to check for breast cancer.\nWhether mammograms should be used is part of a controversial discussion, and it's the topic of our next example where we learn about 2-proportion hypothesis tests when $H_0$ is $p_1 - p_2 = 0$ (or equivalently, $p_1 = p_2).$\n\nA 30-year study was conducted with nearly 90,000 participants who identified as female.\nDuring a 5-year screening period, each participant was randomized to one of two groups: in the first group, participants received regular mammograms to screen for breast cancer, and in the second group, participants received regular non-mammogram breast cancer exams.\nNo intervention was made during the following 25 years of the study, and we'll consider death resulting from breast cancer over the full 30-year period.\nResults from the study are summarized in Figure \\@ref(tab:mammogramStudySummaryTable).\n\n::: {.data data-latex=\"\"}\nThe [`mammogram`](http://openintrostat.github.io/openintro/reference/mammogram.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nIf mammograms are much more effective than non-mammogram breast cancer exams, then we would expect to see additional deaths from breast cancer in the control group.\nOn the other hand, if mammograms are not as effective as regular breast cancer exams, we would expect to see an increase in breast cancer deaths in the mammogram group.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n\n
Summary results for breast cancer study.
Death from breast cancer?
Treatment Yes No
control 505 44,405
mammogram 500 44,425
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIs this study an experiment or an observational study?[^17-inference-two-props-3]\n:::\n\n[^17-inference-two-props-3]: This is an experiment.\n Patients were randomized to receive mammograms or a standard breast cancer exam.\n We will be able to make causal conclusions based on this study.\n\n::: {.guidedpractice data-latex=\"\"}\nSet up hypotheses to test whether there was a difference in breast cancer deaths in the mammogram and control groups.[^17-inference-two-props-4]\n:::\n\n[^17-inference-two-props-4]: $H_0:$ the breast cancer death rate for patients screened using mammograms is the same as the breast cancer death rate for patients in the control, $p_{MGM} - p_{C} = 0.$ $H_A:$ the breast cancer death rate for patients screened using mammograms is different than the breast cancer death rate for patients in the control, $p_{MGM} - p_{C} \\neq 0.$\n\nThe research question describing mammograms is set up to address specific hypotheses (in contrast to a confidence interval for a parameter).\nIn order to fully take advantage of the hypothesis testing structure, we asses the randomness under the condition that the null hypothesis is true (as we always do for hypothesis testing).\nUsing the data from Table \\@ref(tab:mammogramStudySummaryTable), we will check the conditions for using a normal distribution to analyze the results of the study using a hypothesis test.\n\n$$\n\\begin{aligned}\n\\hat{p}_{\\textit{pool}}\n &= \\frac\n {\\text{number of patients who died from breast cancer in the entire study}}\n {\\text{number of patients in the entire study}} \\\\\n &= \\frac{500 + 505}{500 + \\text{44,425} + 505 + \\text{44,405}} \\\\\n &= 0.0112\n\\end{aligned} \n$$\n\nThis proportion is an estimate of the breast cancer death rate across the entire study, and it's our best estimate of the proportions $p_{MGM}$ and $p_{C}$ *if the null hypothesis is true that* $p_{MGM} = p_{C}.$ We will also use this pooled proportion when computing the standard error.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nIs it reasonable to model the difference in proportions using a normal distribution in this study?\n\n------------------------------------------------------------------------\n\nBecause the patients were randomized, observations can be assumed to be independent, both within each group and between treatment groups.\nWe also must check the success-failure condition for each group.\nUnder the null hypothesis, the proportions $p_{MGM}$ and $p_{C}$ are equal, so we check the success-failure condition with our best estimate of these values under $H_0,$ the pooled proportion from the two samples, $\\hat{p}_{\\textit{pool}} = 0.0112:$\n\n$$\n\\begin{aligned}\n\\hat{p}_{\\textit{pool}} \\times n_{MGM}\n &= 0.0112 \\times \\text{44,925} = 503\\\\\n (1 - \\hat{p}_{\\textit{pool}}) \\times n_{MGM}\n &= 0.9888 \\times \\text{44,925} = \\text{44,422} \\\\\n \\hat{p}_{\\textit{pool}} \\times n_{C}\n &= 0.0112 \\times \\text{44,910} = 503\\\\\n (1 - \\hat{p}_{\\textit{pool}}) \\times n_{C}\n &= 0.9888 \\times \\text{44,910} = \\text{44,407}\n\\end{aligned}\n$$\n\nThe success-failure condition is satisfied since all values are at least 10.\nWith both conditions satisfied, we can safely model the difference in proportions using a normal distribution.\n:::\n\nIn the previous example, the pooled proportion was used to check the success-failure condition[^17-inference-two-props-5].\nIn the next example, we see an additional place where the pooled proportion comes into play: the standard error calculation.\n\n[^17-inference-two-props-5]: For an example of a two-proportion hypothesis test that does not require the success-failure condition to be met, see Section \\@ref(two-prop-errors).\n\n::: {.workedexample data-latex=\"\"}\nCompute the point estimate of the difference in breast cancer death rates in the two groups, and use the pooled proportion $\\hat{p}_{\\textit{pool}} = 0.0112$ to calculate the standard error.\n\n------------------------------------------------------------------------\n\nThe point estimate of the difference in breast cancer death rates is\n\n$$\n\\begin{aligned}\n\\hat{p}_{MGM} - \\hat{p}_{C}\n &= \\frac{500}{500 + 44,425} - \\frac{505}{505 + 44,405} \\\\\n &= 0.01113 - 0.01125 \\\\\n &= -0.00012\n\\end{aligned} \n$$\n\nThe breast cancer death rate in the mammogram group was 0.012% less than in the control group.\nNext, the standard error is calculated *using the pooled proportion,* $\\hat{p}_{\\textit{pool}}:$\n\n$$SE = \\sqrt{\\frac{\\hat{p}_{\\textit{pool}}(1-\\hat{p}_{\\textit{pool}})}{n_{MGM}} + \\frac{\\hat{p}_{\\textit{pool}}(1-\\hat{p}_{\\textit{pool}})}{n_{C}}}= 0.00070$$\n:::\n\n::: {.workedexample data-latex=\"\"}\nUsing the point estimate $\\hat{p}_{MGM} - \\hat{p}_{C} = -0.00012$ and standard error $SE = 0.00070,$ calculate a p-value for the hypothesis test and write a conclusion.\n\n------------------------------------------------------------------------\n\nJust like in past tests, we first compute a test statistic and draw a picture:\n\n$$Z = \\frac{\\text{point estimate} - \\text{null value}}{SE} = \\frac{-0.00012 - 0}{0.00070} = -0.17$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](17-inference-two-props_files/figure-html/unnamed-chunk-22-1.png){width=60%}\n:::\n:::\n\n\nThe lower tail area is 0.4325, which we double to get the p-value: 0.8650.\nBecause this p-value is larger than 0.05, we do not reject the null hypothesis.\nThat is, the difference in breast cancer death rates is likely to have occurred just by chance, if the null hypothesis is true.\nThus, we do not observe benefits or harm from mammograms relative to a regular breast exam.\n:::\n\nCan we conclude that mammograms have no benefits or harm?\nHere are a few considerations to keep in mind when reviewing the mammogram study as well as any other medical study:\n\n- We do not reject the null hypothesis, which means we do not have sufficient evidence to conclude that mammograms reduce or increase breast cancer deaths.\n- If mammograms are helpful or harmful, the data suggest the effect isn't very large.\n- Are mammograms more or less expensive than a non-mammogram breast exam? If one option is much more expensive than the other and does not offer clear benefits, then we should lean towards the less expensive option.\n- The study's authors also found that mammograms led to over-diagnosis of breast cancer, which means some breast cancers were found (or thought to be found) but that these cancers would not cause symptoms during patients' lifetimes. That is, something else would kill the patient before breast cancer symptoms appeared. This means some patients may have been treated for breast cancer unnecessarily, and this treatment is another cost to consider. It is also important to recognize that over-diagnosis can cause unnecessary physical or emotional harm to patients.\n\nThese considerations highlight the complexity around medical care and treatment recommendations.\nExperts and medical boards who study medical treatments use considerations like those above to provide their best recommendation based on the current evidence.\n\n\\clearpage\n\n## Chapter review {#chp17-review}\n\n### Summary\n\nWhen the parameter of interest is the difference in population proportions across two groups, randomization tests, bootstrapping, and mathematical modeling can be applied.\nFor confidence intervals, bootstrapping from each group separately will provide a sampling distribution for the difference in sample proportions; the mathematical model shows a similar distributional shape as long as the sample size is large enough to fulfill the success-failure conditions and so that the data are representative of the entire population.\nKeep in mind that some datasets will produce a confidence interval which does not capture the true parameter, this is the nature of variability!\nOver your lifetime, about 95% of the confidence intervals you create will capture the parameter of interest, and about 5% won't.\nFor hypothesis testing, repeated randomization of the explanatory variable creates a null distribution of differences in sample proportions that could have occurred under the null hypothesis.\nRandomization and the mathematical model will have similar null distributions, as long as the sample size is large enough to fulfill the success-failure conditions.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
percentile interval pooled proportion SE interval
point estimate SE difference in proportions Z score two proportions
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp17-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-17].\n\n::: {.exercises data-latex=\"\"}\n1. **Disaggregating Asian American tobacco use, hypothesis testing.**\nUnderstanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups including Asian-Indian (n = 4,373), Chinese (n = 4,736), and Filipino (n = 4,912), in comparison to non-Hispanic Whites (n = 275,025). The number of current smokers in each group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-Hispanic Whites (n = 50,880). [@Rao:2021]\n\n To determine whether the proportion of Asian-Indian Americans who are current smokers is different from the proportion of Chinese Americans who are smokers, a randomization simulation was performed.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-24-1.png){width=90%}\n :::\n :::\n\n a. In both words and symbols provide the parameter and statistic of interest for this study. Do you know the numerical value of either the parameter or statisic of interest? If so, provide the numerical value.\n \n b. The histogram above provides the sampling distribution (under randomization) for $\\hat{p}_{Asian-Indian} - \\hat{p}_{Chinese}$ under repeated null randomizations ($\\hat{p}$ is the proportion in the sample who are current smokers). Estimate the standard error of $\\hat{p}_{Asian-Indian} - \\hat{p}_{Chinese}$ based on the randomization histogram.\n \n c. Consider the hypothesis test to determine if there is a difference in proportion of Asian-Indian Americans as compared to Chinese Americans who are current smokers. Write out the null and alternative hypotheses, estimate a p-value using the randomization histogram, and conclude the test in the context of the problem.\n \n \\clearpage\n\n1. **Malaria vaccine effectiveness, hypothesis test.**\nWith no currently licensed vaccines to inhibit malaria, good news was welcomed with a recent study reporting long-awaited vaccine success for children in Burkina Faso. With 450 children randomized to either one of two different doses of the malaria vaccine or a control vaccine, 89 of 292 malaria vaccine and 106 out of 147 control vaccine children contracted malaria within 12 months after the treatment. [@Datoo:2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-25-1.png){width=90%}\n :::\n :::\n\n a. In both words and symbols provide the parameter and statistic of interest for this study. Do you know the numerical value of either the parameter or statisic of interest? If so, provide the numerical value.\n \n b. The histogram above provides the sampling distribution (under randomization) for $\\hat{p}_{malaria} - \\hat{p}_{control}$ under repeated null randomizations ($\\hat{p}$ is the proportion of children in the sample who contracted malaria). Estimate the standard error of $\\hat{p}_{malaria} - \\hat{p}_{control}$ based on the randomization histogram.\n \n c. Consider the hypothesis test constructed to show a lower proportion of children contracting malaria on the malaria vaccine as compared to the control vaccine. Write out the null and alternative hypotheses, estimate a p-value using the randomization histogram, and conclude the test in the context of the problem.\n \n \\clearpage\n\n1. **Disaggregating Asian American tobacco use, confidence interval.**\nBased on a study on the degree to which smoking practices differ across ethnic groups, a confidence interval for the difference in current smoking status for Filipino versus Chinese Americans is desired. [@Rao:2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-26-1.png){width=90%}\n :::\n :::\n\n a. Consider the bootstrap distribution of difference in sample proportions of current smokers (Filipino Americans minus Chinese Americans) in 1,000 bootstrap repetitions as above. Estimate the standard error of the difference in sample proportions, as seen in the histogram.\n \n b. Using the standard error from the bootstrap distribution, find a 95% bootstrap SE confidence interval for the true difference in proportion of current smokers (Filipino Americans minus Chinese Americans) in the population. Interpret the interval in the context of the problem.\n \n c. Using the entire bootstrap distribution, find a 95% bootstrap percentile confidence interval for the true difference in proportion of current smokers (Filipino Americans minus Chinese Americans) in the population. Interpret the interval in the context of the problem.\n \n \\clearpage\n\n1. **Malaria vaccine effectiveness, confidence interval.**\nWith no currently licensed vaccines to inhibit malaria, good news was welcomed with a recent study reporting long-awaited vaccine success for children in Burkina Faso. With 450 children randomized to either one of two different doses of the malaria vaccine or a control vaccine, 89 of 292 malaria vaccine and 106 out of 147 control vaccine children contracted malaria within 12 months after the treatment. [@Datoo:2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-27-1.png){width=90%}\n :::\n :::\n\n a. Consider the bootstrap distribution of difference in sample proportions of children who contracted malaria (malaria vaccine minus control vaccine) in 1000 bootstrap repetitions as above. Estimate the standard error of the difference in sample proportions, as seen in the histogram.\n \n b. Using the standard error from the bootstrap distribution, find a 95% bootstrap SE confidence interval for the true difference in proportion of children who contract malaria (malaria vaccine minus control vaccine) in the population. Interpret the interval in the context of the problem.\n \n c. Using the entire bootstrap distribution, find a 95% bootstrap percentile confidence interval for the true difference in proportion of children who contract malaria (malaria vaccine minus control vaccine) in the population. Interpret the interval in the context of the problem.\n \n \\clearpage\n\n1. **COVID-19 and degree completion.**\nA 2021 Gallup poll surveyed 3,941 students pursuing a bachelor's degree and 2,064 students pursuing an associate degree (students were not randomly selected but were weighted so as to represent a random selection of currently enrolled US college students). The poll found that 51% of the bachelor's degree students and 44% of associate degree students said that the COVID-19 pandemic will negatively impact their ability to complete the degree. [@data:gallupcollegeimpact]\n\n Below are two histograms which represent different computational approaches (both use 1,000 repetitions) to research questions which could be asked from the Gallup data which was provided. One of the histograms can be used to do a randomization test on whether the proportions of bachelor's and associate students who think the COVID-19 pandemic will negatively impact their ability to complete the degree. The other histogram is a bootstrap distribution used to quantify the difference in the proportions of bachelor's and associate's students who feel this way.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-28-1.png){width=90%}\n :::\n :::\n\n a. Are the center and standard error of the two graphs approximately the same? Explain.\n \n b. Write a research question which could be addressed using this histogram with computational method A.\n \n c. Write a research question which could be addressed using this histogram with computational method B.\n \n \\clearpage\n\n1. **Renewable energy.**\nA 2021 Gallup poll surveyed 5,447 randomly sampled US adults who are Republican (or Republican leaning) and 7,962 who are Democrats (or Democrat leaning). 31% of Republicans and 81% of Democrats said \"government regulations are necessary to encourage businesses and consumers to rely more on renewable energy sources\". [@data:gallupcollegeimpact]\n\n Below are two histograms which represent different computational approaches (both use 1,000 repetitions) to research questions which could be asked from the Gallup data which was provided. One of the histograms can be used to do a randomization test on whether the proportions of Republicans and Democrats who think government regulations are necessary to encourage businesses and consumers to rely more on renewable energy sources are different. The other histogram is a bootstrap distribution used to quantify the difference in the proportions of Republicans and Democrats who agree with this statement.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-29-1.png){width=90%}\n :::\n :::\n\n a. Are the center and standard error of the two graphs approximately the same? Explain.\n \n b. Write a research question which could be addressed using this hitogram with computational method A.\n \n c. Write a research question which could be addressed using this hitogram with computational method B.\n\n1. **HIV in sub-Saharan Africa.**\nIn July 2008 the US National Institutes of Health announced that it was stopping a clinical study early because of unexpected results. The study population consisted of HIV-infected women in sub-Saharan Africa who had been given single dose Nevaripine (a treatment for HIV) while giving birth, to prevent transmission of HIV to the infant. The study was a randomized comparison of continued treatment of a woman (after successful childbirth) with Nevaripine vs Lopinavir, a second drug used to treat HIV. 240 women participated in the study; 120 were randomized to each of the two treatments. Twenty-four weeks after starting the study treatment, each woman was tested to determine if the HIV infection was becoming worse (an outcome called *virologic failure*). Twenty-six of the 120 women treated with Nevaripine experienced virologic failure, while 10 of the 120 women treated with the other drug experienced virologic failure. [@Lockman:2007]\n\n a. Create a two-way table presenting the results of this study.\n\n b. State appropriate hypotheses to test for difference in virologic failure rates between treatment groups.\n\n c. Complete the hypothesis test and state an appropriate conclusion. (Reminder: Verify any necessary conditions for the test.)\n \n \\clearpage\n\n1. **Supercommuters.**\nThe fraction of workers who are considered \"supercommuters\", because they commute more than 90 minutes to get to work, varies by state. Suppose the 1% of Nebraska residents and 6% of New York residents are supercommuters. Now suppose that we plan a study to survey 1000 people from each state, and we will compute the sample proportions $\\hat{p}_{NE}$ for Nebraska and $\\hat{p}_{NY}$ for New York.\n\n a. What is the associated mean and standard deviation of $\\hat{p}_{NE}$ in repeated samples of size 1000?\n\n b. What is the associated mean and standard deviation of $\\hat{p}_{NY}$ in repeated samples of size 1000?\n\n c. Calculate and interpret the mean and standard deviation associated with the difference in sample proportions for the two groups, $\\hat{p}_{NY} - \\hat{p}_{NE}$ in repeated samples of 1000 in each group.\n\n d. How are the standard deviations from parts (a), (b), and (c) related?\n\n1. **National Health Plan.**\nA Kaiser Family Foundation poll for US adults in 2019 found that 79% of Democrats, 55% of Independents, and 24% of Republicans supported a generic “National Health Plan”. There were 347 Democrats, 298 Republicans, and 617 Independents surveyed. 79% of 347 Democrats and 55% of 617 Independents support a National Health Plan. [@data:KFF2019nathealthplan]\n\n a. Calculate a 95% confidence interval for the difference between the proportion of Democrats and Independents who support a National Health Plan $(p_{D} - p_{I})$, and interpret it in this context. We have already checked conditions for you.\n\n b. True or false: If we had picked a random Democrat and a random Independent at the time of this poll, it is more likely that the Democrat would support the National Health Plan than the Independent.\n\n1. **Sleep deprivation, CA vs. OR, confidence interval.**\nAccording to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These data are based on simple random samples of 11,545 California and 4,691 Oregon residents. Calculate a 95% confidence interval for the difference between the proportions of Californians and Oregonians who are sleep deprived and interpret it in context of the data. [@data:sleepCAandOR]\n\n1. **Gender pay gap in medicine.**\nA study examined the average pay for men and women entering the workforce as doctors for 21 different positions. [@LoSassoMedicineGenderPayGap]\n\n a. If each gender was equally paid, then we would expect about half of those positions to have men paid more than women and women would be paid more than men in the other half of positions. Write appropriate hypotheses to test this scenario.\n\n b. Men were, on average, paid more in 19 of those 21 positions. Complete a hypothesis test using your hypotheses from part (a).\n\n1. **Sleep deprivation, CA vs. OR, hypothesis test.**\nA CDC report on sleep deprivation rates shows that the proportion of California residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These data are based on simple random samples of 11,545 California and 4,691 Oregon residents.\n\n a. Conduct a hypothesis test to determine if these data provide strong evidence that the rate of sleep deprivation is different for the two states. (Reminder: Check conditions)\n\n b. It is possible the conclusion of the test in part (a) is incorrect. If this is the case, what type of error was made?\n \n \\clearpage\n\n1. **Is yawning contagious?**\nAn experiment conducted by the MythBusters, a science entertainment TV program on the Discovery Channel, tested if a person can be subconsciously influenced into yawning if another person near them yawns. 50 people were randomly assigned to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a group where there wasn't a person yawning near them (control). The visualization below displays how many participants yawned in each group.^[The [`yawn`](http://openintrostat.github.io/openintro/reference/yawn.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-30-1.png){width=90%}\n :::\n :::\n \n Suppose we are interested in estimating the difference in yawning rates between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?\n \n \\vspace{5mm}\n\n1. **Heart transplant success.**\nThe Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The visualization below displays how many patients survived and died in each group.^[The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Turnbull+Brown+Hu:1974]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](17-inference-two-props_files/figure-html/unnamed-chunk-31-1.png){width=90%}\n :::\n :::\n \n Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?\n \n \\clearpage\n\n1. **Government shutdown.**\nThe United States federal government shutdown of 2018--2019 occurred from December 22, 2018 until January 25, 2019, a span of 35 days. A Survey USA poll of 614 randomly sampled Americans during this time period reported that 48% of those who make less than \\$40,000 per year and 55% of those who make \\$40,000 or more per year said the government shutdown has not at all affected them personally. A 95% confidence interval for $(p_\\text{$<$40K} - p_\\text{$\\ge$40K})$, where $p$ is the proportion of those who said the government shutdown has not at all affected them personally, is (-0.16, 0.02). Based on this information, determine if the following statements are true or false, and explain your reasoning if you identify the statement as false. [@data:govtshuthown]\n\n a. At the 5% significance level, the data provide convincing evidence of a real difference in the proportion who are not affected personally between Americans who make less than \\$40,000 annually and Americans who make \\$40,000 annually.\n\n b. We are 95% confident that 16% more to 2% fewer Americans who make less than \\$40,000 per year are not at all personally affected by the government shutdown compared to those who make \\$40,000 or more per year.\n\n c. A 90% confidence interval for $(p_\\text{$<$40K} - p_\\text{$\\ge$40K})$ would be wider than the $(-0.16, 0.02)$ interval.\n\n d. A 95% confidence interval for $(p_\\text{$\\ge$40K} - p_\\text{$<$40K})$ is (-0.02, 0.16).\n\n1. **Online harassment.**\nA Pew Research poll asked US adults aged 18-29 and 30-49 whether they have personally experienced harassment online.\nA 95% confidence interval for the difference between the proportions of 18-29 year olds and 30-49 year olds who have personally experienced harassment online $(p_{18-29} - p_{30-49})$ was calculated to be (0.115, 0.185).\nBased on this information, determine if the following statements are true or false, and explain your reasoning for each statement you identify as false. [@onlineharassment2021]\n\n a. We are 95% confident that the true proportion of 18-29 year olds who have personally experienced harassment online is 11.5% to 18.5% lower than the true proportion of 30-49 year olds who have personally experienced harassment online.\n\n b. We are 95% confident that the true proportion of 18-29 year olds who have personally experienced harassment online is 11.5% to 18.5% higher than the true proportion of 30-49 year olds who have personally experienced harassment online.\n\n c. 95% of random samples will produce 95% confidence intervals that include the true difference between the population proportions of 18-29 year olds and 30-49 year olds who have personally experienced harassment online.\n\n d. We can conclude that there is a significant difference between the proportions of 18-29 year olds and 30-49 year olds who have personally experienced harassment online is too large to plausibly be due to chance, if in fact there is no difference between the two proportions.\n\n e. The 90% confidence interval for $(p_{18-29} - p_{30-49})$ cannot be calculated with only the information given in this exercise.\n\n\n\n\n1. **Decision errors and comparing proportions, I.**\nIn the following research studies, conclusions were made based on the data provided. It is always possible that the analysis conclusion could be wrong, although we will almost never actually know if an error has been made or not. For each study conclusion, specify which of a Type 1 or Type 2 error could have been made, and state the error in the context of the problem.\n\n a. The malaria vaccine was seen to be effective at lowering the rate of contracting malaria (when compared to the control vaccine).\n \n b. In the US population, Asian-Indian Americans and Chinese Americans are not observed to have different proportions of current smokers.\n \n c. There is no evidence to claim a difference in the proportion of Americans who are not affected personally by a government shutdown when comparing Americans who make less than \\$40,000 annually and Americans who make \\$40,000 annually.\n \n \\clearpage\n\n1. **Decision errors and comparing proportions, II.**\nIn the following research studies, conclusions were made based on the data provided. It is always possible that the analysis conclusion could be wrong, although we will almost never actually know if an error has been made or not. For each study conclusion, specify which of a Type 1 or Type 2 error could have been made, and state the error in the context of the problem.\n \n a. Of registered voters in California, the proportion who report not knowing enough to voice an opinion on whether they support off shore drilling is different across those who have a college degree and those who do not.\n \n b. In comparing Californians and Oregonians, there is no evidence to support a difference in the proportion of each who are sleep deprived.\n \n \\vspace{5mm}\n\n1. **Active learning.**\nA teacher wanting to increase the active learning component of her course is concerned about student reactions to changes she is planning to make. She conducts a survey in her class, asking students whether they believe more active learning in the classroom (hands on exercises) instead of traditional lecture will helps improve their learning. She does this at the beginning and end of the semester and wants to evaluate whether students' opinions have changed over the semester. Can she used the methods we learned in this chapter for this analysis? Explain your reasoning.\n\n \\vspace{5mm}\n\n1. **An apple a day keeps the doctor away.**\nA physical education teacher at a high school wanting to increase awareness on issues of nutrition and health asked her students at the beginning of the semester whether they believed the expression \"an apple a day keeps the doctor away\", and 40% of the students responded yes. Throughout the semester she started each class with a brief discussion of a study highlighting positive effects of eating more fruits and vegetables. She conducted the same apple-a-day survey at the end of the semester, and this time 60% of the students responded yes. Can she used a two-proportion method from this section for this analysis? Explain your reasoning.\n\n \\vspace{5mm}\n\n1. **Malaria vaccine effectiveness, effect size.**\nA randomized controlled trial on malaria vaccine effectiveness randomly assigned 450 children intro either one of two different doses of the malaria vaccine or a control vaccine. 89 of 292 malaria vaccine and 106 out of 147 control vaccine children contracted malaria within 12 months after the treatment. [@Datoo:2021]\n\n Recall that in order to reject the null hypothesis that the two vaccines (malaria and control) are equivalent, we would need the sample proportion to be about 2 standard errors below the hypothesized value of zero.\n\n Say that the true difference (in the population) is given as $\\delta,$ the sample sizes are the same in both groups $(n_{malaria} = n_{control}),$ and the true proportion who contract malaria on the control vaccine is $p_{control} = 0.7.$ If you ran your own study (in the future), how likely is it that you would get a difference in sample proportions that was sufficiently far from zero that you could reject under each of the conditions below. (*Hint:* Use the mathematical model.)\n\n a. $\\delta = -0.1$ and $n_{malaria} = n_{control} = 20$\n \n b. $\\delta = -0.4$ and $n_{malaria} = n_{control} = 20$\n \n c. $\\delta = -0.1$ and $n_{malaria} = n_{control} = 100$\n \n d. $\\delta = -0.4$ and $n_{malaria} = n_{control} = 100$\n \n e. What can you conclude about values of $\\delta$ and the sample size?\n \n \\clearpage\n\n1. **Diabetes and unemployment.**\nA Gallup poll surveyed Americans about their employment status and whether they have diabetes. The survey results indicate that 1.5% of the 47,774 employed (full or part time) and 2.5% of the 5,855 unemployed 18-29 year olds have diabetes. [@data:employmentDiabetes]\n\n a. Create a two-way table presenting the results of this study.\n\n b. State appropriate hypotheses to test for difference in proportions of diabetes between employed and unemployed Americans.\n\n c. The sample difference is about 1%. If we completed the hypothesis test, we would find that the p-value is very small (about 0), meaning the difference is statistically significant. Use this result to explain the difference between statistically significant and practically significant findings.\n\n\n:::\n", + "supporting": [ + "17-inference-two-props_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/17-inference-two-props/figure-html/bootCPR1000-1.png b/_freeze/17-inference-two-props/figure-html/bootCPR1000-1.png new file mode 100644 index 00000000..3caaa353 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/bootCPR1000-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/bootCPR1000CI-1.png b/_freeze/17-inference-two-props/figure-html/bootCPR1000CI-1.png new file mode 100644 index 00000000..0a6ff4da Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/bootCPR1000CI-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/ci25ints-1.png b/_freeze/17-inference-two-props/figure-html/ci25ints-1.png new file mode 100644 index 00000000..1dbb835b Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/ci25ints-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/cpr-rand-dot-plot-1.png b/_freeze/17-inference-two-props/figure-html/cpr-rand-dot-plot-1.png new file mode 100644 index 00000000..bfa8862a Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/cpr-rand-dot-plot-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-22-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 00000000..5868ad0d Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-22-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-24-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..98244e21 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-25-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-25-1.png new file mode 100644 index 00000000..53a4b808 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-25-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-26-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 00000000..7de0b073 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-26-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-27-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 00000000..3cfdbf43 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-27-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-28-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 00000000..c4767895 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-28-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-29-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-29-1.png new file mode 100644 index 00000000..c4d54d2a Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-29-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-30-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 00000000..fa8ca4c8 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-30-1.png differ diff --git a/_freeze/17-inference-two-props/figure-html/unnamed-chunk-31-1.png b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 00000000..c0ea1e18 Binary files /dev/null and b/_freeze/17-inference-two-props/figure-html/unnamed-chunk-31-1.png differ diff --git a/_freeze/18-inference-tables/execute-results/html.json b/_freeze/18-inference-tables/execute-results/html.json new file mode 100644 index 00000000..c5e35631 --- /dev/null +++ b/_freeze/18-inference-tables/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "33738fdc62456a283d63de4f5bc64b16", + "result": { + "markdown": "# Inference for two-way tables {#inference-tables}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn Section \\@ref(inference-two-props) our focus was on the difference in proportions, a statistic calculated from finding the success proportions (from the binary response variable) measured across two groups (the binary explanatory variable).\nAs we will see in the examples below, sometimes the explanatory or response variables have more than two possible options.\nIn that setting, a difference across two groups is not sufficient, and the proportion of \"success\" is not well defined if there are 3 or 4 or more possible response levels.\nThe primary way to summarize categorical data where the explanatory and response variables both have 2 or more levels is through a two-way table as in Table \\@ref(tab:ipod-ask-data-summary).\n\nNote that with two-way tables, there is not an obvious single parameter of interest.\nInstead, research questions usually focus on how the proportions of the response variable changes (or not) across the different levels of the explanatory variable.\nBecause there is not a population parameter to estimate, bootstrapping to find the standard error of the estimate is not meaningful.\nAs such, for two-way tables, we will focus on the randomization test and corresponding mathematical approximation (and not bootstrapping).\n:::\n\n## Randomization test of independence\n\nWe all buy used products -- cars, computers, textbooks, and so on -- and we sometimes assume the sellers of those products will be forthright about any underlying problems with what they're selling.\nThis is not something we should take for granted.\nResearchers recruited 219 participants in a study where they would sell a used iPod[^18-inference-tables-1] that was known to have frozen twice in the past.\nThe participants were incentivized to get as much money as they could for the iPod since they would receive a 5% cut of the sale on top of \\$10 for participating.\nThe researchers wanted to understand what types of questions would elicit the seller to disclose the freezing issue.\n\n[^18-inference-tables-1]: For readers not as old as the authors, an iPod is basically an iPhone without any cellular service, assuming it was one of the later generations.\n Earlier generations were more basic.\n\nUnbeknownst to the participants who were the sellers in the study, the buyers were collaborating with the researchers to evaluate the influence of different questions on the likelihood of getting the sellers to disclose the past issues with the iPod.\nThe scripted buyers started with \"Okay, I guess I'm supposed to go first. So you've had the iPod for 2 years ...\" and ended with one of three questions:\n\n- General: What can you tell me about it?\n- Positive Assumption: It does not have any problems, does it?\n- Negative Assumption: What problems does it have?\n\nThe question is the treatment given to the sellers, and the response is whether the question prompted them to disclose the freezing issue with the iPod.\nThe results are shown in Table \\@ref(tab:ipod-ask-data-summary), and the data suggest that asking the, *What problems does it have?*, was the most effective at getting the seller to disclose the past freezing issues.\nHowever, you should also be asking yourself: could we see these results due to chance alone if there really is no difference in the question asked, or is this in fact evidence that some questions are more effective for getting at the truth?\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of the iPod study, where a question was posed to the study participant who acted.
Question Disclose problem Hide problem Total
General 2 71 73
Positive assumption 23 50 73
Negative assumption 36 37 73
Total 61 158 219
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`ask`](http://openintrostat.github.io/openintro/reference/ask.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe hypothesis test for the iPod experiment is really about assessing whether there is convincing evidence that there was a difference in the success rates that each question had on getting the participant to disclose the problem with the iPod.\nIn other words, the goal is to check whether the buyer's question was independent of whether the seller disclosed a problem.\n\n\n\n\n\n### Expected counts in two-way tables\n\nWhile we would not expect the number of disclosures to be exactly the same across the three question classes, the rate of disclosure seems substantially different across the three groups.\nIn order to investigate whether the differences in rates is due to natural variability in people's honesty or due to a treatment effect (i.e., the question causing the differences), we need to compute estimated counts for each cell in a two-way table.\n\n::: {.workedexample data-latex=\"\"}\nFrom the experiment, we can compute the proportion of all sellers who disclosed the freezing problem as $61/219 = 0.2785.$ If there really is no difference among the questions and 27.85% of sellers were going to disclose the freezing problem no matter the question they were asked, how many of the 73 people in the `General` group would we have expected to disclose the freezing problem?\n\n------------------------------------------------------------------------\n\nWe would predict that $0.2785 \\times 73 = 20.33$ sellers would disclose the problem.\nObviously we observed fewer than this, though it is not yet clear if that is due to chance variation or whether that is because the questions vary in how effective they are at getting to the truth.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf the questions were actually equally effective, meaning about 27.85% of respondents would disclose the freezing issue regardless of what question they were asked, about how many sellers would we expect to *hide* the freezing problem from the Positive Assumption group?[^18-inference-tables-2]\n:::\n\n[^18-inference-tables-2]: We would expect $(1 - 0.2785) \\times 73 = 52.67.$ It is okay that this result, like the result from Example \\ref{iPodExComputeExpAA}, is a fraction.\n\nWe can compute the expected number of sellers who we would expect to disclose or hide the freezing issue for all groups, if the questions had no impact on what they disclosed, using the same strategies employed in the previous Example and Guided Practice to compute expected counts.\nThese expected counts were used to construct Table \\@ref(tab:ipod-ask-data-summary-expected), which is the same as Table \\@ref(tab:ipod-ask-data-summary), except now the expected counts have been added in parentheses.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n\n\n\n\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The observed counts and the expected counts for the iPod experiment.
Disclose problem
Hide problem
Total
General 2 (20.33) 71 (52.67) 73
Positive assumption 23 (20.33) 50 (52.67) 73
Negative assumption 36 (20.33) 37 (52.67) 73
Total 61 158 219
\n\n`````\n:::\n:::\n\n\nThe examples and exercises above provided some help in computing expected counts.\nIn general, expected counts for a two-way table may be computed using the row totals, column totals, and the table total.\nFor instance, if there was no difference between the groups, then about 27.85% of each row should be in the first column:\n\n$$\n\\begin{aligned}\n0.2785\\times (\\text{row 1 total}) &= 20.33 \\\\\n0.2785\\times (\\text{row 2 total}) &= 20.33 \\\\\n0.2785\\times (\\text{row 3 total}) &= 20.33\n\\end{aligned} \n$$\n\nLooking back to how 0.2785 was computed -- as the fraction of sellers who disclosed the freezing issue $(158/219)$ -- these three expected counts could have been computed as\n\n$$\n\\begin{aligned}\n\\left(\\frac{\\text{row 1 total}}{\\text{table total}}\\right)\n \\text{(column 1 total)} &= 20.33 \\\\\n\\left(\\frac{\\text{row 1 total}}{\\text{table total}}\\right)\n \\text{(column 2 total)} &= 20.33 \\\\\n\\left(\\frac{\\text{row 1 total}}{\\text{table total}}\\right)\n \\text{(column 3 total)} &= 20.33\n\\end{aligned} \n$$\n\nThis leads us to a general formula for computing expected counts in a two-way table when we would like to test whether there is strong evidence of an association between the column variable and row variable.\n\n::: {.important data-latex=\"\"}\n**Computing expected counts in a two-way table.**\n\nTo calculate the expected count for the $i^{th}$ row and $j^{th}$ column, compute\n\n$$\\text{Expected Count}_{\\text{row }i,\\text{ col }j} = \\frac{(\\text{row $i$ total}) \\times (\\text{column $j$ total})}{\\text{table total}}$$\n:::\n\n\n\n\n\n### The observed chi-squared statistic\n\nThe chi-squared test statistic for a two-way table is found by finding the ratio of how far the observed counts are from the expected counts, as compared to the expected counts, for every cell in the table.\nFor each table count, compute:\n\n$$\n\\begin{aligned}\n&\\text{General formula} &&\n \\frac{(\\text{observed count } - \\text{expected count})^2}\n {\\text{expected count}} \\\\\n&\\text{Row 1, Col 1} &&\n \\frac{(2 - 20.33)^2}{20.33} = 16.53 \\\\\n&\\text{Row 2, Col 1} &&\n \\frac{(23 - 20.33)^2}{20.33} = 0.35 \\\\\n& \\hspace{9mm}\\vdots &&\n \\hspace{13mm}\\vdots \\\\\n&\\text{Row 3, Col 2} &&\n \\frac{(37 - 52.67)^2}{52.67} = 4.66\n\\end{aligned}\n$$\n\nAdding the computed value for each cell gives the chi-squared test statistic $X^2:$\n\n$$X^2 = 16.53 + 0.35 + \\dots + 4.66 = 40.13$$\n\nIs 40.13 a big number?\nThat is, does it indicate that the observed and expected values are really different?\nOr is 40.13 a value of the statistic that we would expect to see just due to natural variability?\nPreviously, we applied the randomization test to the setting where the research question investigated a difference in proportions.\nThe same idea of shuffling the data under the null hypothesis can be used in the setting of the two-way table.\n\n### Variability of the statistic\n\nAssuming that the individuals would disclose or hide the problems **regardless** of the question they are given (i.e., that the null hypothesis is true), we can randomize the data by reassigning the 61 disclosed problems and 158 hidden problems to the three groups at random.\nTable \\@ref(tab:ipod-ask-data-summary-rand) shows a possible randomization of the observed data under the condition that the null hypothesis is true (in contrast to the original observed data in Table \\@ref(tab:ipod-ask-data-summary)).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of the iPod study, where a question was posed to the study participant who acted.
Question Disclose problem Hide problem Total
General 29 44 73
Positive assumption 15 58 73
Negative assumption 17 56 73
Total 61 158 219
\n\n`````\n:::\n:::\n\n\nAs before, the randomized data is used to find a single value for the test statistic (here a chi-squared statistic).\nThe chi-squared statistic for the randomized two-way table is found by comparing the observed and expected counts for each cell in the *randomized* table.\nFor each cell, compute:\n\n$$\n\\begin{aligned}\n&\\text{General formula} &&\n \\frac{(\\text{observed count } - \\text{expected count})^2}\n {\\text{expected count}} \\\\\n&\\text{Row 1, Col 1} &&\n \\frac{(29 - 20.33)^2}{20.33} = 3.7 \\\\\n&\\text{Row 2, Col 1} &&\n \\frac{(15 - 20.33)^2}{20.33} = 1.4 \\\\\n& \\hspace{9mm}\\vdots &&\n \\hspace{13mm}\\vdots \\\\\n&\\text{Row 3, Col 2} &&\n \\frac{(56 - 52.67)^2}{52.67} = 0.211\n\\end{aligned} \n$$\n\nAdding the computed value for each cell gives the chi-squared test statistic $X^2:$\n\n$$X^2 = 3.7 + 1.4 + \\dots + 0.211 = 8$$\n\n\n\n\n\n### Observed statistic vs. null chi-squared statistics\n\nAs before, one randomization will not be sufficient for understanding if the observed data are particularly different from the expected chi-squared statistics when $H_0$ is true.\nTo investigate whether 40.13 is large enough to indicate the observed and expected counts are substantially different, we need to understand the variability in the values of the chi-squared statistic we would expect to see if the null hypothesis was true.\nFigure \\@ref(fig:ipodRandDotPlot) plots 1,000 chi-squared statistics generated under the null hypothesis.\nWe can see that the observed value is so far from the null statistics that the simulated p-value is zero.\nThat is, the probability of seeing the observed statistic when the null hypothesis is true is virtually zero.\nIn this case we can conclude that the decision of whether to disclose the iPod's problem is changed by the question asked.\nWe use the causal language of \"changed\" because the study was an experiment.\nNote that with a chi-squared test, we only know that the two variables (`question_class` and `response`) are related (i.e., not independent).\nWe are not able to claim which type of question causes which type of response.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A histogram of chi-squared statisics from 1,000 simulations produced under the null hypothesis, $H_0,$ where the question is independent of the response. The observed statistic of 40.13 is marked by the red line. None of the 1,000 simulations had a chi-squared value of at least 40.13. In fact, none of the simulated chi-squared statistics came anywhere close to the observed statistic!](18-inference-tables_files/figure-html/ipodRandDotPlot-1.png){width=90%}\n:::\n:::\n\n\n## Mathematical model for test of independence {#mathchisq}\n\n### The chi-squared test of independence\n\nPreviously, in Section \\@ref(math-2prop), we applied the Central Limit Theorem to the sampling variability of $\\hat{p}_1 - \\hat{p}_2.$ The result was that we could use the normal distribution (e.g., $z^*$ values (see Figure \\@ref(fig:choosingZForCI) ) and p-values from $Z$ scores) to complete the mathematical inferential procedure.\nThe chi-squared test statistic has a different mathematical distribution called the Chi-squared distribution.\nThe important specification to make in describing the chi-squared distribution is something called degrees of freedom.\nThe degrees of freedom change the shape of the chi-squared distribution to fit the problem at hand.\nFigure \\@ref(fig:chisqDistDF) visualizes different chi-squared distributions corresponding to different degrees of freedom.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The chi-squared distribution for differing degrees of freedom. The larger the degrees of freedom, the longer the right tail extends. The smaller the degrees of freedom, the more peaked the mode on the left becomes.](18-inference-tables_files/figure-html/chisqDistDF-1.png){width=90%}\n:::\n:::\n\n\n### Variability of the chi-squared statistic\n\nAs it turns out, the chi-squared test statistic follows a Chi-squared distribution when the null hypothesis is true.\nFor two way tables, the degrees of freedom is equal to: $df = \\text{(number of rows minus 1)}\\times \\text{(number of columns minus 1)}$.\nIn our example, the degrees of freedom parameter is $df = (2-1)\\times (3-1) = 2$.\n\n\n\n\n\n### Observed statistic vs. null chi-squared statistics\n\n::: {.important data-latex=\"\"}\n**The test statistic for assessing the independence between two categorical variables is a** $X^2.$\n\nThe $X^2$ statistic is a ratio of how the observed counts vary from the expected counts as compared to the expected counts (which are a measure of how large the sample size is).\n\n$$X^2 = \\sum_{i,j} \\frac{(\\text{observed count} - \\text{expected count})^2}{\\text{expected count}}$$\n\nWhen the null hypothesis is true and the conditions are met, $X^2$ has a Chi-squared distribution with $df = (r-1) \\times (c-1).$\n\nConditions:\n\n- Independent observations\n- Large samples: 5 expected counts in each cell\n:::\n\nTo bring it back to the example, we can safely assume that the observations are independent, as the question groups were randomly assigned.\nAdditionally, there are over 5 expected counts in each cell, so the conditions for using the Chi-square distribution are met.\nIf the null hypothesis is true (i.e., the questions had no impact on the sellers in the experiment), then the test statistic $X^2 = 40.13$ is expected to follow a Chi-squared distribution with 2 degrees of freedom.\nUsing this information, we can compute the p-value for the test, which is depicted in Figure \\@ref(fig:iPodChiSqTail).\n\n::: {.important data-latex=\"\"}\n**Computing degrees of freedom for a two-way table.**\n\nWhen applying the chi-squared test to a two-way table, we use $df = (R-1)\\times (C-1)$ where $R$ is the number of rows in the table and $C$ is the number of columns.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Visualization of the p-value for $X^2 = 40.13$ when $df = 2.$](18-inference-tables_files/figure-html/iPodChiSqTail-1.png){width=90%}\n:::\n:::\n\n\nThe software R can be used to find the p-value with the function `pchisq()`.\nJust like `pnorm()`, `pchisq()` always gives the area to the left of the cutoff value.\nBecause, in this example, the p-value is represented by the area to the right of 40.13, we subtract the output of `pchisq()` from 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n1 - pchisq(40.13, df = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.93e-09\n```\n\n\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nFind the p-value and draw a conclusion about whether the question affects the sellers likelihood of reporting the freezing problem.\n\n------------------------------------------------------------------------\n\nUsing a computer, we can compute a very precise value for the tail area above $X^2 = 40.13$ for a chi-squared distribution with 2 degrees of freedom: 0.000000002.\n\nUsing a significance level of $\\alpha=0.05,$ the null hypothesis is rejected since the p-value is smaller.\nThat is, the data provide convincing evidence that the question asked did affect a seller's likelihood to tell the truth about problems with the iPod.\n:::\n\n::: {.workedexample data-latex=\"\"}\nTable \\@ref(tab:diabetes2ExpMetRosiLifestyleSummary) summarizes the results of an experiment evaluating three treatments for Type 2 Diabetes in patients aged 10-17 who were being treated with metformin.\nThe three treatments considered were continued treatment with metformin (`met`), treatment with metformin combined with rosiglitazone (`rosi`), or a `lifestyle` intervention program.\nEach patient had a primary outcome, which was either lacked glycemic control (failure) or did not lack that control (success).\nWhat are appropriate hypotheses for this test?\n\n------------------------------------------------------------------------\n\n- $H_0:$ There is no difference in the effectiveness of the three treatments.\n- $H_A:$ There is some difference in effectiveness between the three treatments, e.g., perhaps the `rosi` treatment performed better than `lifestyle`.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Results for the Type 2 Diabetes study.
Treatment Failure Success Total
lifestyle 109 125 234
met 120 112 232
rosi 90 143 233
Total 319 380 699
\n\n`````\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`diabetes2`](http://openintrostat.github.io/openintro/reference/diabetes2.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nTypically we will use a computer to do the computational work of finding the chi-squared statistic.\nHowever, it is always good to have a sense for what the computer is doing, and in particular, calculating the values which would be expected if the null hypothesis is true can help to understand the null hypothesis claim.\nAdditionally, comparing the expected and observed values by eye often gives the researcher some insight into why or why not the null hypothesis for a given test is rejected or not.\n\n::: {.guidedpractice data-latex=\"\"}\nA chi-squared test for a two-way table may be used to test the hypotheses in the diabetes Example above.\nTo get a sense for the statistic used in the chi-squared test, first compute the expected values for each of the six table cells.[^18-inference-tables-3]\n:::\n\n[^18-inference-tables-3]: The expected count for row one / column one is found by multiplying the row one total (234) and column one total (319), then dividing by the table total (699): $\\frac{234\\times 319}{699} = 106.8.$ Similarly for the second column and the first row: $\\frac{234\\times 380}{699} = 127.2.$ Row 2: 105.9 and 126.1.\n Row 3: 106.3 and 126.7.\n\nNote, when analyzing 2-by-2 contingency tables (that is, when both variables only have two possible options), one guideline is to use the two-proportion methods introduced in Chapter \\@ref(inference-two-props).\n\n\\clearpage\n\n## Chapter review {#chp18-review}\n\n### Summary\n\nIn this chapter we extended the randomization / bootstrap / mathematical model paradigm to research questions involving categorical variables.\nWe continued working with one population proportion as well as the difference in populations proportions, but the test of independence allowed for hypothesis testing on categorical variables with more than two levels.\nWe note that the normal model was an excellent mathematical approximation to the sampling distribution of sample proportions (or differences in sample proportions), but that the questions with categorical variables with more than 2 levels required a new mathematical model, the chi-squared distribution.\nAs seen in Chapters \\@ref(foundations-randomization), \\@ref(foundations-bootstrapping) and \\@ref(foundations-mathematical), almost all the research questions can be approached using computational methods (e.g., randomization tests or bootstrapping) or using mathematical models.\nWe continue to emphasize the importance of experimental design in making conclusions about research claims.\nIn particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \\@ref(fig:randsampValloc)).\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
Chi-squared distribution expected counts
chi-squared statistic independence
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp18-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-18].\n\n::: {.exercises data-latex=\"\"}\n1. **Quitters.**\nDoes being part of a support group affect the ability of people to quit smoking? A county health department enrolled 300 smokers in a randomized experiment. 150 participants were randomly assigned to a group that used a nicotine patch and met weekly with a support group; the other 150 received the patch and did not meet with a support group. At the end of the study, 40 of the participants in the patch plus support group had quit smoking while only 30 smokers had quit in the other group.\n\n a. Create a two-way table presenting the results of this study.\n\n b. Answer each of the following questions under the null hypothesis that being part of a support group does not affect the ability of people to quit smoking, and indicate whether the expected values are higher or lower than the observed values.\n \n \\vspace{5mm}\n\n1. **Act on climate change.**\nThe table below summarizes results from a Pew Research poll which asked respondents whether they have personally taken action to help address climate change within the last year and their generation.\nThe differences in each generational group may be due to chance.\nComplete the following computations under the null hypothesis of independence between an individual's generation and whether they have personally taken action to help address climate change within the last year. [@pewclimatechange2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Response
Generation Took action Didn't take action Total
Gen Z 292 620 912
Millenial 885 2,275 3,160
Gen X 809 2,709 3,518
Boomer & older 1,276 4,798 6,074
Total 3,262 10,402 13,664
\n \n `````\n :::\n :::\n\n a. If there is no relationship between age and action, how many Gen Z'ers would you expect to have personally taken action to help address climate change within the last year?\n\n b. If there is no relationship between age and action, how many Millenials would you expect to have personally taken action to help address climate change within the last year?\n\n c. If there is no relationship between age and action, how many Gen X'ers would you expect to have personally taken action to help address climate change within the last year?\n \n d. If there is no relationship between age and action, how many Boomers and older would you expect to have personally taken action to help address climate change within the last year?\n \n \\clearpage\n\n1. **Lizard habitats, data.**\nIn order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (*Sceloporus occidentalis*) were observed across three different microhabitats.^[The [`lizard_habitat`](http://openintrostat.github.io/openintro/reference/lizard_habitat.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Adolph:1990;@Asbury:2007]\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
sunlight
site sun partial shade Total
desert 16 32 71 119
mountain 56 36 15 107
valley 42 40 24 106
Total 114 108 110 332
\n \n `````\n :::\n :::\n \n a. If the variables describing the habitat and the amount of sunlight are independent, what proporiton of lizards (total) would be expected in each of the three sunlight categories?\n \n b. Given the proportions of each sunlight condition, how many lizards of each type would you expect to see in the sun? in the partial sun? in the shade?\n \n c. Compare the observed (original data) and expected (part b.) tables. From a first glance, does it seem as though the habitat and choice of sunlight may be associated?\n \n d. Regardless of your answer to part (c), is it possible to tell from looking only at the expected and observed counts whether the two variables are associated?\n \n \\vspace{5mm}\n\n1. **Disaggregating Asian American tobacco use, data.**\nUnderstanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups including Asian-Indian (n = 4,373), Chinese (n = 4,736), and Filipino (n = 4,912), in comparison to non-Hispanic Whites (n = 275,025). The number of current smokers in each group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-Hispanic Whites (n = 50,880). [@Rao:2021]\n\n In order to assess whether there is a difference in current smoking rates across three Asian American ethnic groups, the observed data is compared to the data that would be expected if there were no association between the variables.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Smoking
ethnicity do not smoke smoke Total
Asian-Indian 4,150 223 4,373
Chinese 4,457 279 4,736
Filipino 4,303 609 4,912
Total 12,910 1,111 14,021
\n \n `````\n :::\n :::\n \n a. If the variables on ethnicity and smoking status are independent, estimate the proporiton of individuals (total) who smoke?\n \n b. Given the overall proportion who smoke, how many of each Asian American ethnicity would you expect to smoke?\n \n c. Compare the observed (original data) and expected (part b.) tables. From a first glance, does it seem as though the Asian American ethnicity and choice of smoking may be associated?\n \n d. Regardless of your answer to part (c), is it possible to tell from looking only at the expected and observed counts whether the two variables are associated?\n \n \\clearpage\n\n1. **Lizard habitats, randomize once.**\nIn order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (*Sceloporus occidentalis*) were observed across three different microhabitats. The original data is shown below. [@Adolph:1990;@Asbury:2007]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Original data
sunlight
site sun partial shade Total
desert 16 32 71 119
mountain 56 36 15 107
valley 42 40 24 106
Total 114 108 110 332
\n \n `````\n :::\n :::\n\n Then, the data were randomized once, where sunlight preference was randomly assigned to the lizards across different sites. The results of the randomization is shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Randomized data
sunlight
site sun partial shade Total
desert 44 42 33 119
mountain 39 31 37 107
valley 31 35 40 106
Total 114 108 110 332
\n \n `````\n :::\n :::\n\n Recall that the Chi-squared statistic $(X^2)$ measures the difference between the expected and observed counts. Without calculating the actual statistic, report on whether the original data or the randomized data will have a larger Chi-squared statistic. Explain your choice.\n \n \\clearpage\n\n1. **Disaggregating Asian American tobacco use, randomize once.**\nIn a study that aims to disaggregate tobacco use across Asian American ethnic groups (Asian-Indian, Chinese, and Filipino, in comparison to non-Hispanic Whites), respondents were asked whether they smoke tobacco or not. The original data is shown below. [@Rao:2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Original data
Smoking
ethnicity do not smoke smoke Total
Asian-Indian 4,150 223 4,373
Chinese 4,457 279 4,736
Filipino 4,303 609 4,912
Total 12,910 1,111 14,021
\n \n `````\n :::\n :::\n\n Then, the data were randomized once, where smoking status was randomly assigned to the participants across different ethnicities. The results of the randomization is shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Randomized data
Smoking
ethnicity do not smoke smoke Total
Asian-Indian 4,015 358 4,373
Chinese 4,385 351 4,736
Filipino 4,510 402 4,912
Total 12,910 1,111 14,021
\n \n `````\n :::\n :::\n\n Recall that the Chi-squared statistic $(X^2)$ measures the difference between the expected and observed counts. Without calculating the actual statistic, report on whether the original data or the randomized data will have a larger Chi-squared statistic. Explain your choice.\n \n \\clearpage\n\n1. **Lizard habitats, randomization test.**\nIn order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (*Sceloporus occidentalis*) were observed across three different microhabitats. [@Adolph:1990;@Asbury:2007]\n\n The original data were randomized 1,000 times (sunlight variable randomly assigned to the observations across different habitats), and the histogram of the Chi-squared statistic on each randomization is displayed.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](18-inference-tables_files/figure-html/unnamed-chunk-23-1.png){width=90%}\n :::\n :::\n\n a. The histogram above describes the Chi-squared statistics for 1,000 different randomization datasets. When randomizing the data, is the imposed structure that the variables are independent or that the variables are associated? Explain.\n \n b. What is the range of plausible values for the randomized Chi-squared statistic?\n \n c. The observed Chi-squared statistic is 68.8 (and seen in red on the graph). Does the observed value provide evidence against the null hypothesis? To answer the question, state the null and alternative hypotheses, approximate the p-value, and conclude the test in the context of the problem.\n \n \\clearpage\n\n1. **Disaggregating Asian American tobacco use, randomization test.**\nUnderstanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups including Asian-Indian (n = 4373), Chinese (n = 4736), and Filipino (n = 4912), in comparison to non-Hispanic Whites (n = 275,025). The number of current smokers in each group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-Hispanic Whites (n = 50,880). [@Rao:2021]\n\n The original data were randomized 1000 times (smoking status randomly assigned to the observations across ethnicities), and the histogram of the Chi-squared statistic on each randomization is displayed.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](18-inference-tables_files/figure-html/unnamed-chunk-24-1.png){width=90%}\n :::\n :::\n\n a. The histogram above describes the Chi-squared statistics for 1000 different randomization datasets. When randomizing the data, is the imposed structure that the variables are independent or that the variables are associated? Explain.\n \n b. What is the (approximate) range of plausible values for the randomized Chi-squared statistic?\n \n c. The observed Chi-squared statistic is 209.42 (and seen in red on the graph). Does the observed value provide evidence against the null hypothesis? To answer the question, state the null and alternative hypotheses, approximate the p-value, and conclude the test in the context of the problem.\n \n \\clearpage\n\n1. **Lizard habitats, larger data.**\nIn order to assess whether habitat conditions are related to the sunlight choices a lizard makes for resting, Western fence lizard (*Sceloporus occidentalis*) were observed across three different microhabitats. [@Adolph:1990;@Asbury:2007]\n\n Consider the situation where the data set is 5 times *larger* than the original data (but have the same proportional representation in each category). The distribution of lizards in each of the sites resting in the sun, partial sun, and shade are as follows.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Larger data
sunlight
site sun partial shade Total
desert 80 160 355 595
mountain 280 180 75 535
valley 210 200 120 530
Total 570 540 550 1,660
\n \n `````\n :::\n :::\n\n The larger dataset was randomized 1,000 times (sunlight preference randomly assigned to the observations across sites), and the histogram of the Chi-squared statistic on each randomization is displayed.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](18-inference-tables_files/figure-html/unnamed-chunk-26-1.png){width=90%}\n :::\n :::\n \n a. The histogram above describes the Chi-squared statistics for 1,000 different randomization of the larger dataset. When randomizing the data, is the imposed structure that the variables are independent or that the variables are associated? Explain.\n \n b. What is the (approximate) range of plausible values for the randomized Chi-squared statistic?\n \n c. The observed Chi-squared statistic is 343.865 (and seen in red on the graph). Does the observed value provide evidence against the null hypothesis? To answer the question, state the null and alternative hypotheses, approximate the p-value, and conclude the test in the context of the problem.\n \n d. If the alternative hypothesis is true, how does the sample size effect the ability to reject the null hypothesis? (*Hint:* Consider the original data as compared with the larger dataset that have the same proportional values.)\n \n \\clearpage\n\n1. **Disaggregating Asian American tobacco use, smaller data.**\nUnderstanding cultural differences in tobacco use across different demographic groups can lead to improved health care education and treatment. A recent study disaggregated tobacco use across Asian American ethnic groups [@Rao:2021].\n\n Consider the situation where the data set is 50 times *smaller* than the original data (but have the same proportional representation in each category). The distribution of smokers in each of the ethnicity groups in the smaller data are as follows.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Smaller data
Smoking
ethnicity do not smoke smoke Total
Asian-Indian 83 4 87
Chinese 89 6 95
Filipino 86 12 98
Total 258 22 280
\n \n `````\n :::\n :::\n\n The smaller dataset was randomized 1,000 times (smoking status randomly assigned to the observations across ethnicities), and the histogram of the Chi-squared statistic on each randomization is displayed.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](18-inference-tables_files/figure-html/unnamed-chunk-28-1.png){width=90%}\n :::\n :::\n \n a. The histogram above describes the Chi-squared statistics for 1,000 different randomization of the smaller dataset. When randomizing the data, is the imposed structure that the variables are independent or that the variables are associated? Explain.\n \n b. What is the (approximate) range of plausible values for the randomized Chi-squared statistic?\n \n c. The observed Chi-squared statistic is 4.19 (and seen in red on the graph). Does the observed value provide evidence against the null hypothesis? To answer the question, state the null and alternative hypotheses, approximate the p-value, and conclude the test in the context of the problem.\n \n d. If the alternative hypothesis is true, how does the sample size effect the ability to reject the null hypothesis? (*Hint:* Consider the original data as compared with the smaller dataset that have the same proportional values.)\n \n \\clearpage\n\n1. **True or false, I.**\nDetermine if the statements below are true or false. For each false statement, suggest an alternative wording to make it a true statement.\n\n a. The Chi-square distribution, just like the normal distribution, has two parameters, mean and standard deviation.\n\n b. The Chi-square distribution is always right skewed, regardless of the value of the degrees of freedom parameter.\n\n c. The Chi-square statistic is always greater than or equal to 0.\n\n d. As the degrees of freedom increases, the shape of the Chi-square distribution becomes more skewed.\n\n1. **True or false, II.**\nDetermine if the statements below are true or false. For each false statement, suggest an alternative wording to make it a true statement.\n\n a. As the degrees of freedom increases, the mean of the Chi-square distribution increases.\n\n b. If you found $\\chi^2 = 10$ with $df = 5$ you would fail to reject $H_0$ at the 5% significance level.\n\n c. When finding the p-value of a Chi-square test, we always shade the tail areas in both tails.\n\n d. As the degrees of freedom increases, the variability of the Chi-square distribution decreases.\n\n1. **Sleep deprived transportation workers.**\nThe National Sleep Foundation conducted a survey on the sleep habits of randomly sampled transportation workers and randomly sampled non-transportation workers that serve as a \"control\" for comparison. The results of the survey are shown below. [@data:sleepTransport]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Profession Less than 6 hours 6 to 8 hours More than 8 hours Total
Non-transportation workers 35 193 64 292
Transportation workers 104 499 192 795
Total 139 692 256 1,087
\n \n `````\n :::\n :::\n\n Conduct a hypothesis test to evaluate if these data provide evidence of an association between sleep levels and profession.\n\n1. **Parasitic worm.**\nLymphatic filariasis is a disease caused by a parasitic worm. Complications of the disease can lead to extreme swelling and other complications. Here we consider results from a randomized experiment that compared three different drug treatment options to clear people of the this parasite, which people are working to eliminate entirely. The results for the second year of the study are given below: [@KingSuamani2018]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Outcome
group Clear at Year 2 Not Clear at Year 2 Total
Three drugs 52 2 54
Two drugs 31 24 55
Two drugs annually 42 14 56
Total 125 40 165
\n \n `````\n :::\n :::\n\n a. Set up hypotheses for evaluating whether there is any difference in the performance of the treatments, and check conditions.\n\n b. Statistical software was used to run a Chi-square test, which output:\n \n \\vspace{-4mm}\n \n $$X^2 = 23.7 \\quad df = 2 \\quad \\text{p-value} < 0.0001$$\n \n Use these results to evaluate the hypotheses from part (a), and provide a conclusion in the context of the problem.\n\n1. **Shipping holiday gifts.**\nA local news survey asked 500 randomly sampled Los Angeles residents which shipping carrier they prefer to use for shipping holiday gifts. \nThe table below shows the distribution of responses by age group as well as the expected counts for each cell (shown in italics).\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Age
Shipping method
18-34
35-54
55+
Total
USPS 72 81 97 102 76 62 245
UPS 52 53 76 68 34 41 162
FedEx 31 21 24 27 9 16 64
Something else 7 5 6 7 3 4 16
Not sure 3 5 6 5 4 3 13
Total 165 209 126 500
\n \n `````\n :::\n :::\n\n a. State the null and alternative hypotheses for testing for independence of age and preferred shipping method for holiday gifts among Los Angeles residents.\n\n b. Are the conditions for inference using a Chi-square test satisfied?\n\n1. **Coffee and depression.**\nResearchers conducted a study investigating the relationship between caffeinated coffee consumption and risk of depression in women. They collected data on 50,739 women free of depression symptoms at the start of the study in the year 1996, and these women were followed through 2006. The researchers used questionnaires to collect data on caffeinated coffee consumption, asked each individual about physician-diagnosed depression, and asked about the use of antidepressants. The table below shows the distribution of incidences of depression by amount of caffeinated coffee consumption. [@Lucas:2011]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Caffeinated coffee consumption
Clinical depression 1 cup / week or fewer 2-6 cups / week 1 cups / day 2-3 cups / day 4 cups / day or more Total
Yes 670 ___ 905 564 95 2,607
No 11,545 6,244 16,329 11,726 2,288 48,132
Total 12,215 6,617 17,234 12,290 2,383 50,739
\n \n `````\n :::\n :::\n\n a. What type of test is appropriate for evaluating if there is an association between coffee intake and depression?\n\n b. Write the hypotheses for the test you identified in part (a).\n\n c. Calculate the overall proportion of women who do and do not suffer from depression.\n\n d. Identify the expected count for the empty cell, and calculate the contribution of this cell to the test statistic.\n\n e. The test statistic is $\\chi^2=20.93$. What is the p-value?\n\n f. What is the conclusion of the hypothesis test?\n\n g. One of the authors of this study was quoted on the NYTimes as saying it was \"too early to recommend that women load up on extra coffee\" based on just this study. [@news:coffeeDepression] Do you agree with this statement? Explain your reasoning.\n\n\n:::\n", + "supporting": [ + "18-inference-tables_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/18-inference-tables/figure-html/chisqDistDF-1.png b/_freeze/18-inference-tables/figure-html/chisqDistDF-1.png new file mode 100644 index 00000000..9ed86bcd Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/chisqDistDF-1.png differ diff --git a/_freeze/18-inference-tables/figure-html/iPodChiSqTail-1.png b/_freeze/18-inference-tables/figure-html/iPodChiSqTail-1.png new file mode 100644 index 00000000..50c0c550 Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/iPodChiSqTail-1.png differ diff --git a/_freeze/18-inference-tables/figure-html/ipodRandDotPlot-1.png b/_freeze/18-inference-tables/figure-html/ipodRandDotPlot-1.png new file mode 100644 index 00000000..664f0238 Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/ipodRandDotPlot-1.png differ diff --git a/_freeze/18-inference-tables/figure-html/unnamed-chunk-23-1.png b/_freeze/18-inference-tables/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 00000000..44be47ff Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/unnamed-chunk-23-1.png differ diff --git a/_freeze/18-inference-tables/figure-html/unnamed-chunk-24-1.png b/_freeze/18-inference-tables/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..78e69699 Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/18-inference-tables/figure-html/unnamed-chunk-26-1.png b/_freeze/18-inference-tables/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 00000000..5a571ad7 Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/unnamed-chunk-26-1.png differ diff --git a/_freeze/18-inference-tables/figure-html/unnamed-chunk-28-1.png b/_freeze/18-inference-tables/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 00000000..416597fc Binary files /dev/null and b/_freeze/18-inference-tables/figure-html/unnamed-chunk-28-1.png differ diff --git a/_freeze/19-inference-one-mean/execute-results/html.json b/_freeze/19-inference-one-mean/execute-results/html.json new file mode 100644 index 00000000..88474379 --- /dev/null +++ b/_freeze/19-inference-one-mean/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "e13ea7ddfaf77968168f5dd1a552823c", + "result": { + "markdown": "# Inference for a single mean {#inference-one-mean}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nFocusing now on Statistical Inference for **numerical data**, again, we will revisit and expand upon the foundational aspects of hypothesis testing from Chapter \\@ref(foundations-randomization).\n\nThe important data structure for this chapter is a numeric response variable (that is, the outcome is quantitative).\nThe four data structures we detail are one numeric response variable, one numeric response variable which is a difference across a pair of observations, a numeric response variable broken down by a binary explanatory variable, and a numeric response variable broken down by an explanatory variable that has two or more levels.\nWhen appropriate, each of the data structures will be analyzed using the three methods from Chapters \\@ref(foundations-randomization), \\@ref(foundations-bootstrapping), and \\@ref(foundations-mathematical): randomization test, bootstrapping, and mathematical models, respectively.\n\nAs we build on the inferential ideas, we will visit new foundational concepts in statistical inference.\nOne key new idea rests in estimating how the sample mean (as opposed to the sample proportion) varies from sample to sample; the resulting value is referred to as the standard error of the mean.\nWe will also introduce a new important mathematical model, the $t$-distribution (as the foundation for the $t$-test).\n:::\n\n\n\n\n\nIn this chapter, we focus on the sample mean (instead of, for example, the sample median or the range of the observations) because of the well-studied mathematical model which describes the behavior of the sample mean.\nWe will not cover mathematical models which describe other statistics, but the bootstrap and randomization techniques described below are immediately extendable to any function of the observed data.\nThe sample mean will be calculated in one group, two paired groups, two independent groups, and many groups settings.\nThe techniques described for each setting will vary slightly, but you will be well served to find the structural similarities across the different settings.\n\nSimilar to how we can model the behavior of the sample proportion $\\hat{p}$ using a normal distribution, the sample mean $\\bar{x}$ can also be modeled using a normal distribution when certain conditions are met.\n\\index{point estimate!single mean} However, we'll soon learn that a new distribution, called the $t$-distribution, tends to be more useful when working with the sample mean.\nWe'll first learn about this new distribution, then we'll use it to construct confidence intervals and conduct hypothesis tests for the mean.\n\n## Bootstrap confidence interval for a mean {#boot1mean}\n\nConsider a situation where you want to know whether you should buy a franchise of the used car store Awesome Autos.\nAs part of your planning, you'd like to know for how much an average car from Awesome Autos sells.\nIn order to go through the example more clearly, let's say that you are only able to randomly sample five cars from Awesome Auto.\n(If this were a real example, you would surely be able to take a much larger sample size, possibly even being able to measure the entire population!)\n\n### Observed data\n\nFigure \\@ref(fig:5cars) shows a (small) random sample of observations from Awesome Auto.\nThe actual cars as well as their selling price is shown.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A sample of five cars from Awesome Auto.](images/5cars.png){fig-alt='Photographs of 5 different automobiles. The cars are different color and different makes and models. On top of the image of each car is its price; the five prices range from 9600 dollars to 27000 dollars.' width=75%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nThe sample average car price of \\$17140.00 is a first guess at the price of the average car price at Awesome Auto.\nHowever, as a student of statistics, you understand that one sample mean based on a sample of five observations will not necessarily equal the true population average car price for all the cars at Awesome Auto.\nIndeed, you can see that the observed car prices vary with a standard deviation of \\$7170.29, and surely the average car price would be different if a different sample of size five had been taken from the population.\nFortunately, as it did in previous chapters for the sample proportion, bootstrapping will approximate the variability of the sample mean from sample to sample.\n\n### Variability of the statistic\n\nAs with the inferential ideas covered in Chapters \\@ref(foundations-randomization), \\@ref(foundations-bootstrapping), and \\@ref(foundations-mathematical), the inferential analysis methods in this chapter are grounded in quantifying how one dataset differs from another when they are both taken from the same population.\nTo repeat, the idea is that we want to know how datasets differ from one another, but we aren't ever going to take more than one sample of observations.\nIt does not make sense to take repeated samples from the same population because if you have the ability to take more samples, a larger sample size will benefit you more than taking two samples from the population.\nInstead, of taking repeated samples from the actual population, we use bootstrapping to measure how the samples behave under an estimate of the population.\n\nAs mentioned previously, to get a sense of the cars at Awesome Auto, you take a sample of five cars from the Awesome Auto branch near you as a way to gauge the price of the cars being sold.\nFigure \\@ref(fig:bootpop1mean) shows how the unknown original population can be estimated by using the sample to approximate the distribution of car prices from the population of cars at Awesome Auto.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![As seen previously, the idea behind bootstrapping is to consider the sample at hand as an estimate of the population. Sampling from the sample (of 5 cars) is identical to sampling from an infinite population which is made up of only the cars in the original sample.](images/bootpop1mean.png){fig-alt='The sample of 5 cars is drawn from a large population with other values (i.e., other car prices) unknown. The sample of five cars is replicated infinitely many times to create a proxy population where the car prices are given by the original dataset in the same relative distribution as measured in the sample.' width=90%}\n:::\n:::\n\n\nBy taking repeated samples from the estimated population, the variability from sample to sample can be observed.\nIn Figure \\@ref(fig:boot2) the repeated bootstrap samples are seen to be different both from each other and from the original population.\nRecall that the bootstrap samples were taken from the same (estimated) population, and so the differences in bootstrap samples are due entirely to natural variability in the sampling procedure.\nFor the situation at hand where the sample mean is the statistic of interest, the variability from sample to sample can be seen in Figure \\@ref(fig:bootsamps1mean).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![To estimate the natural variability in the sample mean, different bootstrap samples are taken from the original sample. Notice that each bootstrap resample is different from each other as well as from the original sample](images/bootsamps1mean.png){fig-alt='The sample is shown being taken from the large unknown population. The bootstrap resamples, however, are taken directly from the original sample (sampling with replacement) as if the resamples had been taken from an infinitely large proxy population. Three bootstrap resamples of 5 cars each are shown, each resample is slightly different due to the process of resampling with replacement.' width=90%}\n:::\n:::\n\n\nBy summarizing each of the bootstrap samples (here, using the sample mean), we see, directly, the variability of the sample mean, $\\bar{x},$ from sample to sample.\nThe distribution of $\\bar{x}_{bs}$ for the Awesome Auto cars is shown in Figure \\@ref(fig:bootmeans1mean).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Because each of the bootstrap resamples respresents a different set of cars, the mean of the each bootstrap resample will be a different value. Each of the bootstrapped means is calculated, and a histogram of the values describes the inherent natural variability of the sample mean which is due to the sampling process.](images/bootmeans1mean.png){fig-alt='The sample is shown being taken from the large unknown population. Three bootstrap resamples of 5 cars each are shown; the three resamples have an average car price of 11780 dollars, 19020 dollars, and 20260 dollars, respectively. A histogram representing many bootstrap resamples indicates that the bootstrap averages vary from roughly 10000 dollars to 25000 dollars.' width=100%}\n:::\n:::\n\n\nFigure \\@ref(fig:carsbsmean) summarizes one thousand bootstrap samples in a histogram of the bootstrap sample means.\nThe bootstrapped average car prices vary from about \\$10,000 to \\$25,000.\nThe bootstrap percentile confidence interval is found by locating the middle 90% (for a 90% confidence interval) or a 95% (for a 95% confidence interval) of the bootstrapped statistics.\n\n::: {.workedexample data-latex=\"\"}\nUsing Figure \\@ref(fig:carsbsmean), find the 90% and 95% bootstrap percentile confidence intervals for the true average price of a car from Awesome Auto.\n\n------------------------------------------------------------------------\n\nA 90% confidence interval is given by \\$12,140 and \\$22,007.\nThe conclusion is that we are 90% confident that the true average car price at Awesome Auto lies somewhere between \\$12,140 and \\$22,007.\n\nA 95% confidence interval is given by \\$11,778 to \\$22,500.\nThe conclusion is that we are 95% confident that the true average car price at Awesome Auto lies somewhere between \\$11,778 to \\$22,500.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The original Awesome Auto data is bootstrapped 1,000 times. The histogram provides a sense for the variability of the average car price from sample to sample.](19-inference-one-mean_files/figure-html/carsbsmean-1.png){width=100%}\n:::\n:::\n\n\n### Bootstrap SE confidence interval\n\nAs seen in Section \\@ref(two-prop-boot-ci), another method for creating bootstrap confidence intervals directly uses a calculation of the variability of the bootstrap statistics (here, the bootstrap means).\nIf the bootstrap distribution is relatively symmetric and bell-shaped, then the 95% bootstrap SE confidence interval can be constructed with the formula familiar from the mathematical models in previous chapters:\n\n$$\\mbox{point estimate} \\pm 2 \\cdot SE_{BS}$$ The number 2 is an approximation connected to the \"95%\" part of the confidence interval (remember the 68-95-99.7 rule).\nAs will be seen in Section \\@ref(one-mean-math), a new distribution (the $t$-distribution) will be applied to most mathematical inference on numerical variables.\nHowever, because bootstrapping is not grounded in the same theory as the mathematical approach given in this text, we stick with the standard normal quantiles (in R use the function `qnorm()` to find normal percentiles other than 95%) for different confidence percentages.[^19-inference-one-mean-1]\n\n[^19-inference-one-mean-1]: There is a large literature on understanding and improving bootstrap intervals, see @Hesterbeg:2015 titled [\"What Teachers Should Know About the Bootstrap\"](https://www.tandfonline.com/doi/full/10.1080/00031305.2015.1089789) and @Hayden:2019 titled [\"Questionable Claims for Simple Versions of the Bootstrap\"](https://www.tandfonline.com/doi/full/10.1080/10691898.2019.1669507) for more information.\n\n::: {.workedexample data-latex=\"\"}\nExplain how the standard error (SE) of the bootstrapped means is calculated and what it is measuring.\n\n------------------------------------------------------------------------\n\nThe SE of the bootstrapped means measures how variable the means are from resample to resample.\nThe bootstrap SE is a good approximation to the SE of means as if we had taken repeated samples from the original population (which we agreed isn't something we would do because of wasted resources).\n\nLogistically, we can find the standard deviation of the bootstrapped means using the same calculations from Chapter \\@ref(explore-numerical).\nThat is, the bootstrapped means are the individual observations about which we measure the variability.\n:::\n\nAlthough we won't spend a lot of energy on this concept, you may be wondering some of the differences between a standard error and a standard deviation.\nThe **standard error**\\index{standard error} describes how a statistic (e.g., sample mean or sample proportion) varies from sample to sample.\nThe **standard deviation**\\index{standard deviation} can be thought of as a function applied to any list of numbers which measures how far those numbers vary from their own average.\nSo, you can have a standard deviation calculated on a column of dog heights or a standard deviation calculated on a column of bootstrapped means from the resampled data.\nNote that the standard deviation calculated on the bootstrapped means is referred to as the bootstrap standard error of the mean.\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nIt turns out that the standard deviation of the bootstrapped means from Figure \\@ref(fig:carsbsmean) is \\$2,891.87 (a value which is an excellent approximation for the standard error of sample means if we were to take repeated samples from the population).\n\\[Note: in R the calculation was done using the function `sd()`.\\] The average of the observed prices is \\$17,140, ad we will consider the sample average to be the best guess point estimate for $\\mu.$ .\n\nFind and interpret the confidence interval for $\\mu$ (the true average cost of a car at Awesome Auto) using the bootstrap SE confidence interval formula.[^19-inference-one-mean-2]\n:::\n\n[^19-inference-one-mean-2]: Using the formula for the bootstrap SE interval, we find the 95% confidence interval for $\\mu$ is: $17,140 \\pm 2 \\cdot 2,891.87 \\rightarrow$ (\\$11,356.26, \\$22,923.74).\n We are 95% confident that the true average car price at Awesome Auto is somewhere between \\$11,356.26 and \\$22,923.74.\n\n::: {.workedexample data-latex=\"\"}\nCompare and contrast the two different 95% confidence intervals for $\\mu$ created by finding the percentiles of the bootstrapped means and created by finding the SE of the bootstrapped means.\nDo you think the intervals *should* be identical?\n\n------------------------------------------------------------------------\n\n- Percentile interval: (\\$11,778, \\$22,500)\n- SE interval: (\\$11,356.26, \\$22,923.74)\n\nThe intervals were created using different methods, so it is not surprising that they are not identical.\nHowever, we are pleased to see that the two methods provide very similar interval approximations.\n\nThe technical details surrounding which data structures are best for percentile intervals and which are best for SE intervals is beyond the scope of this text.\nHowever, the larger the samples are, the better (and closer) the interval estimates will be.\n:::\n\n### Bootstrap percentile confidence interval for a standard deviation\n\nSuppose that the research question at hand seeks to understand how variable the prices of the cars are at Awesome Auto.\nThat is, your interest is no longer in the average car price but in the *standard deviation* of the prices of all cars at Awesome Auto, $\\sigma.$ You may have already realized that the sample standard deviation, $s,$ will work as a good **point estimate** for the parameter of interest: the population standard deviation, $\\sigma.$ The point estimate of the five observations is calculated to be $s = \\$7,170.286.$ While $s = \\$7,170.286$ might be a good guess for $\\sigma,$ we prefer to have an interval estimate for the parameter of interest.\nAlthough there is a mathematical model which describes how $s$ varies from sample to sample, the mathematical model will not be presented in this text.\nEven without the mathematical model, bootstrapping can be used to find a confidence interval for the parameter $\\sigma.$ Using the same technique as presented for a confidence interval for $\\mu,$ here we find the bootstrap percentile confidence interval for $\\sigma.$\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nDescribe the bootstrap distribution for the standard deviation shown in Figure \\@ref(fig:carsbssd).\n\n------------------------------------------------------------------------\n\nThe distribution is skewed left and centered near \\$7,170.286, which is the point estimate from the original data.\nMost observations in this distribution lie between \\$0 and \\$10,000.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nUsing Figure \\@ref(fig:carsbssd), find *and interpret* a 90% bootstrap percentile confidence interval for the population standard deviation for car prices at Awesome Auto.[^19-inference-one-mean-3]\n:::\n\n[^19-inference-one-mean-3]: By looking at the percentile values in Figure \\@ref(fig:carsbssd), the middle 90% of the bootstrap standard deviations are given by the 5 percentile (\\$3,602.5) and 95 percentile (\\$8,737.2).\n That is, we are 90% confident that the true standard deviation of car prices is between \\$3,602.5 and \\$8,737.2.\n Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such a 95% or 99%).\n A lower degree of confidence increases potential for error, but it also produces a more narrow interval.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The original Awesome Auto data is bootstrapped 1,000 times. The histogram provides a sense for the variability of the standard deviation of the car prices from sample to sample.](19-inference-one-mean_files/figure-html/carsbssd-1.png){width=100%}\n:::\n:::\n\n\n### Bootstrapping is not a solution to small sample sizes!\n\nThe example presented above is done for a sample with only five observations.\nAs with analysis techniques that build on mathematical models, bootstrapping works best when a large random sample has been taken from the population.\nBootstrapping is a method for capturing the variability of a statistic when the mathematical model is unknown (it is not a method for navigating small samples).\nAs you might guess, the larger the random sample, the more accurately that sample will represent the population of interest.\n\n## Mathematical model for a mean {#one-mean-math}\n\nAs with the sample proportion, the variability of the sample mean is well described by the mathematical theory given by the Central Limit Theorem.\nHowever, because of missing information about the inherent variability in the population ($\\sigma$), a $t$-distribution is used in place of the standard normal when performing hypothesis test or confidence interval analyses.\n\n### Mathematical distribution of the sample mean\n\nThe sample mean tends to follow a normal distribution centered at the population mean, $\\mu,$ when certain conditions are met.\nAdditionally, we can compute a standard error for the sample mean using the population standard deviation $\\sigma$ and the sample size $n.$\n\n::: {.important data-latex=\"\"}\n**Central Limit Theorem for the sample mean.**\n\nWhen we collect a sufficiently large sample of $n$ independent observations from a population with mean $\\mu$ and standard deviation $\\sigma,$ the sampling distribution of $\\bar{x}$ will be nearly normal with\n\n$$\\text{Mean} = \\mu \\qquad \\text{Standard Error }(SE) = \\frac{\\sigma}{\\sqrt{n}}$$\n:::\n\nBefore diving into confidence intervals and hypothesis tests using $\\bar{x},$ we first need to cover two topics:\n\n- When we modeled $\\hat{p}$ using the normal distribution, certain conditions had to be satisfied. The conditions for working with $\\bar{x}$ are a little more complex, and below, we will discuss how to check conditions for inference using a mathematical model.\n- The standard error is dependent on the population standard deviation, $\\sigma.$ However, we rarely know $\\sigma,$ and instead we must estimate it. Because this estimation is itself imperfect, we use a new distribution called the $t$-distribution to fix this problem, which we discuss below.\n\n\\index{t-distribution@$t$-distribution}\n\n\n\n\n\n### Evaluating the two conditions required for modeling $\\bar{x}$\n\nTwo conditions are required to apply the Central Limit Theorem\\index{Central Limit Theorem} for a sample mean $\\bar{x}:$\n\n- **Independence.** The sample observations must be independent.\n The most common way to satisfy this condition is when the sample is a simple random sample from the population.\n If the data come from a random process, analogous to rolling a die, this would also satisfy the independence condition.\n\n- **Normality.** When a sample is small, we also require that the sample observations come from a normally distributed population.\n We can relax this condition more and more for larger and larger sample sizes.\n This condition is obviously vague, making it difficult to evaluate, so next we introduce a couple rules of thumb to make checking this condition easier.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**General rule for performing the normality check.**\n\nThere is no perfect way to check the normality condition, so instead we use two general rules based on the number and magnitude of extreme observations.\nNote, it often takes practice to get a sense for whether a normal approximation is appropriate.\n\n- Small $n$: If the sample size $n$ is small and there are **no clear outliers** in the data, then we typically assume the data come from a nearly normal distribution to satisfy the condition.\n- Large $n$: If the sample size $n$ is large and there are no **particularly extreme** outliers, then we typically assume the sampling distribution of $\\bar{x}$ is nearly normal, even if the underlying distribution of individual observations is not.\n\nSome guidelines for determining whether $n$ is considered small or large are as follows: slight skew is okay for sample sizes of 15, moderate skew for sample sizes of 30, and strong skew for sample sizes of 60.\n:::\n\nIn this first course in statistics, you aren't expected to develop perfect judgment on the normality condition.\nHowever, you are expected to be able to handle clear cut cases based on the rules of thumb.[^19-inference-one-mean-4]\n\n[^19-inference-one-mean-4]: More nuanced guidelines would consider further relaxing the *particularly extreme outlier* check when the sample size is very large.\n However, we'll leave further discussion here to a future course.\n\n::: {.workedexample data-latex=\"\"}\nConsider the four plots provided in Figure \\@ref(fig:outliersandsscondition) that come from simple random samples from different populations.\nTheir sample sizes are $n_1 = 15$ and $n_2 = 50.$\n\nAre the independence and normality conditions met in each case?\n\n------------------------------------------------------------------------\n\nEach samples is from a simple random sample of its respective population, so the independence condition is satisfied.\nLet's next check the normality condition for each using the rule of thumb.\n\nThe first sample has fewer than 30 observations, so we are watching for any clear outliers.\nNone are present; while there is a small gap in the histogram on the right, this gap is small and over 20% of the observations in this small sample are represented to the left of the gap, so we can hardly call these clear outliers.\nWith no clear outliers, the normality condition can be reasonably assumed to be met.\n\nThe second sample has a sample size greater than 30 and includes an outlier that appears to be roughly 5 times further from the center of the distribution than the next furthest observation.\nThis is an example of a particularly extreme outlier, so the normality condition would not be satisfied.\n\nIt's often helpful to also visualize the data using a box plot to assess skewness and existence of outliers.\nThe box plots provided underneath each histogram confirms our conclusions that the first sample does not have any outliers and the second sample does, with one outlier being particularly more extreme than the others.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histograms of samples from two different populations.](19-inference-one-mean_files/figure-html/outliersandsscondition-1.png){width=90%}\n:::\n:::\n\n\nIn practice, it's typical to also do a mental check to evaluate whether we have reason to believe the underlying population would have moderate skew (if $n < 30)$ or have particularly extreme outliers $(n \\geq 30)$ beyond what we observe in the data.\nFor example, consider the number of followers for each individual account on Twitter, and then imagine this distribution.\nThe large majority of accounts have built up a couple thousand followers or fewer, while a relatively tiny fraction have amassed tens of millions of followers, meaning the distribution is extremely skewed.\nWhen we know the data come from such an extremely skewed distribution, it takes some effort to understand what sample size is large enough for the normality condition to be satisfied.\n\n\\index{Central Limit Theorem}\n\n### Introducing the t-distribution\n\n\\index{t-distribution}\n\nIn practice, we cannot directly calculate the standard error for $\\bar{x}$ since we do not know the population standard deviation, $\\sigma.$ We encountered a similar issue when computing the standard error for a sample proportion, which relied on the population proportion, $p.$ Our solution in the proportion context was to use the sample value in place of the population value when computing the standard error.\nWe'll employ a similar strategy for computing the standard error of $\\bar{x},$ using the sample standard deviation $s$ in place of $\\sigma:$\n\n$$SE = \\frac{\\sigma}{\\sqrt{n}} \\approx \\frac{s}{\\sqrt{n}}$$\n\nThis strategy tends to work well when we have a lot of data and can estimate $\\sigma$ using $s$ accurately.\nHowever, the estimate is less precise with smaller samples, and this leads to problems when using the normal distribution to model $\\bar{x}.$\n\nWe'll find it useful to use a new distribution for inference calculations called the $t$-distribution.\nA $t$-distribution, shown as a solid line in Figure \\@ref(fig:tDistCompareToNormalDist), has a bell shape.\nHowever, its tails are thicker than the normal distribution's, meaning observations are more likely to fall beyond two standard deviations from the mean than under the normal distribution.\n\nThe extra thick tails of the $t$-distribution are exactly the correction needed to resolve the problem (due to extra variability of the T score) of using $s$ in place of $\\sigma$ in the $SE$ calculation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Comparison of a $t$-distribution and a normal distribution.](19-inference-one-mean_files/figure-html/tDistCompareToNormalDist-1.png){width=60%}\n:::\n:::\n\n\nThe $t$-distribution is always centered at zero and has a single parameter: degrees of freedom.\nThe **degrees of freedom** describes the precise form of the bell-shaped $t$-distribution.\nSeveral $t$-distributions are shown in Figure \\@ref(fig:tDistConvergeToNormalDist) in comparison to the normal distribution.\nSimilar to the Chi-square distribution, the shape of the $t$-distribution also depends on the degrees of freedom.\n\nIn general, we'll use a $t$-distribution with $df = n - 1$ to model the sample mean when the sample size is $n.$ That is, when we have more observations, the degrees of freedom will be larger and the $t$-distribution will look more like the standard normal distribution; when the degrees of freedom is about 30 or more, the $t$-distribution is nearly indistinguishable from the normal distribution.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The larger the degrees of freedom, the more closely the $t$-distribution resembles the standard normal distribution.](19-inference-one-mean_files/figure-html/tDistConvergeToNormalDist-1.png){width=90%}\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Degrees of freedom: df.**\n\nThe degrees of freedom describes the shape of the $t$-distribution.\nThe larger the degrees of freedom, the more closely the distribution approximates the normal distribution.\n\nWhen modeling $\\bar{x}$ using the $t$-distribution, use $df = n - 1.$\n:::\n\nThe $t$-distribution allows us greater flexibility than the normal distribution when analyzing numerical data.\nIn practice, it's common to use statistical software, such as R, Python, or SAS for these analyses.\nIn R, the function used for calculating probabilities under a $t$-distribution is `pt()` (which should seem similar to previous R functions, `pnorm()` and `pchisq()`).\nDon't forget that with the $t$-distribution, the degrees of freedom must always be specified!\n\nFor the examples and guided practices below, you may have to use a table or statistical software to find the answers.\nWe recommend trying the problems so as to get a sense for how the $t$-distribution can vary in width depending on the degrees of freedom.\nNo matter the approach you choose, apply your method using the examples below to confirm your working understanding of the $t$-distribution.\n\n::: {.workedexample data-latex=\"\"}\nWhat proportion of the $t$-distribution with 18 degrees of freedom falls below -2.10?\n\n------------------------------------------------------------------------\n\nJust like a normal probability problem, we first draw the picture in Figure \\@ref(fig:tDistDF18LeftTail2Point10) and shade the area below -2.10.\n\nUsing statistical software, we can obtain a precise value: 0.0250.\n:::\n\n\\clearpage\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# use pt() to find probability under the $t$-distribution\npt(-2.10, df = 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.025\n```\n\n\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![The $t$-distribution with 18 degrees of freedom. The area below -2.10 has been shaded.](19-inference-one-mean_files/figure-html/tDistDF18LeftTail2Point10-1.png){width=60%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nA $t$-distribution with 20 degrees of freedom is shown in Figure \\@ref(fig:tDistDF20RightTail1Point65).\nEstimate the proportion of the distribution falling above 1.65.\n\n------------------------------------------------------------------------\n\nNote that with 20 degrees of freedom, the $t$-distribution is relatively close to the normal distribution.\nWith a normal distribution, this would correspond to about 0.05, so we should expect the $t$-distribution to give us a value in this neighborhood.\nUsing statistical software: 0.0573.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# use pt() to find probability under the $t$-distribution\n1 - pt(1.65, df = 20)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0573\n```\n\n\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Top: The $t$-distribution with 20 degrees of freedom, with the area above 1.65 shaded.](19-inference-one-mean_files/figure-html/tDistDF20RightTail1Point65-1.png){width=50%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nA $t$-distribution with 2 degrees of freedom is shown in Figure \\@ref(fig:tDistDF23UnitsFromMean).\nEstimate the proportion of the distribution falling more than 3 units from the mean (above or below).\n\n------------------------------------------------------------------------\n\nWith so few degrees of freedom, the $t$-distribution will give a more notably different value than the normal distribution.\nUnder a normal distribution, the area would be about 0.003 using the 68-95-99.7 rule.\nFor a $t$-distribution with $df = 2,$ the area in both tails beyond 3 units totals 0.0955.\nThis area is dramatically different than what we obtain from the normal distribution.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# use pt() to find probability under the $t$-distribution\npt(-3, df = 2) + (1 - pt(3, df = 2))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0955\n```\n\n\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![The $t$-distribution with 2 degrees of freedom, with the area further than 3 units from 0 shaded.](19-inference-one-mean_files/figure-html/tDistDF23UnitsFromMean-1.png){width=50%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWhat proportion of the $t$-distribution with 19 degrees of freedom falls above -1.79 units?\nUse your preferred method for finding tail areas.[^19-inference-one-mean-5]\n:::\n\n[^19-inference-one-mean-5]: We want to find the shaded area *above* -1.79 (we leave the picture to you).\n The lower tail area has an area of 0.0447, so the upper area would have an area of $1 - 0.0447 = 0.9553.$\n\n\\index{t-distribution}\n\n### One sample t-intervals\n\nLet's get our first taste of applying the $t$-distribution in the context of an example about the mercury content of dolphin muscle.\nElevated mercury concentrations are an important problem for both dolphins and other animals, like humans, who occasionally eat them.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A Risso's dolphin. Photo by Mike Baird, www.bairdphotos.com](images/rissosDolphin.jpg){fig-alt='Photograph of a Risso\\'s dolphin.' width=75%}\n:::\n:::\n\n\nWe will identify a confidence interval for the average mercury content in dolphin muscle using a sample of 19 Risso's dolphins from the Taiji area in Japan.\nThe data are summarized in Table \\@ref(tab:summaryStatsOfHgInMuscleOfRissosDolphins).\nThe minimum and maximum observed values can be used to evaluate whether there are clear outliers.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n\n
Summary of mercury content in the muscle of 19 Risso's dolphins from the Taiji area. Measurements are in micrograms of mercury per wet gram of muscle $(\\mu$g/wet g).
n Mean SD Min Max
19 4.4 2.3 1.7 9.2
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nAre the independence and normality conditions satisfied for this dataset?\n\n------------------------------------------------------------------------\n\nThe observations are a simple random sample, therefore it is reasonable to assume that the dolphins are independent.\nThe summary statistics in Table \\@ref(tab:summaryStatsOfHgInMuscleOfRissosDolphins) do not suggest any clear outliers, with all observations within 2.5 standard deviations of the mean.\nBased on this evidence, the normality condition seems reasonable.\n:::\n\nIn the normal model, we used $z^{\\star}$ and the standard error to determine the width of a confidence interval.\nWe revise the confidence interval formula slightly when using the $t$-distribution:\n\n$$\n\\begin{aligned}\n\\text{point estimate} \\ &\\pm\\ t^{\\star}_{df} \\times SE \\\\\n\\bar{x} \\ &\\pm\\ t^{\\star}_{df} \\times \\frac{s}{\\sqrt{n}}\n\\end{aligned}\n$$\n\n::: {.workedexample data-latex=\"\"}\nUsing the summary statistics in Table \\@ref(tab:summaryStatsOfHgInMuscleOfRissosDolphins), compute the standard error for the average mercury content in the $n = 19$ dolphins.\n\n------------------------------------------------------------------------\n\nWe plug in $s$ and $n$ into the formula: $SE = \\frac{s}{\\sqrt{n}} = \\frac{2.3}{\\sqrt{19}} = 0.528.$\n:::\n\nThe value $t^{\\star}_{df}$ is a cutoff we obtain based on the confidence level and the $t$-distribution with $df$ degrees of freedom.\nThat cutoff is found in the same way as with a normal distribution: we find $t^{\\star}_{df}$ such that the fraction of the $t$-distribution with $df$ degrees of freedom within a distance $t^{\\star}_{df}$ of 0 matches the confidence level of interest.\n\n::: {.workedexample data-latex=\"\"}\nWhen $n = 19,$ what is the appropriate degrees of freedom?\nFind $t^{\\star}_{df}$ for this degrees of freedom and the confidence level of 95%\n\n------------------------------------------------------------------------\n\nThe degrees of freedom is easy to calculate: $df = n - 1 = 18.$\n\nUsing statistical software, we find the cutoff where the upper tail is equal to 2.5%: $t^{\\star}_{18} = 2.10.$ The area below -2.10 will also be equal to 2.5%.\nThat is, 95% of the $t$-distribution with $df = 18$ lies within 2.10 units of 0.\n:::\n\n\\clearpage\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# use qt() to find the t-cutoff (with 95% in the middle)\nqt(0.025, df = 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] -2.1\n```\n\n\n:::\n\n```{.r .cell-code}\nqt(0.975, df = 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2.1\n```\n\n\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Degrees of freedom for a single sample.**\n\nIf the sample has $n$ observations and we are examining a single mean, then we use the $t$-distribution with $df=n-1$ degrees of freedom.\n:::\n\n::: {.workedexample data-latex=\"\"}\nCompute and interpret the 95% confidence interval for the average mercury content in Risso's dolphins.\n\n------------------------------------------------------------------------\n\nWe can construct the confidence interval as\n\n$$\n\\begin{aligned}\n\\bar{x} \\ &\\pm\\ t^{\\star}_{18} \\times SE \\\\\n4.4 \\ &\\pm\\ 2.10 \\times 0.528 \\\\\n(3.29 \\ &, \\ 5.51)\n\\end{aligned} \n$$\n\nWe are 95% confident the average mercury content of muscles in Risso's dolphins is between 3.29 and 5.51 $\\mu$g/wet gram, which is considered extremely high.\n:::\n\n::: {.important data-latex=\"\"}\n**Calculating a** $t$**-confidence interval for the mean,** $\\mu.$\n\nBased on a sample of $n$ independent and nearly normal observations, a confidence interval for the population mean is\n\n$$\n\\begin{aligned}\n\\text{point estimate} \\ &\\pm\\ t^{\\star}_{df} \\times SE \\\\\n\\bar{x} \\ &\\pm\\ t^{\\star}_{df} \\times \\frac{s}{\\sqrt{n}}\n\\end{aligned}\n$$\n\nwhere $\\bar{x}$ is the sample mean, $t^{\\star}_{df}$ corresponds to the confidence level and degrees of freedom $df,$ and $SE$ is the standard error as estimated by the sample.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nThe FDA's webpage provides some data on mercury content of fish.\nBased on a sample of 15 croaker white fish (Pacific), a sample mean and standard deviation were computed as 0.287 and 0.069 ppm (parts per million), respectively.\nThe 15 observations ranged from 0.18 to 0.41 ppm.\nWe will assume these observations are independent.\nBased on the summary statistics of the data, do you have any objections to the normality condition of the individual observations?[^19-inference-one-mean-6]\n:::\n\n[^19-inference-one-mean-6]: The sample size is under 30, so we check for obvious outliers: since all observations are within 2 standard deviations of the mean, there are no such clear outliers.\n\n::: {.workedexample data-latex=\"\"}\nEstimate the standard error of $\\bar{x} = 0.287$ ppm using the data summaries in the previous Guided Practice.\nIf we are to use the $t$-distribution to create a 90% confidence interval for the actual mean of the mercury content, identify the degrees of freedom and $t^{\\star}_{df}.$\n\n------------------------------------------------------------------------\n\nThe standard error: $SE = \\frac{0.069}{\\sqrt{15}} = 0.0178.$\n\nDegrees of freedom: $df = n - 1 = 14.$\n\nSince the goal is a 90% confidence interval, we choose $t_{14}^{\\star}$ so that the two-tail area is 0.1: $t^{\\star}_{14} = 1.76.$\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# use qt() to find the t-cutoff (with 90% in the middle)\nqt(0.05, df = 14)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] -1.76\n```\n\n\n:::\n\n```{.r .cell-code}\nqt(0.95, df = 14)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.76\n```\n\n\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nUsing the information and results of the previous Guided Practice and Example, compute a 90% confidence interval for the average mercury content of croaker white fish (Pacific).[^19-inference-one-mean-7]\n:::\n\n[^19-inference-one-mean-7]: $\\bar{x} \\ \\pm\\ t^{\\star}_{14} \\times SE \\ \\to\\ 0.287 \\ \\pm\\ 1.76 \\times 0.0178 \\ \\to\\ (0.256, 0.318).$ We are 90% confident that the average mercury content of croaker white fish (Pacific) is between 0.256 and 0.318 ppm.\n\n::: {.guidedpractice data-latex=\"\"}\nThe 90% confidence interval from the previous Guided Practice is 0.256 ppm to 0.318 ppm.\nCan we say that 90% of croaker white fish (Pacific) have mercury levels between 0.256 and 0.318 ppm?[^19-inference-one-mean-8]\n:::\n\n[^19-inference-one-mean-8]: No, a confidence interval only provides a range of plausible values for a population parameter, in this case the population mean.\n It does not describe what we might observe for individual observations.\n\nRecall that the margin of error is defined by the standard error.\nThe margin of error for $\\bar{x}$ can be directly obtained from $SE(\\bar{x}).$\n\n::: {.important data-latex=\"\"}\n**Margin of error for** $\\bar{x}.$\n\nThe margin of error is $t^\\star_{df} \\times s/\\sqrt{n}$ where $t^\\star_{df}$ is calculated from a specified percentile on the t-distribution with *df* degrees of freedom.\n:::\n\n### One sample t-tests\n\nNow that we have used the $t$-distribution for making a confidence interval for a mean, let's speed on through to hypothesis tests for the mean.\n\n::: {.important data-latex=\"\"}\n**The test statistic for assessing a single mean is a T.**\n\nThe T score is a ratio of how the sample mean differs from the hypothesized mean as compared to how the observations vary.\n\n$$ T = \\frac{\\bar{x} - \\mbox{null value}}{s/\\sqrt{n}} $$\n\nWhen the null hypothesis is true and the conditions are met, T has a t-distribution with $df = n - 1.$\n\nConditions:\n\n- Independent observations.\n- Large samples and no extreme outliers.\n:::\n\n\\vspace{-3mm}\n\nIs the typical US runner getting faster or slower over time?\nWe consider this question in the context of the Cherry Blossom Race, which is a 10-mile race in Washington, DC each spring.\nThe average time for all runners who finished the Cherry Blossom Race in 2006 was 93.29 minutes (93 minutes and about 17 seconds).\nWe want to determine using data from 100 participants in the 2017 Cherry Blossom Race whether runners in this race are getting faster or slower, versus the other possibility that there has been no change.\n\n::: {.data data-latex=\"\"}\nThe [`run17`](http://openintrostat.github.io/cherryblossom/reference/run17.html) data can be found in the [**cherryblossom**](http://openintrostat.github.io/cherryblossom) R package.\n:::\n\n\\vspace{-3mm}\n\n::: {.guidedpractice data-latex=\"\"}\nWhat are appropriate hypotheses for this context?[^19-inference-one-mean-9]\n:::\n\n[^19-inference-one-mean-9]: $H_0:$ The average 10-mile run time was the same for 2006 and 2017.\n $\\mu = 93.29$ minutes.\n $H_A:$ The average 10-mile run time for 2017 was *different* than that of 2006.\n $\\mu \\neq 93.29$ minutes.\n\n\\vspace{-3mm}\n\n::: {.guidedpractice data-latex=\"\"}\nThe data come from a simple random sample of all participants, so the observations are independent.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](19-inference-one-mean_files/figure-html/unnamed-chunk-28-1.png){width=70%}\n:::\n:::\n\n\nA histogram of the race times is given to evaluate if we can move forward with a t-test.\nIs the normality condition met?[^19-inference-one-mean-10]\n:::\n\n[^19-inference-one-mean-10]: With a sample of 100, we should only be concerned if there is are particularly extreme outliers.\n The histogram of the data does not show any outliers of concern (and arguably, no outliers at all).\n\nWhen completing a hypothesis test for the one-sample mean, the process is nearly identical to completing a hypothesis test for a single proportion.\nFirst, we find the Z score using the observed value, null value, and standard error; however, we call it a **T score** since we use a $t$-distribution for calculating the tail area.\nThen we find the p-value using the same ideas we used previously: find the one-tail area under the sampling distribution, and double it.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nWith both the independence and normality conditions satisfied, we can proceed with a hypothesis test using the $t$-distribution.\nThe sample mean and sample standard deviation of the sample of 100 runners from the 2017 Cherry Blossom Race are 98.78 and 16.59 minutes, respectively.\nRecall that the average run time in 2006 was 93.29 minutes.\nFind the test statistic and p-value.\nWhat is your conclusion?\n\n------------------------------------------------------------------------\n\nTo find the test statistic (T score), we first must determine the standard error:\n\n$$ SE = 16.6 / \\sqrt{100} = 1.66 $$\n\nNow we can compute the **T score** using the sample mean (98.78), null value (93.29), and $SE:$\n\n$$ T = \\frac{98.8 - 93.29}{1.66} = 3.32 $$\n\nFor $df = 100 - 1 = 99,$ we can determine using statistical software (or a $t$-table) that the one-tail area is 0.000631, which we double to get the p-value: 0.00126.\n\nBecause the p-value is smaller than 0.05, we reject the null hypothesis.\nThat is, the data provide convincing evidence that the average run time for the Cherry Blossom Run in 2017 is different than the 2006 average.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# using pt() to find the left tail and multiply by 2 to get both tails\n(1 - pt(3.32, df = 99)) * 2 \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.00126\n```\n\n\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**When using a** $t$**-distribution, we use a T score (similar to a Z score).**\n\nTo help us remember to use the $t$-distribution, we use a $T$ to represent the test statistic, and we often call this a **T score**.\nThe Z score and T score are computed in the exact same way and are conceptually identical: each represents how many standard errors the observed value is from the null value.\n:::\n\n\\clearpage\n\n## Chapter review {#chp19-review}\n\n### Summary\n\nIn this chapter we extended the randomization / bootstrap / mathematical model paradigm to questions involving quantitative variables of interest.\nWhen there is only one variable of interest, we are often hypothesizing or finding confidence intervals about the population mean.\nNote, however, the bootstrap method can be used for other statistics like the population median or the population IQR.\nWhen comparing a quantitative variable across two groups, the question often focuses on the difference in population means (or sometimes a paired difference in means).\nThe questions revolving around one, two, and paired samples of means are addressed using the t-distribution; they are therefore called \"t-tests\" and \"t-intervals.\" When considering a quantitative variable across 3 or more groups, a method called ANOVA is applied.\nAgain, almost all the research questions can be approached using computational methods (e.g., randomization tests or bootstrapping) or using mathematical models.\nWe continue to emphasize the importance of experimental design in making conclusions about research claims.\nIn particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \\@ref(fig:randsampValloc)).\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Central Limit Theorem point estimate T score single mean
degrees of freedom SD single mean t-distribution
numerical data SE single mean
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp19-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-19].\n\n::: {.exercises data-latex=\"\"}\n1. **Statistics vs. parameters: one mean.**\nEach of the following scenarios were set up to assess an average value. For each one, identify, in words: the statistic and the parameter.\n\n a. A sample of 25 New Yorkers were asked how much sleep they get per night.\n \n b. Researchers at two different universities in California collected information on undergraduates' heights.\n\n1. **Statistics vs. parameters: one mean.**\nEach of the following scenarios were set up to assess an average value. For each one, identify, in words: the statistic and the parameter.\n \n a. Georgianna samples 20 children from a particular city and measures how many years they have each been playing piano.\n \n b. Traffic police officers (who are regularly exposed to lead from automobile exhaust) had their lead levels measured in their blood.\n\n1. **Heights of adults.** \nResearchers studying anthropometry collected body measurements, as well as age, weight, height and gender, for 507 physically active individuals. \nSummary statistics for the distribution of heights (measured in centimeters), along with a histogram, are provided below.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Min Q1 Median Mean Q3 Max SD IQR
147 164 170 171 178 198 9.4 14
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](19-inference-one-mean_files/figure-html/unnamed-chunk-32-1.png){width=70%}\n :::\n :::\n\n a. What is the point estimate for the average height of active individuals? What about the median?\n\n b. What is the point estimate for the standard deviation of the heights of active individuals? What about the IQR?\n\n c. Is a person who is 1m 80cm (180 cm) tall considered unusually tall? And is a person who is 1m 55cm (155cm) considered unusually short? Explain your reasoning.\n\n d. The researchers take another random sample of physically active individuals. Would you expect the mean and the standard deviation of this new sample to be the ones given above? Explain your reasoning.\n\n e. The sample means obtained are point estimates for the mean height of all active individuals, if the sample of individuals is equivalent to a simple random sample. What measure do we use to quantify the variability of such an estimate? Compute this quantity using the data from the original sample under the condition that the data are a simple random sample.\n\n1. **Heights of adults, standard error.**\nHeights of 507 physically active individuals have a mean of 171 centimeters and a standard deviation of 9.4 centimeters.\nProvide an estimate for the standard error of the mean for samples of following sizes.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n a. n = 10\n \n b. n = 50\n \n c. n = 100\n \n d. n = 1000\n \n e. The standard error of the mean is a number which describes what?\n\n1. **Heights of adults vs. kindergartners.**\nHeights of 507 physically active individuals have a mean of 171 centimeters and a standard deviation of 9.4 centimeters.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n a. Would the standard deviation of the heights of a few hundred kindergartners be bigger or smaller than 9.4cm? Explain your reasoning.\n \n b. Suppose many samples of size 100 adults are taken and, separately, many samples of size 100 kindergarteners are taken. For each of the many samples, the average height is computed. Which set of sample averages would have a larger standard error of the mean, the adult sample averages or the kindergartner sample averages?\n\n1. **Heights of adults, bootstrap interval.**\nResearchers studying anthropometry collected body measurements, as well as age, weight, height and gender, for 507 physically active individuals. \nThe histogram below shows the sample distribution of bootstrapped means from 1,000 different bootstrap samples.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](19-inference-one-mean_files/figure-html/unnamed-chunk-33-1.png){width=90%}\n :::\n :::\n\n a. Given the bootstrap sampling distribution for the sample mean, find an approximate value for the standard error of the mean. \n \n b. By looking at the bootstrap sampling distribution (1,000 bootstrap samples were taken), find an approximate 90% bootstrap percentile confidence interval for the true average adult height in the population from which the data were randomly sampled. Provide the interval as well as a one-sentence interpretation of the interval.\n\n c. By looking at the bootstrap sampling distribution (1,000 bootstrap samples were taken), find an approximate 90% bootstrap SE confidence interval for the true average adult height in the population from which the data were randomly sampled. Provide the interval as well as a one-sentence interpretation of the interval.\n\n1. **Identify the critical $t$.** \nA random sample is selected from an approximately normal population with unknown standard deviation.\nFind the degrees of freedom and the critical $t$-value (t$^\\star$) for the given sample size and confidence level.\n\n a. $n = 6$, CL = 90%\n\n b. $n = 21$, CL = 98%\n\n c. $n = 29$, CL = 95%\n\n d. $n = 12$, CL = 99%\n\n1. **$t$-distribution.** \nThe figure below shows three unimodal and symmetric curves: the standard normal (z) distribution, the $t$-distribution with 5 degrees of freedom, and the $t$-distribution with 1 degree of freedom. \nDetermine which is which, and explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](19-inference-one-mean_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n\n1. **Find the p-value, I.** \nA random sample is selected from an approximately normal population with an unknown standard deviation. \nFind the p-value for the given sample size and test statistic. \nAlso determine if the null hypothesis would be rejected at $\\alpha = 0.05$.\n\n a. $n = 11$, $T = 1.91$\n\n b. $n = 17$, $T = -3.45$\n\n c. $n = 7$, $T = 0.83$\n\n d. $n = 28$, $T = 2.13$\n\n1. **Find the p-value, II.** \nA random sample is selected from an approximately normal population with an unknown standard deviation. \nFind the p-value for the given sample size and test statistic. \nAlso determine if the null hypothesis would be rejected at $\\alpha = 0.01$.\n\n a. $n = 26$, $T = 2.485$\n\n b. $n = 18$, $T = 0.5$\n \n \\clearpage\n\n1. **Length of gestation, confidence interval.**\nEvery year, the United States Department of Health and Human Services releases to the public a large dataset containing information on births recorded in the country. \nThis dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. \nIn this exercise we work with a random sample of 1,000 cases from the dataset released in 2014.\nThe length of pregnancy, measured in weeks, is commonly referred to as gestation.\nThe histograms below show the distribution of lengths of gestation from the random sample of 1,000 births (on the left) and the distribution of bootstrapped means of gestation from 1,500 different bootstrap samples (on the right).^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] \n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](19-inference-one-mean_files/figure-html/unnamed-chunk-35-1.png){width=100%}\n :::\n :::\n\n a. Given the bootstrap sampling distribution for the sample mean, find an approximate value for the standard error of the mean. \n\n b. By looking at the bootstrap sampling distribution (1,500 bootstrap samples were taken), find an approximate 99% bootstrap percentile confidence interval for the true average gestation length in the population from which the data were randomly sampled. Provide the interval as well as a one-sentence interpretation of the interval.\n\n c. By looking at the bootstrap sampling distribution (1,500 bootstrap samples were taken), find an approximate 99% bootstrap SE confidence interval for the true average gestation length in the population from which the data were randomly sampled. Provide the interval as well as a one-sentence interpretation of the interval.\n \n \\clearpage\n\n1. **Length of gestation, hypothesis test.**\nIn this exercise we work with a random sample of 1,000 cases from the dataset released by the United States Department of Health and Human Services in 2014.\nProvided below are sample statistics for gestation (length of pregnancy, measured in weeks) of births in this sample.^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] \n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Min Q1 Median Mean Q3 Max SD IQR
21 38 39 38.7 40 46 2.6 2
\n \n `````\n :::\n :::\n\n a. What is the point estimate for the average length of pregnancy for all women? What about the median?\n\n b. You might have heard that human gestation is typically 40 weeks. Using the data, perform a complete hypothesis test, using mathematical models, to assess the 40 week claim. State the null and alternative hypotheses, find the T score, find the p-value, and provide a conclusion in context of the data.\n \n c. A quick internet search validates the claim of \"40 weeks gestation\" for humans. A friend of yours claims that there are different ways to measure gestation (starting at first day of last period, ovulation, or conception) which will result in estimates that are a week or two different. Another friend mentions that recent increases in cesarean births is likely to have decreased length of gestation. Do the data provide a mechanism to distinguish between your two friends' claims?\n\n1. **Interpreting confidence intervals for population mean.**\nFor each of the following statements, indicate if they are a true or false interpretation of the confidence interval.\nIf false, provide a reason or correction to the misinterpretation.\nYou collect a large sample and calculate a 95% confidence interval for the average number of cans of sodas consumed annually per adult in the US to be (440 cans, 520 cans), i.e., on average, adults in the US consume just under two cans of soda per day.\n\n a. 95% of adults in the US consume between 440 and 520 cans of soda per year.\n\n b. There is a 95% probability that the true population average per adult yearly soda consumption is between 440 and 520 cans.\n\n c. The true population average per adult yearly soda consumption is between 440 and 520 cans, with 95% confidence.\n\n d. The average soda consumption of the people who were sampled is between 440 and 520 cans of soda per year, with 95% confidence.\n\n1. **Interpreting p-values for population mean.**\nFor each of the following statements, indicate if they are a true or false interpretation of the p-value.\nIf false, provide a reason or correction to the misinterpretation.\nYou are wondering if the average amount of cereal in a 10oz cereal box is greater than 10oz. You collect 50 boxes of cereal, weigh them carefully, find a T score, and a p-value of 0.23.\n\n a. The probability that the average weight of all cereal boxes is 10 oz is 0.23.\n\n b. The probability that the average weight of all cereal boxes is greater than 10 oz is 0.23.\n\n c. Because the p-value is 0.23, the average weight of all cereal boxes is 10 oz.\n\n d. Because the p-value is small, the population average must be just barely above 10 oz (small effect).\n\n e. If $H_0$ is true, the probability of observing another sample with an average as or more extreme as the data is 0.23.\n \n \\clearpage\n\n1. **Working backwards, I.** \nA 95% confidence interval for a population mean, $\\mu$, is given as (18.985, 21.015). \nThe population distribution is approximately normal and the population standard deviation is unknown. \nThis confidence interval is based on a simple random sample of 36 observations. \nCalculate the sample mean, the margin of error, and the sample standard deviation.\nAssume that all conditions necessary for inference are satisfied. \nUse the $t$-distribution in any calculations.\n\n1. **Working backwards, II.** \nA 90% confidence interval for a population mean is (65, 77). \nThe population distribution is approximately normal and the population standard deviation is unknown. \nThis confidence interval is based on a simple random sample of 25 observations. \nCalculate the sample mean, the margin of error, and the sample standard deviation.\nAssume that all conditions necessary for inference are satisfied. \nUse the $t$-distribution in any calculations.\n\n1. **Sleep habits of New Yorkers.** \nNew York is known as \"the city that never sleeps\". \nA random sample of 25 New Yorkers were asked how much sleep they get per night. \nStatistical summaries of these data are shown below. \nThe point estimate suggests New Yorkers sleep less than 8 hours a night on average. \nEvaluate the claim that New York is the city that never sleeps keeping in mind that, despite this claim, the true average number of hours New Yorkers sleep could be less than 8 hours or more than 8 hours.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
n Mean SD Min Max
25 7.73 0.77 6.17 9.78
\n \n `````\n :::\n :::\n\n a. Write the hypotheses in symbols and in words.\n\n b. Check conditions, then calculate the test statistic, $T$, and the associated degrees of freedom.\n\n c. Find and interpret the p-value in this context. Drawing a picture may be helpful.\n\n d. What is the conclusion of the hypothesis test?\n\n e. If you were to construct a 90% confidence interval that corresponded to this hypothesis test, would you expect 8 hours to be in the interval?\n\n1. **Find the mean.** \nYou are given the hypotheses shown below.\nWe know that the sample standard deviation is 8 and the sample size is 20. \nFor what sample mean would the p-value be equal to 0.05? \nAssume that all conditions necessary for inference are satisfied.\n\n $$H_0: \\mu = 60 \\quad \\quad H_A: \\mu \\neq 60$$\n\n1. **$t^\\star$ for the correct confidence level.** \nAs you've seen, the tails of a $t-$distribution are longer than the standard normal which results in $t^{\\star}_{df}$ being larger than $z^{\\star}$ for any given confidence level. When finding a CI for a population mean, explain how mistakenly using $z^{\\star}$ (instead of the correct $t^{*}_{df}$) would affect the confidence level.\n\n1. **Possible bootstrap samples.**\nConsider a simple random sample of the following observations: 47, 4, 92, 47, 12, 8.\nWhich of the following could be a possible bootstrap samples from the observed data above?\nIf the set of values could not be a bootstrap sample, indicate why not.\n\n a. 47, 47, 47, 47, 47, 47\n \n b. 92, 4, 13, 8, 47, 4\n \n c. 92, 47, 12\n \n d. 8, 47, 12, 12, 8, 4, 92\n \n e. 12, 4, 8, 8, 92, 12\n \n \\clearpage\n\n1. **Play the piano.** \nGeorgianna claims that in a small city renowned for its music school, the average child takes less than 5 years of piano lessons. \nWe have a random sample of 20 children from the city, with a mean of 4.6 years of piano lessons and a standard deviation of 2.2 years.\n\n a. Evaluate Georgianna's claim (or that the opposite might be true) using a hypothesis test.\n\n b. Construct a 95% confidence interval for the number of years students in this city take piano lessons, and interpret it in context of the data.\n\n c. Do your results from the hypothesis test and the confidence interval agree? Explain your reasoning.\n\n1. **Auto exhaust and lead exposure.** \nResearchers interested in lead exposure due to car exhaust sampled the blood of 52 police officers subjected to constant inhalation of automobile exhaust fumes while working traffic enforcement in a primarily urban environment. \nThe blood samples of these officers had an average lead concentration of 124.32 $\\mu$g/l and a SD of 37.74 $\\mu$g/l; a previous study of individuals from a nearby suburb, with no history of exposure, found an average blood level concentration of 35 $\\mu$g/l. [@Mortada:2000]\n\n a. Write down the hypotheses that would be appropriate for testing if the police officers appear to have been exposed to a different concentration of lead.\n\n b. Explicitly state and check all conditions necessary for inference on these data.\n\n c. Test the hypothesis that the downtown police officers have a higher lead exposure than the group in the previous study. Interpret your results in context.\n\n\n:::\n", + "supporting": [ + "19-inference-one-mean_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/19-inference-one-mean/figure-html/carsbsmean-1.png b/_freeze/19-inference-one-mean/figure-html/carsbsmean-1.png new file mode 100644 index 00000000..203a130b Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/carsbsmean-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/carsbssd-1.png b/_freeze/19-inference-one-mean/figure-html/carsbssd-1.png new file mode 100644 index 00000000..70e27646 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/carsbssd-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/outliersandsscondition-1.png b/_freeze/19-inference-one-mean/figure-html/outliersandsscondition-1.png new file mode 100644 index 00000000..f74d7d03 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/outliersandsscondition-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/tDistCompareToNormalDist-1.png b/_freeze/19-inference-one-mean/figure-html/tDistCompareToNormalDist-1.png new file mode 100644 index 00000000..ac2fda55 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/tDistCompareToNormalDist-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/tDistConvergeToNormalDist-1.png b/_freeze/19-inference-one-mean/figure-html/tDistConvergeToNormalDist-1.png new file mode 100644 index 00000000..d66a8a13 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/tDistConvergeToNormalDist-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/tDistDF18LeftTail2Point10-1.png b/_freeze/19-inference-one-mean/figure-html/tDistDF18LeftTail2Point10-1.png new file mode 100644 index 00000000..32d079a8 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/tDistDF18LeftTail2Point10-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/tDistDF20RightTail1Point65-1.png b/_freeze/19-inference-one-mean/figure-html/tDistDF20RightTail1Point65-1.png new file mode 100644 index 00000000..daaf5759 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/tDistDF20RightTail1Point65-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/tDistDF23UnitsFromMean-1.png b/_freeze/19-inference-one-mean/figure-html/tDistDF23UnitsFromMean-1.png new file mode 100644 index 00000000..b542392a Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/tDistDF23UnitsFromMean-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-28-1.png b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 00000000..70abab32 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-28-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-32-1.png b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 00000000..a251469a Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-32-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-33-1.png b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 00000000..7b31cb67 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-33-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-34-1.png b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 00000000..1a3cd2b7 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-34-1.png differ diff --git a/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-35-1.png b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-35-1.png new file mode 100644 index 00000000..981cecb3 Binary files /dev/null and b/_freeze/19-inference-one-mean/figure-html/unnamed-chunk-35-1.png differ diff --git a/_freeze/20-inference-two-means/execute-results/html.json b/_freeze/20-inference-two-means/execute-results/html.json new file mode 100644 index 00000000..929a2630 --- /dev/null +++ b/_freeze/20-inference-two-means/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "f3c2b0094215909294e8e7753472caaa", + "result": { + "markdown": "# Inference for comparing two independent means {#inference-two-means}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nWe now extend the methods from Chapter \\@ref(inference-one-mean) to apply confidence intervals and hypothesis tests to differences in population means that come from two groups, Group 1 and Group 2: $\\mu_1 - \\mu_2.$\n\nIn our investigations, we'll identify a reasonable point estimate of $\\mu_1 - \\mu_2$ based on the sample, and you may have already guessed its form: $\\bar{x}_1 - \\bar{x}_2.$ \\index{point estimate!difference of means} Then we'll look at the inferential analysis in three different ways: using a randomization test, applying bootstrapping for interval estimates, and, if we verify that the point estimate can be modeled using a normal distribution, we compute the estimate's standard error and apply the mathematical framework.\n:::\n\n\n\n\n\nIn this section we consider a difference in two population means, $\\mu_1 - \\mu_2,$ under the condition that the data are not paired.\nJust as with a single sample, we identify conditions to ensure we can use the $t$-distribution with a point estimate of the difference, $\\bar{x}_1 - \\bar{x}_2,$ and a new standard error formula.\n\nThe details for working through inferential problems in the two independent means setting are strikingly similar to those applied to the two independent proportions setting.\nWe first cover a randomization test where the observations are shuffled under the assumption that the null hypothesis is true.\nThen we bootstrap the data (with no imposed null hypothesis) to create a confidence interval for the true difference in population means, $\\mu_1 - \\mu_2.$ The mathematical model, here the $t$-distribution, is able to describe both the randomization test and the bootstrapping as long as the conditions are met.\n\nThe inferential tools are applied to three different data contexts: determining whether stem cells can improve heart function, exploring the relationship between pregnant women's smoking habits and birth weights of newborns, and exploring whether there is convincing evidence that one variation of an exam is harder than another variation.\nThis section is motivated by questions like \"Is there convincing evidence that newborns from mothers who smoke have a different average birth weight than newborns from mothers who do not smoke?\"\n\n## Randomization test for the difference in means {#rand2mean}\n\nAn instructor decided to run two slight variations of the same exam.\nPrior to passing out the exams, they shuffled the exams together to ensure each student received a random version.\nAnticipating complaints from students who took Version B, they would like to evaluate whether the difference observed in the groups is so large that it provides convincing evidence that Version B was more difficult (on average) than Version A.\n\n::: {.data data-latex=\"\"}\nThe [`classdata`](http://openintrostat.github.io/openintro/reference/classdata.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n### Observed data\n\nSummary statistics for how students performed on these two exams are shown in Table \\@ref(tab:summaryStatsForTwoVersionsOfExams) and plotted in Figure \\@ref(fig:boxplotTwoVersionsOfExams).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary statistics of scores for each exam version.
Group n Mean SD Min Max
A 58 75.1 13.9 44 100
B 55 72.0 13.8 38 100
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Exam scores for students given one of three different exams.](20-inference-two-means_files/figure-html/boxplotTwoVersionsOfExams-1.png){width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nConstruct hypotheses to evaluate whether the observed difference in sample means, $\\bar{x}_A - \\bar{x}_B=3.1,$ is likely to have happened due to chance, if the null hypothesis is true.\nWe will later evaluate these hypotheses using $\\alpha = 0.01.$[^20-inference-two-means-1]\n:::\n\n[^20-inference-two-means-1]: $H_0:$ the exams are equally difficult, on average.\n $\\mu_A - \\mu_B = 0.$ $H_A:$ one exam was more difficult than the other, on average.\n $\\mu_A - \\mu_B \\neq 0.$\n\n::: {.guidedpractice data-latex=\"\"}\nBefore moving on to evaluate the hypotheses in the previous Guided Practice, let's think carefully about the dataset.\nAre the observations across the two groups independent?\nAre there any concerns about outliers?[^20-inference-two-means-2]\n:::\n\n[^20-inference-two-means-2]: Since the exams were shuffled, the \"treatment\" in this case was randomly assigned, so independence within and between groups is satisfied.\n The summary statistics suggest the data are roughly symmetric about the mean, and the min/max values do not suggest any outliers of concern.\n\n### Variability of the statistic\n\nIn Section \\@ref(foundations-randomization), the variability of the statistic (previously: $\\hat{p}_1 - \\hat{p}_2)$ was visualized after shuffling the observations across the two treatment groups many times.\nThe shuffling process implements the null hypothesis model (that there is no effect of the treatment).\nIn the exam example, the null hypothesis is that exam A and exam B are equally difficult, so the average scores across the two tests should be the same.\nIf the exams were equally difficult, *due to natural variability*, we would sometimes expect students to do slightly better on exam A $(\\bar{x}_A > \\bar{x}_B)$ and sometimes expect students to do slightly better on exam B $(\\bar{x}_B > \\bar{x}_A).$ The question at hand is: does $\\bar{x}_A - \\bar{x}_B=3.1$ indicate that exam A is easier than exam B.\n\nFigure \\@ref(fig:rand2means) shows the process of randomizing the exam to the observed exam scores.\nIf the null hypothesis is true, then the score on each exam should represent the true student ability on that material.\nIt shouldn't matter whether they were given exam A or exam B.\nBy reallocating which student got which exam, we are able to understand how the difference in average exam scores changes due only to natural variability.\nThere is only one iteration of the randomization process in Figure \\@ref(fig:rand2means), leading to one simulated difference in average scores.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The version of the test (A or B) is randomly allocated to the test scores, under the null assumption that the tests are equally difficult.](images/rand2means.png){fig-alt='Four panels representing four different orientations of a toy dataset of 9 exam scores. The first panel provides the observed data; 4 of the exams were version A and the average score was 77.25; 5 of the exams were version B and the average score was 75.8, a difference of 1.45. The second panel shows the shuffled reassignment of the exam versions (4 of the scores are randomly reassigned to A, 5 of the scores are randomly reassigned to B). The third panel shows which score is connected with which new reassigned version of the exam. And the fourth panel sorts the exams so that version A exams are together and version B exams are together. In the randomly reassigned versions, the average score for version A is 74.25 and the average score for version B is 78.2, a difference of -3.95.' width=75%}\n:::\n:::\n\n\nBuilding on Figure \\@ref(fig:rand2means), Figure \\@ref(fig:randexams) shows the values of the simulated statistics $\\bar{x}_{1, sim} - \\bar{x}_{2, sim}$ over 1,000 random simulations.\nWe see that, just by chance, the difference in scores can range anywhere from -10 points to +10 points.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of differences in means, calculated from 1,000 different randomizations of the exam types.](20-inference-two-means_files/figure-html/randexams-1.png){width=90%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nThe goal of the randomization test is to assess the observed data, here the statistic of interest is $\\bar{x}_A - \\bar{x}_B=3.1.$ The randomization distribution allows us to identify whether a difference of 3.1 points is more than one would expect by natural variability of the scores if the two tests were equally difficult.\nBy plotting the value of 3.1 on Figure \\@ref(fig:randexamspval), we can measure how different or similar 3.1 is to the randomized differences which were generated under the null hypothesis.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of differences in means, calculated from 1,000 different randomizations of the exam types. The observed difference of 3.1 points is plotted as a vertical line, and the area more extreme than 3.1 is shaded to represent the p-value.](20-inference-two-means_files/figure-html/randexamspval-1.png){width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nApproximate the p-value depicted in Figure \\@ref(fig:randexamspval), and provide a conclusion in the context of the case study.\n\n------------------------------------------------------------------------\n\nUsing software, we can find the number of shuffled differences in means that are less than the observed difference (of 3.14) is 900 (out of 1,000 randomizations).\nSo 10% of the simulations are larger than the observed difference.\nTo get the p-value, we double the proportion of randomized differences which are larger than the observed difference, p-value = 0.2.\n\nPreviously, we specified that we would use $\\alpha = 0.01.$ Since the p-value is larger than $\\alpha,$ we do not reject the null hypothesis.\nThat is, the data do not convincingly show that one exam version is more difficult than the other, and the teacher should not be convinced that they should add points to the Version B exam scores.\n:::\n\n\n::: {.cell}\n\n:::\n\n\nThe large p-value and consistency of $\\bar{x}_A - \\bar{x}_B=3.1$ with the randomized differences leads us to *not reject the null hypothesis*.\nSaid differently, there is no evidence to think that one of the tests is easier than the other.\nOne might be inclined to conclude that the tests have the same level of difficulty, but that conclusion would be wrong.\nThe hypothesis testing framework is set up only to reject a null claim, it is not set up to validate a null claim.\nAs we concluded, the data are consistent with exams A and B being equally difficult, but the data are also consistent with exam A being 3.1 points \"easier\" than exam B.\nThe data are not able to adjudicate on whether the exams are equally hard or whether one of them is slightly easier.\nIndeed, conclusions where the null hypothesis is not rejected often seem unsatisfactory.\nHowever, in this case, the teacher and class are probably all relieved that there is no evidence to demonstrate that one of the exams is more difficult than the other.\n\n## Bootstrap confidence interval for the difference in means\n\nBefore providing a full example working through a bootstrap analysis on actual data, we return to the fictional Awesome Auto example as a way to visualize the two sample bootstrap setting.\nConsider an expanded scenario where the research question centers on comparing the average price of a car at one Awesome Auto franchise (Group 1) to the average price of a car at a different Awesome Auto franchise (Group 2).\nThe process of bootstrapping can be applied to *each* Group separately, and the differences of means recalculated each time.\nFigure \\@ref(fig:bootmeans2means) visually describes the bootstrap process when interest is in a statistic computed on two separate samples.\nThe analysis proceeds as in the one sample case, but now the (single) statistic of interest is the *difference in sample means*.\nThat is, a bootstrap resample is done on each of the groups separately, but the results are combined to have a single bootstrapped difference in means.\nRepetition will produce $k$ bootstrapped differences in means, and the histogram will describe the natural sampling variability associated with the difference in means.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For the two group comparison, the bootstrap resampling is done separately on each group, but the statistic is calculated as a difference. The set of k differences is then analyzed as the statistic of interest with conclusions drawn on the parameter of interest.](images/bootmeans2means.png){fig-alt='Samples are shown as separately coming from two independent, large, unknown populations. Direcly from each of the two observed samples, bootstrap resamples can be taken (with replacement). Bootstrap resample 1 from sample 1 is compared to bootstrap resample 1 from sample 2 by comparing the difference in bootstrapped averages. A histogram of differences in bootstrapped averages displays the differences ranging from roughly -20000 dollars to +10000 dollars.' width=75%}\n:::\n:::\n\n\n### Observed data\n\nDoes treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack?\nTable \\@ref(tab:statsSheepEscStudy) contains summary statistics for an experiment to test ESCs in sheep that had a heart attack.\nEach of these sheep was randomly assigned to the ESC or control group, and the change in their hearts' pumping capacity was measured in the study.\n[@Menard:2005] Figure \\@ref(fig:stem-cell-histograms) provides histograms of the two datasets.\nA positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery.\nOur goal will be to identify a 95% confidence interval for the effect of ESCs on the change in heart pumping capacity relative to the control group.\n\n::: {.data data-latex=\"\"}\nThe [`stem_cell`](http://openintrostat.github.io/openintro/reference/stem_cell.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary statistics of the embryonic stem cell study.
Group n Mean SD
ESC 9 3.50 5.17
Control 9 -4.33 2.76
\n\n`````\n:::\n:::\n\n\nThe point estimate of the difference in the heart pumping variable is straightforward to find: it is the difference in the sample means.\n\n$$\\bar{x}_{esc} - \\bar{x}_{control}\\ =\\ 3.50 - (-4.33)\\ =\\ 7.83$$\n\n### Variability of the statistic\n\nAs we saw in Section \\@ref(two-prop-boot-ci), we will use bootstrapping to estimate the variability associated with the difference in sample means when taking repeated samples.\nIn a method akin to two proportions, a *separate* sample is taken with replacement from each group (here ESCs and control), the sample means are calculated, and their difference is taken.\nThe entire process is repeated multiple times to produce a bootstrap distribution of the difference in sample means (*without* the null hypothesis assumption).\n\nFigure \\@ref(fig:bootexamsci) displays the variability of the differences in means with the 90% percentile and SE CIs super imposed.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of differences in means after 1,000 bootstrap samples from each of the two groups. The observed difference is plotted as a black vertical line at 7.83. The blue dashed and red dotted lines provide the bootstrap percentile and boostrap SE confidence intervals, respectively, for the difference in true population means.](20-inference-two-means_files/figure-html/bootexamsci-1.png){width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"} Using the histogram of bootstrapped difference in means, estimate the standard error of the differences in sample means, $\\bar{x}_{ESC} - \\bar{x}_{Control}.$[^20-inference-two-means-3] :::\n\n[^20-inference-two-means-3]: The point estimate of the population difference ($\\bar{x}_{ESC} - \\bar{x}_{Control}$) is 7.83.\n\n::: {.workedexample data-latex=\"\"}\nChoose one of the bootstrap confidence intervals for the true difference in average pumping capacity, $\\mu_{ESC} - \\mu_{Control}.$ Does the interval show that there is a difference across the two treatments?\n\n------------------------------------------------------------------------\n\nBecause neither of the 90% intervals (either percentile or SE) above overlap zero (note that zero is never one of the bootstrapped differences so 95% and 99% intervals would have given the same conclusion!), we conclude that the ESC treatment is substantially better with respect to heart pumping capacity than the treatment.\n\nBecause the study is a randomized controlled experiment, we can conclude that it is the treatment (ESC) which is causing the change in pumping capacity.\n:::\n\n## Mathematical model for testing the difference in means {#math2samp}\n\nEvery year, the US releases to the public a large data set containing information on births recorded in the country.\nThis data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children.\nWe will work with a random sample of 1,000 cases from the data set released in 2014.\n\n::: {.data data-latex=\"\"}\nThe [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n### Observed data\n\nFour cases from this dataset are represented in Table \\@ref(tab:babySmokeDF).\nWe are particularly interested in two variables: `weight` and `smoke`.\nThe `weight` variable represents the weights of the newborns and the `smoke` variable describes which mothers smoked during pregnancy.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Four cases from the `births14` dataset. The emoty cells indicate missing data.
fage mage weeks visits gained weight sex habit
34 34 37 14 28 6.96 male nonsmoker
36 31 41 12 41 8.86 female nonsmoker
37 36 37 10 28 7.51 female nonsmoker
16 38 29 6.19 male nonsmoker
\n\n`````\n:::\n:::\n\n\nWe would like to know, is there convincing evidence that newborns from mothers who smoke have a different average birth weight than newborns from mothers who do not smoke?\nWe will use data from this sample to try to answer this question.\n\n::: {.workedexample data-latex=\"\"}\nSet up appropriate hypotheses to evaluate whether there is a relationship between a mother smoking and average birth weight.\n\n------------------------------------------------------------------------\n\nThe null hypothesis represents the case of no difference between the groups.\n\n- $H_0:$ There is no difference in average birth weight for newborns from mothers who did and did not smoke. In statistical notation: $\\mu_{n} - \\mu_{s} = 0,$ where $\\mu_{n}$ represents non-smoking mothers and $\\mu_s$ represents mothers who smoked.\n- $H_A:$ There is some difference in average newborn weights from mothers who did and did not smoke $(\\mu_{n} - \\mu_{s} \\neq 0).$\n:::\n\nTable \\@ref(tab:births14-summary-stats) displays sample statistics from the data.\nWe can see that the average birth weight of babies born to smoker moms is lower than those born to nonsmoker moms.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary statistics for the `births14` dataset.
Habit n Mean SD
nonsmoker 867 7.27 1.23
smoker 114 6.68 1.60
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nWe check the two conditions necessary to model the difference in sample means using the $t$-distribution.\n\n- Because the data come from a simple random sample, the observations are independent, both within and between samples.\n- With both groups over 30 observations, we inspect the data in Figure \\@ref(fig:babySmokePlotOfTwoGroupsToExamineSkew) for any particularly extreme outliers and find none.\n\nSince both conditions are satisfied, the difference in sample means may be modeled using a $t$-distribution.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The top panel represents birth weights for infants whose mothers smoked during pregnancy. The bottom panel represents the birth weights for infants whose mothers who did not smoke during pregnancy.](20-inference-two-means_files/figure-html/babySmokePlotOfTwoGroupsToExamineSkew-1.png){width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nThe summary statistics in Table \\@ref(tab:births14-summary-stats) may be useful for this Guided Practice.\nWhat is the point estimate of the population difference, $\\mu_{n} - \\mu_{s}$?[^20-inference-two-means-4]\n:::\n\n[^20-inference-two-means-4]: The point estimate of the population difference ($\\bar{x}_{n} - \\bar{x}_{s}$) is 0.59.\n\n### Observed statistic vs. null statistics\n\n::: {.important data-latex=\"\"}\n**The test statistic for comparing two means is a T.**\n\nThe T score is a ratio of how the groups differ as compared to how the observations within a group vary.\n\n$$T = \\frac{(\\bar{x}_1 - \\bar{x}_2) - 0}{\\sqrt{s_1^2/n_1 + s_2^2/n_2}}$$\n\nWhen the null hypothesis is true and the conditions are met, T has a t-distribution with $df = min(n_1 - 1, n_2 -1).$\n\nConditions:\n\n- Independent observations within and between groups.\n- Large samples and no extreme outliers.\n:::\n\n\n\n\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the standard error of the point estimate for the average difference between the weights of babies born to nonsmoker and smoker mothers.[^20-inference-two-means-5]\n:::\n\n[^20-inference-two-means-5]: $SE(\\bar{x}_{n} - \\bar{x}_{s}) = \\sqrt{s^2_{n}/ n_{n} + s^2_{s}/n_{s}}\\\\ = \\sqrt{1.23^2/867 + 1.60^2/114} = 0.16$\n\n::: {.workedexample data-latex=\"\"}\nComplete the hypothesis test started in the previous Example and Guided Practice on `births14` dataset and research question.\nUse a significance level of $\\alpha=0.05.$ For reference, $\\bar{x}_{n} - \\bar{x}_{s} = 0.59,$ $SE = 0.16,$ and the sample sizes were $n_n = 867$ and $n_s = 114.$\n\n------------------------------------------------------------------------\n\nWe can find the test statistic for this test using the previous information:\n\n$$T = \\frac{\\ 0.59 - 0\\ }{0.16} = 3.69$$\n\nWe find the single tail area using software.\nWe'll use the smaller of $n_n - 1 = 866$ and $n_s - 1 = 113$ as the degrees of freedom: $df = 113.$ The one tail area is roughly 0.00017; doubling this value gives the two-tail area and p-value, 0.00034.\n\nThe p-value is much smaller than the significance value, 0.05, so we reject the null hypothesis.\nThe data provide is convincing evidence of a difference in the average weights of babies born to mothers who smoked during pregnancy and those who did not.\n:::\n\nThis result is likely not surprising.\nWe all know that smoking is bad for you and you've probably also heard that smoking during pregnancy is not just bad for the mother but also for the baby as well.\nIn fact, some in the tobacco industry actually had the audacity to tout that as a *benefit* of smoking:\n\n> *It's true. The babies born from women who smoke are smaller, but they're just as healthy as the babies born from women who do not smoke. And some women would prefer having smaller babies.* - Joseph Cullman, Philip Morris' Chairman of the Board on CBS' *Face the Nation*, Jan 3, 1971\n\nFurthermore, health differences between babies born to mothers who smoke and those who do not are not limited to weight differences.[^20-inference-two-means-6]\n\n[^20-inference-two-means-6]: You can watch an episode of John Oliver on [*Last Week Tonight*](youtu.be/6UsHHOCH4q8) to explore the present day offenses of the tobacco industry.\n Please be aware that there is some adult language.\n\n## Mathematical model for estimating the difference in means\n\n### Observed data\n\nAs with hypothesis testing, for the question of whether we can model the difference using a $t$-distribution, we'll need to check new conditions.\nLike the 2-proportion cases, we will require a more robust version of independence so we are confident the two groups are also independent.\nSecondly, we also check for normality in each group separately, which in practice is a check for outliers.\n\n\n\n\n\n\\index{point estimate}\n\n::: {.important data-latex=\"\"}\n**Using the** $t$**-distribution for a difference in means.**\n\nThe $t$-distribution can be used for inference when working with the standardized difference of two means if\n\n- *Independence* (extended). The data are independent within and between the two groups, e.g., the data come from independent random samples or from a randomized experiment.\n- *Normality*. We check the outliers for each group separately.\n\nThe standard error may be computed as\n\n$$SE = \\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}$$\n\nThe official formula for the degrees of freedom is quite complex and is generally computed using software, so instead you may use the smaller of $n_1 - 1$ and $n_2 - 1$ for the degrees of freedom if software isn't readily available.\n:::\n\nRecall that the margin of error is defined by the standard error.\nThe margin of error for $\\bar{x}_1 - \\bar{x}_2$ can be directly obtained from $SE(\\bar{x}_1 - \\bar{x}_2).$\n\n::: {.important data-latex=\"\"}\n**Margin of error for** $\\bar{x}_1 - \\bar{x}_2.$\n\nThe margin of error is $t^\\star_{df} \\times \\sqrt{\\frac{s_1^2}{n_1} + \\frac{s_2^2}{n_2}}$ where $t^\\star_{df}$ is calculated from a specified percentile on the t-distribution with *df* degrees of freedom.\n:::\n\n\n\n\n\n\\index{standard error (SE)!difference in means}\n\n### Variability of the statistic\n\n::: {.workedexample data-latex=\"\"}\nCan the $t$-distribution be used to make inference using the point estimate, $\\bar{x}_{esc} - \\bar{x}_{control} = 7.83$?\n\n------------------------------------------------------------------------\n\nFirst, we check for independence.\nBecause the sheep were randomized into the groups, independence within and between groups is satisfied.\n\nFigure \\@ref(fig:stem-cell-histograms) does not reveal any clear outliers in either group.\n(The ESC group does look a bit more variable, but this is not the same as having clear outliers.)\n\nWith both conditions met, we can use the $t$-distribution to model the difference of sample means.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histograms for the difference in heart pumping function after a heart attack for both the treatment group (ESC, which received an the embryonic stem cell treatment) and the control group (which did not receive the treatment).](20-inference-two-means_files/figure-html/stem-cell-histograms-1.png){width=90%}\n:::\n:::\n\n\nGenerally, we use statistical software to find the appropriate degrees of freedom, or if software isn't available, we can use the smaller of $n_1 - 1$ and $n_2 - 1$ for the degrees of freedom, e.g., if using a $t$-table to find tail areas.\nFor transparency in the Examples and Guided Practice, we'll use the latter approach for finding $df$; in the case of the ESC example, this means we'll use $df = 8.$\n\n::: {.workedexample data-latex=\"\"}\nCalculate a 95% confidence interval for the effect of ESCs on the change in heart pumping capacity of sheep after they've suffered a heart attack.\n\n------------------------------------------------------------------------\n\nWe will use the sample difference and the standard error that we computed earlier:\n\n$$\n\\begin{aligned}\n\\bar{x}_{esc} - \\bar{x}_{control} &= 7.83 \\\\\nSE &= \\sqrt{\\frac{5.17^2}{9} + \\frac{2.76^2}{9}} = 1.95\n\\end{aligned}\n$$\n\nUsing $df = 8,$ we can identify the critical value of $t^{\\star}_{8} = 2.31$ for a 95% confidence interval.\nFinally, we can enter the values into the confidence interval formula:\n\n$$\n\\begin{aligned}\n\\text{point estimate} \\ &\\pm\\ t^{\\star} \\times SE \\\\\n7.83 \\ &\\pm\\ 2.31\\times 1.95 \\\\\n(3.32 \\ &, \\ 12.34)\n\\end{aligned} \n$$\n\nWe are 95% confident that the heart pumping function in sheep that received embryonic stem cells is between 3.32% and 12.34% higher than for sheep that did not receive the stem cell treatment.\n:::\n\n\\clearpage\n\n## Chapter review {#chp20-review}\n\n### Summary\n\nIn this chapter we extended the single mean inferential methods to questions of differences in means.\nYou may have seen parallels from the chapters that extended a single proportion (Chapter \\@ref(inference-one-prop)) to differences in proportions (Chapter \\@ref(inference-two-props)).\nWhen considering differences in sample means (indeed, when considering many quantitative statistics), we use the t-distribution to describe the sampling distribution of the T score (the standardized difference in sample means).\nIdeas of confidence level and type of error which might occur from a hypothesis test conclusion are similar to those seen in other chapters (see Section \\@ref(decerr)).\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
difference in means SE difference in means t-CI
point estimate T score t-test
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp20-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-20].\n\n::: {.exercises data-latex=\"\"}\n1. **Experimental baker.** \nA baker working on perfecting their bagel recipe is experimenting with active dry (AD) and instant (I) yeast. \nThey bake a dozen bagels with each type of yeast and score each bagel on a scale of 1 to 10 on how well the bagels rise.\nThey come up with the following set of hypotheses for evaluating whether there is a difference in the average rise of bagels baked with active dry and instant yeast.\nWhat is wrong with the hypotheses as stated?\n\n $$H_0: \\bar{x}_{AD} \\leq \\bar{x}_{I} \\quad \\quad H_A: \\bar{x}_{AD} > \\bar{x}_{I}$$\n\n1. **Fill in the blanks.** \nWe use a \\_\\_\\_ to evaluate if data provide convincing evidence of a difference between two population means and we use a \\_\\_\\_ to estimate this difference.\n\n1. **Diamonds, randomization test.**\nThe prices of diamonds go up as the carat weight increases, but the increase is not smooth. \nFor example, the difference between the size of a 0.99 carat diamond and a 1 carat diamond is undetectable to the naked human eye, but the price of a 1 carat diamond tends to be much higher than the price of a 0.99 diamond. \nIn this question we use two random samples of diamonds, 0.99 carats and 1 carat, each sample of size 23, and randomize the carat weight to the price values in order compare the average prices of the diamonds to a null distribution.\nIn order to be able to compare equivalent units, we first divide the price for each diamond by 100 times its weight in carats. \nThat is, for a 0.99 carat diamond, we divide the price by 99. or a 1 carat diamond, we divide the price by 100. \nThe randomization distribution (with 1,000 repetitions) below describes the null distribution of the difference in sample means (of price per carat) if there really was no difference in the population from which these diamonds came.^[The [`diamonds`](https://ggplot2.tidyverse.org/reference/diamonds.html) data used in this exercise can be found in the [**ggplot2**](http://ggplot2.tidyverse.org/) R package.] [@ggplot2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-21-1.png){width=90%}\n :::\n :::\n\n Using the randomization distribution of the difference in average price per carat (1,000 randomizations were run), conduct a hypothesis test to evaluate if there is a difference between the prices per carat of diamonds that weigh 0.99 carats and diamonds that weigh 1 carat. Note that the observed difference marked on the plot with a vertical line is -12.7. Make sure to state your hypotheses clearly and interpret your results in context of the data. [@ggplot2]\n \n \\clearpage\n\n1. **Lizards running, randomization test.**\nIn order to assess physiological characteristics of common lizards, data on top speeds (in m/sec) measured on a laboratory race track for two species of lizards: Western fence lizard (Sceloporus occidentalis) and Sagebrush lizard (Sceloporus graciosus).\nThe original observed difference in lizard speeds is $\\bar{x}_{Western fence} - \\bar{x}_{Sagebrush} = 0.7 \\mbox{m/sec}.$ The histogram below shows the distribution of average differences when speed has been randomly allocated across lizard species 1,000 times.^[The [`lizard_run`](http://openintrostat.github.io/openintro/reference/lizard_run.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Adolph:1987]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-22-1.png){width=90%}\n :::\n :::\n \n Using the randomization distribution, conduct a hypothesis test to evaluate if there is a difference between the average speed of the Western fence lizard as compared to the Sagebrush lizard. Make sure to state your hypotheses clearly and interpret your results in context of the data.\n\n1. **Diamonds, bootstrap interval.**\nWe have data on two random samples of diamonds: one with diamonds that weigh 0.99 carats and one with diamonds that weigh 1 carat. \nEach sample has 23 diamonds.\nProvided below is a histogram of bootstrap differences in means of price per carat of diamonds that weigh 0.99 carats and diamonds that weigh 1 carat.\n[@ggplot2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-23-1.png){width=90%}\n :::\n :::\n \n a. Using the bootstrap distribution, create a (rough) 95% bootstrap percentile confidence interval for the true population difference in prices per carat of diamonds that weigh 0.99 carats and 1 carat. \n \n b. Using the bootstrap distribution, create a (rough) 95% bootstrap SE confidence interval for the true population difference in prices per carat of diamonds that weigh 0.99 carats and 1 carat. Note that the standard error of the bootstrap distribution is 4.64.\n \n \\clearpage\n\n1. **Lizards running, bootstrap interval.**\nWe have data on top speeds (in m/sec) measured on a laboratory race track for two species of lizards: Western fence lizard (Sceloporus occidentalis) and Sagebrush lizard (Sceloporus graciosus).\nThe bootstrap distribution below describes the variability of difference in means captured from 1,000 bootstrap samples of the lizard data. [@Adolph:1987]\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-24-1.png){width=90%}\n :::\n :::\n \n a. Using the bootstrap distribution, create a (rough) 90% percentile bootrap confidence interval for the true population difference in price per carat for Western fence lizard as compared with Sagebrush lizard.\n \n b. Using the bootstrap distribution, create a (rough) 90% bootstrap SE confidence interval for the true population difference in price per carat for Western fence lizard as compared with Sagebrush lizard.\n\n1. **Weight loss.**\nYou are reading an article in which the researchers have created a 95% confidence interval for the difference in average weight loss for two diets. \nThey are 95% confident that the true difference in average weight loss over 6 months for the two diets is somewhere between (1 lb, 25 lbs).\nThe authors claim that, \"therefore diet A ($\\bar{x}_A$ = 20 lbs average loss) results in a much larger average weight loss as compared to diet B ($\\bar{x}_B$ = 7 lbs average loss).\" Comment on the authors' claim.\n\n1. **Possible randomized means.**\nData were collected on data from two groups (A and B). There were 3 measurements taken on Group A and two measurements \n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Group Measurement 1 Measurement 2 Measurement 3
A 1 15 5
B 7 3
\n \n `````\n :::\n :::\n \n If the data are (repeatedly) randomly allocated across the two conditions, provide the following: (1) the values which are assigned to group A, (2) the values which are assigned to group B, and (3) the difference in averages $(\\bar{x}_A - \\bar{x}_B)$ for each of the following: \n\n a. When the randomized difference in averages is as big as possible.\n \n b. When the randomized difference in averages is as small as possible (a big in magnitude negative number).\n \n c. When the randomized difference in averages is as close to zero as possible.\n \n d. When the observed values are randomly assigned to the two groups, to which of the previous parts would you expect the difference in means to fall closest? Explain your reasoning.\n\n1. **Diamonds, mathematical test.** \nWe have data on two random samples of diamonds: one with diamonds that weigh 0.99 carats and one with diamonds that weigh 1 carat. \nEach sample has 23 diamonds.\nSample statistics for the price per carat of diamonds in each sample are provided below.\nConduct a hypothesis test using a mathematical model to evaluate if there is a difference between the prices per carat of diamonds that weigh 0.99 carats and diamonds that weigh 1 carat \nMake sure to state your hypotheses clearly, check relevant conditions, and interpret your results in context of the data. [@ggplot2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Mean SD n
0.99 carats $44.51 $13.32 23
1 carat $57.20 $18.19 23
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-26-1.png){width=90%}\n :::\n :::\n\n1. **A/B testing.** \nA/B testing is a user experience research methodology where two variants of a page are shown to users at random.\nA company wants to evaluate whether users will spend more time, on average, on Page A or Page B using an A/B test.\nTwo user experience designers at the company, Lucie and Müge, are tasked with conducting the analysis of the data collected.\nThey agree on how the null hypothesis should be set: on average, users spend the same amount of time on Page A and Page B.\nLucie believes that Page B will provide a better experience for users and hence wants to use a one-tailed test, Müge believes that a two-tailed test would be a better choice.\nWhich designer do you agree with, and why?\n\n1. **Diamonds, mathematical interval.** \nWe have data on two random samples of diamonds: one with diamonds that weigh 0.99 carats and one with diamonds that weigh 1 carat. \nEach sample has 23 diamonds. \nSample statistics for the price per carat of diamonds in each sample are provided below.\nAssuming that the conditions for conducting inference using a mathematical model are satisfied, construct a 95% confidence interval for the true population difference in prices per carat of diamonds that weigh 0.99 carats and 1 carat. [@ggplot2]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Mean SD n
0.99 carats $44.51 $13.32 23
1 carat $57.20 $18.19 23
\n \n `````\n :::\n :::\n\n1. **True / False: comparing means.**\nDetermine if the following statements are true or false, and explain your reasoning for statements you identify as false.\n\n a. As the degrees of freedom increases, the $t$-distribution approaches normality.\n\n b. If a 95% confidence interval for the difference between two population means contains 0, a 99% confidence interval calculated based on the same two samples will also contain 0.\n \n c. If a 95% confidence interval for the difference between two population means contains 0, a 90% confidence interval calculated based on the same two samples will also contain 0.\n \n \\clearpage\n\n1. **Difference of means.** \nSuppose we will collect two random samples from the following distributions.\nIn each of the parts below, consider the sample means $\\bar{x}_1$ and $\\bar{x}_2$ that we might observe from these two samples.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Mean Standard deviation Sample size
Sample 1 15 20 50
Sample 2 20 10 30
\n \n `````\n :::\n :::\n\n a. What is the associated mean and standard deviation of $\\bar{x}_1$?\n\n b. What is the associated mean and standard deviation of $\\bar{x}_2$?\n\n c. Calculate and interpret the mean and standard deviation associated with the difference in sample means for the two groups, $\\bar{x}_2 - \\bar{x}_1$.\n\n d. How are the standard deviations from parts (a), (b), and (c) related?\n\n1. **Gaming, distracted eating, and intake.** \nA group of researchers who are interested in the possible effects of distracting stimuli during eating, such as an increase or decrease in the amount of food consumption, monitored food intake for a group of 44 patients who were randomized into two equal groups. \nThe treatment group ate lunch while playing solitaire, and the control group ate lunch without any added distractions. \nPatients in the treatment group ate 52.1 grams of biscuits, with a standard deviation of 45.1 grams, and patients in the control group ate 27.1 grams of biscuits, with a standard deviation of 26.4 grams. \nDo these data provide convincing evidence that the average food intake (measured in amount of biscuits consumed) is different for the patients in the treatment group compared to the control group? \nAssume that conditions for conducting inference using mathematical models are satisfied. [@Oldham:2011]\n\n1. **Chicken diet: horsebean vs. linseed.** \nChicken farming is a multi-billion dollar industry, and any methods that increase the growth rate of young chicks can reduce consumer costs while increasing company profits, possibly by millions of dollars. \nAn experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. \nNewly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. \nIn this exercise we consider chicks that were fed horsebean and linseed.\nBelow are some summary statistics from this dataset along with box plots showing the distribution of weights by feed type.^[The [`chickwts`](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/chickwts.html) data used in this exercise can be found in the [**datasets**](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) R package.] [@data:chickwts]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Feed type Mean SD n
horsebean 160.20 38.63 10
linseed 218.75 52.24 12
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-29-1.png){width=90%}\n :::\n :::\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for parts a to d.*\n :::\n \n \\clearpage\n\n a. Describe the distributions of weights of chickens that were fed horsebean and linseed.\n\n b. Do these data provide strong evidence that the average weights of chickens that were fed linseed and horsebean are different? Use a 5% significance level.\n\n c. What type of error might we have committed? Explain.\n\n d. Would your conclusion change if we used $\\alpha = 0.01$?\n\n1. **Fuel efficiency in the city.** \nEach year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. \nBelow are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2021. \nDo these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage?^[The [`epa2021`](http://openintrostat.github.io/openintro/reference/epa2021.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@data:epa2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
CITY Mean SD n
Automatic 17.4 3.44 25
Manual 22.7 4.58 25
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-30-1.png){width=90%}\n :::\n :::\n\n1. **Chicken diet: casein vs. soybean.** \nCasein is a common weight gain supplement for humans. \nDoes it have an effect on chickens? \nAn experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. \nNewly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. \nIn this exercise we consider chicks that were fed casein and soybean.\nAssume that the conditions for conducting inference using mathematical models are met, and using the data provided below, test the hypothesis that the average weight of chickens that were fed casein is different than the average weight of chickens that were fed soybean. \nIf your hypothesis test yields a statistically significant result, discuss whether the higher average weight of chickens can be attributed to the casein diet. [@data:chickwts]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Feed type Mean SD n
casein 323.58 64.43 12
soybean 246.43 54.13 14
\n \n `````\n :::\n :::\n \n \\clearpage\n\n1. **Fuel efficiency on the highway.** \nEach year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. \nBelow are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2021. \nDo these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average highway mileage? [@data:epa2021]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
HIGHWAY Mean SD n
Automatic 23.7 3.90 25
Manual 30.9 5.13 25
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](20-inference-two-means_files/figure-html/unnamed-chunk-32-1.png){width=90%}\n :::\n :::\n\n1. **Gaming, distracted eating, and intake.** \nA group of researchers who are interested in the possible effects of distracting stimuli during eating, such as an increase or decrease in the amount of food consumption, monitored food intake for a group of 44 patients who were randomized into two equal groups. \nThe treatment group ate lunch while playing solitaire, and the control group ate lunch without any added distractions. \nPatients in the treatment group ate 52.1 grams of biscuits, with a standard deviation of 45.1 grams, and patients in the control group ate 27.1 grams of biscuits, with a standard deviation of 26.4 grams. \nDo these data provide convincing evidence that the average food intake (measured in amount of biscuits consumed) is different for the patients in the treatment group compared to the control group? \nAssume that conditions for conducting inference using mathematical models are satisfied. [@Oldham:2011]\n\n1. **Gaming, distracted eating, and recall.** \nA group of researchers who are interested in the possible effects of distracting stimuli during eating, such as an increase or decrease in the amount of food consumption, monitored food intake for a group of 44 patients who were randomized into two equal groups. \nThe 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. \nThe average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. \nThe average number of items recalled by the patients in the control group (no distraction) was 6.1, with a standard deviation of 1.8. \nDo these data provide strong evidence that the average numbers of food items recalled by the patients in the treatment and control groups are different?\nAssume that conditions for conducting inference using mathematical models are satisfied. [@Oldham:2011]\n\n\n:::\n", + "supporting": [ + "20-inference-two-means_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/20-inference-two-means/figure-html/babySmokePlotOfTwoGroupsToExamineSkew-1.png b/_freeze/20-inference-two-means/figure-html/babySmokePlotOfTwoGroupsToExamineSkew-1.png new file mode 100644 index 00000000..7df597e0 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/babySmokePlotOfTwoGroupsToExamineSkew-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/bootexamsci-1.png b/_freeze/20-inference-two-means/figure-html/bootexamsci-1.png new file mode 100644 index 00000000..55729b12 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/bootexamsci-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/boxplotTwoVersionsOfExams-1.png b/_freeze/20-inference-two-means/figure-html/boxplotTwoVersionsOfExams-1.png new file mode 100644 index 00000000..6c23b6bc Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/boxplotTwoVersionsOfExams-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/randexams-1.png b/_freeze/20-inference-two-means/figure-html/randexams-1.png new file mode 100644 index 00000000..365909a1 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/randexams-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/randexamspval-1.png b/_freeze/20-inference-two-means/figure-html/randexamspval-1.png new file mode 100644 index 00000000..28fd909a Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/randexamspval-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/stem-cell-histograms-1.png b/_freeze/20-inference-two-means/figure-html/stem-cell-histograms-1.png new file mode 100644 index 00000000..8d1340e2 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/stem-cell-histograms-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-21-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 00000000..6f0c438e Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-21-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-22-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 00000000..7b203715 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-22-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-23-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 00000000..a0e2c45d Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-23-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-24-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..eac13d9c Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-26-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 00000000..743674f2 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-26-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-29-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-29-1.png new file mode 100644 index 00000000..71260a20 Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-29-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-30-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 00000000..78f25cbd Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-30-1.png differ diff --git a/_freeze/20-inference-two-means/figure-html/unnamed-chunk-32-1.png b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 00000000..67684c0f Binary files /dev/null and b/_freeze/20-inference-two-means/figure-html/unnamed-chunk-32-1.png differ diff --git a/_freeze/21-inference-paired-means/execute-results/html.json b/_freeze/21-inference-paired-means/execute-results/html.json new file mode 100644 index 00000000..0a3858fe --- /dev/null +++ b/_freeze/21-inference-paired-means/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "a505c29a785a2bbab5390469e67d3b99", + "result": { + "markdown": "# Inference for comparing paired means {#inference-paired-means}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn Chapter \\@ref(inference-two-means) analysis was done to compare the average population value across two different groups.\nRecall that one of the important conditions in doing a two-sample analysis is that the two groups are independent.\nHere, independence across groups means that knowledge of the observations in one group does not change what we would expect to happen in the other group.\nBut what happens if the groups are **dependent**?\nSometimes dependency is not something that can be addressed through a statistical method.\nHowever, a particular dependency, **pairing**, can be modeled quite effectively using many of the same tools we have already covered in this text.\n:::\n\nPaired data represent a particular type of experimental structure where the analysis is somewhat akin to a one-sample analysis (see Chapter \\@ref(inference-one-mean)) but has other features that resemble a two-sample analysis (see Chapter \\@ref(inference-two-means)).\nAs with a two-sample analysis, quantitative measurements are made on each of two different levels of the explanatory variable.\nHowever, because the observational unit is **paired** across the two groups, the two measurements are subtracted such that only the difference is retained.\nTable \\@ref(tab:pairedexamples) presents some examples of studies where paired designs were implemented.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Examples of studies where a paired design is used to measure the difference in the measurement over two conditions.
Observational unit Comparison groups Measurement Value of interest
Car Smooth Turn vs Quick Spin amount of tire tread after 1,000 miles difference in tread
Textbook UCLA vs Amazon price of new text difference in price
Individual person Pre-course vs Post-course exam score difference in score
\n\n`````\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Paired data.**\n\nTwo sets of observations are *paired* if each observation in one set has a special correspondence or connection with exactly one observation in the other dataset.\n:::\n\nIt is worth noting that if mathematical modeling is chosen as the analysis tool, paired data inference on the difference in measurements will be identical to the one-sample mathematical techniques described in Chapter \\@ref(inference-one-mean).\nHowever, recall from Chapter \\@ref(inference-one-mean) that with pure one-sample data, the computational tools for hypothesis testing are not easy to implement and were not presented (although the bootstrap was presented as a computational approach for constructing a one sample confidence interval).\nWith paired data, the randomization test fits nicely with the structure of the experiment and is presented here.\n\n\n\n\n\n## Randomization test for the mean paired difference\n\nConsider an experiment done to measure whether tire brand Smooth Turn or tire brand Quick Spin has longer tread wear (in cm).\nThat is, after 1,000 miles on a car, which brand of tires has more tread, on average?\n\n### Observed data\n\nThe observed data represent 25 tread measurements (in cm) taken on 25 tires of Smooth Turn and 25 tires of Quick Spin.\nThe study used a total of 25 cars, so on each car, one tire was of Smooth Turn and one was of Quick Spin.\nFigure \\@ref(fig:tiredata) presents the observed data, calculations on tread remaining (in cm).\n\nThe Smooth Turn manufacturer looks at the box plot and says:\n\n> *Clearly the tread on Smooth Turn tires is higher, on average, than the tread on Quick Spin tires after 1,000 miles of driving.*\n\nThe Quick Spin manufacturer is skeptical and retorts:\n\n> *But with only 25 cars, it seems that the variability in road conditions (sometimes one tire hits a pothole, etc.) could be what leads to the small difference in average tread amount.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Boxplots of the tire tread data (in cm) and the brand of tire from which the original measurements came.](21-inference-paired-means_files/figure-html/tiredata-1.png){width=90%}\n:::\n:::\n\n\nWe'd like to be able to systematically distinguish between what the Smooth Turn manufacturer sees in the plot and what the Quick Spin manufacturer sees in the plot.\nFortunately for us, we have an excellent way to simulate the natural variability (from road conditions, etc.) that can lead to tires being worn at different rates.\n\n### Variability of the statistic\n\nA randomization test will identify whether the differences seen in the box plot of the original data in Figure \\@ref(fig:tiredata) could have happened just by chance variability.\nAs before, we will simulate the variability in the study under the assumption that the null hypothesis is true.\nIn this study, the null hypothesis is that average tire tread wear is the same across Smooth Turn and Quick Spin tires.\n\n- $H_0: \\mu_{diff} = 0,$ the average tread wear is the same for the two tire brands.\n- $H_A: \\mu_{diff} \\ne 0,$ the average tread wear is different across the two tire brands.\n\nWhen observations are paired, the randomization process randomly assigns the tire brand to each of the observed tread values.\nNote that in the randomization test for the two-sample mean setting (see Section \\@ref(rand2mean)) the explanatory variable was *also* randomly assigned to the responses.\nThe change in the paired setting, however, is that the assignment happens *within* an observational unit (here, a car).\nRemember, if the null hypothesis is true, it will not matter which brand is put on which tire because the overall tread wear will be the same across pairs.\n\nFigures \\@ref(fig:tiredata4) and \\@ref(fig:tiredata5) show that the random assignment of group (tire brand) happens within a single car.\nThat is, every single car will still have one tire of each type.\nIn the first randomization, it just so happens that the 4th car's tire brands were swapped and the 5th car's tire brands were not swapped.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The 4th car: the tire brand was randomly permuted, and in the randomization calculation, the measurements (in cm) ended up in different groups.](21-inference-paired-means_files/figure-html/tiredata4-1.png){width=90%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![The 5th car: the tire brand was randomly permuted to stay the same! In the randomization calculation, the measurements (in cm) ended up in the original groups.](21-inference-paired-means_files/figure-html/tiredata5-1.png){width=90%}\n:::\n:::\n\n\nWe can put the shuffled assignments for all the cars into one plot as seen in Figure \\@ref(fig:tiredataPerm).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Tire tread data (in cm) with: the brand of tire from which the original measurements came (left) and shuffled brand assignment (right). As evidenced by the colors, some of the cars kept their original tire assignments and some cars swapped the tire assignments.](21-inference-paired-means_files/figure-html/tiredataPerm-1.png){width=100%}\n:::\n:::\n\n\nThe next step in the randomization test is to sort the brands so that the assigned brand value on the x-axis aligns with the assigned group from the randomization.\nSee Figure \\@ref(fig:tiredataPermSort) which has the same randomized groups (right image in Figure \\@ref(fig:tiredataPerm) and left image in Figure \\@ref(fig:tiredataPermSort)) as seen previously.\nHowever, the right image in Figure \\@ref(fig:tiredataPermSort) sorts the randomized groups so that we can measure the variability across groups as compared to the variability within groups.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Tire tread from (left) randomized brand assignment, (right) sorted by randomized brand.](21-inference-paired-means_files/figure-html/tiredataPermSort-1.png){width=100%}\n:::\n:::\n\n\nFigure \\@ref(fig:tiredatarand1) presents a second randomization of the data.\nNotice how the two observations from the same car are linked by a grey line; some of the tread values have been randomly assigned to the opposite tire brand than they were originally (while some are still connected to their original tire brands).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A second randomization where the brand is randomly swapped (or not) across the two tread wear measurements (in cm) from the same car.](21-inference-paired-means_files/figure-html/tiredatarand1-1.png){width=90%}\n:::\n:::\n\n\nFigure \\@ref(fig:tiredatarand2) presents yet another randomization of the data.\nAgain, the same observations are linked by a grey line, and some of the tread values have been randomly assigned to the opposite tire brand than they were originally (while some are still connected to their original tire brands).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An additional randomization where the brand is randomly swapped (or not) across the two tread wear measurements (in cm) from the same car.](21-inference-paired-means_files/figure-html/tiredatarand2-1.png){width=90%}\n:::\n:::\n\n\n### Observed statistic vs. null statistics\n\nBy repeating the randomization process, we can create a distribution of the average of the differences in tire treads, as seen in Figure \\@ref(fig:pairRandomiz).\nAs expected (because the differences were generated under the null hypothesis), the center of the histogram is zero.\nA line has been drawn at the observed difference which is well outside the majority of the null differences simulated from natural variability by mixing up which the tire received Smooth Turn and which received Quick Spin.\nBecause the observed statistic is so far away from the natural variability of the randomized differences, we are convinced that there is a difference between Smooth Turn and Quick Spin.\nOur conclusion is that the extra amount of average tire tread in Smooth Turn is due to more than just natural variability: we reject $H_0$ and conclude that $\\mu_{ST} \\ne \\mu_{QS}.$\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of 1,000 mean differences with tire brand randomly assigned across the two tread measurements (in cm) per pair.](21-inference-paired-means_files/figure-html/pairRandomiz-1.png){width=90%}\n:::\n:::\n\n\n## Bootstrap confidence interval for the mean paired difference\n\nFor both the bootstrap and the mathematical models applied to paired data, the analysis is virtually identical to the one-sample approach given in Chapter \\@ref(inference-one-mean).\nThe key to working with paired data (for bootstrapping and mathematical approaches) is to consider the measurement of interest to be the difference in measured values across the pair of observations.\n\n\n\n\n\n### Observed data\n\nIn an earlier edition of this textbook, we found that Amazon prices were, on average, lower than those of the UCLA Bookstore for UCLA courses in 2010.\nIt's been several years, and many stores have adapted to the online market, so we wondered, how is the UCLA Bookstore doing today?\n\nWe sampled 201 UCLA courses.\nOf those, 68 required books could be found on Amazon.\nA portion of the dataset from these courses is shown in Figure \\@ref(tab:textbooksDF), where prices are in US dollars.\n\n::: {.data data-latex=\"\"}\nThe [`ucla_textbooks_f18`](http://openintrostat.github.io/openintro/reference/ucla_textbooks_f18.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Four cases from the `ucla_textbooks_f18` dataset.
subject course_num bookstore_new amazon_new price_diff
American Indian Studies M10 48.0 47.5 0.52
Anthropology 2 14.3 13.6 0.71
Arts and Architecture 10 13.5 12.5 0.97
Asian M60W 49.3 55.0 -5.69
\n\n`````\n:::\n:::\n\n\n\\index{paired data}\n\nEach textbook has two corresponding prices in the dataset: one for the UCLA Bookstore and one for Amazon.\nWhen two sets of observations have this special correspondence, they are said to be **paired**.\n\n### Variability of the statistic\n\nFollowing the example of bootstrapping the one-sample statistic, the observed *differences* can be bootstrapped in order to understand the variability of the average difference from sample to sample.\nRemember, the differences act as a single value to bootstrap.\nThat is, the original dataset would include the list of 68 price differences, and each resample will also include 68 price differences (some repeated through the bootstrap resampling process).\nThe bootstrap procedure for paired differences is quite similar to the procedure applied to the one-sample statistic case in Section \\@ref(boot1mean).\n\nIn Figure \\@ref(fig:pairboot), two 99% confidence intervals for the difference in the cost of a new book at the UCLA bookstore compared with Amazon have been calculated.\nThe bootstrap percentile confidence interval is computing using the 0.5 percentile and 99.5 percentile bootstrapped differences and is found to be (\\$0.25, \\$7.87).\n\n::: {.guidedpractice data-latex=\"\"}\nUsing the histogram of bootstrapped difference in means, estimate the standard error of the mean of the sample differences, $\\bar{x}_{diff}.$[^21-inference-paired-means-1]\n:::\n\n[^21-inference-paired-means-1]: The bootstrapped differences in sample means vary roughly from 0.7 to 7.5, a range of \\$6.80.\n Although the bootstrap distribution is not symmetric, we use the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the mean differences is approximately \\$1.70.\n You might note that the standard error calculation given in Section \\@ref(mathpaired) is $SE(\\bar{x}_{diff}) = \\sqrt{s^2_{diff}/n_{diff}}\\\\ = \\sqrt{13.4^2/68} = \\$1.62$ (values from Section \\@ref(mathpaired)), very close to the bootstrap approximation.\n\nThe bootstrap SE interval is found by computing the SE of the bootstrapped differences $(SE_{\\overline{x}_{diff}} = \\$1.64)$ and the normal multiplier of $z^* = 2.58.$ The averaged difference is $\\bar{x} = \\$3.58.$ The 99% confidence interval is: $\\$3.58 \\pm 2.58 \\times \\$ 1.64 = (\\$-0.65, \\$7.81).$\n\nThe confidence intervals seem to indicate that the UCLA bookstore price is, on average, higher than the Amazon price, as the majority of the confidence interval is positive.\nHowever, if the analysis required a strong degree of certainty (e.g., 99% confidence), and the bootstrap SE interval was most appropriate (given a second course in statistics the nuances of the methods can be investigated), the results of which book seller is higher is not well determined (because the bootstrap SE interval overlaps zero).\nThat is, the 99% bootstrap SE interval gives potential for UCLA to be lower, on average, than Amazon (because of the possible negative values for the true mean difference in price).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:pairboot-cap)](21-inference-paired-means_files/figure-html/pairboot-1.png){width=90%}\n:::\n:::\n\n\n(ref:pairboot-cap) Bootstrap distribution for the average difference in new book price at the UCLA bookstore versus Amazon. 99% confidence intervals are superimposed using blue dashed (bootstrap percentile interval) and red dotted (bootstrap SE interval) lines.\n\n\\clearpage\n\n## Mathematical model for the mean paired difference {#mathpaired}\n\nThinking about the differences as a single observation on an observational unit changes the paired setting into the one-sample setting.\nThe mathematical model for the one-sample case is covered in Section \\@ref(one-mean-math).\n\n### Observed data\n\nTo analyze paired data, it is often useful to look at the difference in outcomes of each pair of observations.\nIn the textbook data, we look at the differences in prices, which is represented as the `price_difference` variable in the dataset.\nHere the differences are taken as\n\n$$\\text{UCLA Bookstore price} - \\text{Amazon price}$$\n\nIt is important that we always subtract using a consistent order; here Amazon prices are always subtracted from UCLA prices.\nThe first difference shown in Table \\@ref(tab:textbooksDF) is computed as $47.97 - 47.45 = 0.52.$ Similarly, the second difference is computed as $14.26 - 13.55 = 0.71,$ and the third is $13.50 - 12.53 = 0.97.$ A histogram of the differences is shown in Figure \\@ref(fig:diffInTextbookPricesF18).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of the difference in price for each book sampled.](21-inference-paired-means_files/figure-html/diffInTextbookPricesF18-1.png){width=90%}\n:::\n:::\n\n\n### Variability of the statistic\n\nTo analyze a paired dataset, we simply analyze the differences.\nTable \\@ref(tab:textbooksSummaryStats) provides the data summaries from the textbook data.\nNote that instead of reporting the prices separately for UCLA and Amazon, the summary statistics are given by the mean of the differences, the standard deviation of the differences, and the total number of pairs (i.e., differences).\nThe parameter of interest is also a single value, $\\mu_{diff},$ so we can use the same $t$-distribution techniques we applied in Section \\@ref(one-mean-math) directly onto the observed differences.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n
Summary statistics for the 68 price differences.
n Mean SD
68 3.58 13.4
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nSet up a hypothesis test to determine whether, on average, there is a difference between Amazon's price for a book and the UCLA bookstore's price.\nAlso, check the conditions for whether we can move forward with the test using the $t$-distribution.\n\n------------------------------------------------------------------------\n\nWe are considering two scenarios: there is no difference or there is some difference in average prices.\n\n- $H_0:$ $\\mu_{diff} = 0.$ There is no difference in the average textbook price.\n\n- $H_A:$ $\\mu_{diff} \\neq 0.$ There is a difference in average prices.\n\nNext, we check the independence and normality conditions.\nThe observations are based on a simple random sample, so assuming the textbooks are independent seems reasonable.\nWhile there are some outliers, $n = 68$ and none of the outliers are particularly extreme, so the normality of $\\bar{x}$ is satisfied.\nWith these conditions satisfied, we can move forward with the $t$-distribution.\n:::\n\n### Observed statistic vs. null statistics\n\nAs mentioned previously, the methods applied to a difference will be identical to the one-sample techniques.\nTherefore, the full hypothesis test framework is presented as guided practices.\n\n::: {.important data-latex=\"\"}\n**The test statistic for assessing a paired mean is a T.**\n\nThe T score is a ratio of how the sample mean difference varies from zero as compared to how the observations vary.\n\n$$T = \\frac{\\bar{x}_{diff} - 0 }{s_{diff}/\\sqrt{n_{diff}}}$$\n\nWhen the null hypothesis is true and the conditions are met, T has a t-distribution with $df = n_{diff} - 1.$\n\nConditions:\n\n- Independently sampled pairs.\n- Large samples and no extreme outliers.\n:::\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nComplete the hypothesis test started in the previous Example.\n\n------------------------------------------------------------------------\n\nTo compute the test compute the standard error associated with $\\bar{x}_{diff}$ using the standard deviation of the differences $(s_{diff} = 13.42)$ and the number of differences $(n_{diff} = 68):$\n\n$$SE_{\\bar{x}_{diff}} = \\frac{s_{diff}}{\\sqrt{n_{diff}}} = \\frac{13.42}{\\sqrt{68}} = 1.63$$\n\nThe test statistic is the T score of $\\bar{x}_{diff}$ under the null condition that the actual mean difference is 0:\n\n$$T = \\frac{\\bar{x}_{diff} - 0}{SE_{\\bar{x}_{diff}}} = \\frac{3.58 - 0}{1.63} = 2.20$$\n\nTo visualize the p-value, the sampling distribution of $\\bar{x}_{diff}$ is drawn as though $H_0$ is true, and the p-value is represented by the two shaded tails in the figure below.\nThe degrees of freedom is $df = 68 - 1 = 67.$ Using statistical software, we find the one-tail area of 0.0156.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](21-inference-paired-means_files/figure-html/textbooksF18HTTails-1.png){width=60%}\n:::\n:::\n\n\nDoubling this area gives the p-value: 0.0312.\nBecause the p-value is less than 0.05, we reject the null hypothesis.\nAmazon prices are, on average, lower than the UCLA Bookstore prices for UCLA courses.\n:::\n\nRecall that the margin of error is defined by the standard error.\nThe margin of error for $\\bar{x}_{diff}$ can be directly obtained from $SE(\\bar{x}_{diff}).$\n\n::: {.important data-latex=\"\"}\n**Margin of error for** $\\bar{x}_{diff}.$\n\nThe margin of error is $t^\\star_{df} \\times s_{diff}/\\sqrt{n_{diff}}$ where $t^\\star_{df}$ is calculated from a specified percentile on the t-distribution with *df* degrees of freedom.\n:::\n\n::: {.workedexample data-latex=\"\"}\nCreate a 95% confidence interval for the average price difference between books at the UCLA bookstore and books on Amazon.\n\n------------------------------------------------------------------------\n\nConditions have already verified and the standard error computed in a previous Example.\\\nTo find the confidence interval, identify $t^{\\star}_{67}$ using statistical software or the $t$-table $(t^{\\star}_{67} = 2.00),$ and plug it, the point estimate, and the standard error into the confidence interval formula:\n\n$$\n\\begin{aligned}\n\\text{point estimate} \\ &\\pm \\ z^{\\star} \\ \\times \\ SE \\\\\n3.58 \\ &\\pm \\ 2.00 \\ \\times \\ 1.63 \\\\\n(0.32 \\ &, \\ 6.84)\n\\end{aligned}\n$$\n\nWe are 95% confident that the UCLA Bookstore is, on average, between \\$0.32 and \\$6.84 more expensive than Amazon for UCLA course books.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nWe have convincing evidence that Amazon is, on average, less expensive.\nHow should this conclusion affect UCLA student buying habits?\nShould UCLA students always buy their books on Amazon?[^21-inference-paired-means-2]\n:::\n\n[^21-inference-paired-means-2]: The average price difference is only mildly useful for this question.\n Examine the distribution shown in Figure \\@ref(fig:diffInTextbookPricesF18).\n There are certainly a handful of cases where Amazon prices are far below the UCLA Bookstore's, which suggests it is worth checking Amazon (and probably other online sites) before purchasing.\n However, in many cases the Amazon price is above what the UCLA Bookstore charges, and most of the time the price isn't that different.\n Ultimately, if getting a book immediately from the bookstore is notably more convenient, e.g., to get started on reading or homework, it's likely a good idea to go with the UCLA Bookstore unless the price difference on a specific book happens to be quite large.\n For reference, this is a very different result from what we (the authors) had seen in a similar dataset from 2010.\n At that time, Amazon prices were almost uniformly lower than those of the UCLA Bookstore's and by a large margin, making the case to use Amazon over the UCLA Bookstore quite compelling at that time.\n Now we frequently check multiple websites to find the best price.\n\n\\index{paired}\n\n\n\n\n\n\\clearpage\n\n## Chapter review {#chp21-review}\n\n### Summary\n\nLike the two independent sample procedures in Chapter \\@ref(inference-two-means), the paired difference analysis can be done using a t-distribution.\nThe randomization test applied to the paired differences is slightly different, however.\nNote that when randomizing under the paired setting, each null statistic is created by randomly assigning the group to a numerical outcome **within** the individual observational unit.\nThe procedure for creating a confidence interval for the paired difference is almost identical to the confidence intervals created in Chapter \\@ref(inference-one-mean) for a single mean.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
bootstrap CI paired difference paired difference CI T score paired difference
paired data paired difference t-test
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp21-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-21].\n\n::: {.exercises data-latex=\"\"}\n1. **Air quality.** \nAir quality measurements were collected in a random sample of 25 country capitals in 2013, and then again in the same cities in 2014. \nWe would like to use these data to compare average air quality between the two years. \nShould we use a paired or non-paired test? \nExplain your reasoning.\n\n \\vspace{5mm}\n\n1. **True / False: paired.** \nDetermine if the following statements are true or false. If false, explain.\n\n a. In a paired analysis we first take the difference of each pair of observations, and then we do inference on these differences.\n\n b. Two datasets of different sizes cannot be analyzed as paired data.\n\n c. Consider two sets of data that are paired with each other. Each observation in one dataset has a natural correspondence with exactly one observation from the other dataset.\n\n d. Consider two sets of data that are paired with each other. Each observation in one dataset is subtracted from the average of the other dataset's observations.\n\n \\vspace{5mm}\n\n1. **Paired or not? I.** \nIn each of the following scenarios, determine if the data are paired.\n\n a. Compare pre- (beginning of semester) and post-test (end of semester) scores of students.\n\n b. Assess gender-related salary gap by comparing salaries of randomly sampled men and women.\n\n c. Compare artery thicknesses at the beginning of a study and after 2 years of taking Vitamin E for the same group of patients.\n\n d. Assess effectiveness of a diet regimen by comparing the before and after weights of subjects.\n \n \\vspace{5mm}\n\n1. **Paired or not? II.** \nIn each of the following scenarios, determine if the data are paired.\n\n a. We would like to know if Intel's stock and Southwest Airlines' stock have similar rates of return. To find out, we take a random sample of 50 days, and record Intel's and Southwest's stock on those same days.\n\n b. We randomly sample 50 items from Target stores and note the price for each. Then we visit Walmart and collect the price for each of those same 50 items.\n\n c. A school board would like to determine whether there is a difference in average SAT scores for students at one high school versus another high school in the district. To check, they take a simple random sample of 100 students from each high school.\n \n \\vspace{5mm}\n\n1. **Sample size and pairing.**\nDetermine if the following statement is true or false, and if false, explain your reasoning: If comparing means of two groups with equal sample sizes, always use a paired test.\n\n \\clearpage\n\n1. **High School and Beyond, randomization test.** \nThe National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. \nHere we examine a simple random sample of 200 students from this survey.\nSide-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.\nAlso provided below is a histogram of randomized averages of paired differences of scores (read - write), with the observed difference ($\\bar{x}_{read-write} = -0.545$) marked with a red vertical line.\nThe randomization distribution was produced by doing the following 1000 times: for each student, the two scores were randomly assigned to either read or write, and the average was taken across all students in the sample.^[The [`hsb2`](http://openintrostat.github.io/openintro/reference/hsb2.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-22-1.png){width=100%}\n :::\n :::\n\n a. Is there a clear difference in the average reading and writing scores?\n\n b. Are the reading and writing scores of each student independent of each other?\n\n c. Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?\n\n d. Is the average of the observed difference in scores $(\\bar{x}_{read-write} = -0.545)$ consistent with the distribution of randomized average differences? Explain. \n\n e. Do these data provide convincing evidence of a difference between the average scores on the two exams? Estimate the p-value from the randomization test, and conclude the hypothesis test using words like \"score on reading test\" and \"score on writing test.\"\n\n1. **Forest management.**\nForest rangers wanted to better understand the rate of growth for younger trees in the park. They took measurements of a random sample of 50 young trees in 2009 and again measured those same trees in 2019. The data below summarize their measurements, where the heights are in feet.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Year Mean SD n
2009 12.0 3.5 50
2019 24.5 9.5 50
Difference 12.5 7.2 50
\n \n `````\n :::\n :::\n\n Construct a 99% confidence interval for the average growth of (what had been) younger trees in the park over 2009-2019.\n \n \\clearpage\n\n1. **High School and Beyond, bootstrap interval.** \nWe considered the differences between the reading and writing scores of a random sample of 200 students who took the High School and Beyond Survey.\nThe mean and standard deviation of the differences are $\\bar{x}_{read-write} = -0.545$ and $s_{read-write}$ = 8.887 points.\nThe bootstrap distribution below was produced by bootstrapping from the sample of differences in reading and writing scores 1,000 times.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-24-1.png){width=90%}\n :::\n :::\n\n a. Find an approximate 95% bootstrap percentile confidence interval for the true average difference in scores (read - write).\n \n b. Find an approximate 95% bootstrap SE confidence interval for the true average difference in scores (read - write).\n\n c. Interpret both confidence intervals using words like \"population\" and \"score\".\n\n d. From the confidence intervals calculated above, does it appear that there is a significant difference in reading and writing scores, on average?\n\n1. **Possible paired randomized differences.**\nData were collected on five people.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Person 1 Person 2 Person 3 Person 4 Person 5
Observation 1 3 14 4 5 10
Observation 2 7 3 6 5 9
Difference -4 11 -2 0 1
\n \n `````\n :::\n :::\n\n Which of the following could be a possible randomization of the paired differences given above? If the set of values could not be a randomized set of differences, indicate why not.\n\n a. -2, 1, 1, 11, -2\n \n b. -4\t11\t-2\t0\t1\n \n c. -2, 2, -11, 11, -2, 2, 0, 1, -1\n \n d. 0, -1, 2, -4, 11\n \n e. 4, -11, 2, 0, -1\n \n \\clearpage\n\n1. **Study environment.**\nIn order to test the effects of listening to music while studying versus studying in silence, students agree to be randomized to two treatments (i.e., study with music or study in silence).\nThere are two exams during the semester, so the researchers can either randomize the students to have one exam with music and one with silence (randomly selecting which exam corresponds to which study environment) or the researchers can randomize the students to one study habit for both exams.\n\n The researchers are interested in estimating the true population difference of exam score for those who listen to music while studying as compared to those who study in silence.\n\n a. Describe the experiment which is consistent with a paired designed experiment. How is the treatment assigned, and how are the data collected such that the observations are paired?\n\n b. Describe the experiment which is consistent with an indpenedent samples experiment. How is the treatment assigned, and how are the data collected such that the observations are independent?\n\n1. **Global warming, randomization test.** \nLet's consider a limited set of climate data, examining temperature differences in 1948 vs 2018. \nWe sampled 197 locations from the National Oceanic and Atmospheric Administration's (NOAA) historical data, where the data was available for both years of interest. \nWe want to know: were there more days with temperatures exceeding 90F in 2018 or in 1948? [@webpage:noaa19482018] \nThe difference in number of days exceeding 90F (number of days in 2018 - number of days in 1948) was calculated for each of the 197 locations. \nThe average of these differences was 2.9 days with a standard deviation of 17.2 days. \nWe are interested in determining whether these data provide strong evidence that there were more days in 2018 that exceeded 90F from NOAA's weather stations.^[The [`climate70`](http://openintrostat.github.io/openintro/reference/climate70.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-26-1.png){width=90%}\n :::\n :::\n \n a. Create hypotheses appropriate for the following research question: is there an evident difference in the average number of days greater than 90F across the two years (1948 and 2018)?\n\n b. Is the average of the observed difference in scores $(\\bar{x}_{2018-1948} = 2.9)$ consistent with the distribution of randomized average differences? Explain. \n\n c. Do these data provide convincing evidence of a difference between the average number of days? Estimate the p-value from the randomization test, and conclude the hypothesis test using words like \"number of days in 1948\" and \"number of days in 2018.\"\n \n \\clearpage\n\n1. **Global warming, bootstrap interval.** \nWe considered the change in the number of days exceeding 90F from 1948 and 2018 at 197 randomly sampled locations from the NOAA database. \nThe mean and standard deviation of the reported differences are 2.9 days and 17.2 days.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-27-1.png){width=90%}\n :::\n :::\n\n a. Calculate a 90% bootstrap percentile confidence interval for the average difference between number of days exceeding 90F between 1948 and 2018. \n\n b. Calculate a 90% bootstrap SE confidence interval for the average difference between number of days exceeding 90F between 1948 and 2018. \n \n c. Interpret both intervals in context.\n\n d. Do the confidence intervals provide convincing evidence that there were more days exceeding 90F in 2018 than in 1948 at NOAA stations? Explain your reasoning.\n\n1. **Global warming, mathematical test.** \nWe considered the change in the number of days exceeding 90F from 1948 and 2018 at 197 randomly sampled locations from the NOAA database. \nThe mean and standard deviation of the reported differences are 2.9 days and 17.2 days.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-28-1.png){width=90%}\n :::\n :::\n\n a. Is there a relationship between the observations collected in 1948 and 2018? Or are the observations in the two groups independent? Explain.\n\n b. Write hypotheses for this research in symbols and in words.\n\n c. Check the conditions required to complete this test. A histogram of the differences is given to the right.\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for parts d to g.*\n :::\n \n \\clearpage\n\n d. Calculate the test statistic and find the p-value.\n\n e. Use $\\alpha = 0.05$ to evaluate the test, and interpret your conclusion in context.\n\n f. What type of error might we have made? Explain in context what the error means.\n\n g. Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the number of days exceeding 90F from 1948 and 2018 to include 0? Explain your reasoning.\n\n1. **High School and Beyond, mathematical test.** \nWe considered the differences between the reading and writing scores of a random sample of 200 students who took the High School and Beyond Survey.\n\n a. Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?\n\n b. Check the conditions required to complete this test.\n\n c. The average observed difference in scores is $\\bar{x}_{read-write} = -0.545$, and the standard deviation of the differences is $s_{read-write} = 8.887$ points. Do these data provide convincing evidence of a difference between the average scores on the two exams?\n\n d. What type of error might we have made? Explain what the error means in the context of the application.\n\n e. Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.\n\n1. **Global warming, mathematical interval.** \nWe considered the change in the number of days exceeding 90F from 1948 and 2018 at 197 randomly sampled locations from the NOAA database. \nThe mean and standard deviation of the reported differences are 2.9 days and 17.2 days.\n\n a. Calculate a 90% confidence interval for the average difference between number of days exceeding 90F between 1948 and 2018. We've already checked the conditions for you.\n\n b. Interpret the interval in context.\n\n c. Does the confidence interval provide convincing evidence that there were more days exceeding 90F in 2018 than in 1948 at NOAA stations? Explain your reasoning.\n\n1. **High school and beyond, mathematical interval.** \nWe considered the differences between the reading and writing scores of a random sample of 200 students who took the High School and Beyond Survey. \nThe mean and standard deviation of the differences are $\\bar{x}_{read-write} = -0.545$ and $s_{read-write}$ = 8.887 points.\n\n a. Calculate a 95% confidence interval for the average difference between the reading and writing scores of all students.\n\n b. Interpret this interval in context.\n\n c. Does the confidence interval provide convincing evidence that there is a real difference in the average scores? Explain.\n \n \\clearpage\n\n1. **Friday the 13th, traffic.**\nIn the early 1990's, researchers in the UK collected data on traffic flow on Friday the 13th with the goal of addressing issues of how superstitions regarding Friday the 13th affect human behavior and and whether Friday the 13th is an unlucky day.\nThe histograms below show the distributions of numbers of cars passing by a specific intersection on Friday the 6th and Friday the 13th for many such date pairs.\nAlso provided are some sample statistics, where the difference is the number of cars on the 6th minus the number of cars on the 13th.^[The [`friday`](http://openintrostat.github.io/openintro/reference/friday.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Scanlon:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-29-1.png){width=100%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
n Mean SD
sixth 10 128,385 7,259
thirteenth 10 126,550 7,664
diff 10 1,836 1,176
\n \n `````\n :::\n :::\n\n a. Are there any underlying structures in these data that should be considered in an analysis? Explain.\n\n b. What are the hypotheses for evaluating whether the number of people out on Friday the 6$^{\\text{th}}$ is different than the number out on Friday the 13$^{\\text{th}}$?\n\n c. Check conditions to carry out the hypothesis test from part (b) using mathematical models.\n\n d. Calculate the test statistic and the p-value.\n\n e. What is the conclusion of the hypothesis test?\n\n f. Interpret the p-value in this context.\n\n g. What type of error might have been made in the conclusion of your test? Explain.\n \n \\clearpage\n\n1. **Friday the 13th, accidents.** \nIn the early 1990's, researchers in the UK collected data the number of traffic accident related emergency room (ER) admissions on Friday the 13th with the goal of addressing issues of how superstitions regarding Friday the 13th affect human behavior and and whether Friday the 13th is an unlucky day.\nThe histograms below show the distributions of numbers of ER admissions at specific emergency rooms on Friday the 6th and Friday the 13th for many such date pairs.\nAlso provided are some sample statistics, where the difference is the ER admissions on the 6th minus the ER admissions on the 13th.[@Scanlon:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](21-inference-paired-means_files/figure-html/unnamed-chunk-30-1.png){width=100%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
n Mean SD
sixth 6 8 3
thirteenth 6 11 4
diff 6 -3 3
\n \n `````\n :::\n :::\n\n a. Conduct a hypothesis test using mathematical models to evaluate if there is a difference between the average numbers of traffic accident related emergency room admissions between Friday the 6$^{\\text{th}}$ and Friday the 13$^{\\text{th}}$.\n\n b. Calculate a 95% confidence interval using mathematical models for the difference between the average numbers of traffic accident related emergency room admissions between Friday the 6$^{\\text{th}}$ and Friday the 13$^{\\text{th}}$.\n\n c. The conclusion of the original study states, \"Friday 13th is unlucky for some. The risk of hospital admission as a result of a transport accident may be increased by as much as 52%. Staying at home is recommended.\" Do you agree with this statement? Explain your reasoning.\n\n\n:::\n", + "supporting": [ + "21-inference-paired-means_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/21-inference-paired-means/figure-html/diffInTextbookPricesF18-1.png b/_freeze/21-inference-paired-means/figure-html/diffInTextbookPricesF18-1.png new file mode 100644 index 00000000..b6375255 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/diffInTextbookPricesF18-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/pairRandomiz-1.png b/_freeze/21-inference-paired-means/figure-html/pairRandomiz-1.png new file mode 100644 index 00000000..077c7416 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/pairRandomiz-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/pairboot-1.png b/_freeze/21-inference-paired-means/figure-html/pairboot-1.png new file mode 100644 index 00000000..70955a62 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/pairboot-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/textbooksF18HTTails-1.png b/_freeze/21-inference-paired-means/figure-html/textbooksF18HTTails-1.png new file mode 100644 index 00000000..d20d588e Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/textbooksF18HTTails-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredata-1.png b/_freeze/21-inference-paired-means/figure-html/tiredata-1.png new file mode 100644 index 00000000..8a43b8c1 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredata-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredata4-1.png b/_freeze/21-inference-paired-means/figure-html/tiredata4-1.png new file mode 100644 index 00000000..91b95aad Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredata4-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredata5-1.png b/_freeze/21-inference-paired-means/figure-html/tiredata5-1.png new file mode 100644 index 00000000..d784ef67 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredata5-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredataPerm-1.png b/_freeze/21-inference-paired-means/figure-html/tiredataPerm-1.png new file mode 100644 index 00000000..5e7d4b3f Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredataPerm-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredataPermSort-1.png b/_freeze/21-inference-paired-means/figure-html/tiredataPermSort-1.png new file mode 100644 index 00000000..21779bee Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredataPermSort-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredatarand1-1.png b/_freeze/21-inference-paired-means/figure-html/tiredatarand1-1.png new file mode 100644 index 00000000..967ae35a Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredatarand1-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/tiredatarand2-1.png b/_freeze/21-inference-paired-means/figure-html/tiredatarand2-1.png new file mode 100644 index 00000000..77e78425 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/tiredatarand2-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-22-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 00000000..bca694a1 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-22-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-24-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..1b13c236 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-26-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 00000000..28b41198 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-26-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-27-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 00000000..661c713a Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-27-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-28-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 00000000..79bd8445 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-28-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-29-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-29-1.png new file mode 100644 index 00000000..02124c4f Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-29-1.png differ diff --git a/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-30-1.png b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 00000000..b8bd7357 Binary files /dev/null and b/_freeze/21-inference-paired-means/figure-html/unnamed-chunk-30-1.png differ diff --git a/_freeze/22-inference-many-means/execute-results/html.json b/_freeze/22-inference-many-means/execute-results/html.json new file mode 100644 index 00000000..5c2eb481 --- /dev/null +++ b/_freeze/22-inference-many-means/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "42e4e02583779063080f09c13c05e0dc", + "result": { + "markdown": "# Inference for comparing many means {#inference-many-means}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn Chapter \\@ref(inference-two-means) analysis was done to compare the average population value across two different groups.\nAn important aspect of the analysis was to look at the difference in sample means as an estimate for the difference in population means.\nWhen comparing more than two groups, the difference (i.e., subtraction) will not fully capture the nuance in variation across the three or more groups.\nAs with two groups, the research question will focus on whether the group membership is independent of the numerical response variable.\nHere, independence across groups means that knowledge of the observations in one group does not change what we would expect to happen in the other group.\nBut what happens if the groups are **dependent**?\nIn this section we focus on a new statistic which incorporates differences in means across more than two groups.\nAlthough the ideas in this chapter are quite similar to the t-test, they have earned themselves their own name: **AN**alysis **O**f **VA**riance, or ANOVA.\n:::\n\n\\index{analysis of variance (ANOVA)}\n\nSometimes we want to compare means across many groups.\nWe might initially think to do pairwise comparisons.\nFor example, if there were three groups, we might be tempted to compare the first mean with the second, then with the third, and then finally compare the second and third means for a total of three comparisons.\nHowever, this strategy can be treacherous.\nIf we have many groups and do many comparisons, it is likely that we will eventually find a difference just by chance, even if there is no difference in the populations.\nInstead, we should apply a holistic test to check whether there is evidence that at least one pair groups are in fact different, and this is where **ANOVA** saves the day.\n\n\n\n\n\nIn this section, we will learn a new method called **analysis of variance (ANOVA)** and a new test statistic called an $F$-statistic (which we will introduce in our discussion of mathematical models).\nANOVA uses a single hypothesis test to check whether the means across many groups are equal:\n\n- $H_0:$ The mean outcome is the same across all groups. In statistical notation, $\\mu_1 = \\mu_2 = \\cdots = \\mu_k$ where $\\mu_i$ represents the mean of the outcome for observations in category $i.$\\\n- $H_A:$ At least one mean is different.\n\nGenerally we must check three conditions on the data before performing ANOVA:\n\n- the observations are independent within and between groups,\n- the responses within each group are nearly normal, and\n- the variability across the groups is about equal.\n\nWhen these three conditions are met, we may perform an ANOVA to determine whether the data provide convincing evidence against the null hypothesis that all the $\\mu_i$ are equal.\n\n::: {.workedexample data-latex=\"\"}\nCollege departments commonly run multiple sections of the same introductory course each semester because of high demand.\nConsider a statistics department that runs three sections of an introductory statistics course.\nWe might like to determine whether there are substantial differences in first exam scores in these three classes (Section A, Section B, and Section C).\nDescribe appropriate hypotheses to determine whether there are any differences between the three classes.\n\n------------------------------------------------------------------------\n\nThe hypotheses may be written in the following form:\n\n- $H_0:$ The average score is identical in all sections, $\\mu_A = \\mu_B = \\mu_C$. Assuming each class is equally difficult, the observed difference in the exam scores is due to chance.\n- $H_A:$ The average score varies by class. We would reject the null hypothesis in favor of the alternative hypothesis if there were larger differences among the class averages than what we might expect from chance alone.\n:::\n\nStrong evidence favoring the alternative hypothesis in ANOVA is described by unusually large differences among the group means.\nWe will soon learn that assessing the variability of the group means relative to the variability among individual observations within each group is key to ANOVA's success.\n\n::: {.workedexample data-latex=\"\"}\nExamine Figure \\@ref(fig:toyANOVA).\nCompare groups I, II, and III.\nCan you visually determine if the differences in the group centers is unlikely to have occurred if there were no differences in the groups?\nNow compare groups IV, V, and VI.\nDo these differences appear to be unlikely to have occurred if there were no differences in the groups?\n\n------------------------------------------------------------------------\n\nAny real difference in the means of groups I, II, and III is difficult to discern, because the data within each group are very volatile relative to any differences in the average outcome.\nOn the other hand, it appears there are differences in the centers of groups IV, V, and VI.\nFor instance, group V appears to have a higher mean than that of the other two groups.\nInvestigating groups IV, V, and VI, we see the differences in the groups' centers are noticeable because those differences are large *relative to the variability in the individual observations within each group*.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Side-by-side dot plot for the outcomes for six groups. Two sets of groups: first set is comprised of Groups I, II, and III, the second set is comprised of Groups IV, V, and VI.](22-inference-many-means_files/figure-html/toyANOVA-1.png){width=90%}\n:::\n:::\n\n\n## Case study: Batting\n\nWe would like to discern whether there are real differences between the batting performance of baseball players according to their position: outfielder (OF), infielder (IF), and catcher (C).\nWe will use a dataset called `mlb_players_18`, which includes batting records of 429 Major League Baseball (MLB) players from the 2018 season who had at least 100 at bats.\nSix of the 429 cases represented in `mlb_players_18` are shown in Table \\@ref(tab:mlbBat18DataFrame), and descriptions for each variable are provided in Table \\@ref(tab:mlbBat18Variables).\nThe measure we will use for the player batting performance (the outcome variable) is on-base percentage (`OBP`).\nThe on-base percentage roughly represents the fraction of the time a player successfully gets on base or hits a home run.\n\n::: {.data data-latex=\"\"}\nThe [`mlb_players_18`](http://openintrostat.github.io/openintro/reference/mlb_players_18.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Six cases and some of the variables from the `mlb_players_18` data frame.
name team position AB H HR RBI AVG OBP
Abreu, J CWS IF 499 132 22 78 0.265 0.325
Acuna Jr., R ATL OF 433 127 26 64 0.293 0.366
Adames, W TB IF 288 80 10 34 0.278 0.348
Adams, M STL IF 306 73 21 57 0.239 0.309
Adduci, J DET IF 176 47 3 21 0.267 0.290
Adrianza, E MIN IF 335 84 6 39 0.251 0.301
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variables and their descriptions for the `mlb_players_18` dataset.
Variable Description
name Player name
team The abbreviated name of the player's team
position The player's primary field position (OF, IF, C)
AB Number of opportunities at bat
H Number of hits
HR Number of home runs
RBI Number of runs batted in
AVG Batting average, which is equal to H/AB
OBP On-base percentage, which is roughly equal to the fraction of times a player gets on base or hits a home run
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nThe null hypothesis under consideration is the following: $\\mu_{OF} = \\mu_{IF} = \\mu_{C} % = \\mu_{DH}.$ Write the null and corresponding alternative hypotheses in plain language.[^22-inference-many-means-1]\n:::\n\n[^22-inference-many-means-1]: $H_0:$ The average on-base percentage is equal across the four positions.\n $H_A:$ The average on-base percentage varies across some (or all) groups.\n\n::: {.workedexample data-latex=\"\"}\nThe player positions have been divided into three groups: outfield (OF), infield (IF), and catcher (C).\nWhat would be an appropriate point estimate of the on-base percentage by outfielders, $\\mu_{OF}$?\n\n------------------------------------------------------------------------\n\nA good estimate of the on-base percentage by outfielders would be the sample average of `OBP` for just those players whose position is outfield: $\\bar{x}_{OF} = 0.320.$\n:::\n\n## Randomization test for comparing many means {#randANOVA}\n\nTable \\@ref(tab:mlbHRPerABSummaryTable) provides summary statistics for each group.\nA side-by-side box plot for the on-base percentage is shown in Figure \\@ref(fig:mlbANOVABoxPlot).\nNotice that the variability appears to be approximately constant across groups; nearly constant variance across groups is an important assumption that must be satisfied before we consider the ANOVA approach.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary statistics of on-base percentage, split by player position.
Position n Mean SD
OF 160 0.320 0.043
IF 205 0.318 0.038
C 64 0.302 0.038
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Side-by-side box plot of the on-base percentage for 429 players across three groups. There is one prominent outlier visible in the infield group, but with 205 observations in the infield group, this outlier is not extreme enough to have an impact on the calculations, so it is not a concern for moving forward with the analysis.](22-inference-many-means_files/figure-html/mlbANOVABoxPlot-1.png){width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nThe largest difference between the sample means is between the catcher and the outfielder positions.\nConsider again the original hypotheses:\n\n- $H_0:$ $\\mu_{OF} = \\mu_{IF} = \\mu_{C}$\n- $H_A:$ The average on-base percentage $(\\mu_i)$ varies across some (or all) groups.\n\nWhy might it be inappropriate to run the test by simply estimating whether the difference of $\\mu_{C}$ and $\\mu_{OF}$ is \"statistically significant\" at a 0.05 significance level?\n\n------------------------------------------------------------------------\n\nThe primary issue here is that we are inspecting the data before picking the groups that will be compared.\nIt is inappropriate to examine all data by eye (informal testing) and only afterwards decide which parts to formally test.\nThis is called **data snooping** or **data fishing**.\nNaturally, we would pick the groups with the large differences for the formal test, and this would leading to an inflation in the Type 1 Error rate.\nTo understand this better, let's consider a slightly different problem.\n\nSuppose we are to measure the aptitude for students in 20 classes in a large elementary school at the beginning of the year.\nIn this school, all students are randomly assigned to classrooms, so any differences we observe between the classes at the start of the year are completely due to chance.\nHowever, with so many groups, we will probably observe a few groups that look rather different from each other.\nIf we select only these classes that look so different and then perform a formal test, we will probably make the wrong conclusion that the assignment wasn't random.\nWhile we might only formally test differences for a few pairs of classes, we informally evaluated the other classes by eye before choosing the most extreme cases for a comparison.\n:::\n\nFor additional information on the ideas expressed above, we recommend reading about the **prosecutor's fallacy**.[^22-inference-many-means-2]\n\n[^22-inference-many-means-2]: See, for example, [this blog post](https://statmodeling.stat.columbia.edu/2007/05/18/the_prosecutors/).\n\n\n\n\n\n### Observed data\n\nIn the next section we will learn how to use the $F$ statistic to test whether observed differences in sample means could have happened just by chance even if there was no difference in the respective population means.\n\nThe method of analysis of variance in this context focuses on answering one question: is the variability in the sample means so large that it seems unlikely to be from chance alone?\nThis question is different from earlier testing procedures since we will *simultaneously* consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation.\nWe call this variability the **mean square between groups (MSG)**, and it has an associated degrees of freedom, $df_{G} = k - 1$ when there are $k$ groups.\nThe $MSG$ can be thought of as a scaled variance formula for means.\nIf the null hypothesis is true, any variation in the sample means is due to chance and shouldn't be too large.\nDetails of $MSG$ calculations are provided in the footnote.[^22-inference-many-means-3]\nHowever, we typically use software for these computations.\n\\index{degrees of freedom}\n\n[^22-inference-many-means-3]: Let $\\bar{x}$ represent the mean of outcomes across all groups.\n Then the mean square between groups is computed as $MSG = \\frac{1}{df_{G}}SSG = \\frac{1}{k-1}\\sum_{i=1}^{k} n_{i} \\left(\\bar{x}_{i} - \\bar{x}\\right)^2$ where $SSG$ is called the **sum of squares between groups** and $n_{i}$ is the sample size of group $i.$\n\n\n\n\n\nThe mean square between the groups is, on its own, quite useless in a hypothesis test.\nWe need a benchmark value for how much variability should be expected among the sample means if the null hypothesis is true.\nTo this end, we compute a pooled variance estimate, often abbreviated as the **mean square error (**$MSE)$, which has an associated degrees of freedom value $df_E = n - k.$ It is helpful to think of $MSE$ as a measure of the variability within the groups.\nDetails of the computations of the $MSE$ and a link to an extra online section for ANOVA calculations are provided in the footnote.[^22-inference-many-means-4]\n\n[^22-inference-many-means-4]: See [additional details on ANOVA calculations](https://www.openintro.org/download.php?file=stat_extra_anova_calculations) for interested readers.\n Let $\\bar{x}$ represent the mean of outcomes across all groups.\n Then the **sum of squares total (**$SST)$ is computed as $$SST = \\sum_{i=1}^{n} \\left(x_{i} - \\bar{x}\\right)^2$$ where the sum is over all observations in the dataset.\n Then we compute the **sum of squared errors** $(SSE)$ in one of two equivalent ways: $SSE = SST - SSG = (n_1-1)s_1^2 + (n_2-1)s_2^2 + \\cdots + (n_k-1)s_k^2$ where $s_i^2$ is the sample variance (square of the standard deviation) of the residuals in group $i.$ Then the $MSE$ is the standardized form of $SSE: MSE = \\frac{1}{df_{E}}SSE.$\n\nWhen the null hypothesis is true, any differences among the sample means are only due to chance, and the $MSG$ and $MSE$ should be about equal.\nAs a test statistic for ANOVA, we examine the fraction of $MSG$ and $MSE:$\n\n$$F = \\frac{MSG}{MSE}$$\n\nThe $MSG$ represents a measure of the between-group variability,and $MSE$ measures the variability within each of the groups.\n\n::: {.important data-latex=\"\"}\n**The test statistic for three or more means is an F.**\n\nThe F statistic is a ratio of how the groups differ (MSG) as compared to how the observations within a group vary (MSE).\n\n$$F = \\frac{MSG}{MSE}$$\n\nWhen the null hypothesis is true and the conditions are met, F has an F-distribution with $df_1 = k-1$ and $df_2 = n-k.$\n\nConditions:\n\n- independent observations, both within and across groups\\\n- large samples and no extreme outliers\\\n:::\n\n### Variability of the statistic\n\nWe recall the exams from Section \\@ref(rand2mean) which demonstrated a two-sample randomization test for a comparison of means.\nSuppose now that the teacher had had such an extremely large class that three different exams were given: A, B, and C.\nTable \\@ref(tab:summaryStatsForThreeVersionsOfExams) and Figure \\@ref(fig:boxplotThreeVersionsOfExams) provide a summary of the data including exam C.\nAgain, we would like to investigate whether the difficulty of the exams is the same across the three exams, so the test is\n\n- $H_0: \\mu_A = \\mu_B = \\mu_C.$ The inherent average difficulty is the same across the three exams.\n- $H_A:$ not $H_0.$ At least one of the exams is inherently more (or less) difficult than the others.\n\n::: {.data data-latex=\"\"}\nThe [`classdata`](http://openintrostat.github.io/openintro/reference/classdata.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary statistics of scores for each exam version.
Exam n Mean SD Min Max
A 58 75.1 13.9 44 100
B 55 72.0 13.8 38 100
C 51 78.9 13.1 45 100
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Exam scores for students given one of three different exams.](22-inference-many-means_files/figure-html/boxplotThreeVersionsOfExams-1.png){width=90%}\n:::\n:::\n\n\nFigure \\@ref(fig:randANOVA) shows the process of randomizing the three different exams to the observed exam scores.\nIf the null hypothesis is true, then the score on each exam should represent the true student ability on that material.\nIt shouldn't matter whether they were given exam A or exam B or exam C.\nBy reallocating which student got which exam, we are able to understand how the difference in average exam scores changes due only to natural variability.\nThere is only one iteration of the randomization process in Figure \\@ref(fig:randANOVA), leading to three different randomized sample means (computed assuming the null hypothesis is true).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The version of the test (A or B or C) is randomly allocated to the test scores, under the null assumption that the tests are equally difficult.](images/randANOVA.png){fig-alt='Four panels representing four different orientations of a toy dataset of 13 exam scores. The first panel provides the observed data; 4 of the exams were version A and the average score was 77.25; 5 of the exams were version B and the average score was 75.8; 4 of the exams were version C and the average score was 78.5. The observed F statistic is 0.0747. The second panel shows the shuffled reassignment of the exam versions (4 of the scores are randomly reassigned to A, 5 of the scores are randomly reassigned to B, 4 of the scores are randomly reassigned to C). The third panel shows which score is connected with which new reassigned version of the exam. And the fourth panel sorts the exams so that version A exams are together, version B exams are together, and version C exams are together. In the randomly reassigned versions, the average score for version A is 72.25, the average score for version B is 78, and the average score for version C is 75.25. The randomized F statistic is 0.1637.' width=75%}\n:::\n:::\n\n\nIn the two-sample case, the null hypothesis was investigated using the difference in the sample means.\nHowever, as noted above, with three groups (three different exams), the comparison of the three sample means gets slightly more complicated.\nWe have already derived the F-statistic which is exactly the way to compare the averages across three or more groups!\nRecall, the F statistic is a ratio of how the groups differ (MSG) as compared to how the observations within a group vary (MSE).\n\nBuilding on Figure \\@ref(fig:randANOVA), Figure \\@ref(fig:rand3exams) shows the values of the simulated $F$ statistics over 1,000 random simulations.\nWe see that, just by chance, the F statistic can be as large as 7.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of F statistics calculated from 1,000 different randomizations of the exam type.](22-inference-many-means_files/figure-html/rand3exams-1.png){width=90%}\n:::\n:::\n\n\n### Observed statistic vs. null statistic\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histogram of F statistics calculated from 1000 different randomizations of the exam type. The observed F statistic is given as a red vertical line 3.48. The area to the right is more extreme than the observed value and represents the p-value.](22-inference-many-means_files/figure-html/rand3examspval-1.png){width=90%}\n:::\n:::\n\n\nUsing statistical software, we can calculate that 3.6% of the randomized F test statistics were at or above the observed test statistic of $F= 3.48.$ That is, the p-value of the test is 0.036.\nAssuming that we had set the level of significance to be $\\alpha = 0.05,$ the p-value is smaller than the level of significance which would lead us to reject the null hypothesis.\nWe claim that the difficulty level (i.e., the true average score, $\\mu)$ is different for at least one of the exams.\n\nWhile it is temping to say that exam C is harder than the other two (given the inability to differentiate between exam A and exam B in Section \\@ref(rand2mean)), we must be very careful about conclusions made using different techniques on the same data.\n\nWhen the null hypothesis is true, random variability that exists in nature produces data with p-values less than 0.05.\nHow often does that happen?\n5% of the time.\nThat is to say, if you use 20 different models applied to the same data where there is no signal (i.e., the null hypothesis is true), you are reasonably likely to to get a p-value less than 0.05 in one of the tests you run.\nThe details surrounding the ideas of this problem, called a **multiple comparisons test** or **multiple comparisons problem**, are outside the scope of this textbook, but should be something that you keep in the back of your head.\nTo best mitigate any extra type I errors, we suggest that you set up your hypotheses and testing protocol before running any analyses.\nOnce the conclusions have been reached, you should report your findings instead of running a different type of test on the same data.\n\n## Mathematical model for test for comparing many means {#mathANOVA}\n\nAs seen with many of the tests and statistics from previous sections, the randomization test on the F statistic has mathematical theory to describe the distribution without using a computational approach.\n\nWe return to the baseball example from Table \\@ref(tab:mlbHRPerABSummaryTable) to demonstrate the mathematical model applied to the ANOVA setting.\n\n### Variability of the statistic\n\nThe larger the observed variability in the sample means $(MSG)$ relative to the within-group observations $(MSE)$, the larger $F$-statistic will be and the stronger the evidence against the null hypothesis.\nBecause larger $F$-statistics represent stronger evidence against the null hypothesis, we use the upper tail of the distribution to compute a p-value.\n\n::: {.important data-latex=\"\"}\n**The F statistic and the F-test.**\n\nAnalysis of variance (ANOVA) is used to test whether the mean outcome differs across two or more groups.\nANOVA uses a test statistic, the $F$-statistic, which represents a standardized ratio of variability in the sample means relative to the variability within the groups.\nIf $H_0$ is true and the model conditions are satisfied, an $F$-statistic follows an $F$ distribution with parameters $df_{1} = k - 1$ and $df_{2} = n - k.$ The upper tail of the $F$ distribution is used to represent the p-value.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nFor the baseball data, $MSG = 0.00803$ and $MSE=0.00158.$ Identify the degrees of freedom associated with MSG and MSE and verify the $F$-statistic is approximately 5.077.[^22-inference-many-means-5]\n:::\n\n[^22-inference-many-means-5]: There are $k = 3$ groups, so $df_{G} = k - 1 = 2.$ There are $n = n_1 + n_2 + n_3 = 429$ total observations, so $df_{E} = n - k = 426.$ Then the $F$-statistic is computed as the ratio of $MSG$ and $MSE:$ $F = \\frac{MSG}{MSE} = \\frac{0.00803}{0.00158} = 5.082 \\approx 5.077.$ $(F = 5.077$ was computed by using values for $MSG$ and $MSE$ that were not rounded.)\n\n### Observed statistic vs. null statistics\n\nWe can use the $F$-statistic to evaluate the hypotheses in what is called an F-test.\nA p-value can be computed from the $F$ statistic using an $F$ distribution, which has two associated parameters: $df_{1}$ and $df_{2}.$ For the $F$-statistic in ANOVA, $df_{1} = df_{G}$ and $df_{2} = df_{E}.$ An $F$ distribution with 2 and 426 degrees of freedom, corresponding to the $F$ statistic for the baseball hypothesis test, is shown in Figure \\@ref(fig:fDist2And423Shaded).\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![An $F$ distribution with $df_1=2$ and $df_2=426.$](22-inference-many-means_files/figure-html/fDist2And423Shaded-1.png){width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nThe p-value corresponding to the shaded area in Figure \\@ref(fig:fDist2And423Shaded) is equal to about 0.0066.\nDoes this provide strong evidence against the null hypothesis?\n\n------------------------------------------------------------------------\n\nThe p-value is smaller than 0.05, indicating the evidence is strong enough to reject the null hypothesis at a significance level of 0.05.\nThat is, the data provide strong evidence that the average on-base percentage varies by player's primary field position.\n:::\n\nNote that the small p-value indicates that there is a notable difference between the mean batting averages of the different positions.\nHowever, the ANOVA test does not provide a mechanism for knowing *which* group is driving the differences.\nIf we move forward with all possible two mean comparisons, we run the risk of a high type I error rate.\nAs we saw at the end of Section \\@ref(randANOVA), the follow-up questions surrounding individual group comparisons is called a problem of **multiple comparisons**\\index{multiple comparisons} and is outside the scope of this text.\nWe encourage you to learn more about multiple comparisons, however, so that additional comparisons, after you have rejected the null hypothesis in an ANOVA test, do not lead to undue false positive conclusions.\n\n\n\n\n\n### Reading an ANOVA table from software\n\nThe calculations required to perform an ANOVA by hand are tedious and prone to human error.\nFor these reasons, it is common to use statistical software to calculate the $F$-statistic and p-value.\n\nAn ANOVA can be summarized in a table very similar to that of a regression summary, which we saw in Chapters \\@ref(model-slr) and \\@ref(model-mlr).\nTable \\@ref(tab:anovaSummaryTableForOBPAgainstPosition) shows an ANOVA summary to test whether the mean of on-base percentage varies by player positions in the MLB.\nMany of these values should look familiar; in particular, the $F$-statistic and p-value can be retrieved from the last two columns.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
ANOVA summary for testing whether the average on-base percentage differs across player positions.
term df sumsq meansq statistic p.value
position 2 0.0161 0.0080 5.08 0.0066
Residuals 426 0.6740 0.0016
\n\n`````\n:::\n:::\n\n\n### Conditions for an ANOVA analysis\n\nThere are three conditions we must check for an ANOVA analysis: all observations must be independent, the data in each group must be nearly normal, and the variance within each group must be approximately equal.\n\n- **Independence.** If the data are a simple random sample, this condition can be assumed to be satisfied.\n For processes and experiments, carefully consider whether the data may be independent (e.g., no pairing).\n For example, in the MLB data, the data were not sampled.\n However, there are not obvious reasons why independence would not hold for most or all observations.\n\n- **Approximately normal.** As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is quite small when it is ironically difficult to check for non-normality.\n A histogram of the observations from each group is shown in Figure \\@ref(fig:mlbANOVADiagNormalityGroups).\n Since each of the groups we are considering have relatively large sample sizes, what we are looking for are major outliers.\n None are apparent, so this conditions is reasonably met.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Histograms of OBP for each field position.](22-inference-many-means_files/figure-html/mlbANOVADiagNormalityGroups-1.png){width=90%}\n:::\n:::\n\n\n- **Constant variance.** The last assumption is that the variance in the groups is about equal from one group to the next. This assumption can be checked by examining a side-by-side box plot of the outcomes across the groups, as in Figure \\@ref(fig:mlbANOVABoxPlot). In this case, the variability is similar in the four groups but not identical. We see in Table \\@ref(tab:mlbHRPerABSummaryTable) that the standard deviation does not vary much from one group to the next.\n\n::: {.important data-latex=\"\"}\n**Diagnostics for an ANOVA analysis.**\n\nIndependence is always important to an ANOVA analysis.\nThe normality condition is very important when the sample sizes for each group are relatively small.\nThe constant variance condition is especially important when the sample sizes differ between groups.\n:::\n\n\\clearpage\n\n## Chapter review {#chp22-review}\n\n### Summary\n\nIn this chapter we have provided both the randomization test and the mathematical model appropriate for addressing questions of equality of means across two or more groups.\nNote that there were important technical conditions required for confirming that the F distribution appropriately modeled the ANOVA test statistic.\nAlso, you may have noticed that there was no discussion of creating confidence intervals.\nThat is because the ANOVA statistic does not have a direct analogue parameter to estimate.\nIf there is interest in comparisons of mean differences (across each set of two groups), then the methods from Chapter \\@ref(inference-two-means) comparing two independent means should be applied.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
analysis of variance F-test sum of squared error (SSE)
ANOVA mean square between groups (MSG) sum of squares between groups (SSG)
data fishing mean square error (MSE) sum of squares total (SST)
data snooping multiple comparisons
degrees of freedom prosecutor's fallacy
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp22-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-22].\n\n::: {.exercises data-latex=\"\"}\n1. **Fill in the blank.**\nWhen doing an ANOVA, you observe large differences in means between groups. Within the ANOVA framework, this would most likely be interpreted as evidence strongly favoring the _____________ hypothesis.\n\n1. **Which test?**\nWe would like to test if students who are in the social sciences, natural sciences, arts and humanities, and other fields spend the same amount of time, on average, studying for a course. What type of test should we use? Explain your reasoning.\n\n1. **Cuckoo bird egg lengths, randomize once.**\nCuckoo birds lay their eggs in other birds' nests, making them known as brood parasites. One question relates to whether the size of the cuckoo egg differs depending on the species of the host bird.^[The [`Cuckoo`](https://rdrr.io/cran/Stat2Data/man/Cuckoo.html) data used in this exercise can be found in the [**Stat2Data**](https://cran.r-project.org/web/packages/Stat2Data/index.html) R package.] [@Latter:1902]\n\n Consider the following plots, one represents the original data, the second represents data where the host species has been randomly assigned to the egg length.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-22-1.png){width=90%}\n :::\n :::\n\n a. Consider the average length of the eggs for each species. Is the average length for the original data: more variable, less variable, or about the same as the randomized species? Describe what you see in the plots.\n\n b. Consider the standard deviation of the lengths of the eggs within each species. Is the within species standard deviation of the length for the original data: bigger, smaller, or about the same as the randomized species?\n\n c. Recall that the F statistic's numerator measures how much the groups vary (MSG) with the denominator measuring how much the within species values vary (MSE), which of the plots above would have a larger F statistic, the original data or the randomized data? Explain.\n\n1. **Cuckoo bird egg lengths, randomization test.**\nCuckoo birds lay their eggs in other birds' nests, making them known as brood parasites. One question relates to whether the size of the cuckoo egg differs depending on the species of the host bird.^[The data [`Cuckoo`](https://rdrr.io/cran/Stat2Data/man/Cuckoo.html) used in this exercise can be found in the [**Stat2Data**](https://cran.r-project.org/web/packages/Stat2Data/index.html) R package.] [@Latter:1902]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-23-1.png){width=90%}\n :::\n :::\n\n Using the randomization distribution of the F statistic (host species randomized to egg length), conduct a hypothesis test to evaluate if there is a difference, in the population, between the average egg lengths for different host bird species. Make sure to state your hypotheses clearly and interpret your results in context of the data.\n\n1. **Chicken diet and weight, many groups.**\nAn experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. \nNewly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement.\nSample statistics and a visualization of the observed data are shown below. [@data:chickwts]\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-24-1.png){width=70%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Feed type Mean SD n
casein 323.58 64.43 12
horsebean 160.20 38.63 10
linseed 218.75 52.24 12
meatmeal 276.91 64.90 11
soybean 246.43 54.13 14
sunflower 328.92 48.84 12
\n \n `````\n :::\n :::\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for the rest of the exercise.*\n :::\n \n \\clearpage\n\n The ANOVA output below can be used to test for differences between the average weights of chicks on different diets. Conduct a hypothesis test to determine if these data provide convincing evidence that the average weight of chicks varies across some (or all) groups. Make sure to check relevant conditions.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term df sumsq meansq statistic p.value
feed 5 231,129 46,226 15.4 <0.0001
Residuals 65 195,556 3,009
\n \n `````\n :::\n :::\n\n1. **Teaching descriptive statistics.**\nA study compared five different methods for teaching descriptive statistics. The five methods were traditional lecture and discussion, programmed textbook instruction, programmed text with lectures, computer instruction, and computer instruction with lectures. 45 students were randomly assigned, 9 to each method. After completing the course, students took a 1-hour exam.\n\n a. What are the hypotheses for evaluating if the average test scores are different for the different teaching methods?\n\n b. What are the degrees of freedom associated with the $F$-test for evaluating these hypotheses?\n\n c. Suppose the p-value for this test is 0.0168. What is the conclusion?\n\n1. **Coffee, depression, and physical activity.**\nCaffeine is the world's most widely used stimulant, with approximately 80% consumed in the form of coffee.\nParticipants in a study investigating the relationship between coffee consumption and exercise were asked to report the number of hours they spent per week on moderate (e.g., brisk walking) and vigorous (e.g., strenuous sports and jogging) exercise.\nBased on these data the researchers estimated the total hours of metabolic equivalent tasks (MET) per week, a value always greater than 0.\nThe table below gives summary statistics of MET for women in this study based on the amount of coffee consumed. [@Lucas:2011]\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Caffeinated coffee consumption
1 cup / week or fewer 2-6 cups / week 1 cups / day 2-3 cups / day 4 cups / day or more
Mean 18.7 19.6 19.3 18.9 17.5
SD 21.1 25.5 22.5 22.0 22.0
n 12,215.0 6,617.0 17,234.0 12,290.0 2,383.0
\n \n `````\n :::\n :::\n\n a. Write the hypotheses for evaluating if the average physical activity level varies among the different levels of coffee consumption.\n\n b. Check conditions and describe any assumptions you must make to proceed with the test.\n\n c. Below is the output associated with this test. What is the conclusion of the test?\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
df sumsq meansq statistic p.value
cofee 4 10,508 2,627 5.2 0
Residuals 50,734 25,564,819 504
Total 50,738 25,575,327
\n \n `````\n :::\n :::\n \n \\clearpage\n\n1. **Student performance across discussion sections.**\nA professor who teaches a large introductory statistics class (197 students) with eight discussion sections would like to test if student performance differs by discussion section, where each discussion section has a different teaching assistant. The summary table below shows the average final exam score for each discussion section as well as the standard deviation of scores and the number of students in each section.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Sec 1 Sec 2 Sec 3 Sec 4 Sec 5 Sec 6 Sec 7 Sec 8
Mean 92.94 91.11 91.80 92.45 89.30 88.30 90.12 93.35
SD 4.21 5.58 3.43 5.92 9.32 7.27 6.93 4.57
n 33.00 19.00 10.00 29.00 33.00 10.00 32.00 31.00
\n \n `````\n :::\n :::\n\n The ANOVA output below can be used to test for differences between the average scores from the different discussion sections.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
df sumsq meansq statistic p.value
section 7 525 75.0 1.87 0.077
Residuals 189 7,584 40.1
Total 196 8,109
\n \n `````\n :::\n :::\n\n Conduct a hypothesis test to determine if these data provide convincing evidence that the average score varies across some (or all) groups. Check conditions and describe any assumptions you must make to proceed with the test.\n\n1. **GPA and major.**\nUndergraduate students taking an introductory statistics course at Duke University conducted a survey about GPA and major. \nThe side-by-side box plots show the distribution of GPA among three groups of majors. Also provided is the ANOVA output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-30-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term df sumsq meansq statistic p.value
major 2 0.03 0.02 0.21 0.81
Residuals 195 15.77 0.08
\n \n `````\n :::\n :::\n\n a. Write the hypotheses for testing for a difference between average GPA across majors.\n\n b. What is the conclusion of the hypothesis test?\n\n c. How many students answered these questions on the survey, i.e. what is the sample size?\n \n \\clearpage\n\n1. **Work hours and education.**\nThe General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents. [@data:gss:2010] Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Educational attainment Mean SD n
Lt High School 38.7 15.8 121
High School 39.6 15.0 546
Junior College 41.4 18.1 97
Bachelor 42.5 13.6 253
Graduate 40.8 15.5 155
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-31-1.png){width=100%}\n :::\n :::\n\n a. Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.\n\n b. Check conditions and describe any assumptions you must make to proceed with the test.\n\n c. Below is the output associated with this test. What is the conclusion of the test?\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term df sumsq meansq statistic p.value
degree 4 2,006 502 2.19 0.07
Residuals 1,167 267,382 229
\n \n `````\n :::\n :::\n \n \\clearpage\n\n1. **True / False: ANOVA, I.**\nDetermine if the following statements are true or false in ANOVA, and explain your reasoning for statements you identify as false.\n\n a. As the number of groups increases, the modified significance level for pairwise tests increases as well.\n\n b. As the total sample size increases, the degrees of freedom for the residuals increases as well.\n\n c. The constant variance condition can be somewhat relaxed when the sample sizes are relatively consistent across groups.\n\n d. The independence assumption can be relaxed when the total sample size is large.\n\n1. **True / False: ANOVA, II.**\nDetermine if the following statements are true or false, and explain your reasoning for statements you identify as false.\n\n If the null hypothesis that the means of four groups are all the same is rejected using ANOVA at a 5% significance level, then...\n\n a. we can then conclude that all the means are different from one another.\n\n b. the standardized variability between groups is higher than the standardized variability within groups.\n\n c. the pairwise analysis will identify at least one pair of means that are significantly different.\n\n d. the appropriate $\\alpha$ to be used in pairwise comparisons is 0.05 / 4 = 0.0125 since there are four groups.\n\n1. **Matching observed data with randomized F statistics.**\nConsider the following two datasets. The response variable is the `score` and the explanatory variable is whether the individual is in one of four groups.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-33-1.png){width=100%}\n :::\n :::\n\n The randomizations (randomly assigning group to the score, calculating a randomization F statistic) were done 1000 times for each of Dataset A and B. The red line on each plot indicates the observed F statistic for the original (unrandomized) data.\n\n a. Does the randomization distribution on the left correspond to Dataset A or B? Explain.\n\n b. Does the randomization distribution on the right correspond to Dataset A or B? Explain.\n \n \\clearpage\n\n1. **Child care hours.**\nThe China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments. [@data:china] It, for example, collects information on number of hours Chinese parents spend taking care of their children under age 6. The side-by-side box plots below show the distribution of this variable by educational attainment of the parent. Also provided below is the ANOVA output for comparing average hours across educational attainment categories.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](22-inference-many-means_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term df sumsq meansq statistic p.value
edu 4 4,142 1,036 1.26 0.28
Residuals 794 653,048 822
\n \n `````\n :::\n :::\n\n a. Write the hypotheses for testing for a difference between the average number of hours spent on child care across educational attainment levels.\n\n b. What is the conclusion of the hypothesis test?\n\n\n:::\n", + "supporting": [ + "22-inference-many-means_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/22-inference-many-means/figure-html/boxplotThreeVersionsOfExams-1.png b/_freeze/22-inference-many-means/figure-html/boxplotThreeVersionsOfExams-1.png new file mode 100644 index 00000000..990fef66 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/boxplotThreeVersionsOfExams-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/fDist2And423Shaded-1.png b/_freeze/22-inference-many-means/figure-html/fDist2And423Shaded-1.png new file mode 100644 index 00000000..fd0bae70 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/fDist2And423Shaded-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/mlbANOVABoxPlot-1.png b/_freeze/22-inference-many-means/figure-html/mlbANOVABoxPlot-1.png new file mode 100644 index 00000000..b586918a Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/mlbANOVABoxPlot-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/mlbANOVADiagNormalityGroups-1.png b/_freeze/22-inference-many-means/figure-html/mlbANOVADiagNormalityGroups-1.png new file mode 100644 index 00000000..cec5da14 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/mlbANOVADiagNormalityGroups-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/rand3exams-1.png b/_freeze/22-inference-many-means/figure-html/rand3exams-1.png new file mode 100644 index 00000000..83bee21d Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/rand3exams-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/rand3examspval-1.png b/_freeze/22-inference-many-means/figure-html/rand3examspval-1.png new file mode 100644 index 00000000..d0355b03 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/rand3examspval-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/toyANOVA-1.png b/_freeze/22-inference-many-means/figure-html/toyANOVA-1.png new file mode 100644 index 00000000..22731719 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/toyANOVA-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-22-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 00000000..8ea9dc40 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-22-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-23-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 00000000..74aef051 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-23-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-24-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..fbf45dcb Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-30-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 00000000..decbedb0 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-30-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-31-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 00000000..99183edd Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-31-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-33-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 00000000..c22cfcf5 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-33-1.png differ diff --git a/_freeze/22-inference-many-means/figure-html/unnamed-chunk-34-1.png b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 00000000..60fe59d1 Binary files /dev/null and b/_freeze/22-inference-many-means/figure-html/unnamed-chunk-34-1.png differ diff --git a/_freeze/23-inference-applications/execute-results/html.json b/_freeze/23-inference-applications/execute-results/html.json new file mode 100644 index 00000000..d0d4f120 --- /dev/null +++ b/_freeze/23-inference-applications/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "ff8fa3ae8a115469eeb2d74571dd844c", + "result": { + "markdown": "# Applications: Infer {#inference-applications}\n\n\n\n\n\n## Recap: Computational methods {#comp-methods-summary}\n\nThe computational methods we have presented are used in two settings.\nFirst, in many real life applications (as in those covered here), the mathematical model and computational model give identical conclusions.\nWhen there are no differences in conclusions, the advantage of the computational method is that it gives the analyst a good sense for the logic of the statistical inference process.\nSecond, when there is a difference in the conclusions (seen primarily in methods beyond the scope of this text), it is often the case that the computational method relies on fewer technical conditions and is therefore more appropriate to use.\n\n### Randomization\n\nThe important feature of randomization tests is that the data is permuted in such a way that the null hypothesis is true.\nThe randomization distribution provides a distribution of the statistic of interest under the null hypothesis, which is exactly the information needed to calculate a p-value --- where the p-value is the probability of obtaining the observed data or more extreme when the null hypothesis is true.\nAlthough there are ways to adjust the randomization for settings other than the null hypothesis being true, they are not covered in this book and they are not used widely.\nIn approaching research questions with a randomization test, be sure to ask yourself what the null hypothesis represents and how it is that permuting the data is creating different possible null data representations.\n\n**Hypothesis tests.** When using a randomization test, we proceed as follows:\n\n- Write appropriate hypotheses.\n\n- Compute the observed statistic of interest.\n\n- Permute the data repeatedly, each time, recalculating the statistic of interest.\n\n- Compute the proportion of times the permuted statistics are as extreme as or more extreme than the observed statistic, this is the p-value.\n\n- Make a conclusion based on the p-value, and write the conclusion in context and in plain language so anyone can understand the result.\n\n\\clearpage\n\n### Bootstrapping\n\nBootstrapping, in contrast to randomization tests, represents a proxy sampling of the original population.\nWith bootstrapping, the analyst is not forcing the null hypothesis to be true (or false, for that matter), but instead, they are replicating the variability seen in taking repeated samples from a population.\nBecause there is no underlying true (or false) null hypothesis, bootstrapping is typically used for creating confidence intervals for the parameter of interest.\nBootstrapping can be used to test particular values of a parameter (e.g., by evaluating whether a particular value of interest is contained in the confidence interval), but generally, bootstrapping is used for interval estimation instead of testing.\n\n**Confidence intervals.** The following is how we generally computed a confidence interval using bootstrapping:\n\n- Repeatedly resample the original data, with replacement, using the same sample size as the original data.\n\n- For each resample, calculate the statistic of interest.\n\n- Calculate the confidence interval using one of the following methods:\n\n - Bootstrap percentile interval: Obtain the endpoints representing the middle (e.g., 95%) of the bootstrapped statistics.\n The endpoints will be the confidence interval.\n\n - Bootstrap standard error (SE) interval: Find the SE of the bootstrapped statistics.\n The confidence interval will be given by the original observed statistic plus or minus some multiple (e.g., 2) of SEs.\n\n- Put the conclusions in context and in plain language so even non-statisticians and data scientists can understand the results.\n\n\\vspace{-4mm}\n\n## Recap: Mathematical models {#math-models-summary}\n\nThe mathematical models which have been used to produce inferential analyses follow a consistent framework for different parameters of interest.\nAs a way to contrast and compare the mathematical approach, we offer the following summaries in Tables \\@ref(tab:zcompare) and \\@ref(tab:tcompare).\n\n\\vspace{-4mm}\n\n### z-procedures\n\nGenerally, when the response variable is categorical (or binary), the summary statistic is a proportion and the model used to describe the proportion is the standard normal curve (also referred to as a $z$-curve or a $z$-distribution).\nWe provide Table \\@ref(tab:zcompare) partly as a mechanism for understanding $z$-procedures and partly to highlight the extremely common usage of the $z$-distribution in practice.\n\n\\vspace{-2mm}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Similarities of $z$-methods across one and two independent samples analysis of a binary response variable.
One sample Two independent samples
Response variable Binary Binary
Parameter of interest Proportion: $p$ Difference in proportions: $p_1 - p_2$
Statistic of interest Proportion: $\\hat{p}$ Difference in proportions: $\\hat{p}_1 - \\hat{p}_2$
Standard error: HT $\\sqrt{\\frac{p_0(1-p_0)}{n}}$ $\\sqrt{\\hat{p}_{pool}\\bigg(1-\\hat{p}_{pool}\\bigg)\\bigg(\\frac{1}{n_1} + \\frac{1}{n_2}}\\bigg)$
Standard error: CI $\\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}}$ $\\sqrt{\\frac{\\hat{p}_{1}(1-\\hat{p}_{1})}{n_1} + \\frac{\\hat{p}_{2}(1-\\hat{p}_{2})}{n_2}}$
Conditions 1. Independence, 2. Success-failure 1. Independence, 2. Success-failure
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n**Hypothesis tests.** When applying the $z$-distribution for a hypothesis test, we proceed as follows:\n\n- Write appropriate hypotheses.\n\n- Verify conditions for using the $z$-distribution.\n\n - One-sample: the observations (or differences) must be independent. The success-failure condition of at least 10 success and at least 10 failures should hold.\n - For a difference of proportions: each sample must separately satisfy the success-failure conditions, and the data in the groups must also be independent.\n\n- Compute the point estimate of interest and the standard error.\n\n- Compute the Z score and p-value.\n\n- Make a conclusion based on the p-value, and write a conclusion in context and in plain language so anyone can understand the result.\n\n**Confidence intervals.** Similarly, the following is how we generally computed a confidence interval using a $z$-distribution:\n\n- Verify conditions for using the $z$-distribution. (See above.)\n- Compute the point estimate of interest, the standard error, and $z^{\\star}.$\n- Calculate the confidence interval using the general formula:\\\n point estimate $\\pm\\ z^{\\star} SE.$\n- Put the conclusions in context and in plain language so even non-statisticians and data scientists can understand the results.\n\n### t-procedures\n\nWith quantitative response variables, the $t$-distribution was applied as the appropriate mathematical model in three distinct settings.\nAlthough the three data structures are different, their similarities and differences are worth pointing out.\nWe provide Table \\@ref(tab:tcompare) partly as a mechanism for understanding $t$-procedures and partly to highlight the extremely common usage of the $t$-distribution in practice.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Similarities of $t$-methods across one sample, paired sample, and two independent samples analysis of a numeric response variable.
One sample Paired sample Two independent samples
Response variable Numeric Numeric Numeric
Parameter of interest Mean: $\\mu$ Paired mean: $\\mu_{diff}$ Difference in means: $\\mu_1 - \\mu_2$
Statistic of interest Mean: $\\bar{x}$ Paired mean: $\\bar{x}_{diff}$ Difference in means: $\\bar{x}_1 - \\bar{x}_2$
Standard error $\\frac{s}{\\sqrt{n}}$ $\\frac{s_{diff}}{\\sqrt{n_{diff}}}$ $\\sqrt{\\frac{s_1^2}{n_1} + \\frac{s_2^2}{n_2}}$
Degrees of freedom $n-1$ $n_{diff} -1$ $\\min(n_1 -1, n_2 - 1)$
Conditions 1. Independence, 2. Normality or large samples 1. Independence, 2. Normality or large samples 1. Independence, 2. Normality or large samples
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n**Hypothesis tests.** When applying the $t$-distribution for a hypothesis test, we proceed as follows:\n\n- Write appropriate hypotheses.\n\n- Verify conditions for using the $t$-distribution.\n\n - One-sample or differences from paired data: the observations (or differences) must be independent and nearly normal. For larger sample sizes, we can relax the nearly normal requirement, e.g., slight skew is okay for sample sizes of 15, moderate skew for sample sizes of 30, and strong skew for sample sizes of 60.\n - For a difference of means when the data are not paired: each sample mean must separately satisfy the one-sample conditions for the $t$-distribution, and the data in the groups must also be independent.\n\n- Compute the point estimate of interest, the standard error, and the degrees of freedom For $df,$ use $n-1$ for one sample, and for two samples use either statistical software or the smaller of $n_1 - 1$ and $n_2 - 1.$\n\n- Compute the T score and p-value.\n\n- Make a conclusion based on the p-value, and write a conclusion in context and in plain language so anyone can understand the result.\n\n**Confidence intervals.** Similarly, the following is how we generally computed a confidence interval using a $t$-distribution:\n\n- Verify conditions for using the $t$-distribution. (See above.)\n- Compute the point estimate of interest, the standard error, the degrees of freedom, and $t^{\\star}_{df}.$\n- Calculate the confidence interval using the general formula:\\\n point estimate $\\pm\\ t_{df}^{\\star} SE.$\n- Put the conclusions in context and in plain language so even non-statisticians and data scientists can understand the results.\n\n## Case study: Redundant adjectives {#case-study-redundant-adjectives}\n\nTake a look at the images in Figure \\@ref(fig:blue-triangle-shapes).\nHow would you describe the circled item in the top image (A)?\nWould you call it \"the triangle\"?\nOr \"the blue triangle\"?\nHow about in the bottom image (B)?\nDoes your answer change?\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two sets of four shapes. In A, the circled triangle is the only triangle. In B, the circled triangle is the only blue triangle.](23-inference-applications_files/figure-html/blue-triangle-shapes-1.png){fig-alt='Four shapes are presented twice. In A (the top presentation) the shapes and colors are all different: pink circle, yellow square, red diamond, blue triangle. In B (the bottom presentation) the colors are all different but the traingle shape is repeated: pink circle, yellow square, red triangle, blue triangle. In each of the two presentations the blue triangle is circled.' width=90%}\n:::\n:::\n\n\nIn the top image in Figure \\@ref(fig:blue-triangle-shapes) the circled item is the only triangle, while in the bottom image the circled item is one of two triangles.\nWhile in the top image \"the triangle\" is a sufficient description for the circled item, many of us might choose to refer to it as the \"blue triangle\" anyway.\nIn the bottom image there are two triangles, so \"the triangle\" is no longer sufficient, and to describe the circled item we must qualify it with the color as well, as \"the blue triangle\".\n\nYour answers to the above questions might be different if you're answering in a different language than English.\nFor example, in Spanish, the adjective comes after the noun (e.g., \"el triángulo azul\") therefore the incremental value of the additional adjective might be different for the top image.\n\nResearchers studying frequent use of redundant adjectives (e.g., referring to a single triangle as \"the blue triangle\") and incrementality of language processing designed an experiment where they showed the following two images to 22 native English speakers (undergraduates from University College London) and 22 native Spanish speakers (undergraduates from the Universidad de las Islas Baleares).\nThey found that in both languages, the subjects used more redundant color adjectives in denser displays where it would be more efficient.\n[@rubio-fernandez2021]\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Images used in one of the experiments described in [@rubio-fernandez2021].](images/redundant-adjectives-blue-triangle.png){fig-alt='Two presentations of shapes. In each presentation all of the shapes and their colors are unique. In the left presentation, the blue triangle is circled and is one of four shapes. In the right presentation, the blue triangle is circled and is one of sixteen shapes.' width=90%}\n:::\n:::\n\n\nIn this case study we will examine data from redundant adjective study, which the authors have made available on Open Science Framework at [osf.io/9hw68](https://osf.io/9hw68/).\n\n\n::: {.cell}\n\n:::\n\n\nTable \\@ref(tab:redundant-data) shows the top six rows of the data.\nThe full dataset has 88 rows.\nRemember that there are a total of 44 subjects in the study (22 English and 22 Spanish speakers).\nThere are two rows in the dataset for each of the subjects: one representing data from when they were shown an image with 4 items on it and the other with 16 items on it.\nEach subject was asked 10 questions for each type of image (with a different layout of items on the image for each question).\nThe variable of interest to us is `redundant_perc`, which gives the percentage of questions the subject used a redundant adjective to identify \"the blue triangle\".\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Top six rows of the data collected in the study.
language subject items n_questions redundant_perc
English 1 4 10 100
English 1 16 10 100
English 2 4 10 0
English 2 16 10 0
English 3 4 10 100
English 3 16 10 100
\n\n`````\n:::\n:::\n\n\n### Exploratory analysis\n\nIn one of the images shown to the subjects, there are 4 items, and in the other, there are 16 items.\nIn each of the images the circled item is the only triangle, therefore referring to it as \"the blue triangle\" or as \"el triángulo azul\" is considered redundant.\nIf the subject's response was \"the triangle\", they were recorded to have not used a redundant adjective.\nIf the response was \"the blue triangle\", they were recorded to have used a redundant adjective.\nFigure \\@ref(fig:reduntant-bar) shows the results of the experiment.\nWe can see that English speakers are more likely than Spanish speakers to use redundant adjectives, and that in both languages, subjects are more likely to use a redundant adjective when there are more items in the image (i.e. in a denser display).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Results of redundant adjective usage experiment from [@rubio-fernandez2021]. English speakers are more likely than Spanish speakers to use redundant adjectives, regardless of number of items in image. For both images, respondents are more likely to use a redundant adjective when there are more items in the image.](23-inference-applications_files/figure-html/reduntant-bar-1.png){width=90%}\n:::\n:::\n\n\nThese values are also shown in Table \\@ref(tab:redundant-table).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of redundant adjective usage experiment from study.
Language Number of items Percentage redundant
English 4 37.27
English 16 80.45
Spanish 4 2.73
Spanish 16 61.36
\n\n`````\n:::\n:::\n\n\n### Confidence interval for a single mean\n\n\n::: {.cell}\n\n:::\n\n\nIn this experiment, the average percentage of redundant adjective usage among subjects who responded in English when presented with an image with 4 items in it is 37.27.\nAlong with the sample average as a point estimate, however, we can construct a confidence interval for the true mean redundant adjective usage of English speakers who use redundant color adjectives when describing items in an image that is not very dense.\n\n\n::: {.cell}\n\n:::\n\n\nUsing a computational method, we can construct the interval via bootstrapping.\nFigure \\@ref(fig:boot-eng-4-viz) shows the distribution of 1,000 bootstrapped means from this sample.\nThe 95% confidence interval (that is calculated by taking the 2.5th and 97.5th percentile of the bootstrap distribution is 19.1% to 56.4%.\nNote that this interval for the true population parameter is only valid if we can assume that the sample of English speakers are representative of the population of all English speakers.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:boot-eng-4-viz-cap)](23-inference-applications_files/figure-html/boot-eng-4-viz-1.png){width=90%}\n:::\n:::\n\n\n(ref:boot-eng-4-viz-cap) Distribution of 1,000 bootstrapped means of redundant adjective usage percentage among English speakers who were shown four items in images. Overlaid on the distribution is the 95% bootstrap percentile interval that ranges from 19.1% to 56.4%.\n\nUsing a similar technique, we can also construct confidence intervals for the true mean redundant adjective usage percentage for English speakers who are shown dense (16 item) displays and for Spanish speakers with both types (4 and 16 items) displays.\nHowever, these confidence intervals are not very meaningful to compare to one another as the interpretation of the \"true mean redundant adjective usage percentage\" is quite an abstract concept.\nInstead, we might be more interested in comparative questions such as \"Does redundant adjective usage differ between dense and sparse displays among English speakers and among Spanish speakers?\" or \"Does redundant adjective usage differ between English speakers and Spanish speakers?\" To answer either of these questions we need to conduct a hypothesis test.\n\n### Paired mean test\n\n\n::: {.cell}\n\n:::\n\n\nLet's start with the following question: \"Do the data provide convincing evidence of a difference in mean redundant adjective usage percentages between sparse (4 item) and dense (16 item) displays for English speakers?\" Note that the English speaking participants were each evaluated on both the 4 item and the 16 item displays.\nTherefore, the variable of interest is the difference in redundant percentage.\nThe statistic of interest will be the average of the differences, here $\\bar{x}_{diff} =$ 43.18.\n\nData from the first six English speaking participants are seen in Table \\@ref(tab:redundant-data-paired).\nAlthough the redundancy percentages seem higher in the 16 item task, a hypothesis test will tell us whether the differences observed in the data could be due to natural variability.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Six participants who speak English with redundancy difference.
subject redundant_perc_4 redundant_perc_16 diff_redundant_perc
1 100 100 0
2 0 0 0
3 100 100 0
4 10 80 70
5 0 90 90
6 0 70 70
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\nWe can answer the research question using a hypothesis test with the following hypotheses:\n\n$$H_0: \\mu_{diff} = 0$$ $$H_A: \\mu_{diff} \\ne 0$$\n\nwhere $\\mu_{diff}$ is the true difference in redundancy percentages when comparing a 16 item display with a 4 item display.\nRecall that the computational method used to assess a hypothesis pertaining to the true average of a paired difference shuffles the observed percentage across the two groups (4 item vs 16 item) but **within** a single participant.\nThe shuffling process allows for repeated calculations of potential sample differences under the condition that the null hypothesis is true.\n\nFigure \\@ref(fig:eng-viz) shows the distribution of 1,000 mean differences from redundancy percentages permuted across the two conditions.\nNote that the distribution is centered at 0, since the structure of randomly assigning redundancy percentages to each item display will balance the data out such that the average of any differences will be zero.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distribution of 1,000 mean differences of redundant adjective usage percentage among English speakers who were shown images with 4 and 16 items. Overlaid on the distribution is the observed average difference in the sample (solid line) as well as the difference in the other direction (dashed line), which is far out in the tail, yielding a p-value that is approximately 0.](23-inference-applications_files/figure-html/eng-viz-1.png){width=90%}\n:::\n:::\n\n\nWith such a small p-value, we reject the null hypothesis and conclude that the data provide convincing evidence of a difference in mean redundant adjective usage percentages across different displays for English speakers.\n\n### Two independent means test\n\nFinally, let's consider the question \"How does redundant adjective usage differ between English speakers and Spanish speakers?\" The English speakers are independent from the Spanish speakers, but since the same subjects were shown the two types of displays, we can't combine data from the two display types (4 objects and 16 objects) together while maintaining independence of observations.\nTherefore, to answer questions about language differences, we will need to conduct two hypothesis tests, one for sparse displays and the other for dense displays.\nIn each of the tests, the hypotheses are as follows:\n\n$$H_0: \\mu_{English} = \\mu_{Spanish}$$ $$H_A: \\mu_{English} \\ne \\mu_{Spanish}$$\n\nHere, the randomization process is slightly different than the paired setting (because the English and Spanish speakers do not have a natural pairing across the two groups).\nTo answer the research question using a computational method, we can use a randomization test where we permute the data across all participants under the assumption that the null hypothesis is true (no difference in mean redundant adjective usage percentages across English vs Spanish speakers).\n\n\n::: {.cell}\n\n:::\n\n\nFigure \\@ref(fig:compare-lang-viz) shows the null distributions for each of these hypothesis tests.\nThe p-value for the 4 item display comparison is very small (0.002) while the p-value for the 16 item display is much larger (0.102).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Distributions of 1,000 differences in randomized means of redundant adjective usage percentage between English and Spanish speakers. Plot A shows the differences in 4 item displays and Plot B shows the differences in 16 item displays. In each plot, the observed differences in the sample (solid line) as well as the differences in the other direction (dashed line) are overlaid.](23-inference-applications_files/figure-html/compare-lang-viz-1.png){width=90%}\n:::\n:::\n\n\nBased on the p-values (a measure of deviation from the null claim), we can conclude that the data provide convincing evidence of a difference in mean redundant adjective usage percentages between languages in 4 item displays (small p-value) but not in 16 item displays (not small p-value).\nThe results suggests that language patterns around redundant adjective usage might be more similar for denser displays than sparser displays.\n\n\\clearpage\n\n## Interactive R tutorials {#inference-tutorials}\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 5: Statistical inference](https://openintrostat.github.io/ims-tutorials/05-infer/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintrostat.github.io/ims-tutorials/05-infer\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 1: Inference for a single proportion](https://openintro.shinyapps.io/ims-05-infer-01/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-01\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 2: Hypothesis tests to compare proportions](https://openintro.shinyapps.io/ims-05-infer-02/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-02\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 3: Chi-squared test of independence](https://openintro.shinyapps.io/ims-05-infer-03/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-03\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 4: Chi-squared goodness of fit Test](https://openintro.shinyapps.io/ims-05-infer-04/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-04\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 5: Bootstrapping for estimating a parameter](https://openintro.shinyapps.io/ims-05-infer-05/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-05\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 6: Introducing the t-distribution](https://openintro.shinyapps.io/ims-05-infer-06/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-06\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 7: Inference for difference in two means](https://openintro.shinyapps.io/ims-05-infer-07/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-07\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 5 - Lesson 8: Comparing many means](https://openintro.shinyapps.io/ims-05-infer-08/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-05-infer-08\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n\\vspace{-3mm}\n\n## R labs {#inference-labs}\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Inference for categorical responses - Texting while driving](https://www.openintro.org/go?id=ims-r-lab-infer-1)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?id=ims-r-lab-infer-1\n:::\n\n:::\n\n\\vspace{-2mm}\n\n::: {.singlelab data-latex=\"\"}\n[Inference for numerical responses - Youth Risk Behavior Surveillance System](https://www.openintro.org/go?id=ims-r-lab-infer-2)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?id=ims-r-lab-infer-2\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "supporting": [ + "23-inference-applications_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/23-inference-applications/figure-html/blue-triangle-shapes-1.png b/_freeze/23-inference-applications/figure-html/blue-triangle-shapes-1.png new file mode 100644 index 00000000..ec62945e Binary files /dev/null and b/_freeze/23-inference-applications/figure-html/blue-triangle-shapes-1.png differ diff --git a/_freeze/23-inference-applications/figure-html/boot-eng-4-viz-1.png b/_freeze/23-inference-applications/figure-html/boot-eng-4-viz-1.png new file mode 100644 index 00000000..573df133 Binary files /dev/null and b/_freeze/23-inference-applications/figure-html/boot-eng-4-viz-1.png differ diff --git a/_freeze/23-inference-applications/figure-html/compare-lang-viz-1.png b/_freeze/23-inference-applications/figure-html/compare-lang-viz-1.png new file mode 100644 index 00000000..28ee812d Binary files /dev/null and b/_freeze/23-inference-applications/figure-html/compare-lang-viz-1.png differ diff --git a/_freeze/23-inference-applications/figure-html/eng-viz-1.png b/_freeze/23-inference-applications/figure-html/eng-viz-1.png new file mode 100644 index 00000000..3e037ff2 Binary files /dev/null and b/_freeze/23-inference-applications/figure-html/eng-viz-1.png differ diff --git a/_freeze/23-inference-applications/figure-html/reduntant-bar-1.png b/_freeze/23-inference-applications/figure-html/reduntant-bar-1.png new file mode 100644 index 00000000..154bd8cb Binary files /dev/null and b/_freeze/23-inference-applications/figure-html/reduntant-bar-1.png differ diff --git a/_freeze/24-inf-model-slr/execute-results/html.json b/_freeze/24-inf-model-slr/execute-results/html.json new file mode 100644 index 00000000..61b80fd0 --- /dev/null +++ b/_freeze/24-inf-model-slr/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "7df20055bb7b899c867239ffddab372f", + "result": { + "markdown": "\n\n\n# Inference for linear regression with a single predictor {#sec-inf-model-slr}\n\n\\chaptermark{Inference for regression with a single predictor}\n\n::: {.chapterintro data-latex=\"\"}\nWe now bring together ideas of inferential analyses with the descriptive models seen in Chapters \\@ref(model-slr).\nIn particular, we will use the least squares regression line to test whether there is a relationship between two continuous variables.\nAdditionally, we will build confidence intervals which quantify the slope of the linear regression line.\nThe setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample.\nThe sampling variability will inform the conclusions about the population that can be drawn.\n\nMany of the inferential ideas are remarkably similar to those covered in previous chapters.\nThe technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.\n\nWe encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software.\nInference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims.\nWe provide some guidance and hope the reader will further their statistical learning after working through the material in this text.\n:::\n\n\n\n\n\n## Case study: Sandwich store\n\n### Observed data\n\nWe start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue.\nThe hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample.\nBecause we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability.\nNote that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample.\nThe change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.\n\nConsider the following hypothetical population of all of the sandwich stores of a particular chain seen in Figure \\@ref(fig:sandpop).\nIn this made-up world, the CEO actually has all the relevant data, which is why they can plot it here.\nThe CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue.\nThere appears to be a linear relationship between advertising dollars and revenue (both in \\$1,000).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars.](24-inf-model-slr_files/figure-html/sandpop-1.png){width=90%}\n:::\n:::\n\n\nYou may remember from Chapter \\@ref(model-slr) that the population model is: $$y = \\beta_0 + \\beta_1 x + \\varepsilon.$$\n\nAgain, the omniscient CEO (with the full population information) can write down the true population model as: $$\\texttt{expected revenue} = 11.23 + 4.8 \\times \\texttt{advertising}.$$\n\n### Variability of the statistic\n\nUnfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether set up a new franchise.\nThe CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.\n\nAs with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample.\nConsider the linear model which describes revenue (in \\$1,000) based on advertising dollars (in \\$1,000).\n\nThe least squares regression model uses the data to find a sample linear fit: $$\\hat{y} = b_0 + b_1 x.$$\n\nA random sample of 20 stores shows a different least square regression line depending on which observations are selected.\nA subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in Figure \\@ref(fig:sandpop) which described the population) despite having fewer observations on the plot.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.](24-inf-model-slr_files/figure-html/unnamed-chunk-5-1.png){width=90%}\n:::\n:::\n\n\nA second sample of size 20 also shows a positive trend!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.](24-inf-model-slr_files/figure-html/unnamed-chunk-6-1.png){width=90%}\n:::\n:::\n\n\nBut the lines are slightly different!\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The linear models from the two different random samples are quite similar, but they are not the same line.](24-inf-model-slr_files/figure-html/unnamed-chunk-7-1.png){width=90%}\n:::\n:::\n\n\nThat is, there is **variability** in the regression line from sample to sample.\nThe concept of the sampling variability is something you've seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: **the slope of the line**.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.](24-inf-model-slr_files/figure-html/slopes-1.png){width=90%}\n:::\n:::\n\n\nYou might notice in Figure \\@ref(fig:slopes) that the $\\hat{y}$ values given by the lines are much more consistent in the middle of the dataset than at the ends.\nThe reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.\nThe effect of the fan-shaped lines is that predicted revenue for advertising close to \\$4,000 will be much more precise than the revenue predictions made for \\$1,000 or \\$7,000 of advertising.\n\nThe distribution of slopes (for samples of size $n=20$) can be seen in a histogram, as in Figure \\@ref(fig:sand20lm).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Variability of slope estimates taken from many different samples of stores, each of size 20.](24-inf-model-slr_files/figure-html/sand20lm-1.png){width=90%}\n:::\n:::\n\n\nRecall, the example described in this introduction is hypothetical.\nThat is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.\nThe tools in this textbook are designed to evaluate only one single sample of data.\nWith actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.\nWe have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.\nHowever, it is nice to visually consider the linear models produced by different slopes.\nAdditionally, as with measuring the variability of previous statistics (e.g., $\\overline{X}_1 - \\overline{X}_2$ or $\\hat{p}_1 - \\hat{p}_2$), the histogram of the sample statistics can provide information related to inferential considerations.\n\nIn the following sections, the distribution (i.e., histogram) of $b_1$ (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you.\nFirst (in Section \\@ref(randslope)), the distribution of $b_1$ when $\\beta_1 = 0$ is constructed by randomizing (permuting) the response variable.\nNext (in Section \\@ref(bootbeta1)), we can bootstrap the data by taking random samples of size n from the original dataset.\nAnd last (in Section \\@ref(mathslope)), we use mathematical tools to describe the variability using the $t$-distribution that was first encountered in Section \\@ref(one-mean-math).\n\n## Randomization test for the slope {#randslope}\n\nConsider data on 100 randomly selected births gathered originally from the US Department of Health and Human Services.\nSome of the variables are plotted in Figure \\@ref(fig:babyweight).\n\nThe scientific research interest at hand will be in determining the linear relationship between weight of baby at birth (in lbs) and number of weeks of gestation.\nThe dataset is quite rich and deserves exploring, but for this example, we will focus only on the weight of the baby.\n\n::: {.data data-latex=\"\"}\nThe [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\nWe will work with a random sample of 100 observations from these data.\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Weight of baby at birth (in lbs) as plotted by four other birth variables (mother's weight gain, mother's age, number of hospital visits, and weeks gestation).](24-inf-model-slr_files/figure-html/babyweight-1.png){width=90%}\n:::\n:::\n\n\nAs you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.\nIn the linear model setting, we might hope to have a linear relationship between `weeks` and `weight` in settings where `weeks` gestation is known and `weight` of baby needs to be predicted.\n\nThe relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.\nHere the population refers to a larger population of births in the US.\n\n- $H_0: \\beta_1= 0$, there is no linear relationship between `weight` and `weeks`.\n- $H_A: \\beta_1 \\ne 0$, there is some linear relationship between `weight` and `weeks`.\n\nRecall that for the randomization test, we permute one variable to eliminate any existing relationship between the variables.\nThat is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but **not** due to variables being correlated.\nFigure \\@ref(fig:permweightScatter) shows the observed data and a scatterplot of one permutation of the `weight` variable.\nThe careful observer can see that each of the observed values for `weight` (and for `weeks`) exist in both the original data plot as well as the permuted `weight` plot, but the `weight` and `weeks` gestation are no longer matched for a given birth.\nThat is, each `weight` value is randomly assigned to a new `weeks` gestation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:permweightScatter-cap)](24-inf-model-slr_files/figure-html/permweightScatter-1.png){width=90%}\n:::\n:::\n\n\n(ref:permweightScatter-cap) Original (left) and permuted (right) data. The permutation removes the linear relationship between `weight` and `weeks`. Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).\n\nBy repeatedly permuting the response variable, any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship).\nThe randomization test compares the slopes calculated from the permuted response variable with the observed slope.\nIf the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).\n\n### Observed data\n\nWe will continue to use the births data to investigate the linear relationship between `weight` and `weeks` gestation.\nNote that the least squares model (see Chapter \\@ref(model-slr)) describing the relationship is given in Table \\@ref(tab:ls-births).\nThe columns in Table \\@ref(tab:ls-births) are further described in Section \\@ref(mathslope).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.335.
term estimate std.error statistic p.value
(Intercept) -5.72 1.61 -3.54 6e-04
weeks 0.34 0.04 8.07 <0.0001
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nAfter permuting the data, the least squares estimate of the line can be computed.\nRepeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between `weight` and `weeks` gestation.\nFigure \\@ref(fig:permweekslm) shows two different permutations of `weight` and the resulting linear models.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:permweekslm-cap)](24-inf-model-slr_files/figure-html/permweekslm-1.png){width=90%}\n:::\n:::\n\n\n(ref:permweekslm-cap) Two different permutations of the `weight` variable with slightly different least squares regression lines.\n\nAs you can see, sometimes the slope of the permuted data is positive, sometimes it is negative.\nBecause the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.\n\n### Observed statistic vs. null statistics\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:nulldistBirths-cap)](24-inf-model-slr_files/figure-html/nulldistBirths-1.png){width=90%}\n:::\n:::\n\n\n(ref:nulldistBirths-cap) Histogram of slopes given different permutations of the `weight` variable. The vertical red line is at the observed value of the slope, 0.335.\n\nAs we can see from Figure \\@ref(fig:nulldistBirths), a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the `weight` variable.\nThat is, if indeed there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.\nWe reject the null hypothesis.\nTherefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between `weight` of baby and `weeks` gestation for births in the US.\n\n## Bootstrap confidence interval for the slope {#bootbeta1}\n\nAs we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test).\nBecause interest is now in creating a CI, there is no null hypothesis, so there won't be any reason to permute either of the variables.\n\n\n\n\n\n### Observed data\n\nReturning to the births data, we may want to consider the relationship between `mage` (mother's age) and `weight`.\nIs `mage` a good predictor of `weight`?\nAnd if so, what is the relationship?\nThat is, what is the slope that models average `weight` of baby as a function of `mage` (mother's age)?\nThe linear model regressing `weight` on `mage` is provided in Table \\@ref(tab:ls-births-mage).\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:magePlot-cap)](24-inf-model-slr_files/figure-html/magePlot-1.png){width=90%}\n:::\n:::\n\n\n(ref:magePlot-cap) Original data: `weight` of baby as a linear model of mother's age. Notice that the relationship between `mage` and `weight` is not as strong as the relationship we saw previously between `weeks` and `weight`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.036
term estimate std.error statistic p.value
(Intercept) 6.23 0.71 8.79 <0.0001
mage 0.04 0.02 1.50 0.1362
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nBecause the focused is not on a null distribution, sample with replacement $n=100$ observations from the original dataset.\nRecall that with bootstrapping the resample always has the same number of observations as the original dataset in order to mimic the process of taking a sample from the population.\nWhen sampling in the linear model case, consider each observation to be a single dot.\nIf the dot is resampled, both the `weight` and the `mage` measurement are observed.\nThe measurements are linked to the dot (i.e., to the birth in the sample).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Original and one bootstrap sample of the births data. Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another. The red circles represent points in the original data which were not included in the bootstrap sample. The blue circles represents a data point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circles represents a particular structure to the data which is observed in both the original and bootstrap samples.](24-inf-model-slr_files/figure-html/birth2BS-1.png){width=90%}\n:::\n:::\n\n\nFigure \\@ref(fig:birth2BS) shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models.\nThe red circles represent points in the original data which were not included in the bootstrap sample.\nThe blue circles represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.\nThe green circles represents a particular structure to the data which is observed in both the original and bootstrap samples.\nBy repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in Figure \\@ref(fig:birthBS).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Repeated bootstrap resamples of size 100 are taken from the original data. Each of the bootstrapped linear model is slightly different.](24-inf-model-slr_files/figure-html/birthBS-1.png){width=90%}\n:::\n:::\n\n\nRecall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.\nFigure \\@ref(fig:mageBSslopes) is a histogram of the relevant bootstrapped slopes.\nWe can see that a 95% bootstrap percentile interval for the true population slope is given by (-0.01, 0.081).\nWe are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.01 and 0.081 pounds (notice that the CI overlaps zero!).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:mageBSslopes-cap)](24-inf-model-slr_files/figure-html/mageBSslopes-1.png){width=90%}\n:::\n:::\n\n\n(ref:mageBSslopes-cap) The original births data on `weight` and `mage` is bootstrapped 1,000 times. The histogram provides a sense for the variability of the slope of the linear model slope from sample to sample.\n\n::: {.workedexample data-latex=\"\"}\nUsing Figure \\@ref(fig:mageBSslopes), calculate the bootstrap estimate for the standard error of the slope.\nUsing the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true population slope, and interpret the interval in context.\n\n------------------------------------------------------------------------\n\nNotice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).\nUsing the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the slopes is approximately 0.0225.\nThe normal cutoff for a 95% confidence interval is $z^\\star = 1.96$ which leads to a confidence interval of $b_1 \\pm 1.96 \\cdot SE \\rightarrow 0.036 \\pm 1.96 \\cdot 0.0225 \\rightarrow (-0.0081, 0.0801).$ The bootstrap SE confidence interval is almost identical to the bootstrap percentile interval.\nIn context, we are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.0081 and 0.0801 pounds\n:::\n\n## Mathematical model for testing the slope {#mathslope}\n\nWhen certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.\nThe approximations will build on the t-distribution which was described in Chapter \\@ref(inference-one-mean).\nThe mathematical model is often correct and is usually easy to implement computationally.\nThe validity of the technical conditions will be considered in detail in Section \\@ref(tech-cond-linmod).\n\nIn this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line.\nJust as we identified standard errors for point estimates in previous chapters, we first discuss standard errors for these new estimates.\n\n### Observed data\n\n**Midterm elections and unemployment**\n\nElections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S.\nPresidential election.\nThe set of House elections occurring during the middle of a Presidential term are called midterm elections.\nIn America's two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.\nIn 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.\n\nTo assess the validity of this claim, we can compile historical data and look for a connection.\nWe consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression.\nThe House of Representatives is made up of 435 voting members.\n\n::: {.data data-latex=\"\"}\nThe [`midterms_house`](http://openintrostat.github.io/openintro/reference/midterms_house.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nFigure \\@ref(fig:unemploymentAndChangeInHouse) shows these data and the least-squares regression line:\n\n$$\n\\begin{aligned}\n&\\texttt{percent change in House seats for President's party} \\\\\n&\\qquad\\qquad= -7.36 - 0.89 \\times \\texttt{(unemployment rate)}\n\\end{aligned}\n$$\n\nWe consider the percent change in the number of seats of the President's party (e.g., percent change in the number of seats for Republicans in 2018) against the unemployment rate.\n\nExamining the data, there are no clear deviations from linearity or substantial outliers (see Section \\@ref(resids) for a discussion on using residuals to visualize how well a linear model fits the data).\nWhile the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The percent change in House seats for the President's party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.](24-inf-model-slr_files/figure-html/unemploymentAndChangeInHouse-1.png){width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nThe data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively.\nDo you agree that they should be removed for this investigation?\nWhy or why not?[^24-inf-model-slr-1]\n:::\n\n[^24-inf-model-slr-1]: The answer to this question relies on the idea that statistical data analysis is somewhat of an art.\n That is, in many situations, there is no \"right\" answer.\n As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset.\n In terms of the Great Depression, we will provide two contrasting considerations.\n Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high.\n On the other hand, these are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.\n\nThere is a negative slope in the line shown in Figure \\@ref(fig:unemploymentAndChangeInHouse).\nHowever, this slope (and the y-intercept) are only estimates of the parameter values.\nWe might wonder, is this convincing evidence that the \"true\" linear model has a negative slope?\nThat is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?\nWe can frame this investigation into a statistical hypothesis test:\n\n- $H_0$: $\\beta_1 = 0$. The true linear model has slope zero.\n- $H_A$: $\\beta_1 \\neq 0$. The true linear model has a slope different than zero. The unemployment is predictive of whether the President's party wins or loses seats in the House of Representatives.\n\nWe would reject $H_0$ in favor of $H_A$ if the data provide strong evidence that the true slope parameter is different than zero.\nTo assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.\n\n### Variability of the statistic\n\nJust like other point estimates we have seen before, we can compute a standard error and test statistic for $b_1$.\nWe will generally label the test statistic using a $T$, since it follows the $t$-distribution.\n\nWe will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.\nTable \\@ref(tab:midtermUnempRegTable) shows software output for the least squares regression line in Figure \\@ref(fig:unemploymentAndChangeInHouse).\nThe row labeled `unemp` includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable).\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Output from statistical software for the regression line modeling the midterm election losses for the President's party as a response to unemployment.
term estimate std.error statistic p.value
(Intercept) -7.36 5.16 -1.43 0.16
unemp -0.89 0.83 -1.07 0.30
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nWhat do the first and second columns of Table \\@ref(tab:midtermUnempRegTable) represent?\n\n------------------------------------------------------------------------\n\nThe entries in the first column represent the least squares estimates, $b_0$ and $b_1$, and the values in the second column correspond to the standard errors of each estimate.\nUsing the estimates, we could write the equation for the least square regression line as\n\n$$ \\hat{y} = -7.36 - 0.89 x $$\n\nwhere $\\hat{y}$ in this case represents the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.\n:::\n\nWe previously used a $t$-test statistic for hypothesis testing in the context of numerical data.\nRegression is very similar.\nIn the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T score formula:\n\n$$\nT \\ = \\ \\frac{\\text{estimate} - \\text{null value}}{\\text{SE}} = \\ \\frac{-0.89 - 0}{0.835} = \\ -1.07\n$$\n\nThis corresponds to the third column of Table \\@ref(tab:midtermUnempRegTable) .\n\n::: {.workedexample data-latex=\"\"}\nUse Table \\@ref(tab:midtermUnempRegTable) to determine the p-value for the hypothesis test\n\n------------------------------------------------------------------------\n\nThe last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections.\n:::\n\n### Observed statistic vs. null statistics\n\nAs the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether the data could have come from a population where the true slope was zero (i.e., $\\beta_1 = 0$).\nBefore evaluating the formal hypothesis claim, sometimes it is important to check your intuition.\nBased on everything we have seen in the examples above describing the variability of a line from sample to sample, ask yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.\n\n::: {.workedexample data-latex=\"\"}\nExamine Figure \\@ref(fig:elmhurstScatterWLine), which relates the Elmhurst College aid and student family income.\nAre you convinced that the slope is meaningfully different from zero?\nThat is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?\n\n------------------------------------------------------------------------\n\nWhile the relationship between the variables is not perfect, there is an evident decreasing trend in the data.\nThis suggests the hypothesis test will reject the null claim that the slope is zero.\n:::\n\n::: {.data data-latex=\"\"}\nThe [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe tools in this section help you go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to suggest that the true population slope is different from 0.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.
term estimate std.error statistic p.value
(Intercept) 24319.33 1291.45 18.83 <0.0001
family_income -0.04 0.01 -3.98 2e-04
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nTable \\@ref(tab:rOutputForIncomeAidLSRLineInInferenceSection) shows statistical software output from fitting the least squares regression line shown in Figure \\@ref(fig:elmhurstScatterWLine).\nUse the output to formally evaluate the following hypotheses.[^24-inf-model-slr-2]\n\n- $H_0$: The true coefficient for family income is zero.\n- $H_A$: The true coefficient for family income is not zero.\n:::\n\n[^24-inf-model-slr-2]: We look in the second row corresponding to the family income variable.\n We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the $t$-test statistic is $T = -3.98$.\n The p-value corresponds exactly to the two-sided test we are interested in: 0.0002.\n The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of Figure \\@ref(fig:elmhurstScatterWLine).\n\n::: {.important data-latex=\"\"}\n**Inference for regression.**\n\nWe usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice.\nHowever, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.\nSee Section \\@ref(tech-cond-linmod).\n:::\n\n\\clearpage\n\n## Mathematical model, interval for the slope\n\n### Observed data\n\nSimilar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct a confidence interval for that coefficient.\n\n::: {.workedexample data-latex=\"\"}\nCompute the 95% confidence interval for the coefficient using the regression output from Table \\@ref(tab:rOutputForIncomeAidLSRLineInInferenceSection).\n\n------------------------------------------------------------------------\n\nThe point estimate is -0.0431 and the standard error is $SE = 0.0108$.\nWhen constructing a confidence interval for a model coefficient, we generally use a $t$-distribution.\nThe degrees of freedom for the distribution are noted in the regression output, $df = 48$, allowing us to identify $t_{48}^{\\star} = 2.01$ for use in the confidence interval.\n\nWe can now construct the confidence interval in the usual way:\n\n$$\n\\begin{aligned}\n\\text{point estimate} &\\pm t_{48}^{\\star} \\times SE \\\\\n-0.0431 &\\pm 2.01 \\times 0.0108 \\\\\n(-0.0648 &, -0.0214)\n\\end{aligned}\n$$\n\nWe are 95% confident that with each dollar increase in , the university's gift aid is predicted to decrease on average by \\$0.0214 to \\$0.0648.\n:::\n\n### Variability of the statistic\n\n::: {.important data-latex=\"\"}\n**Confidence intervals for coefficients.**\n\nConfidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the $t$-distribution:\n\n$$ b_i \\ \\pm\\ t_{df}^{\\star} \\times SE_{b_{i}} $$\n\nwhere $t_{df}^{\\star}$ is the appropriate $t$-value corresponding to the confidence level with the model's degrees of freedom.\n:::\n\nOn the topic of intervals in this book, we have focused exclusively on confidence intervals for model parameters.\nHowever, there are other types of intervals that may be of interest, including prediction intervals for a response value and confidence intervals for a mean response value in the context of regression.\n\n\\clearpage\n\n## Checking model conditions {#tech-cond-linmod}\n\nIn the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions.\nIn this section, we'll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures.\n\n\n\n\n\n### What are the technical conditions for the mathematical model?\n\nWhen fitting a least squares line, we generally require\n\n- **Linearity.** The data should show a linear trend.\n If there is a nonlinear trend (e.g., first panel of Figure \\@ref(fig:whatCanGoWrongWithLinearModel)) an advanced regression method from another book or later course should be applied.\n\n- **Independent observations.** Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day.\n Such data may have an underlying structure that should be considered in a model and analysis.\n An example of a dataset where successive observations are not independent is shown in the fourth panel of Figure \\@ref(fig:whatCanGoWrongWithLinearModel).\n There are also other instances where correlations within the data are important, which is further discussed in Chapter \\@ref(inf-model-mlr).\n\n- **Nearly normal residuals.** Generally, the residuals must be nearly normal.\n When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we'll talk about more in Section \\@ref(outliers-in-regression).\n An example of a residual that would be a potentially concern is shown in the second panel of Figure \\@ref(fig:whatCanGoWrongWithLinearModel), where one observation is clearly much further from the regression line than the others.\n\n- **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.\n An example of non-constant variability is shown in the third panel of Figure \\@ref(fig:whatCanGoWrongWithLinearModel), which represents the most common pattern observed when this condition fails: the variability of $y$ is larger when $x$ is larger.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the $x$ and $y$ relationship. The bottom set of graphs is a residual plot.First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of $x$. Fourth panel: a time series dataset is shown, where successive observations are highly correlated.](24-inf-model-slr_files/figure-html/whatCanGoWrongWithLinearModel-1.png){width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nShould we have concerns about applying least squares regression to the Elmhurst data in Figure \\@ref(fig:elmhurstScatterW2Lines)?[^24-inf-model-slr-3]\n:::\n\n[^24-inf-model-slr-3]: The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant.\n These are also not time series observations.\n Least squares regression can be applied to these data.\n\nThe technical conditions are often remembered using the **LINE** mnemonic.\nThe linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in Figure \\@ref(fig:whatCanGoWrongWithLinearModel).\nA careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.\n\n- L: **linear** model\n- I: **independent** observations\n- N: points are **normally** distributed around the line\n- E: **equal** variability around the line for all values of the explanatory variable\n\n### Why do we need technical conditions?\n\nAs with other inferential techniques we have covered in this text, if the technical conditions above do not hold, then it is not possible to make concluding claims about the population.\nThat is, without the technical conditions, the T score (or Z score) will not have the assumed t-distribution (or standard normal $z$-distribution).\nThat said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions.\nIf there are large deviations, we will be unable to trust the calculated p-value or the endpoints of the resulting confidence interval.\n\n**The model based on Linearity**\n\nThe linearity condition is among the most important if your goal is to understand a linear model between $x$ and $y$.\nFor example, the value of the slope will not be at all meaningful if the true relationship between $x$ and $y$ is quadratic, as in Figure \\@ref(fig:notGoodAtAllForALinearModel).\nNot only should we be cautious about the inference, but the model *itself* is also not an accurate portrayal of the relationship between the variables.\n\nIn Section \\@ref(inf-model-mlr) we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones.\nHowever, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.\n\n**The importance of Independence**\n\nThe technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose.\nIt is also extremely difficult to gather a dataset which is a true random sample from the population of interest.\n(Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)\n\nDependent observations can bias results in ways that produce fundamentally flawed analyses.\nThat is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university.\nAt best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).\n\nIn lieu of trying to answer whether your observations are a true random sample, you might instead focus on whether you believe your observations are representative of the populations.\nHumans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.\n\n\\clearpage\n\n**Some thoughts on Normality**\n\nThe normality condition requires that points vary symmetrically around the line, spreading out in a bell-shaped fashion.\nYou should consider the \"bell\" of the normal distribution as sitting on top of the line (coming off the paper in a 3-D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.\n\nThe normality condition is less important than linearity or independence for a few reasons.\nFirst, the linear model fit with least squares will still be an unbiased estimate of the true population model.\nHowever, the standard errors associated with variability of the line will not be well estimated.\nFortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough.\nOne analysis method that *does* require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given $x$ value, using the linear model.\nOne additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.\n\n**Equal variability for prediction in particular**\n\nAs with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of $x$) will not cause problems for the estimate of the linear model.\nThat said, the **inference** on the model (e.g., computing p-values) will be incorrect if the variability around the line is heterogeneous.\nData that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).\n\nThe inference results for both a randomization test or a bootstrap confidence interval are robust to the equal variability condition, so they give the analyst methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line).\nAlthough randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation).\nWhen the equal variability condition is violated and a mathematical analysis (e.g., p-value from T score) is needed, there are other existing methods (outside the scope of this text) which can easily handle the unequal variance (e.g., weighted least squares analysis).\n\n### What if all the technical conditions are met?\n\nWhen the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software.\nIn addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests).\nIn particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions.\nWhen the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand.\nWe will discuss some of the extended modeling and associated inference in Chapter \\@ref(inf-model-mlr) and Section \\@ref(inf-model-logistic).\nMany of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one.\nIf you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.\n\n\\clearpage\n\n## Chapter review {#chp24-review}\n\n### Summary\n\nRecall that early in the text we presented graphical techniques which communicated relationships across multiple variables.\nWe also used modeling to formalize the relationships.\nMany chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data.\nNot only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods.\n\nHere in @sec-inf-model-slr we brought all of those ideas together by considering inferential claims on linear models through randomization tests, bootstrapping, and mathematical modeling.\nWe continue to emphasize the importance of experimental design in making conclusions about research claims.\nIn particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \\@ref(fig:randsampValloc)).\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
bootstrap CI for the slope randomization test for the slope technical conditions linear regression
inference with single precictor regression t-distribution for slope variability of the slope
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp24-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-24].\n\n::: {.exercises data-latex=\"\"}\n1. **Body measurements, randomization test.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nA linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n \n Below are two items.\n The first is the standard linear model output for predicting height from shoulder girth.\n The second is a histogram of slopes from 1,000 randomized datasets (1,000 times, `hgt` was permuted and regressed against `sho_gi`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 105.832 3.27 32.3 <0.0001
sho_gi 0.604 0.03 20.0 <0.0001
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-31-1.png){width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model predicting height from shoulder girth is differen than 0.\n \n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like shoulder girth and height).\n \n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model? Explain.\n \n \\clearpage\n\n1. **Body measurements, mathematical test.**\nThe scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals. [@Heinz:2003]\n\n \\vspace{-2mm}\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-32-1.png){width=70%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -105.01 7.54 -13.9 <0.0001
hgt 1.02 0.04 23.1 <0.0001
\n \n `````\n :::\n :::\n\n a. Describe the relationship between height and weight.\n\n b. Write the equation of the regression line. Interpret the slope and intercept in context.\n\n c. Do the data provide convincing evidence that the true slope parameter is different than 0? State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n\n d. The correlation coefficient for height and weight is 0.72. Calculate $R^2$ and interpret it.\n\n1. **Body measurements, bootstrap percentile interval.** \nIn order to estimate the slope of the model predicting height based on shoulder girth (circumference of shoulders measured over deltoid muscles), 1,000 bootstrap samples are taken from a dataset of body measurements from 507 people. \nA linear model predicting height from shoulder girth is fit to each bootstrap sample, and the slope is estimated.\nA histogram of these slopes is shown below. [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-33-1.png){width=90%}\n :::\n :::\n\n a. Using the bootstrap percentile method and the histogram above, find a 98% confidence interval for the slope parameter.\n \n b. Interpret the confidence interval in the context of the problem.\n \n \\clearpage\n\n1. **Body measurements, standard error bootstrap interval.** \nA linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters. [@Heinz:2003]\n\n Below are two items.\n The first is the standard linear model output for predicting height from shoulder girth.\n The second is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 105.832 3.27 32.3 <0.0001
sho_gi 0.604 0.03 20.0 <0.0001
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n \n b. Find a 98% bootstrap SE confidence interval for the slope parameter.\n \n c. Interpret the confidence interval in the context of the problem.\n \n \\clearpage\n\n1. **Murders and poverty, randomization test.** \nThe following regression output is for predicting annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n Below are two items.\n The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.\n The second is a histogram of slopes from 1000 randomized datasets (1000 times, `annual_murders_per_mil` was permuted and regressed against `perc_pov`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -29.90 7.79 -3.84 0.0012
perc_pov 2.56 0.39 6.56 <0.0001
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-35-1.png){width=90%}\n :::\n :::\n \n a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting annual murder rate from poverty percentage is different than 0?\n \n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like murder rate and poverty).\n \n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model? Explain.\n \n \\clearpage\n\n1. **Murders and poverty, mathematical test.**\nThe table below shows the output of a linear model annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -29.90 7.79 -3.84 0.0012
perc_pov 2.56 0.39 6.56 <0.0001
\n \n `````\n :::\n :::\n\n a. What are the hypotheses for evaluating whether the slope of the model predicting annual murder rate from poverty percentage is different than 0?\n\n b. State the conclusion of the hypothesis test from part (a) in context of the data. What does this say about whether poverty percentage is a useful predictor of annual murder rate?\n\n c. Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.\n\n d. Do your results from the hypothesis test and the confidence interval agree? Explain.\n\n1. **Murders and poverty, bootstrap percentile interval.**\nData on annual murders per million (`annual_murders_per_mil`) and percentage living in poverty (`perc_pov`) is collected from a random sample of 20 metropolitan areas.\nUsing these data we want to estimate the slope of the model predicting `annual_murders_per_mil` from `perc_pov`.\nWe take 1,000 bootstrap samples of the data and fit a linear model predicting `annual_murders_per_mil` from `perc_pov` to each bootstrap sample.\nA histogram of these slopes is shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-37-1.png){width=90%}\n :::\n :::\n \n a. Using the percentile bootstrap method and the histogram above, find a 90% confidence interval for the slope parameter.\n \n b. Interpret the confidence interval in the context of the problem.\n \n \\clearpage\n\n1. **Murders and poverty, standard error bootstrap interval.**\nA linear model is built to predict annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n Below are two items.\n The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.\n The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -29.90 7.79 -3.84 0.0012
perc_pov 2.56 0.39 6.56 <0.0001
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-38-1.png){width=90%}\n :::\n :::\n \n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n \n b. Find a 90% bootstrap SE confidence interval for the slope parameter.\n \n c. Interpret the confidence interval in the context of the problem.\n \n \\clearpage\n\n1. **Baby's weight and father's age, randomization test.**\nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1000 births from 2014.\nHere, we study the relationship between the father's age and the weight of the baby.^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@data:births14]\n\n Below are two items.\n The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).\n The second is a histogram of slopes from 1000 randomized datasets (1000 times, `weight` was permuted and regressed against `fage`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 7.101 0.199 35.674 <0.0001
fage 0.005 0.006 0.757 0.4495
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting baby's weight from father's age is different than 0?\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like father's age and weight of baby). What does the conclusion of your test say about whether the father's age is a useful predictor of baby's weight?\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model? Explain.\n \n \\clearpage\n\n1. **Baby's weight and father's age, mathematical test.**\nIs the father's age useful in predicting the baby's weight?\nThe scatterplot and least squares summary below show the relationship between baby's weight (measured in pounds) and father's age for a random sample of babies. [@data:births14]\n\n \\vspace{-2mm}\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-40-1.png){width=70%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 7.1042 0.1936 36.698 <0.0001
fage 0.0047 0.0061 0.779 0.4359
\n \n `````\n :::\n :::\n \n a. What is the predicted weight of a baby whose father is 30 years old.\n \n b. Do the data provide convincing evidence that the model for predicting baby weights from father's age has a slope different than 0? State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n \n c. Based on your conclusion, is father's age a useful predictor of baby's weight?\n\n1. **Baby's weight and father's age, bootstrap percentile interval.**\nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1000 births from 2014.\nHere, we study the relationship between the father's age and the weight of the baby.\nBelow is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data. [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-41-1.png){width=90%}\n :::\n :::\n\n a. Using the bootstrap percentile method and the histogram above, find a 95% confidence interval for the slope parameter.\n \n b. Interpret the confidence interval in the context of the problem.\n \n \\clearpage\n\n1. **Baby's weight and father's age, standard error bootstrap interval.** \nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1000 births from 2014.\nHere, we study the relationship between the father's age and the weight of the baby. [@data:births14]\n\n Below are two items.\n The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).\n The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 7.101 0.199 35.674 <0.0001
fage 0.005 0.006 0.757 0.4495
\n \n `````\n :::\n \n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-42-1.png){width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n \n b. Find a 95% bootstrap SE confidence interval for the slope parameter.\n \n c. Interpret the confidence interval in the context of the problem.\n\n1. **I heart cats.**\nResearchers collected data on heart and body weights of 144 domestic adult cats. \nThe table below shows the output of a linear model predicting heat weight (measured in grams) from body weight (measured in kilograms) of these cats.^[The [`cats`](https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/cats.html) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.357 0.692 -0.515 0.6072
Bwt 4.034 0.250 16.119 <0.0001
\n \n `````\n :::\n :::\n\n a. What are the hypotheses for evaluating whether body weight is positively associated with heart weight in cats?\n\n b. State the conclusion of the hypothesis test from part (a) in context of the data.\n\n c. Calculate a 95% confidence interval for the slope of body weight, and interpret it in context of the data.\n\n d. Do your results from the hypothesis test and the confidence interval agree? Explain.\n \n \\clearpage\n\n1. **Beer and blood alcohol content**\nMany people believe that weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they differed in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood. The scatterplot and regression table summarize the findings. ^[The [`bac`](http://openintrostat.github.io/openintro/reference/bac.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Malkevitc+Lesser:2008] \n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-44-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.0127 0.0126 -1.00 0.332
beers 0.0180 0.0024 7.48 <0.0001
\n \n `````\n :::\n :::\n\n a. Describe the relationship between the number of cans of beer and BAC.\n\n b. Write the equation of the regression line. Interpret the slope and intercept in context.\n\n c. Do the data provide convincing evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.\n\n d. The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate $R^2$ and interpret it in context.\n\n e. Suppose we visit a bar, ask people how many drinks they have had, and take their BAC. Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?\n \n \\clearpage\n\n1. **Urban homeowners, conditions.**\nThe scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. [@data:urbanOwner] There are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](24-inf-model-slr_files/figure-html/unnamed-chunk-45-1.png){width=50%}\n :::\n :::\n\n a. For these data, $R^2$ is 29.16%. What is the value of the correlation coefficient? How can you tell if it is positive or negative?\n\n b. Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data? Which of the LINE conditions are met or not met?\n\n\n:::\n", + "supporting": [ + "24-inf-model-slr_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/24-inf-model-slr/figure-html/babyweight-1.png b/_freeze/24-inf-model-slr/figure-html/babyweight-1.png new file mode 100644 index 00000000..4630c584 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/babyweight-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/birth2BS-1.png b/_freeze/24-inf-model-slr/figure-html/birth2BS-1.png new file mode 100644 index 00000000..d8cae071 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/birth2BS-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/birthBS-1.png b/_freeze/24-inf-model-slr/figure-html/birthBS-1.png new file mode 100644 index 00000000..6603291d Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/birthBS-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/mageBSslopes-1.png b/_freeze/24-inf-model-slr/figure-html/mageBSslopes-1.png new file mode 100644 index 00000000..399fa891 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/mageBSslopes-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/magePlot-1.png b/_freeze/24-inf-model-slr/figure-html/magePlot-1.png new file mode 100644 index 00000000..97cdc488 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/magePlot-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/nulldistBirths-1.png b/_freeze/24-inf-model-slr/figure-html/nulldistBirths-1.png new file mode 100644 index 00000000..4e0a21ea Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/nulldistBirths-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/permweekslm-1.png b/_freeze/24-inf-model-slr/figure-html/permweekslm-1.png new file mode 100644 index 00000000..6a4d97b2 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/permweekslm-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/permweightScatter-1.png b/_freeze/24-inf-model-slr/figure-html/permweightScatter-1.png new file mode 100644 index 00000000..49d1fc13 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/permweightScatter-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/sand20lm-1.png b/_freeze/24-inf-model-slr/figure-html/sand20lm-1.png new file mode 100644 index 00000000..1b2ed14f Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/sand20lm-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/sandpop-1.png b/_freeze/24-inf-model-slr/figure-html/sandpop-1.png new file mode 100644 index 00000000..127a00bb Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/sandpop-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/slopes-1.png b/_freeze/24-inf-model-slr/figure-html/slopes-1.png new file mode 100644 index 00000000..578596af Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/slopes-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unemploymentAndChangeInHouse-1.png b/_freeze/24-inf-model-slr/figure-html/unemploymentAndChangeInHouse-1.png new file mode 100644 index 00000000..19e38965 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unemploymentAndChangeInHouse-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-31-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 00000000..e71dd8ce Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-31-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 00000000..7817aa93 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png new file mode 100644 index 00000000..3f713886 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 00000000..3f713886 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png new file mode 100644 index 00000000..5e1c008f Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png new file mode 100644 index 00000000..ee00cbf7 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png new file mode 100644 index 00000000..f4591087 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 00000000..5688ad8f Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png new file mode 100644 index 00000000..2d04bc42 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png new file mode 100644 index 00000000..ca2c1ee8 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-42-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-42-1.png new file mode 100644 index 00000000..ca2c1ee8 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-42-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png new file mode 100644 index 00000000..71215dee Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png new file mode 100644 index 00000000..82f2d028 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-5-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 00000000..ad73d828 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-5-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-6-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 00000000..f4ebc100 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-6-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-7-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 00000000..2a15bed4 Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-7-1.png differ diff --git a/_freeze/24-inf-model-slr/figure-html/whatCanGoWrongWithLinearModel-1.png b/_freeze/24-inf-model-slr/figure-html/whatCanGoWrongWithLinearModel-1.png new file mode 100644 index 00000000..cdc021ec Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/whatCanGoWrongWithLinearModel-1.png differ diff --git a/_freeze/25-inf-model-mlr/execute-results/html.json b/_freeze/25-inf-model-mlr/execute-results/html.json new file mode 100644 index 00000000..643fbc0d --- /dev/null +++ b/_freeze/25-inf-model-mlr/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "6af045999ba3bbc428a4149ffdc6121a", + "result": { + "markdown": "# Inference for linear regression with multiple predictors {#inf-model-mlr}\n\n\\chaptermark{Inference for regression with multiple predictors}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nIn Chapter \\@ref(model-mlr), the least squares regression method was used to estimate linear models which predicted a particular response variable given more than one explanatory variable.\nHere, we discuss whether each of the variables individually is a statistically significant predictor of the outcome or whether the model might be just as strong without that variable.\nThat is, as before, we apply inferential methods to ask whether a variable could have come from a population where the particular coefficient at hand was zero.\nIf one of the linear model coefficients is truly zero (in the population), then the estimate of the coefficient (using least squares) will vary around zero.\nThe inference task at hand is to decide whether the coefficient's difference from zero is large enough to decide that the data cannot possibly have come from a model where the true population coefficient is zero.\nBoth the derivations from the mathematical model and the randomization model are beyond the scope of this book, but we are able to calculate p-values using statistical software.\nWe will discuss interpreting p-values in the multiple regression setting and note some scenarios where careful understanding of the context and the relationship between variables is important.\nWe use cross-validation as a method for independent assessment of the multiple linear regression model.\n:::\n\n\n\n\n\n## Multiple regression output from software {#inf-mult-reg-soft}\n\nRecall the `loans` data from Chapter \\@ref(model-mlr).\n\n::: {.data data-latex=\"\"}\nThe [`loans_full_schema`](http://openintrostat.github.io/openintro/reference/loans_full_schema.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\nBased on the data in this dataset we have created two new variables: `credit_util` which is calculated as the total credit utilized divided by the total credit limit and `bankruptcy` which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies).\nWe will refer to this modified dataset as `loans`.\n:::\n\nNow, our goal is to create a model where `interest_rate` can be predicted using the variables `debt_to_income`, `term`, and `credit_checks`.\nAs you learned in Chapter \\@ref(model-mlr), least squares can be used to find the coefficient estimates for the linear model.\nThe unknown population model can be written as:\n\n$$\n\\begin{aligned}\nE[\\texttt{interest_rate}] = \\beta_0 &+ \\beta_1\\times \\texttt{debt_to_income} \\\\\n&+ \\beta_2 \\times \\texttt{term}\\\\\n&+ \\beta_3 \\times \\texttt{credit_checks}\\\\\n\\end{aligned}\n$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a linear model for predicting interest rate based on `debt_to_income`, `term`, and `credit_checks`. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) 4.31 0.20 22.1 <0.0001
debt_to_income 0.04 0.00 13.3 <0.0001
term 0.16 0.00 37.9 <0.0001
credit_checks 0.25 0.02 12.8 <0.0001
\n\n`````\n:::\n:::\n\n\nThe estimated equation for the regression model may be written as a model with three predictor variables:\n\n$$\n\\begin{aligned}\n\\widehat{\\texttt{interest_rate}} = 4.31 &+ 0.041 \\times \\texttt{debt_to_income} \\\\\n&+ 0.16 \\times \\texttt{term} \\\\\n&+ 0.25 \\times \\texttt{credit_checks}\n\\end{aligned}\n$$\n\nNot only does Table \\@ref(tab:loansmodel) provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.\n\nIn @sec-inf-model-slr, you learned that the hypothesis test for a linear model with **one predictor**[^25-inf-model-mlr-1] can be written as:\n\n[^25-inf-model-mlr-1]: In previous sections, the term **explanatory variable** was used instead of **predictor**.\n The words are synonymous and are used separately in the different sections to be consistent with how most analysts use them: explanatory variable for testing, predictor for modeling.\n\n\n\n\n\n> if only one predictor, $H_0: \\beta_1 = 0.$\n\nThat is, if the true population slope is zero, the p-value measures how likely it would be to select data which produced the observed slope ($b_1$) value.\n\nWith **multiple predictors**, the hypothesis is similar, however, it is now conditioned on each of the other variables remaining in the model.\n\n\n\n\n\n> if multiple predictors, $H_0: \\beta_i = 0$ given other variables in the model\n\nUsing the example above and focusing on each of the variable p-values (here we won't discuss the p-value associated with the intercept), we can write out the three different hypotheses:\n\n- $H_0: \\beta_1 = 0$, given `term` and `credit_checks` are included in the model\n- $H_0: \\beta_2 = 0$, given `debt_to_income` and `credit_checks` are included in the model\n- $H_0: \\beta_3 = 0$, given `debt_to_income` and `term` are included in the model\n\nThe very low p-values from the software output tell us that each of the variables acts as an important predictor in the model, despite the inclusion of the other two.\nConsider the p-value on $H_0: \\beta_1 = 0$.\nThe low p-value says that it would be extremely unlikely to see data that produce a coefficient on `debt_to_income` as large as 0.041 if the true relationship between `debt_to_income`and `interest_rate` was non-existent (i.e., if $\\beta_1 = 0$) and the model also included `term` and `credit_checks`.\nYou might have thought that the value 0.041 is a small number (i.e., close to zero), but in the units of the problem, 0.041 turns out to be far away from zero, it's all about context!\nThe p-values on `term` and on `credit_checks` are interpreted similarly.\n\nSometimes a set of predictor variables can impact the model in unusual ways, often due to the predictor variables themselves being correlated.\n\n## Multicollinearity {#inf-mult-reg-collin}\n\nIn practice, there will almost always be some degree of correlation between the explanatory variables in a multiple regression model.\nFor regression models, it is important to understand the entire context of the model, particularly for correlated variables.\nOur discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the significance (i.e., the p-value) of each coefficient.\n\nConsider an example where we would like to predict how much money is in a coin dish based only on the number of coins in the dish.\nWe ask 26 students to tell us about their individual coin dishes, collecting data on the total dollar amount, the total number of coins, and the total number of low coins.[^25-inf-model-mlr-2]\nThe number of low coins is the number of coins minus the number of quarters (a quarter is the largest commonly used US coin, at US\\$0.25).\nFigure \\@ref(fig:money) illustrates a sample of U.S. coins, their total worth (`total_amount`), the total `number of coins`, and the `number of low coins`.\n\n[^25-inf-model-mlr-2]: In all honesty, this particular dataset is fabricated, and the original idea for the problem comes from Jeff Witmer at Oberlin College.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:money-cap)](images/money.png){fig-alt='Graphic of five pennies, three nickles, two dimes, and six quarters. A summary table indicates that there are 16 total coins, 10 of which are considered low coins. The total amount of money is one dollar and ninety cents.' width=90%}\n:::\n:::\n\n\n(ref:money-cap) A sample of coins with 16 total coins, 10 low coins, and a net worth of \\$1.90.\n\nThe collected data is given in Figure \\@ref(fig:coinfig) and shows that the `total_amount` of money is more highly correlated with the total `number of coins` than it is with the `number of low coins`.\nWe also note that the total `number of coins` and the `number of low coins` are positively correlated.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Plot describing the total amount of money (USD) as a function of the number of coins and the number of low coins. As you might expect, the total amount of money is more highly postively correlated with the total number of coins than with the number of low coins.](25-inf-model-mlr_files/figure-html/coinfig-1.png){width=90%}\n:::\n:::\n\n\nUsing the total `number of coins` as the predictor variable, Table \\@ref(tab:coinhigh) provides the least squares estimate of the coefficient is 0.13.\nFor every additional coin in the dish, we would predict that the student had US\\$0.13 more.\nThe $b_1 = 0.13$ coefficient has a small p-value associated with it, suggesting we would not have seen data like this if `number of coins` and `total_amount` of money were not linearly related.\n\n$$\\widehat{\\texttt{total_amount}} = 0.55 + 0.13 \\times \\texttt{number_of_coins}$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Linear model output predicting the total amount of money based on the total number of coins.
term estimate std.error statistic p.value
(Intercept) 0.55 0.44 1.23 0.2301
number_of_coins 0.13 0.02 5.54 <0.0001
\n\n`````\n:::\n:::\n\n\nUsing the `number of low coins` as the predictor variable, Table \\@ref(tab:coinlow) provides the least squares estimate of the coefficient is 0.02.\nFor every additional low coin in the dish, we would predict that the student had US\\$0.02 more.\nThe $b_1 = 0.02$ coefficient has a large p-value associated with it, suggesting we could easily have seen data like ours even if the `number of low coins` and `total_amount` of money are not at all linearly related.\n\n$$\\widehat{\\texttt{total_amount}} = 2.28 + 0.02 \\times \\texttt{number_of_low_coins}$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Linear model output predicting the total amount of money based on the number of low coins.
term estimate std.error statistic p.value
(Intercept) 2.28 0.58 3.9 0.0
number_of_low_coins 0.02 0.05 0.4 0.7
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nCome up with an example of two observations that have the same number of low coins but the number of total coins differs by one.\nWhat is the difference in total amount?\n\n------------------------------------------------------------------------\n\nTwo samples of coins with the same number of low coins (3), but a different number of total coins (4 vs 5) and a different total total amount (\\$0.41 vs \\$0.66).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/lowsame.png){fig-alt='A set of four coins is shown: one of each penny, nickle, dime, and quarter. One quarter is then added to create a set of five coins: one penny, one nickle, one dime, and two quarters. The two sets of coins have the same number of low coins but a different number of total coins.' width=90%}\n:::\n:::\n\n:::\n\n::: {.workedexample data-latex=\"\"}\nCome up with an example of two observations that have the same total number of coins but a different number of low coins.\nWhat is the difference in total amount?\n\n------------------------------------------------------------------------\n\nTwo samples of coins with the same total number of coins (4), but a different number of low coins (3 vs 4) and a different total total amount (\\$0.41 vs \\$0.17).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/totalsame.png){fig-alt='A set of four coins is shown: one of each penny, nickle, dime, and quarter. The quarter is then replaced with a penny to create a new set of four coins: two pennies, one nickle, and one dime. The two sets of coins have the same number of total coins but a different number of low coins.' width=90%}\n:::\n:::\n\n:::\n\nUsing both the total `number of coins` and the `number of low coins` as predictor variables, Table \\@ref(tab:coinhighlow) provides the least squares estimates of both coefficients as 0.21 and -0.16.\nNow, with two variables in the model, the interpretation is more nuanced.\n\n- The coefficient indicates a change in one variable while keeping the other variable constant.\\\n For every additional coin in the dish **while** the `number of low coins` stays constant, we would predict that the student had US\\$0.21 more. Re-considering the phrase \"every additional coin in the dish **while** the number of low coins stays constant\" makes us realize that each increase is a single additional quarter (larger samples sizes would have led to a $b_1$ coefficient closer to 0.25 because of the deterministic relationship described here).\\\n- For every additional low coin in the dish **while** the total `number of coins` stays constant, we would predict that the student had US\\$0.16 less. Re-considering the phrase \"every additional low coin in the dish **while** the number of total coins stays constant\" makes us realize that a quarter is being swapped out for a penny, nickel, or dime.\n\n\\clearpage\n\nConsidering the coefficients across Tables \\@ref(tab:coinhigh), \\@ref(tab:coinlow), and \\@ref(tab:coinhighlow) within the context and knowledge we have of US coins allows us to understand the correlation between variables and why the signs of the coefficients would change depending on the model.\nNote also, however, that the p-value for the `number of low coins` coefficient changed from Table \\@ref(tab:coinlow) to Table \\@ref(tab:coinhighlow).\nIt makes sense that the variable describing the `number of low coins` provides more information about the `total_amount` of money when it is part of a model which also includes the total `number of coins` than it does when it is used as a single variable in a simple linear regression model.\n\n$$\\widehat{\\texttt{total_amount}} = 0.80 + 0.21 \\times \\texttt{number_of_coins} - 0.16 \\times \\texttt{number of low coins}$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Linear model output predicting the total amount of money based on both the total number of coins and the number of low coins.
term estimate std.error statistic p.value
(Intercept) 0.80 0.30 2.65 0.0142
number_of_coins 0.21 0.02 9.89 <0.0001
number_of_low_coins -0.16 0.03 -5.51 <0.0001
\n\n`````\n:::\n:::\n\n\nWhen working with multiple regression models, interpreting the model coefficient is not always as straightforward as it was with the coin example.\nHowever, we encourage you to always think carefully about the variables in the model, consider how they might be correlated among themselves, and work through different models to see how using different sets of variables might produce different relationships for predicting the response variable of interest.\n\n::: {.important data-latex=\"\"}\n**Multicollinearity.**\n\nMulticollinearity happens when the predictor variables are correlated within themselves.\nWhen the predictor variables themselves are correlated, the coefficients in a multiple regression model can be difficult to interpret.\n:::\n\n\n\n\n\nAlthough diving into the details are beyond the scope of this text, we will provide one more reflection about multicollinearity.\nIf the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or evaluate whether the variable is a statistically significant predictor of the outcome.\nHowever, even a model that suffers from high multicollinearity will likely lead to unbiased predictions of the response variable.\nSo if the task at hand is only to do prediction, multicollinearity is likely to not cause you substantial problems.\n\n\n```{=html}\n\n```\n\n## Cross-validation for prediction error {#inf-mult-reg-cv}\n\nIn Section \\@ref(inf-mult-reg-soft), p-values were calculated on each of the model coefficients.\nThe p-value gives a sense of which variables are important to the model; however, a more extensive treatment of variable selection is warranted in a follow-up course or textbook.\nHere, we use cross-validation prediction error to focus on which variable(s) are important for predicting the response variable of interest.\nIn general, linear models are also used to make predictions of individual observations.\nIn addition to model building, cross-validation provides a method for generating predictions that are not overfit to the particular dataset at hand.\nWe continue to encourage you to take up further study on the topic of cross-validation, as it is among the most important ideas in modern data analysis, and we are only able to scratch the surface here.\n\nCross-validation is a computational technique which removes some observations before a model is run, then assesses the model accuracy on the held-out sample.\nBy removing some observations, we provide ourselves with an independent evaluation of the model (that is, the removed observations do not contribute to finding the parameters which minimize the least squares equation).\nCross-validation can be used in many different ways (as an independent assessment), and here we will just scratch the surface with respect to one way the technique can be used to compare models.\nSee Figure \\@ref(fig:cv) for a visual representation of the cross-validation process.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The dataset is broken into k folds (here k = 4). One at a time, a model is built using k-1 of the folds, and predictions are calculated on the single held out sample which will be completely independent of the model estimation.](images/CV.png){fig-alt='A square represents the original observed data which has been partitioned into four triangular segments. From the original partition, four different settings are considered. The first setting is such that the red, green, and yellow triangular representations of the data are used to build the model; the blue triangular representation of the data is heldout and used for independent model prediction. The second setting is such that the blue, green, and yellow triangular representations of the data are used to build the model; the red triangular representation of the data is heldout and used for independent model prediction. The third setting is such that the red, blue, and yellow triangular representations of the data are used to build the model; the green triangular representation of the data is heldout and used for independent model prediction. The fourth setting is such that the red, green, and blue triangular representations of the data are used to build the model; the yellow triangular representation of the data is heldout and used for independent model prediction.' width=90%}\n:::\n:::\n\n\n\n\n::: {.data data-latex=\"\"}\nThe [`penguins`](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) data can be found in the [**palmerpenguings**](https://github.com/allisonhorst/palmerpenguins) R package.\n:::\n\nOur goal in this section is to compare two different regression models which both seek to predict the mass of an individual penguin in grams.\nThe observations of three different penguin species include measurements on body size and sex.\nThe data were collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/) as part of the [Long Term Ecological Research Network](https://lternet.edu/).\n[@Gorman:2014] Although not exactly aligned with this research project, you might be able to imagine a setting where the dimensions of the penguin are known (through, for example, aerial photographs) but the mass is not known.\nThe first model will predict `body_mass_g` by using only the `bill_length_mm`, a variable denoting the length of a penguin's bill, in mm.\nThe second model will predict `body_mass_g` by using `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species`.\n\n::: {.important data-latex=\"\"}\n**Prediction error.**\n\nThe predicted error (also previously called the **residual**) is the difference between the observed value and the predicted value (from the regression model).\n\n$$\\text{prediction error}_i = e_i = y_i - \\hat{y}_i$$\n:::\n\nThe presentation below (see the comparison of Figures \\@ref(fig:peng-mass1) and \\@ref(fig:peng-mass2)) shows that the model with more variables predicts `body_mass_g` with much smaller errors (predicted minus actual body mass) than the model which uses only `bill_length_g`.\nWe have deliberately used a model that intuitively makes sense (the more body measurements, the more predictable mass is).\nHowever, in many settings, it is not obvious which variables or which models contribute most to accurate predictions.\nCross-validation is one way to get accurate independent predictions with which to compare different models.\n\n### Comparing two models to predict body mass in penguins\n\nThe question we will seek to answer is whether the predictions of `body_mass_g` are substantially better when `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` are used in the model, as compared with a model on `bill_length_mm` only.\n\nWe refer to the model given with only `bill_lengh_mm` as the **smaller** model.\nIt is seen in Table \\@ref(tab:peng-lm-bill) with coefficient estimates of the parameters as well as standard errors and p-values.\nWe refer to the model given with `bill_lengh_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` as the **larger** model.\nIt is seen in Table \\@ref(tab:peng-lm-all) with coefficient estimates of the parameters as well as standard errors and p-values.\nGiven what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables have low p-values, suggesting that each variable is a statistically significant predictor of `body_mass_g`, given all other variables in the model.\nHowever, in this section, we will go beyond the use of p-values to consider independent predictions of `body_mass_g` as a way to compare the smaller and larger models.\n\n**The smaller model:**\n\n$$\n\\begin{aligned}\nE[\\texttt{body_mass_g}] &= \\ \\beta_0 + \\beta_1 \\times \\texttt{bill_length_mm}\\\\\n\\widehat{\\texttt{body_mass_g}} &= \\ 362.31 + 87.42 \\times \\texttt{bill_length_mm}\\\\\n\\end{aligned}\n$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The smalller model: least squares estimates of the regression model predicting `body_mass_g` from `bill_length_mm`.
term estimate std.error statistic p.value
(Intercept) 362.3 283.4 1.28 0.2019
bill_length_mm 87.4 6.4 13.65 <0.0001
\n\n`````\n:::\n:::\n\n\n**The larger model:**\n\n$$\n\\begin{aligned}\nE[\\texttt{body_mass_g}] = \\beta_0 &+ \\beta_1 \\times \\texttt{bill_length_mm} \\\\\n&+ \\beta_2 \\times \\texttt{bill_depth_mm} \\\\\n&+ \\beta_3 \\times \\texttt{flipper_length_mm} \\\\\n&+ \\beta_4 \\times \\texttt{sex}_{male} \\\\\n&+ \\beta_5 \\times \\texttt{species}_{Chinstrap} \\\\\n&+ \\beta_6 \\times \\texttt{species}_{Gentoo}\\\\\n\\widehat{\\texttt{body_mass_g}} = -1460.99 &+ 18.20 \\times \\texttt{bill_length_mm} \\\\\n&+ 67.22 \\times \\texttt{bill_depth_mm} \\\\\n&+ 15.95 \\times \\texttt{flipper_length_mm} \\\\\n&+ 389.89 \\times \\texttt{sex}_{male} \\\\\n&- 251.48 \\times \\texttt{species}_{Chinstrap} \\\\\n&+ 1014.63 \\times \\texttt{species}_{Gentoo}\\\\\n\\end{aligned}\n$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
The larger model: least squares estimates of the regression model predicting `body_mass_g` from `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species`.
term estimate std.error statistic p.value
(Intercept) -1461.0 571.31 -2.56 0.011
bill_length_mm 18.2 7.11 2.56 0.0109
bill_depth_mm 67.2 19.74 3.40 7e-04
flipper_length_mm 15.9 2.91 5.48 <0.0001
sexmale 389.9 47.85 8.15 <0.0001
speciesChinstrap -251.5 81.08 -3.10 0.0021
speciesGentoo 1014.6 129.56 7.83 <0.0001
\n\n`````\n:::\n:::\n\n\nIn order to compare the smaller and larger models in terms of their **ability to predict penguin mass**, we need to build models that can provide independent predictions based on the penguins in the holdout samples created by cross-validation.\nTo reiterate, each of the predictions that (when combined together) will allow us to distinguish between the smaller and larger are independent of the data which were used to build the model.\nIn this example, using cross-validation, we remove one quarter of the data before running the least squares calculations.\nThen the least squares model is used to predict the `body_mass_g` of the penguins in the holdout sample.\nHere we use a 4-fold cross-validation (meaning that one quarter of the data is removed each time) to produce four different versions of each model (other times it might be more appropriate to use 2-fold or 10-fold or even run the model separately after removing each individual data point one at a time).\n\nFigure \\@ref(fig:massCV1) displays how a model is fit to 3/4 of the data (note the slight differences in coefficients as compared to Table \\@ref(tab:peng-lm-bill)), and then predictions are made on the holdout sample.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The coefficients are estimated using the least squares model on 3/4 of the dataset with only a single predictor variable. Predictions are made on the remaining 1/4 of the observations. The y-axis in the scatterplot represents the residual: true observed value minus the predicted value. Note that the predictions are independent of the estimated model coefficients.](images/massCV1.png){fig-alt='The left panel shows the linear model predicting body mass in grams using bill length in mm; the model was built using the red, green, and yellow triangular sections of the observed data. The right panel shows a scatterplot of the prediction error versus the fitted values for the set of observations in the blue triangular section of the observed data. The prediction errors range from roughly -1000 grams to +1000 grams.' width=100%}\n:::\n:::\n\n\nBy repeating the process for each holdout quarter sample, the residuals from the model can be plotted against the predicted values.\nWe see that the predictions are scattered which shows a good model fit but that the prediction errors vary $\\pm$ 1000g of the true body mass.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted. The least squares regression model was fit independently of the removed penguins. The predictions of body mass are based on bill length only. The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).](25-inf-model-mlr_files/figure-html/peng-mass1-1.png){width=100%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nThe cross-validation SSE is the sum of squared error associated with the predictions.\nLet $\\hat{y}_{cv,i}$ be the prediction for the $i^{th}$ observation where the $i^{th}$ observation was in the hold-out fold and the other three folds were used to create the linear model.\nFor the model using only `bill_length_mm` to predict `body_mass_g`, the CV SSE is 141,552,822.\n\n::: {.important data-latex=\"\"}\n**Cross-validation SSE.**\n\nThe prediction error from the cross-validated model can be used to calculate a single numerical summary of the model.\nThe cross-validation SSE is the sum of squared cross-validation prediction errors.\n\n$$\\mbox{CV SSE} = \\sum_{i=1}^n (\\hat{y}_{cv,i} - y_i)^2$$\n:::\n\nThe same process is repeated for the larger number of explanatory variables.\nNote that the coefficients estimated for the first cross-validation model (in Figure \\@ref(fig:massCV2)) are slightly different from the estimates computed on the entire dataset (seen in Table \\@ref(tab:peng-lm-all)).\nFigure \\@ref(fig:massCV2) displays the cross-validation process for the multivariable model with a full set of residual plots given in Figure \\@ref(fig:peng-mass2).\nNote that the residuals are mostly within $\\pm$ 500g, providing much more precise predictions for the independent body mass values of the individual penguins.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The coefficients are estimated using the least squares model on 3/4 of the dataset with the five specified predictor variables. Predictions are made on the remaining 1/4 of the observations. The y-axis in the scatterplot represents the residual: true observed value minus the predicted value. Note that the predictions are independent of the estimated model coefficients.](images/massCV2.png){fig-alt='The left panel shows the linear model predicting body mass in grams using bill length, bill depth, flipper length, sex, and species; the model was built using the red, green, and yellow triangular sections of the observed data. The right panel shows a scatterplot of the prediction error versus the fitted values for the set of observations in the blue triangular section of the observed data. The prediction errors range from roughly -500 grams to +500 grams.' width=100%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted. The least squares regression model was fit independently of the removed penguins. The predictions of body mass are based on bill length only. The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).](25-inf-model-mlr_files/figure-html/peng-mass2-1.png){width=90%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nFigure \\@ref(fig:peng-mass1) shows that the independent predictions are centered around the true values (i.e., errors are centered around zero), but that the predictions can be as much as 1000g off when using only `bill_length_mm` to predict `body_mass_g`.\nOn the other hand, when using `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` to predict `body_mass_g`, the prediction errors seem to be about half as big, as seen in Figure \\@ref(fig:peng-mass2).\nFor the model using `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`, and `species` to predict `body_mass_g`, the CV SSE is 27,728,698.\nConsistent with visually comparing the two sets of residual plots, the sum of squared prediction errors is smaller for the model which uses more predictor variables.\nThe model with more predictor variables seems like the better model (according to the cross-validated prediction errors criteria).\n\nWe have provided a very brief overview to and example using cross-validation.\nCross-validation is a computational approach to model building and model validation as an alternative to reliance on p-values.\nWhile p-values have a role to play in understanding model coefficients, throughout this text, we have continued to present computational methods that broaden statistical approaches to data analysis.\nCross-validation will be used again in Section \\@ref(inf-model-logistic) with logistic regression.\nWe encourage you to consider both standard inferential methods (such as p-values) and computational approaches (such as cross-validation) as you build and use multivariable models of all varieties.\n\n\\clearpage\n\n## Chapter review {#chp25-review}\n\n### Summary\n\nBuilding on the modeling ideas from Chapter \\@ref(model-mlr), we have now introduced methods for evaluating coefficients (based on p-values) and evaluating models (cross-validation).\nThere are many important aspects to consider when working with multiple variables in a single model, and we have only glanced at a few topics.\nRemember, multicollinearity can make coefficient interpretation difficult.\nA topic not covered in this text but important for multiple regression models is interaction, and we hope that you learn more about how variables work together as you continue to build up your modeling skills.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
cross-validation multicollinearity prediction error
inference on multiple linear regression multiple predictors predictor
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp25-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-25].\n\n::: {.exercises data-latex=\"\"}\n1. **GPA, mathematical interval.**\nA survey of 55 Duke University students asked about their GPA (`gpa`), number of hours they study weekly (`studyweek`), number of hours they sleep nightly (`sleepnight`), and whether they go out more than two nights a week (`out_mt2`).\nWe use these data to build a model predicting GPA from the other variables. Summary of the model is shown below. Note that `out_mt2` is `1` if the student goes out more than two nights a week, and 0 otherwise.^[The [`gpa`](http://openintrostat.github.io/openintro/reference/gpa.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 3.508 0.347 10.114 <0.0001
studyweek 0.002 0.004 0.400 0.6908
sleepnight 0.000 0.047 0.008 0.994
out_mt2 0.151 0.097 1.551 0.127
\n \n `````\n :::\n :::\n\n a. Calculate a 95% confidence interval for the coefficient of `out_mt2` (go out more than two night a week) in the model, and interpret it in the context of the data.\n\n 2. Would you expect a 95% confidence interval for the slope of the remaining variables to include 0? Explain.\n\n1. **GPA, collinear predictors.**\nIn this exercise we work with data from a survey of 55 Duke University students who were asked about their GPA, number of hours they sleep nightly, and number of nights they go `out` each week.\n\n The plots below describe the show the distribution of each of these variables (on the diagonal) as well as provide information on the pairwise correlations between them.\n \n Also provided below are three regression model outputs: `gpa` vs. `out`, `gpa` vs. `sleepnight`, and `gpa` vs. `out + sleepnight`.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 3.504 0.106 33.011 <0.0001
out 0.045 0.046 0.998 0.3229
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 3.46 0.318 10.874 <0.0001
sleepnight 0.02 0.045 0.445 0.6583
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 3.483 0.320 10.888 <0.0001
out 0.044 0.050 0.886 0.3796
sleepnight 0.003 0.048 0.072 0.9432
\n \n `````\n :::\n :::\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for the plots and parts a to c.*\n :::\n \n \\clearpage\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-27-1.png){width=100%}\n :::\n :::\n \n a. There are three variables described in the figure, and each is paired with each other to create three different scatterplots. Rate the pairwise relationships from most correlated to least correlated.\n \n b. When using only one variable to model `gpa`, is `out` a significant predictor variable? Is `sleepnight` a significant predictor variable? Explain.\n \n c. When using both `out` and `sleepnight` to predict `gpa` in a multiple regression model, are either of the variables significant? Explain.\n \n \\clearpage\n\n1. **Tourism spending.** \nThe Association of Turkish Travel Agencies reports the number of foreign tourists visiting Turkey and tourist spending by year. \nThree plots are provided: scatterplot showing the relationship between these two variables along with the least squares fit, residuals plot, and histogram of residuals.^[The [`tourism`](http://openintrostat.github.io/openintro/reference/tourism.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-28-1.png){width=100%}\n :::\n :::\n\n a. Describe the relationship between number of tourists and spending.\n\n b. What are the predictor and the outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. Do the data meet the LINE conditions required for fitting a least squares line? In addition to the scatterplot, use the residual plot and histogram to answer this question.\n \n \\clearpage\n\n1. **Cherry trees with collinear predictors.**\nTimber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree's volume and yield.\nResearchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.^[The [`cherry`](http://openintrostat.github.io/openintro/reference/cherry.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Hand:1994]\n\n The plots below describe the show the distribution of each of these variables (on the diagonal) as well as provide information on the pairwise correlations between them.\n\n Also provided below are three regression model outputs: `volume` vs. `diam`, `volume` vs. `height`, and `volume` vs. `height + diam`.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -36.94 3.365 -11.0 <0.0001
diam 5.07 0.247 20.5 <0.0001
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -87.12 29.273 -2.98 0.006
height 1.54 0.384 4.02 0.000
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -57.988 8.638 -6.71 <0.0001
height 0.339 0.130 2.61 0.0145
diam 4.708 0.264 17.82 <0.0001
\n \n `````\n :::\n :::\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for the plots and parts a to c.*\n :::\n \n \\clearpage\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-30-1.png){width=100%}\n :::\n :::\n\n a. There are three variables described in the figure, and each is paired with each other to create three different scatterplots. Rate the pairwise relationships from most correlated to least correlated.\n \n b. When using only one variable to model a tree's `volume`, is `diam`eter a significant predictor variable? Is `height` a significant predictor variable? Explain.\n \n c. When using both `diam`eter and `height` to predict a tree's `volume`, are both predictor variables still significant? Explain.\n \n \\clearpage\n\n1. **Movie returns.** \nA FiveThirtyEight.com article reports that \"Horror movies get nowhere near as much draw at the box office as the big-time summer blockbusters or action/adventure movies, but there's a huge incentive for studios to continue pushing them out. The return-on-investment potential for horror movies is absurd.\" \nTo investigate how the return-on-investment (ROI) compares between genres and how this relationship has changed over time, an introductory statistics student fit a linear regression model to predict the ratio of gross revenue of movies to the production costs from genre and release year for 1,070 movies released between 2000 and 2018.\nUsing the plots given below, determine if this regression model is appropriate for these data. In particular, use the residual plot to check the LINE conditons. [@webpage:horrormovies]\n\n \\vspace{5mm}\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-31-1.png){width=100%}\n :::\n :::\n \n \\clearpage\n\n1. **Difficult encounters.**\nA study was conducted at a university outpatient primary care clinic in Switzerland to identify factors associated with difficult doctor-patient encounters. The data consist of 527 patient encounters, conducted by the 27 medical residents employed at the clinic. After each encounter, the attending physician completed two questionnaires: the Difficult Doctor Patient Relationship Questionnaire (DDPRQ-10) and the patient's vulnerability grid (PVG).\n\n A higher score on the DDPRQ-10 indicates a more difficult encounter. The maximum possible score is 60 and encounters with score 30 and higher are considered difficult.\n \n A model was fit for the association of DDPRQ-10 score with features of the attending physician: age, sex (identify as male or not), and years of training.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 30.594 2.886 10.601 <0.0001
age -0.016 0.104 -0.157 0.876
sexMale -0.535 0.781 -0.686 0.494
yrs_train 0.096 0.215 0.445 0.656
\n \n `````\n :::\n :::\n \n a. The intercept of the model is 30.594. What is the age, sex, and years of training of a physician whom this model would predict to have a DDPRQ-10 score of 30.594.\n \n b. Is there evidence of a significant association between DDPRQ-10 score and any of the physician features?\n \n \\clearpage\n\n1. **Baby's weight, mathematical test.**\nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1,000 births from 2014.\nHere, we study the relationship between smoking and weight of the baby. \nThe variable `smoke` is coded 1 if the mother is a smoker, and 0 if not. \nThe summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in pounds, based on the smoking status of the mother.^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -3.82 0.57 -6.73 <0.0001
weeks 0.26 0.01 18.93 <0.0001
mage 0.02 0.01 2.53 0.0115
sexmale 0.37 0.07 5.30 <0.0001
visits 0.02 0.01 2.09 0.0373
habitsmoker -0.43 0.13 -3.41 7e-04
\n \n `````\n :::\n :::\n \n Also shown below are a series of diagnostics plots.\n \n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-34-1.png){width=90%}\n :::\n :::\n \n a. Determine if the conditions for doing inference based on mathematical models with these data are met using the diagnostic plots above. If not, describe how to proceed with the analysis.\n \n b. Using the regression output, evaluate whether the true slope of `habit` (i.e. whether the mother is a smoker) is different than 0, given the other variables in the model. State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n \n \\clearpage\n\n1. **Baby's weight with collinear predictors.** \nIn this exercise we study the relationship between the weight of the baby and two explanatory variables: number of `weeks` of gestation and number of pregnancy hospital `visits`. [@data:births14]\n\n The plots below describe the show the distribution of each of these variables (on the diagonal) as well as provide information on the pairwise correlations between them.\n \n Also provided below are three regression model outputs: `weight` vs. `weeks`, `weight` vs. `visits`, and `weight` vs. `weeks + visits`.\n \n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.83 1.76 -0.47 0.6395
weeks 0.21 0.05 4.70 <0.0001
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 6.56 0.36 18.36 <0.0001
visits 0.08 0.03 2.61 0.0105
\n \n `````\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.44 1.78 -0.25 0.81
weeks 0.19 0.05 4.00 0.00
visits 0.04 0.03 1.26 0.21
\n \n `````\n :::\n :::\n \n ::: {.content-hidden unless-format=\"pdf\"}\n *See next page for the plots and parts a to c.*\n :::\n \n \\clearpage\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-36-1.png){width=100%}\n :::\n :::\n \n a. There are three variables described in the figure, and each is paired with each other to create three different scatterplots. Rate the pairwise relationships from most correlated to least correlated.\n \n b. When using only one variable to model the baby's `weight`, is `weeks` a significant predictor variable? Is `visits` a significant predictor variable? Explain.\n \n c. When using both `visits` and `weeks` to predict the baby's `weight`, are both predictor variables still significant? Explain.\n \n \\clearpage\n\n1. **Baby's weight, cross-validation.** \nUsing a random sample of 1,000 US births from 2014, we study the relationship between the weight of the baby and various explanatory variables. [@data:births14]\n\n The plots below describe prediction errors associated with two different models designed to predict `weight` of baby at birth; one model uses 7 predictors, one model uses 2 predictors. Using 4-fold cross-validation, the data were split into 4 folds. Three of the folds estimate the $\\beta_i$ parameters using $b_i$, and the model is applied to the held out fold for prediction. The process was repeated 4 times (each time holding out one of the folds).\n\n $$\n \\begin{aligned}\n E[\\texttt{weight}] = \\beta_0 &+ \\beta_1\\times \\texttt{fage} + \\beta_2\\times \\texttt{mage}\\\\\n &+ \\beta_3 \\times \\texttt{mature} + \\beta_4 \\times \\texttt{weeks}\\\\\\\n &+ \\beta_5 \\times \\texttt{visits}+ \\beta_6 \\times \\texttt{gained}\\\\\n &+ \\beta_7 \\times \\texttt{habit}\\\\\n \\end{aligned}\n $$\n\n $$\n \\begin{aligned}\n E[\\texttt{weight}] = \\beta_0 &+ \\beta_1\\times \\texttt{weeks} + \\beta_2\\times \\texttt{mature}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-37-1.png){width=90%}\n :::\n :::\n \n a. In the first graph, note the point at roughly (predicted = 11 and error = -4). Estimate the observed and predicted value for that observation.\n \n b. Using the same point, describe which cross-validation fold(s) were used to build its prediction model.\n \n c. For the plot on the left, for one of the cross-validation folds, how many coefficients were estimated in the linear model? For the plot on the right, for one of the cross-validation folds, how many coefficients were estimated in the linear model? \n \n d. Do the values of the residuals (along the y-axis, not the x-axis) seem markedly different for the two models? Explain.\n \n \\clearpage\n\n1. **Baby's weight, cross-validation to choose model.** \nUsing a random sample of 1,000 US births from 2014, we study the relationship between the weight of the baby and various explanatory variables. [@data:births14]\n\n The plots below describe prediction errors associated with two different models designed to predict `weight` of baby at birth; one model uses 7 predictors, one model uses 2 predictors. Using 4-fold cross-validation, the data were split into 4 folds. Three of the folds estimate the $\\beta_i$ parameters using $b_i$, and the model is applied to the held out fold for prediction. The process was repeated 4 times (each time holding out one of the folds).\n\n $$\n \\begin{aligned}\n E[\\texttt{weight}] = \\beta_0 &+ \\beta_1\\times \\texttt{fage} + \\beta_2\\times \\texttt{mage}\\\\\n &+ \\beta_3 \\times \\texttt{mature} + \\beta_4 \\times \\texttt{weeks}\\\\\\\n &+ \\beta_5 \\times \\texttt{visits}+ \\beta_6 \\times \\texttt{gained}\\\\\n &+ \\beta_7 \\times \\texttt{habit}\\\\\n \\end{aligned}\n $$\n\n $$\n \\begin{aligned}\n E[\\texttt{weight}] = \\beta_0 &+ \\beta_1\\times \\texttt{weeks} + \\beta_2\\times \\texttt{mature}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-38-1.png){width=90%}\n :::\n :::\n \n a. Using the spread of the points, which model should be chosen for a final report on these data? Explain.\n \n b. Using the summary statistic (CV SSE), which model should be chosen for a final report on these data? Explain.\n \n c. Why would the model with more predictors fit the data less closely than the model with only two predictors?\n \n \\clearpage\n\n1. **RailTrail, cross-validation.**\nThe Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.^[The [`RailTrail`](https://rdrr.io/cran/mosaicData/man/RailTrail.html) data used in this exercise can be found in the [**mosaicData**](https://cran.r-project.org/web/packages/mosaicData/index.html) R package.]\n\n The plots below describe prediction errors associated with two different models designed to predict the `volume` of riders on the RailTrail; one model uses 6 predictors, one model uses 2 predictors. Using 3-fold cross-validation, the data were split into 3 folds. Three of the folds estimate the $\\beta_i$ parameters using $b_i$, and the model is applied to the held out fold for prediction. The process was repeated 4 times (each time holding out one of the folds).\n\n $$\n \\begin{aligned}\n E[\\texttt{volume}] = \\beta_0 &+ \\beta_1\\times \\texttt{hightemp} + \\beta_2\\times \\texttt{lowtemp}\\\\\n &+ \\beta_3 \\times \\texttt{spring} + \\beta_4 \\times \\texttt{weekday}\\\\\\\n &+ \\beta_5 \\times \\texttt{cloudcover}+ \\beta_6 \\times \\texttt{precip}\\\\\n \\end{aligned}\n $$\n \n $$\n \\begin{aligned}\n E[\\texttt{volume}] = \\beta_0 &+ \\beta_1\\times \\texttt{hightemp} + \\beta_2\\times \\texttt{precip}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n \n a. In the second graph, note the point at roughly (predicted = 400 and error = 100). Estimate the observed and predicted value for that observation.\n \n b. Using the same point, describe which cross-validation fold(s) were used to build its prediction model.\n \n c. For the plot on the left, for one of the cross-validation folds, how many coefficients were estimated in the linear model? For the plot on the right, for one of the cross-validation folds, how many coefficients were estimated in the linear model? \n \n d. Do the values of the residuals (along the y-axis, not the x-axis) seem markedly different for the two models? Explain.\n \n \\clearpage\n\n1. **RailTrail, cross-validation to choose model.**\nThe Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.\n\n The plots below describe prediction errors associated with two different models designed to predict the `volume` of riders on the RailTrail; one model uses 6 predictors, one model uses 2 predictors. Using 3-fold cross-validation, the data were split into 3 folds. Three of the folds estimate the $\\beta_i$ parameters using $b_i$, and the model is applied to the held out fold for prediction. The process was repeated 4 times (each time holding out one of the folds).\n\n $$\n \\begin{aligned}\n E[\\texttt{volume}] = \\beta_0 &+ \\beta_1\\times \\texttt{hightemp} + \\beta_2\\times \\texttt{lowtemp}\\\\\n &+ \\beta_3 \\times \\texttt{spring} + \\beta_4 \\times \\texttt{weekday}\\\\\\\n &+ \\beta_5 \\times \\texttt{cloudcover}+ \\beta_6 \\times \\texttt{precip}\\\\\n \\end{aligned}\n $$\n \n $$\n \\begin{aligned}\n E[\\texttt{volume}] = \\beta_0 &+ \\beta_1\\times \\texttt{hightemp} + \\beta_2\\times \\texttt{precip}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](25-inf-model-mlr_files/figure-html/unnamed-chunk-40-1.png){width=90%}\n :::\n :::\n \n a. Using the spread of the points, which model should be chosen for a final report on these data? Explain.\n \n b. Using the summary statistic (CV SSE), which model should be chosen for a final report on these data? Explain.\n \n c. Why would the model with more predictors fit the data less closely than the model with only two predictors?\n\n\n:::\n", + "supporting": [ + "25-inf-model-mlr_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/25-inf-model-mlr/figure-html/coinfig-1.png b/_freeze/25-inf-model-mlr/figure-html/coinfig-1.png new file mode 100644 index 00000000..88e6e4ce Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/coinfig-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/peng-mass1-1.png b/_freeze/25-inf-model-mlr/figure-html/peng-mass1-1.png new file mode 100644 index 00000000..cc460730 Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/peng-mass1-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/peng-mass2-1.png b/_freeze/25-inf-model-mlr/figure-html/peng-mass2-1.png new file mode 100644 index 00000000..cdac4ced Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/peng-mass2-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-27-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 00000000..61db0a94 Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-27-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-28-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 00000000..7cbe7234 Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-28-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-30-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 00000000..eb99a4de Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-30-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-31-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 00000000..92ae2cd4 Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-31-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-34-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-34-1.png new file mode 100644 index 00000000..cae2f95a Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-34-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-36-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-36-1.png new file mode 100644 index 00000000..a0df20eb Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-36-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-37-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-37-1.png new file mode 100644 index 00000000..6252c2a6 Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-37-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-38-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-38-1.png new file mode 100644 index 00000000..6252c2a6 Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-38-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-39-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-39-1.png new file mode 100644 index 00000000..2a6bc6ab Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-39-1.png differ diff --git a/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-40-1.png b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-40-1.png new file mode 100644 index 00000000..2a6bc6ab Binary files /dev/null and b/_freeze/25-inf-model-mlr/figure-html/unnamed-chunk-40-1.png differ diff --git a/_freeze/26-inf-model-logistic/execute-results/html.json b/_freeze/26-inf-model-logistic/execute-results/html.json new file mode 100644 index 00000000..d16e8db3 --- /dev/null +++ b/_freeze/26-inf-model-logistic/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "911c27faa9d94687ede8b2064a6c2b94", + "result": { + "markdown": "# Inference for logistic regression {#inf-model-logistic}\n\n\n\n\n\n::: {.chapterintro data-latex=\"\"}\nCombining ideas from Chapter \\@ref(model-logistic) on logistic regression, Chapter \\@ref(foundations-mathematical) on inference with mathematical models, and Chapters \\@ref(inf-model-slr) and \\@ref(inf-model-mlr) which apply inferential techniques to the linear model, we wrap up the book by considering inferential methods applied to a logistic regression model.\nAdditionally, we use cross-validation as a method for independent assessment of the logistic regression model.\n:::\n\n\n\n\n\nAs with multiple linear regression, the inference aspect for logistic regression will focus on interpretation of coefficients and relationships between explanatory variables.\nBoth p-values and cross-validation will be used for assessing a logistic regression model.\n\nConsider the `email` data which describes email characteristics which can be used to predict whether a particular incoming email is (unsolicited bulk email).\nWithout reading every incoming message, it might be nice to have an automated way to identify spam emails.\nWhich of the variables describing each email are important for predicting the status of the email?\n\n::: {.data data-latex=\"\"}\nThe [`email`](http://openintrostat.github.io/openintro/reference/email.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variables and their descriptions for the `email` dataset. Many of the variables are indicator variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.
Variable Description
spam Indicator for whether the email was spam.
to_multiple Indicator for whether the email was addressed to more than one recipient.
from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc Number of people cc'ed.
sent_email Indicator for whether the sender had been sent an email in the last 30 days.
attach The number of attached files.
dollar The number of times a dollar sign or the word “dollar” appeared in the email.
winner Indicates whether “winner” appeared in the email.
format Indicates whether the email was written using HTML (e.g., may have included bolding or active links).
re_subj Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
exclaim_subj Whether there was an exclamation point in the subject.
urgent_subj Whether the word “urgent” was in the email subject.
exclaim_mess The number of exclamation points in the email message.
number Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
\n\n`````\n:::\n:::\n\n\n## Model diagnostics\n\nBefore looking at the hypothesis tests associated with the coefficients (turns out they are very similar to those in linear regression!), it is valuable to understand the technical conditions that underlie the inference applied to the logistic regression model.\nGenerally, as you've seen in the logistic regression modeling examples, it is imperative that the response variable is binary.\nAdditionally, the key technical condition for logistic regression has to do with the relationship between the predictor variables $(x_i$ values) and the probability the outcome will be a success.\nIt turns out, the relationship is a specific functional form called a logit function, where ${\\rm logit}(p) = \\log_e(\\frac{p}{1-p}).$ The function may feel complicated, and memorizing the formula of the logit is not necessary for understanding logistic regression.\nWhat you do need to remember is that the probability of the outcome being a success is a function of a linear combination of the explanatory variables.\n\n::: {.important data-latex=\"\"}\n**Logistic regression conditions.**\n\nThere are two key conditions for fitting a logistic regression model:\n\n1. Each outcome $Y_i$ is independent of the other outcomes.\n2. Each predictor $x_i$ is linearly related to logit$(p_i)$ if all other predictors are held constant.\n:::\n\n\n\n\n\nThe first logistic regression model condition --- independence of the outcomes --- is reasonable if we can assume that the emails that arrive in an inbox within a few months are independent of each other with respect to whether they're spam or not.\n\nThe second condition of the logistic regression model is not easily checked without a fairly sizable amount of data.\nLuckily, we have 3921 emails in the dataset!\nLet's first visualize these data by plotting the true classification of the emails against the model's fitted probabilities, as shown in Figure \\@ref(fig:spam-predict).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The predicted probability that each of the 3921 emails that are spam. Points have been jittered so that those with nearly identical values aren’t plotted exactly on top of one another.](26-inf-model-logistic_files/figure-html/spam-predict-1.png){width=90%}\n:::\n:::\n\n\nWe'd like to assess the quality of the model.\nFor example, we might ask: if we look at emails that we modeled as having 10% chance of being spam, do we find out 10% of the actually are spam?\nWe can check this for groups of the data by constructing a plot as follows:\n\n1. Bucket the data into groups based on their predicted probabilities.\n2. Compute the average predicted probability for each group.\n3. Compute the observed probability for each group, along with a 95% confidence interval for the true probability of success for those individuals.\n4. Plot the observed probabilities (with 95% confidence intervals) against the average predicted probabilities for each group.\n\nIf the model does a good job describing the data, the plotted points should fall close to the line $y = x$, since the predicted probabilities should be similar to the observed probabilities.\nWe can use the confidence intervals to roughly gauge whether anything might be amiss.\nSuch a plot is shown in Figure \\@ref(fig:logisticModelBucketDiag).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:logisticModelBucketDiag-cap)](26-inf-model-logistic_files/figure-html/logisticModelBucketDiag-1.png){width=90%}\n:::\n:::\n\n\n(ref:logisticModelBucketDiag-cap) The dashed line is within the confidence bound of the 95% confidence intervals of each of the buckets, suggesting the logistic fit is reasonable.\n\nA plot like Figure \\@ref(fig:logisticModelBucketDiag) helps to better understand the deviations.\nAdditional diagnostics may be created that are similar to those featured in Section \\@ref(tech-cond-linmod).\nFor instance, we could compute residuals as the observed outcome minus the expected outcome ($e_i = Y_i - \\hat{p}_i$), and then we could create plots of these residuals against each predictor.\n\n\\index{logistic regression}\n\n## Multiple logistic regression output from software {#inf-log-reg-soft}\n\nAs you learned in Chapter \\@ref(model-mlr), optimization can be used to find the coefficient estimates for the logistic model.\nThe unknown population model can be written as:\n\n$$\n\\begin{aligned}\n\\log_e\\bigg(\\frac{p}{1-p}\\bigg) &= \\beta_0 + \\beta_1 \\times \\texttt{to_multiple} + \\beta_2 \\times \\texttt{cc} \\\\\n&+ \\beta_3 \\times \\texttt{dollar} + \\beta_4 \\times \\texttt{urgent_subj}\n\\end{aligned}\n$$\n\nThe estimated equation for the regression model may be written as a model with four predictor variables (where $\\hat{p}$ is the estimated probability of being a spam email message):\n\n$$\n\\begin{aligned}\n\\log_e\\bigg(\\frac{\\hat{p}}{1-\\hat{p}}\\bigg) &= -2.05 + -1.91 \\times \\texttt{to_multiple} + 0.02 \\times \\texttt{cc} \\\\\n&- 0.07 \\times \\texttt{dollar} + 2.66 \\times \\texttt{urgent_subj}\n\\end{aligned}\n$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a logistic model for predicting whether an email is spam based on the variables `to_multiple`, `cc`, `dollar`, and `urgent_subj`. Each of the variables has its own coefficient estimate and p-value.
term estimate std.error statistic p.value
(Intercept) -2.05 0.06 -34.67 <0.0001
to_multiple1 -1.91 0.30 -6.37 <0.0001
cc 0.02 0.02 1.16 0.245
dollar -0.07 0.02 -3.38 7e-04
urgent_subj1 2.66 0.80 3.32 9e-04
\n\n`````\n:::\n:::\n\n\nNot only does Table \\@ref(tab:emaillogmodel) provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.\n\nAs in Section \\@ref(inf-model-mlr), with **multiple predictors**, each hypothesis test (for each of the explanatory variables) is conditioned on each of the other variables remaining in the model.\n\n\n\n\n\n> if multiple predictors $H_0: \\beta_i = 0$ given other variables in the model\n\nUsing the example above and focusing on each of the variable p-values (here we won't discuss the p-value associated with the intercept), we can write out the four different hypotheses (associated with the p-value corresponding to each of the coefficients / row in Table \\@ref(tab:emaillogmodel)):\n\n- $H_0: \\beta_1 = 0$ given `cc`, `dollar`, and `urgent_subj` are included in the model\n- $H_0: \\beta_2 = 0$ given `to_multiple`, `dollar`, and `urgent_subj` are included in the model\n- $H_0: \\beta_3 = 0$ given `to_multiple`, `cc`, and `urgent_subj` are included in the model\n- $H_0: \\beta_4 = 0$ given `to_multiple`, `dollar`, and `dollar` are included in the model\n\nThe very low p-values from the software output tell us that three of the variables (that is, not `cc`) act as statistically significant predictors in the model at the significance level of 0.05, despite the inclusion of any of the other variables.\nConsider the p-value on $H_0: \\beta_1 = 0$.\nThe low p-value says that it would be extremely unlikely to observe data that yield a coefficient on `to_multiple` at least as far from 0 as -1.91 (i.e. $|b_1| > 1.91$) if the true relationship between `to_multiple` and `spam` was non-existent (i.e., if $\\beta_1 = 0$) **and** the model also included `cc` and `dollar` and `urgent_subj`.\nNote also that the coefficient on `dollar` has a small associated p-value, but the magnitude of the coefficient is also seemingly small (0.07).\nIt turns out that in units of standard errors (0.02 here), 0.07 is actually quite far from zero, it's all about context!\nThe p-values on the remaining variables are interpreted similarly.\nFrom the initial output (p-values) in Table \\@ref(tab:emaillogmodel), it seems as though `to_multiple`, `dollar`, and `urgent_subj` are important variables for modeling whether an email is `spam`.\nWe remind you that although p-values provide some information about the importance of each of the predictors in the model, there are many, arguably more important, aspects to consider when choosing the best model.\n\nAs with linear regression (see Section \\@ref(inf-mult-reg-collin)), existence of predictors that are correlated with each other can affect both the coefficient estimates and the associated p-values.\nHowever, investigating multicollinearity in a logistic regression model is saved for a text which provides more detail about logistic regression.\nNext, as a model building alternative (or enhancement) to p-values, we revisit cross-validation within the context of predicting status for each of the individual emails.\n\n## Cross-validation for prediction error {#inf-log-reg-cv}\n\nThe p-value is a probability measure under a setting of no relationship.\nThat p-value provides information about the degree of the relationship (e.g., above we measure the relationship between `spam` and `to_multiple` using a p-value), but the p-value does not measure how well the model will predict the individual emails (e.g., the accuracy of the model).\nDepending on the goal of the research project, you might be inclined to focus on variable importance (through p-values) or you might be inclined to focus on prediction accuracy (through cross-validation).\n\nHere we present a method for using cross-validation accuracy to determine which variables (if any) should be used in a model which predicts whether an email is spam.\nA full treatment of cross-validation and logistic regression models is beyond the scope of this text.\nUsing cross-validation, we can build $k$ different models which are used to predict the observations in each of the $k$ holdout samples.\nThe smaller model uses only the `to_multiple` variable, see the complete dataset (not cross-validated) model output in Table \\@ref(tab:emaillogmodel1).\nThe logistic regression model can be written as (where $\\hat{p}$ is the estimated probability of being a spam email message):\n\n\n\n\n\n$$\\log_e\\bigg(\\frac{\\hat{p}}{1-\\hat{p}}\\bigg) = -2.12 + -1.81 \\times \\texttt{to_multiple}$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a logistic model for predicting whether an email is spam based on only the predictor variable `to_multiple`. Each of the variables has its own coefficient estimate and p-value.
term estimate std.error statistic p.value
(Intercept) -2.12 0.06 -37.67 <0.0001
to_multiple1 -1.81 0.30 -6.09 <0.0001
\n\n`````\n:::\n:::\n\n\nFor each cross-validated model, the coefficients change slightly, and the model is used to make independent predictions on the holdout sample.\nThe model from the first cross-validation sample is given in Table \\@ref(fig:emailCV1) and can be compared to the coefficients in Table \\@ref(tab:emaillogmodel1).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The coefficients are estimated using the least squares model on 3/4 of the dataset with a single predictor variable. Predictions are made on the remaining 1/4 of the observations. Note that the predictions are independent of the estimated model coefficients, and the prediction error rate is quite high.](images/emailCV1.png){fig-alt='The left panel shows the logistic model predicting the probability of an email being spam as a function of whether the email was sent to multiple individuals; the model was built using the red, green, and yellow triangular sections of the observed data. The right panel shows a confusion matrix of the predicted spam label crossed with the observed spam label for the set of observations in the blue triangular section of the observed data. Of the 83 spam emails, 80 were correctly classified as spam. Of the 897 non-spam emails, 143 were correctly classified as not spam.' width=100%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
One quarter at a time, the data were removed from the model building, and whether the email was spam (TRUE) or not (FALSE) was predicted. The logistic regression model was fit independently of the removed emails. Only `to_multiple` is used to predict whether the email is spam. Because we used a cutoff designed to identify spam emails, the accuracy of the non-spam email predictions is very low.
fold count accuracy notspamTP spamTP
1st quarter 980 0.26 0.19 0.98
2nd quarter 981 0.23 0.15 0.96
3rd quarter 979 0.25 0.18 0.96
4th quarter 981 0.24 0.17 0.98
\n\n`````\n:::\n:::\n\n\nBecause the `email` dataset has a ratio of roughly 90% non-spam and 10% spam emails, a model which randomly guessed all non-spam would have an overall accuracy of 90%!\nClearly, we would like to capture the information with the spam emails, so our interest is in the percent of spam emails which are identified as spam (see Table \\@ref(tab:email-spam)).\nAdditionally, in the logistic regression model, we use a 10% cutoff to predict whether the email is spam.\nFortunately, we have done a great job of predicting !\nHowever, the trade-off was that most of the non-spam emails are now predicted to be which is not acceptable for a prediction algorithm.\nAdding more variables to the model may help with both the spam and non-spam predictions.\n\nThe larger model uses `to_multiple`, `attach`, `winner`, `format`, `re_subj`, `exclaim_mess`, and `number` as the set of seven predictor variables, see the complete dataset (not cross-validated) model output in Table \\@ref(tab:emaillogmodel2).\nThe logistic regression model can be written as (where $\\hat{p}$ is the estimated probability of being a spam email message):\n\n$$\n\\begin{aligned}\n\\log_e\\bigg(\\frac{\\hat{p}}{1-\\hat{p}}\\bigg) = -0.34 &- 2.56 \\times \\texttt{to_multiple} + 0.20 \\times \\texttt{attach} \\\\\n&+ 1.73 \\times \\texttt{winner}_{yes} - 1.28 \\times \\texttt{format} \\\\\n&- 2.86 \\times \\texttt{re_subj} + 0 \\times \\texttt{exclaim_mess} \\\\\n&- 1.07 \\times \\texttt{number}_{small} - 0.42 \\times \\texttt{number}_{big}\n\\end{aligned}\n$$\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a logistic model for predicting whether an email is spam based on only the predictor variable `to_multiple`. Each of the variables has its own coefficient estimate and p-value.
term estimate std.error statistic p.value
(Intercept) -0.34 0.11 -3.02 0.0025
to_multiple1 -2.56 0.31 -8.28 <0.0001
attach 0.20 0.06 3.29 0.001
winneryes 1.73 0.33 5.33 <0.0001
format1 -1.28 0.13 -9.80 <0.0001
re_subj1 -2.86 0.37 -7.83 <0.0001
exclaim_mess 0.00 0.00 0.26 0.7925
numbersmall -1.07 0.14 -7.54 <0.0001
numberbig -0.42 0.20 -2.10 0.0357
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![The coefficients are estimated using the least squares model on 3/4 of the dataset with the seven specified predictor variables. Predictions are made on the remaining 1/4 of the observations. Note that the predictions are independent of the estimated model coefficients. The predictions are now much better for both the and the non-spam emails (than they were with a single predictor variable).](images/emailCV2.png){fig-alt='The left panel shows the logistic model predicting the probability of an email being spam as a function of the email being sent to multiple recipients, number of attachments, using the word winner, format of HTML, RE in the subject line, number of exclamation points in the message, and the existence of a number in the email; the model was built using the red, green, and yellow triangular sections of the observed data. The right panel shows a confusion matrix of the predicted spam label crossed with the observed spam label for the set of observations in the blue triangular section of the observed data. Of the 97 spam emails, 73 were correctly classified as spam. Of the 883 non-spam emails, 688 were correctly classified as not spam.' width=100%}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
One quarter at a time, the data were removed from the model building, and whether the email was spam (TRUE) or not (FALSE) was predicted. The logistic regression model was fit independently of the removed emails. Now, the variables `to_multiple`, `attach`, `winner`, `format`, `re_subj`, `exclaim_mess`, and `number` are used to predict whether the email is spam.
fold count accuracy notspamTP spamTP
1st quarter 980 0.77 0.77 0.71
2nd quarter 981 0.80 0.81 0.70
3rd quarter 979 0.76 0.77 0.65
4th quarter 981 0.78 0.79 0.75
\n\n`````\n:::\n:::\n\n\nSomewhat expected, the larger model (see Table \\@ref(tab:email-spam2)) was able to capture more nuance in the emails which lead to better predictions.\nHowever, it is not true that adding variables will always lead to better predictions, as correlated or noise variables may dampen the signal from those variables which truly predict the status.\nWe encourage you to learn more about multiple variable models and cross-validation in your future exploration of statistical topics.\n\n\\clearpage\n\n## Chapter review {#chp27-review}\n\n### Summary\n\nThroughout the text, we have presented a modern view to introduction to statistics.\nEarly we presented graphical techniques which communicated relationships across multiple variables.\nWe also used modeling to formalize the relationships.\nIn Chapter \\@ref(inf-model-logistic) we considered inferential claims on models which include many variables used to predict the probability of the outcome being a success.\nWe continue to emphasize the importance of experimental design in making conclusions about research claims.\nIn particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure \\@ref(fig:randsampValloc)).\n\nAs you might guess, this text has only scratched the surface of the world of statistical analyses that can be applied to different datasets.\nIn particular, to do justice to the topic, the linear models and generalized linear models we have introduced can each be covered with their own course or book.\nHierarchical models, alternative methods for fitting parameters (e.g., Ridge Regression or LASSO), and advanced computational methods applied to models (e.g., permuting the response variable? one explanatory variable? all the explanatory variables?) are all beyond the scope of this book.\nHowever, your successful understanding of the ideas we have covered has set you up perfectly to move on to a higher level of statistical modeling and inference.\nEnjoy!\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n\n
cross-validation multiple predictors
inference on logistic regression technical conditions
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp26-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-26].\n\n::: {.exercises data-latex=\"\"}\n1. **Marijuana use in college.** \nResearchers studying whether the value systems adolescents conflict with those of their children asked 445 college students if they use marijuana. They also asked the students' parents if they've used marijuana when they were in college.\nThe following model was fit to predict student drug use from parent drug use.^[The [`drug_use`](http://openintrostat.github.io/openintro/reference/drug_use.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Ellis:1979]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.405 0.133 -3.04 0.0023
parentsused 0.791 0.194 4.09 <0.0001
\n \n `````\n :::\n :::\n\n a. State the hypotheses for evaluating whether parents' marijuana usage is a significant predictor of their kids' marijuana usage.\n \n b. Based on the regression output, state the sample statistic and the p-value of the test.\n \n c. State the conclusion of the hypothesis test in context of the data and the research question.\n\n1. **Treating heart attacks.**\nResearchers studying the effectiveness of Sulfinpyrazone in the prevention of sudden death after a heart attack conducted a randomized experiment on 1,475 patients. \nThe following model was fit to predict experiment outcome (`died` or `lived`, where success is defined as `lived`) from the experimental group (`control` and `treatment`).^[The [`sulphinpyrazone`](http://openintrostat.github.io/openintro/reference/sulphinpyrazone.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Anturane:1980]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 2.431 0.135 18.05 <0.0001
grouptreatment 0.395 0.210 1.89 0.0594
\n \n `````\n :::\n :::\n\n a. State the hypotheses for evaluating whether experimental group is a significant predictor of treatment outcome.\n \n b. Based on the regression output, state the sample statistic and the p-value of the test.\n \n c. State the conclusion of the hypothesis test in context of the data and the research question.\n \n \\clearpage\n\n1. **Possum classification, cross-validation.** \nThe common brushtail possum of the Australia region is a bit cuter than its distant cousin, the American opossum.\nWe consider 104 brushtail possums from two regions in Australia, where the possums may be considered a random sample from the population.\nThe first region is Victoria, which is in the eastern half of Australia and traverses the southern coast. \nThe second region consists of New South Wales and Queensland, which make up eastern and northeastern Australia. \nWe use logistic regression to differentiate between possums in these two regions.\nThe outcome variable, called `pop`, takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland.^[The [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n \\vspace{-2mm}\n\n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{tail_l}\n \\end{aligned}\n $$\n \n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{total_l} + \\beta_2\\times \\texttt{sex}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-21-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-21-2.png){width=90%}\n :::\n :::\n \n a. How many observations are in Fold2? Use the model with only tail length as a predictor variable. Of the observations in Fold2, how many of them were correctly predicted to be from Vicotria? How many of them were incorrectly predicted to be from Victoria?\n \n b. How many observations are used to build the model which predicts for the observations in Fold2?\n \n c. For one of the cross-validation folds, how many coefficients were estimated for the model which uses tail length as a predictor? For one of the cross-validation folds, how many coefficients were estimated for the model which uses total length and sex as predictors?\n \n \\clearpage\n\n1. **Possum classification, cross-validation to choose model.** \nIn this exercise we consider 104 brushtail possums from two regions in Australia (the first region is Victoria and the second is New South Wales and Queensland), where the possums may be considered a random sample from the population. \nWe use logistic regression to classify the possums into the two regions.\nThe outcome variable, called `pop`, takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland.\n\n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{tail_l}\n \\end{aligned}\n $$\n \n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{total_l} + \\beta_2\\times \\texttt{sex}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-22-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-22-2.png){width=90%}\n :::\n :::\n\n a. For the model with tail length, how many of the observations were correctly classified? What proportion of the observations were correctly classified?\n \n b. For the model with total length and sex, how many of the observations were correctly classified? What proportion of the observations were correctly classified?\n \n c. If you have to choose between using only tail length as a predictor versus using total length and sex as predictors (for classification into region), which model would you choose? Explain.\n \n d. Given the predictions provided, what model might be preferable to either of the models given above?\n \n \\clearpage\n\n1. **Premature babies, cross-validation.** \nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1000 births from 2014 (with some rows removed due to missing data).\nHere, we use logistic regression to model whether the baby is premature from various explanatory variables.^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@data:births14]\n\n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{mage} + \\beta_2\\times \\texttt{weight}\\\\\n &+ \\beta_3 \\times \\texttt{mature} + \\beta_4 \\times \\texttt{visits}\\\\\n &+ \\beta_5 \\times \\texttt{gained}+ \\beta_6 \\times \\texttt{habit}\\\\\n \\end{aligned}\n $$\n \n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{weight} + \\beta_2\\times \\texttt{mature}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-23-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-23-2.png){width=90%}\n :::\n :::\n \n a. How many observations are in Fold2? Use the model with only `weight` and `mature` as predictor variables. Of the observations in Fold2, how many of them were correctly predicted to be premature? How many of them were incorrectly predicted to be premature?\n \n b. How many observations are used to build the model which predicts for the observations in Fold2?\n \n c. In the original dataset, are most of the births premature or full term? Explain.\n \n d. For one of the cross-validation folds, how many coefficients were estimated for the model which uses `mage`, `weight`, `mature`, `visits`, `gained`, and `habit` as predictors? For one of the cross-validation folds, how many coefficients were estimated for the model which uses `weight` and `mature` as predictors?\n \n \\clearpage\n\n1. **Premature babies, cross-validation to choose model.** \nUS Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\nThe data used here are a random sample of 1000 births from 2014 (with some rows removed due to missing data).\nHere, we use logistic regression to model whether the baby is premature from various explanatory variables. [@data:births14]\n\n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{mage} + \\beta_2\\times \\texttt{weight}\\\\\n &+ \\beta_3 \\times \\texttt{mature} + \\beta_4 \\times \\texttt{visits}\\\\\n &+ \\beta_5 \\times \\texttt{gained}+ \\beta_6 \\times \\texttt{habit}\\\\\n \\end{aligned}\n $$\n \n $$\n \\begin{aligned}\n \\log_e \\bigg(\\frac{p}{1-p}\\bigg) = \\beta_0 &+ \\beta_1\\times \\texttt{weight} + \\beta_2\\times \\texttt{mature}\n \\end{aligned}\n $$\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-24-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n ![](26-inf-model-logistic_files/figure-html/unnamed-chunk-24-2.png){width=90%}\n :::\n :::\n\n a. For the model with 6 predictors, how many of the observations were correctly classified? What proportion of the observations were correctly classified?\n \n b. For the model with 2 predictors, how many of the observations were correctly classified? What proportion of the observations were correctly classified?\n \n c. If you have to choose between the model with 6 predictors and the model with 2 predictors (for predicting whether a baby will be premature), which model would you choose? Explain.\n\n\n:::\n", + "supporting": [ + "26-inf-model-logistic_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/26-inf-model-logistic/figure-html/logisticModelBucketDiag-1.png b/_freeze/26-inf-model-logistic/figure-html/logisticModelBucketDiag-1.png new file mode 100644 index 00000000..719d457e Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/logisticModelBucketDiag-1.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/spam-predict-1.png b/_freeze/26-inf-model-logistic/figure-html/spam-predict-1.png new file mode 100644 index 00000000..c489938a Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/spam-predict-1.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-21-1.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 00000000..cbfe6260 Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-21-1.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-21-2.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-21-2.png new file mode 100644 index 00000000..925c0e33 Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-21-2.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-22-1.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 00000000..cbfe6260 Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-22-1.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-22-2.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-22-2.png new file mode 100644 index 00000000..925c0e33 Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-22-2.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-23-1.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-23-1.png new file mode 100644 index 00000000..c7ff187b Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-23-1.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-23-2.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-23-2.png new file mode 100644 index 00000000..c065ffdc Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-23-2.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-24-1.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-24-1.png new file mode 100644 index 00000000..c7ff187b Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-24-1.png differ diff --git a/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-24-2.png b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-24-2.png new file mode 100644 index 00000000..c065ffdc Binary files /dev/null and b/_freeze/26-inf-model-logistic/figure-html/unnamed-chunk-24-2.png differ diff --git a/_freeze/27-inf-model-applications/execute-results/html.json b/_freeze/27-inf-model-applications/execute-results/html.json new file mode 100644 index 00000000..c40fc042 --- /dev/null +++ b/_freeze/27-inf-model-applications/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "41b4f81c00b9460e5b3e6a663a11e7a9", + "result": { + "markdown": "# Applications: Model and infer {#inf-model-applications}\n\n\n\n\n\n## Case study: Mario Kart {#case-study-mario-cart}\n\nIn this case study, we consider Ebay auctions of a video game called *Mario Kart* for the Nintendo Wii.\nThe outcome variable of interest is the total price of an auction, which is the highest bid plus the shipping cost.\nWe will try to determine how total price is related to each characteristic in an auction while simultaneously controlling for other variables.\nFor instance, all other characteristics held constant, are longer auctions associated with higher or lower prices?\nAnd, on average, how much more do buyers tend to pay for additional Wii wheels (plastic steering wheels that attach to the Wii controller) in auctions?\nMultiple regression will help us answer these and other questions.\n\n::: {.data data-latex=\"\"}\nThe [`mariokart`](http://openintrostat.github.io/openintro/reference/mariokart.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe `mariokart` data set includes results from 141 auctions.\nFour observations from this data set are shown in Table \\@ref(tab:mariokart-data-frame), and descriptions for each variable are shown in Table \\@ref(tab:mariokart-var-def).\nNotice that the condition and stock photo variables are indicator variables\\index{indicator variable}, similar to `bankruptcy` in the `loans` data set from Chapter \\@ref(inf-model-mlr).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Top four rows of the `mariokart` dataset.
price cond_new stock_photo duration wheels
51.5 new yes 3 1
37.0 used yes 7 1
45.5 new no 3 1
44.0 new yes 3 1
\n\n`````\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Variables and their descriptions for the `mariokart` dataset.
Variable Description
price Final auction price plus shipping costs, in US dollars.
cond_new Indicator variable for if the game is new (1) or used (0).
stock_photo Indicator variable for if the auction's main photo is a stock photo.
duration The length of the auction, in days, taking values from 1 to 10.
wheels The number of Wii wheels included with the auction. A Wii wheel is an optional steering wheel accessory that holds the Wii controller.
\n\n`````\n:::\n:::\n\n\n### Mathematical approach to linear models\n\nIn Table \\@ref(tab:mariokart-model-output) we fit a mathematical linear regression model with the game's condition as a predictor of auction price.\n\n$$E[\\texttt{price}] = \\beta_0 + \\beta_1\\times \\texttt{cond_new}$$\n\nResults of the model are summarized below:\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a linear model for predicting `price` based on `cond_new`.
term estimate std.error statistic p.value
(Intercept) 42.9 0.81 52.67 <0.0001
cond_new 10.9 1.26 8.66 <0.0001
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWrite down the equation for the model, note whether the slope is statistically different from zero, and interpret the coefficient.[^27-inf-model-applications-1]\n:::\n\n[^27-inf-model-applications-1]: The equation for the line may be written as $\\widehat{\\texttt{price}} = 47.15 + 10.90\\times \\texttt{cond_new}$.\n Examining the regression output in we can see that the p-value for `cond_new` is very close to zero, indicating there is strong evidence that the coefficient is different from zero when using this one-variable model.\n The variable `cond_new` is a two-level categorical variable that takes value 1 when the game is new and value 0 when the game is used.\n This means the 10.90 model coefficient predicts a price of an extra \\$10.90 for those games that are new versus those that are used.\n\nSometimes there are underlying structures or relationships between predictor variables.\nFor instance, new games sold on Ebay tend to come with more Wii wheels, which may have led to higher prices for those auctions.\nWe would like to fit a model that includes all potentially important variables simultaneously, which would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables.\n\nWe want to construct a model that accounts for not only the game condition but simultaneously accounts for three other variables:\n\n$$\n\\begin{aligned}\nE[\\texttt{price}]\n &= \\beta_0 + \\beta_1\\times \\texttt{cond_new} +\n \\beta_2\\times \\texttt{stock_photo} \\\\\n &\\qquad\\ + \\beta_3 \\times \\texttt{duration} +\n \\beta_4 \\times \\texttt{wheels}\n\\end{aligned}\n$$\n\nTable \\@ref(tab:mariokart-full-model-output) summarizes the full model.\nUsing the output, we identify the point estimates of each coefficient and the corresponding impact (measured with information for the standard error to compute the p-value).\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Summary of a linear model for predicting `price` based on `cond_new`, `stock_photo`, `duration`, and `wheels`.
term estimate std.error statistic p.value
(Intercept) 36.21 1.51 23.92 <0.0001
cond_new 5.13 1.05 4.88 <0.0001
stock_photo 1.08 1.06 1.02 0.3085
duration -0.03 0.19 -0.14 0.8882
wheels 7.29 0.55 13.13 <0.0001
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nWrite out the model's equation using the point estimates from Table \\@ref(tab:mariokart-full-model-output).\nHow many predictors are there in the model?\nHow many coefficients are estimated?[^27-inf-model-applications-2]\n:::\n\n[^27-inf-model-applications-2]: $\\widehat{\\texttt{price}} = 36.21 + 5.13 \\times \\texttt{cond_new} + 1.08 \\times \\texttt{stock_photo} - 0.03 \\times \\texttt{duration} + 7.29 \\times \\\\texttt{wheels}$, with the $k=4$ predictors but 5 coefficients (including the intercept).\n\n::: {.guidedpractice data-latex=\"\"}\nWhat does $\\beta_4,$ the coefficient of variable $x_4$ (Wii wheels), represent?\nWhat is the point estimate of $\\beta_4?$[^27-inf-model-applications-3]\n:::\n\n[^27-inf-model-applications-3]: In the population of all auctions, it is the average difference in auction price for each additional Wii wheel included when holding the other variables constant.\n The point estimate is $b_4 = 7.29$\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the residual of the first observation in Table \\@ref(tab:mariokart-data-frame) using the equation identified in Table \\@ref(tab:mariokart-full-model-output).[^27-inf-model-applications-4]\n:::\n\n[^27-inf-model-applications-4]: $e_i = y_i - \\hat{y_i} = 51.55 - 49.62 = 1.93$.\n\n::: {.workedexample data-latex=\"\"}\nIn Table \\@ref(tab:mariokart-model-output), we estimated a coefficient for `cond_new` in of $b_1 = 10.90$ with a standard error of $SE_{b_1} = 1.26$ when using simple linear regression.\nWhy might there be a difference between that estimate and the one in the multiple regression setting?\n\n------------------------------------------------------------------------\n\nIf we examined the data carefully, we would see that there is multicollinearity\\index{multicollinearity} among some predictors.\nFor instance, when we estimated the connection of the outcome `price` and predictor `cond_new` using simple linear regression, we were unable to control for other variables like the number of Wii wheels included in the auction.\nThat model was biased by the confounding variable `wheels`.\nWhen we use both variables, this particular underlying and unintentional bias is reduced or eliminated (though bias from other confounding variables may still remain).\n:::\n\n\\clearpage\n\n### Computational approach to linear models\n\nPreviously, using a mathematical model, we investigated the coefficients associated with `cond_new` when predicting `price` in a linear model.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:mariokart-rand-dist-cap)](27-inf-model-applications_files/figure-html/mariokart-rand-dist-1.png){width=90%}\n:::\n:::\n\n\n(ref:mariokart-rand-dist-cap) Estimated slopes from linear models (`price` regressed on `cond_new`) built on 1,000 randomized datasets. Each dataset was permuted under the null hypothesis.\n\n::: {.workedexample data-latex=\"\"}\nIn Figure \\@ref(fig:mariokart-rand-dist), the red line (the observed slope) is far from the bulk of the histogram.\nExplain why the randomly permuted datasets produce slopes that are quite different from the observed slope.\n\n------------------------------------------------------------------------\n\nThe null hypothesis is that, in the population, there is no linear relationship between the `price` and the `cond_new` of the *Mario Kart* games.\nWhen the data are randomly permuted, prices are randomly assigned to a condition (new or used), so that the null hypothesis is forced to be true, i.e. permutation is done under the assumption that no relationship between the two variables exists.\nIn the actual study, the new *Mario Kart* games do actually cost more (on average) than the used games!\nSo the slope describing the actual observed relationship is not one that is likely to have happened in a randomly dataset permuted under the assumption that the null hypothesis is true.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nUsing the histogram in Figure \\@ref(fig:mariokart-rand-dist), find the p-value and conclude the hypothesis test in the context of the problem (use words like price of the game and whether it is new).[^27-inf-model-applications-5]\n:::\n\n[^27-inf-model-applications-5]: The observed slope is 10.9 which is nowhere near the range of values for the permuted slopes (roughly -5 to 5).\n Because the observed slope is not a plausible value under the null distribution, the p-value is essentially zero.\n We reject the null hypothesis and claim that there is a relationship between whether the game is new (or not) and the average predicted price of the game.\n\n::: {.guidedpractice data-latex=\"\"}\nIs the conclusion based on the histogram of randomized slopes consistent with the conclusion obtained using the mathematical model?\nExplain.[^27-inf-model-applications-6]\n:::\n\n[^27-inf-model-applications-6]: The p-value in Table \\@ref(tab:mariokart-model-output) is also essentially zero, so the null hypothesis is also rejected when the mathematical model approach is taken.\n Often, the mathematical and computational approaches to inference will give quite similar answers.\n\nAlthough knowing there is a relationship between the condition of the game and its price, we might be more interested in the difference in price, here given by the slope of the linear regression line.\nThat is, $\\beta_1$ represents the population value for the difference in price between new *Mario Kart* games and used games.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:mariokart-boot-dist-cap)](27-inf-model-applications_files/figure-html/mariokart-boot-dist-1.png){width=90%}\n:::\n:::\n\n\n(ref:mariokart-boot-dist-cap) Estimated slopes from linear models (`price` regressed on `cond_new`) built on 1000 bootstrapped datasets. Each bootstrap dataset was a resample taken from the original Mario Kart auction data.\n\n::: {.workedexample data-latex=\"\"}\nFigure \\@ref(fig:mariokart-boot-dist) displays the slope estimates taken from bootstrap samples of the original data.\nUsing the histogram, estimate the standard error of the slope.\nIs your estimate similar to the value of the standard error of the slope provided in the output of the mathematical linear model?\n\n------------------------------------------------------------------------\n\nThe slopes seem to vary from approximately 8 to 14.\nUsing the empirical rule, we know that if a variable has a bell-shaped distribution, most of the observations will be with 2 standard errors of the center.\nTherefore, a rough approximation of the standard error is 1.5.\nThe standard error given in Table \\@ref(tab:mariokart-model-output) is 1.26 which is not too different from the value computed using the bootstrap approach.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nUse Figure \\@ref(fig:mariokart-boot-dist) to create a 90% standard error bootstrap confidence interval for the true slope.\nInterpret the interval in context.[^27-inf-model-applications-7]\n:::\n\n[^27-inf-model-applications-7]: Using the bootstrap SE method, we know the normal percentile is $z^\\star = 1.645$, which gives a CI of $b_1 \\pm 1.645 \\cdot SE \\rightarrow 10.9 \\pm 1.645 \\cdot 1.5 \\rightarrow (8.43, 13.37).$ For games that are new, the average price is higher by between \\$8.43 and \\$13.37, with 90% confidence.\n\n::: {.guidedpractice data-latex=\"\"}\nUse Figure \\@ref(fig:mariokart-boot-dist) to create a 90% bootstrap percentile confidence interval for the true slope.\nInterpret the interval in context.[^27-inf-model-applications-8]\n:::\n\n[^27-inf-model-applications-8]: Because there were 1000 bootstrap resamples, we look for the cutoffs which provide 50 bootstrap slopes on the left, 900 in the middle, and 50 on the right.\n Looking at the bootstrap histogram, the rough 95% confidence interval is \\$9 to \\$13.10.\n For games that are new, the average price is higher by between \\$9.00 and \\$13.10, with 90% confidence.\n\n### Cross-validation\n\nIn Chapter \\@ref(model-mlr), models were compared using $R^2_{adj}.$ In Chapter \\@ref(inf-model-mlr), however, a computational approach was introduced to compare models by removing chunks of data one at a time and assessing how well the variables predicted the observations that had been held out.\n\nFigure \\@ref(fig:mariokart-cv-residuals) was created by cross-validating models with the same variables as in Table \\@ref(tab:mariokart-model-output) and Table \\@ref(tab:mariokart-full-model-output).\nWe applied 3-fold cross-validation, so 1/3 of the data was removed while 2/3 of the observations were used to build each model (first `cond_new` only and then `cond_new`, `stock_photo`, `duration`, and `wheels`).\nNote that each time 1/3 of the data is removed, the resulting model will produce slightly different coefficients.\n\nThe points in Figure \\@ref(fig:mariokart-cv-residuals) represent the prediction (x-axis) and residual (y-axis) for each observation run through the cross-validated model.\nIn other words, the model is built (using the other 2/3) without the observation (which is in the 1/3) being used.\nThe residuals give us a sense for how well the model will do at predicting observations which were not a part of the original dataset (e.g., future studies).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![(ref:mariokart-cv-residuals-cap)](27-inf-model-applications_files/figure-html/mariokart-cv-residuals-1.png){width=100%}\n:::\n:::\n\n\n(ref:mariokart-cv-residuals-cap) Cross-validation predictions and errors from linear models built on two different sets of variables. Left regressed `price` on `cond_new`; right regressed `price` on `cond_new`, `stock_photo`, `duration`, and `wheels`.\n\n::: {.guidedpractice data-latex=\"\"}\nIn the second graph in Figure \\@ref(fig:mariokart-cv-residuals), note the point at roughly (predicted = 50 and error = 10).\nEstimate the observed and predicted value for that observation.[^27-inf-model-applications-9]\n:::\n\n[^27-inf-model-applications-9]: The predicted value is roughly $\\widehat{\\texttt{price}} = \\$50.$ The observed value is roughly $\\texttt{price}_i = \\$60$ riders (using $e_i = y_i - \\hat{y}_i).$\n\n::: {.guidedpractice data-latex=\"\"}\nIn the second graph in Figure \\@ref(fig:mariokart-cv-residuals), for the same point at roughly (predicted = 50 and error = 10), describe which cross-validation fold(s) were used to build its prediction model.[^27-inf-model-applications-10]\n:::\n\n[^27-inf-model-applications-10]: The point appears to be in fold 2, so folds 1 and 3 were used to build the prediction model.\n\n::: {.guidedpractice data-latex=\"\"}\nBy noting the spread of the cross-validated prediction errors (on the y-axis) in Figure \\@ref(fig:mariokart-cv-residuals), which model should be chosen for a final report on these data?[^27-inf-model-applications-11]\n:::\n\n[^27-inf-model-applications-11]: The cross-validated residuals on `cond_new` only vary roughly from -15 to 15, while the cross-validated residuals on the four predictor model vary less, roughly from -10 to 10.\n Given the smaller residuals from the four predictor model, it seems as though the larger model is better.\n\n::: {.guidedpractice data-latex=\"\"}\nUsing the summary statistic cross-validation sum of squared errors (CV SSE), which model should be chosen for a final report on these data?[^27-inf-model-applications-12]\n:::\n\n[^27-inf-model-applications-12]: The CV SSE is smaller (by a factor of almost two!) for the model with four predictors.\n Using a single valued criterion (CV SSE) allows us to make a decision to choose the model with four predictors.\n\n\\clearpage\n\n## Interactive R tutorials\n\nNavigate the concepts you've learned in this chapter in R using the following self-paced tutorials.\nAll you need is your browser to get started!\n\n::: {.alltutorials data-latex=\"\"}\n[Tutorial 6: Inferential modeling](https://openintrostat.github.io/ims-tutorials/06-model-infer/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintrostat.github.io/ims-tutorials/06-model-infer\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 6 - Lesson 1: Inference in regression](https://openintro.shinyapps.io/ims-06-model-infer-01/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-06-model-infer-01\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 6 - Lesson 2: Randomization test for slope](https://openintro.shinyapps.io/ims-06-model-infer-02/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-06-model-infer-02\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 6 - Lesson 3: t-test for slope](https://openintro.shinyapps.io/ims-06-model-infer-03/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-06-model-infer-03\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 6 - Lesson 4: Checking technical conditions for slope inference](https://openintro.shinyapps.io/ims-06-model-infer-04/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-06-model-infer-04\n:::\n\n:::\n\n::: {.singletutorial data-latex=\"\"}\n[Tutorial 6 - Lesson 5: Inference beyond the simple linear regression model](https://openintro.shinyapps.io/ims-06-model-infer-05/)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://openintro.shinyapps.io/ims-06-model-infer-05\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of tutorials supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).\n:::\n\n## R labs\n\nFurther apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.\n\n::: {.singlelab data-latex=\"\"}\n[Multiple linear regression - Grading the professor](https://www.openintro.org/go?id=ims-r-lab-model-infer)\\\n::: {.content-hidden unless-format=\"pdf\"}\nhttps://www.openintro.org/go?id=ims-r-lab-model-infer\n:::\n\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nYou can also access the full list of labs supporting this book at\\\n.\n:::\n\n::: {.content-visible when-format=\"html\"}\nYou can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).\n:::\n", + "supporting": [ + "27-inf-model-applications_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/27-inf-model-applications/figure-html/mariokart-boot-dist-1.png b/_freeze/27-inf-model-applications/figure-html/mariokart-boot-dist-1.png new file mode 100644 index 00000000..1db4d439 Binary files /dev/null and b/_freeze/27-inf-model-applications/figure-html/mariokart-boot-dist-1.png differ diff --git a/_freeze/27-inf-model-applications/figure-html/mariokart-cv-residuals-1.png b/_freeze/27-inf-model-applications/figure-html/mariokart-cv-residuals-1.png new file mode 100644 index 00000000..a8d304f5 Binary files /dev/null and b/_freeze/27-inf-model-applications/figure-html/mariokart-cv-residuals-1.png differ diff --git a/_freeze/27-inf-model-applications/figure-html/mariokart-rand-dist-1.png b/_freeze/27-inf-model-applications/figure-html/mariokart-rand-dist-1.png new file mode 100644 index 00000000..d843f6cf Binary files /dev/null and b/_freeze/27-inf-model-applications/figure-html/mariokart-rand-dist-1.png differ diff --git a/_freeze/exercise-solutions/execute-results/html.json b/_freeze/exercise-solutions/execute-results/html.json new file mode 100644 index 00000000..43579de1 --- /dev/null +++ b/_freeze/exercise-solutions/execute-results/html.json @@ -0,0 +1,20 @@ +{ + "hash": "94906b5f562df036840d2a4fb8857261", + "result": { + "markdown": "---\ntoc: true\n---\n\n\n# Exercise solutions {#sec-exercise-solutions}\n\n\n\n\n\n## Chapter 1 {#sec-exercise-solutions-01 .unlisted}\n\n::: exercises-solution\n1. 23 observations and 7 variables.\n\\addtocounter{enumi}{1}\n\n1. \\(a) \"Is there an association between air pollution exposure and preterm births?\" (b) 143,196 births in Southern California between 1989 and 1993. (c) Measurements of carbon monoxide, nitrogen dioxide, ozone, and particulate matter less than 10$\\mu g/m^3$ (PM$_{10}$) collected at air-quality-monitoring stations as well as length of gestation. Continuous numerical variables.\n\\addtocounter{enumi}{1}\n\n1. \\(a) \"What is the effect of gamification on learning outcomes compared to traditional teaching methods?\" (b) 365 college students taking a statistics course (c) Gender (categorical), level of studies (categorical, ordinal), academic major (categorical), expertise in English language (categorical, ordinal), use of personal computers and games (categorical, ordinal), treatment group (categorical), score (numerical, discrete).\n\\addtocounter{enumi}{1}\n\n1. \\(a) Treatment: $10/43 = 0.23 \\rightarrow 23\\%$. (b) Control: $2/46 = 0.04 \\rightarrow 4\\%$. (c) A higher percentage of patients in the treatment group were pain free 24 hours after receiving acupuncture. (d) It is possible that the observed difference between the two group percentages is due to chance. (e) Explanatory: acupuncture or not. Response: if the patient was pain free or not.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Experiment; researchers are evaluating the effect of fines on parents' behavior related to picking up their children late from daycare. (b) 10 cases: the daycare centers. (c) Number of late pickups (discrete numerical). (d) Week (numerical, discrete), group (categorical, nominal), number of late pickups (numerical discrete), and study period (categorical, ordinal).\n\\addtocounter{enumi}{1}\n\n1. \\(a) 344 cases (penguins) are included in the data. (b) There are 4 numerical variables in the data: bill length, bill depth, and flipper length (measured in millimeters) and body mass (measured in grams). They are all continuous. (c) There are 3 categorical variables in the data: species (Adelie, Chinstrap, Gentoo), island (Torgersen, Biscoe, and Dream), and sex (female and male).\n\\addtocounter{enumi}{1}\n\n1. \\(a) Airport ownership status (public/private), airport usage status (public/private), region (Central, Eastern, Great Lakes, New England, Northwest Mountain, Southern, Southwest, Western Pacific), latitude, and longitude. (b) Airport ownership status: categorical, not ordinal. Airport usage status: categorical, not ordinal. Region: categorical, not ordinal. Latitude: numerical, continuous. Longitude: numerical, continuous.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Year, number of baby girls named Fiona born in that year, nation. (b) Year (numerical, discrete), number of baby girls named Fiona born in that year (numerical, discrete), nation (categorical, nominal).\n\\addtocounter{enumi}{1}\n\n1. \\(a) County, state, driver's race, whether the car was searched or not, and whether the driver was arrested or not. (b) All categorical, non-ordinal. (c) Response: whether the car was searched or not. Explanatory: race of the driver.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Observational study. (b) Dog: Lucy. Cat: Luna. (c) Oliver and Lily. (d) Positive, as the popularity of a name for dogs increases, so does the popularity of that name for cats.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 2 {#sec-exercise-solutions-02 .unlisted}\n\n\\vspace{-2mm}\n\n::: exercises-solution\n1. \\(a) Population mean, $\\mu_{2007} = 52$; sample mean, $\\bar{x}_{2008} = 58$. (b) Population mean, $\\mu_{2001} = 3.37$; sample mean, $\\bar{x}_{2012} = 3.59$.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Population: all births, sample: 143,196 births between 1989 and 1993 in Southern California. (b) If births in this time span at the geography can be considered to be representative of all births, then the results are generalizable to the population of Southern California. However, since the study is observational the findings cannot be used to establish causal relationships.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The population of interest is all college students studying statistics. The sample consists of 365 such students. (b) If the students in this sample, who are likely not randomly sampled, can be considered to be representative of all college students studying statistics, then the results are generalizable to the population defined above. This is probably not a reasonable assumption since these students are from two specific majors only. Additionally, since the study is experimental, the findings can be used to establish causal relationships.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Observation. (b) Variable. (c) Sample statistic (mean).\n (d) Population parameter (mean).\n\\addtocounter{enumi}{1}\n\n1. \\(a) Observational. (b) Use stratified sampling to randomly sample a fixed number of students, say 10, from each section for a total sample size of 40 students.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Positive, non-linear, somewhat strong. Countries in which a higher percentage of the population have access to the internet also tend to have higher average life expectancies, however rise in life expectancy trails off before around 80 years old. (b) Observational. (c) Wealth: countries with individuals who can widely afford the internet can probably also afford basic medical care. (Note: Answers may vary.)\n\\addtocounter{enumi}{1}\n\n1. \\(a) Simple random sampling is okay. In fact, it's rare for simple random sampling to not be a reasonable sampling method! (b) The student opinions may vary by field of study, so the stratifying by this variable makes sense and would be reasonable. (c) Students of similar ages are probably going to have more similar opinions, and we want clusters to be diverse with respect to the outcome of interest, so this would **not** be a good approach. (Additional thought: the clusters in this case may also have very different numbers of people, which can also create unexpected sample sizes.)\n\\addtocounter{enumi}{1}\n\n1. \\(a) The cases are 200 randomly sampled men and women. (b) The response variable is attitude towards a fictional microwave oven. (c) The explanatory variable is dispositional attitude. (d) Yes, the cases are sampled randomly, recruited online using Amazon's Mechanical Turk. (e) This is an observational study since there is no random assignment to treatments. (f) No, we cannot establish a causal link between the explanatory and response variables since the study is observational. (g) Yes, the results of the study can be generalized to the population at large since the sample is random.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Simple random sample. Non-response bias, if only those people who have strong opinions about the survey responds their sample may not be representative of the population. (b) Convenience sample. Under coverage bias, their sample may not be representative of the population since it consists only of their friends. It is also possible that the study will have non-response bias if some choose to not bring back the survey. (c) Convenience sample. This will have a similar issues to handing out surveys to friends. (d) Multi-stage sampling. If the classes are similar to each other with respect to student composition this approach should not introduce bias, other than potential non-response bias.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Exam performance. (b) Light level: fluorescent overhead lighting, yellow overhead lighting, no overhead lighting (only desk lamps). (c) Wearing glasses or not.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Experiment. (b) Light level (overhead lighting, yellow overhead lighting, no overhead lighting) and noise level (no noise, construction noise, and human chatter noise). (c) Since the researchers want to ensure equal representation of those wearing glasses and not wearing glasses, wearing glasses is a blocking variable.\n\\addtocounter{enumi}{1}\n\n1. Need randomization and blinding. One possible outline: (1) Prepare two cups for each participant, one containing regular Coke and the other containing Diet Coke. Make sure the cups ar identical and contain equal amounts of soda. Label the cups (regular) and B (diet). (Be sure to randomize A and B for each trial!) (2) Give each participant the two cups, one cup at a time, in random order, and ask the participant to record a value that indicates ho much she liked the beverage. Be sure that neither the participant nor the person handing out the cups knows the identity of th beverage to make this a double-blind experiment. (Answers may vary.)\n\\addtocounter{enumi}{1}\n\n1. \\(a) Experiment. (b) Treatment: 25 grams of chia seeds twice a day, control: placebo. (c) Yes, gender. (d) Yes, single blind since the patients were blinded to the treatment they received. (e) Since this is an experiment, we can make a causal statement. However, since the sample is not random, the causal statement cannot be generalized to the population at large.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Non-responders may have a different response to this question, e.g., parents who returned the surveys likely do not have difficulty spending time with their children. (b) It is unlikely that the women who were reached at the same address 3 years later are a random sample. These missing responders are probably renters (as opposed to homeowners) which means that they might have a lower socio-economic status than the respondents. (c) There is no control group in this study, this is an observational study, and there may be confounding variables, e.g., these people may go running because they are generally healthier and/or do other exercises.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Randomized controlled experiment. (b) Explanatory: treatment group (categorical, with 3 levels). Response variable: Psychological well-being. (c) No, because the participants were volunteers. (d) Yes, because it was an experiment. (e) The statement should say \"evidence\" instead of \"proof\".\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 3 {#sec-exercise-solutions-03 .unlisted}\n\nApplication chapter, no exercises.\n\n## Chapter 4 {#sec-exercise-solutions-04 .unlisted}\n\n::: exercises-solution\n1. \\(a) We see the order of the categories and the relative frequencies in the bar plot. (b) There are no features that are apparent in the pie chart but not in the bar plot. (c) We usually prefer to use a bar plot as we can also see the relative frequencies of the categories in this graph.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The horizontal locations at which the age groups break into the various opinion levels differ, which indicates that likelihood of supporting protests varies by age group. Two variables may be associated. (b) Answers may vary. Political ideology/leaning and education level.\n\\addtocounter{enumi}{1}\n\n1. (a) Number of participants in each group. (b) Proportion of survival. (c) The standardized bar plot should be displayed as a way to visualize the survival improvement in the treatment versus the control group.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 5 {#sec-exercise-solutions-05 .unlisted}\n\n::: exercises-solution\n1. \\(a) Positive association: mammals with longer gestation periods tend to live longer as well. (b) Association would still be positive. (c) No, they are not independent. See part (a).\n\\addtocounter{enumi}{1}\n\n1. The graph below shows a ramp up period. There may also be a period of exponential growth at the start before the size of the petri dish becomes a factor in slowing growth.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](exercise-solutions_files/figure-html/unnamed-chunk-2-1.png){width=30%}\n :::\n :::\n\\addtocounter{enumi}{1}\n\n1. \\(a) Decrease: the new score is smaller than the mean of the 24 previous scores. (b) Calculate a weighted mean. Use a weight of 24 for the old mean and 1 for the new mean: $(24\\times 74 + 1\\times64)/(24+1) = 73.6$. (c) The new score is more than 1 standard deviation away from the previous mean, so increase.\n\\addtocounter{enumi}{1}\n\n1. Any 10 employees whose average number of days off is between the minimum and the mean number of days off for the entire workforce at this plant.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Dist B has a higher mean since $20 > 13$, and a higher standard deviation since 20 is further from the rest of the data than 13. (b) Dist A has a higher mean since $-20 > -40$, and Dist B has a higher standard deviation since -40 is farther away from the rest of the data than -20. (c) Dist B has a higher mean since all values in this Dist Are higher than those in Dist A, but both distribution have the same standard deviation since they are equally variable around their respective means. (d) Both distributions have the same mean since they're both centered at 300, but Dist B has a higher standard deviation since the observations are farther from the mean than in Dist A.\n\\addtocounter{enumi}{1}\n\n1. \\(a) About 30. (b) Since the distribution is right skewed the mean is higher than the median. (c) Q1: between 15 and 20, Q3: between 35 and 40, IQR: about 20. (d) Values that are considered to be unusually low or high lie more than 1.5$\\times$IQR away from the quartiles. Upper fence: Q3 + 1.5 $\\times$ IQR = $37.5 + 1.5 \\times 20 = 67.5$; Lower fence: Q1 - 1.5 $\\times$ IQR = $17.5 + 1.5 \\times 20 = -12.5$; The lowest AQI recorded is not lower than 5 and the highest AQI recorded is not higher than 65, which are both within the fences. Therefore none of the days in this sample would be considered to have an unusually low or high AQI.\n\\addtocounter{enumi}{1}\n\n1. The histogram shows that the distribution is bimodal, which is not apparent in the box plot. The box plot makes it easy to identify more precise values of observations outside of the whiskers.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Right skewed, there is a natural boundary at 0 and only a few people have many pets. Center: median, variability: IQR. (b) Right skewed, there is a natural boundary at 0 and only a few people live a very long distance from work. Center: median, variability: IQR. (c) Symmetric. Center: mean, variability: standard deviation. (d) Left skewed. Center: median, variability: IQR. (e) Left skewed. Center: median, variability: IQR.\n\\addtocounter{enumi}{1}\n\n1. No, we would expect this distribution to be right skewed. There are two reasons for this: there is a natural boundary at 0 (it is not possible to watch less than 0 hours of TV) and the standard deviation of the distribution is very large compared to the mean.\n\\addtocounter{enumi}{1}\n\n1. No, the outliers are likely the maximum and the minimum of the distribution so a statistic based on these values cannot be robust to outliers.\n\\addtocounter{enumi}{1}\n\n1. The 75th percentile is 82.5, so 5 students will get an A. Also, by definition 25% of students will be above the 75th percentile.\n\\addtocounter{enumi}{1}\n\n1. \\(a) If $\\frac{\\bar{x}}{median} = 1$, then $\\bar{x} = median$. This is most likely to be the case for symmetric distributions. (b) If $\\frac{\\bar{x}}{median} < 1$, then $\\bar{x} < median$. This is most likely to be the case for left skewed distributions, since the mean is affected (and pulled down) by the lower values more so than the median. (c) If $\\frac{\\bar{x}}{median} > 1$, then $\\bar{x} > median$. This is most likely to be the case for right skewed distributions, since the mean is affected (and pulled up) by the higher values more so than the median.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The distribution of percentage of population that is Hispanic is extremely right skewed with majority of counties with less than 10% Hispanic residents. However, there are a few counties that have more than 90% Hispanic population. It might be preferable to, in certain analyses, to use the log-transformed values since this distribution is much less skewed. (b) The map reveals that counties with higher proportions of Hispanic residents are clustered along the Southwest border, all of New Mexico, a large swath of Southwest Texas, the bottom two-thirds of California, and in Southern Florida. In the map all counties with more than 40% of Hispanic residents are indicated by the darker shading, so it is impossible to discern how high Hispanic percentages go. The histogram reveals that there are counties with over 90% Hispanic residents. The histogram is also useful for estimating measures of center and spread. (c) Both visualizations are useful, but if we could only examine one, we should examine the map since it explicitly ties geographic data to each county's percentage.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 6 {#sec-exercise-solutions-06 .unlisted}\n\nApplication chapter, no exercises.\n\n## Chapter 7 {#sec-exercise-solutions-07 .unlisted}\n\n::: exercises-solution\n1. \\(a) The residual plot will show randomly distributed residuals around 0. The variance is also approximately constant. (b) The residuals will show a fan shape, with higher variability for smaller $x$. There will also be many points on the right above the line. There is trouble with the model being fit here.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Strong relationship, but a straight line would not fit the data. (b) Strong relationship, and a linear fit would be reasonable. (c) Weak relationship, and trying a linear fit would be reasonable. (d) Moderate relationship, but a straight line would not fit the data. (e) Strong relationship, and a linear fit would be reasonable. (f) Weak relationship, and trying a linear fit would be reasonable.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Exam 2 since there is less of a scatter in the plot of course grade versus exam 2. Notice that the relationship between Exam 1 and the course grade appears to be slightly nonlinear. (b) (Answers may vary.) If Exam 2 is cumulative it might be a better indicator of how a student is doing in the class.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $r = -0.7$ $\\rightarrow$ (4). (b) $r = 0.45$ $\\rightarrow$ (3). (c) $r = 0.06$ $\\rightarrow$ (1). (d) $r = 0.92$ $\\rightarrow$ (2).\n\\addtocounter{enumi}{1}\n\n1. \\(a) There is a moderate, positive, and linear relationship between shoulder girth and height. (b) Changing the units, even if just for one of the variables, will not change the form, direction or strength of the relationship between the two variables.\n\\addtocounter{enumi}{1}\n\n1. \\(a) There is a somewhat weak, positive, possibly linear relationship between the distance traveled and travel time. There is clustering near the lower left corner that we should take special note of. (b) Changing the units will not change the form, direction or strength of the relationship between the two variables. If longer distances measured in miles are associated with longer travel time measured in minutes, longer distances measured in kilometers will be associated with longer travel time measured in hours. (c) Changing units does not affect correlation: $r = 0.636$.\n\\addtocounter{enumi}{1}\n\n1. In each part, we can write the age of one partner as a linear function of the other. (a) $age_{P1} = age_{P2} + 3$. (b) $age_{P1} = age_{P2} - 2$. (c) $age_{P1} = 2 \\times age_{P2}$. Since the slopes are positive and these are perfect linear relationships, the correlation will be exactly 1 in all three parts. An alternative way to gain insight into this solution is to create a mock dataset, e.g., 5 women aged 26, 27, 28, 29, and 30, then find the husband ages for each wife in each part and create a scatterplot.\n\\addtocounter{enumi}{1}\n\n1. Correlation: no units. Intercept: cal. Slope: cal/cm.\n\\addtocounter{enumi}{1}\n\n1. Over-estimate. Since the residual is calculated as $observed - predicted$, a negative residual means that the predicted value is higher than the observed value.\n\\addtocounter{enumi}{1}\n\n1. \\(a) There is a positive, moderate, linear association between number of calories and amount of protein. In addition, the amount of protein is more variable for menu items with higher calories, indicating non-constant variance. There also appear to be two clusters of data: a patch of about a dozen observations in the lower left and a larger patch on the right side. (b) Explanatory: number of calories. Response: amount of protein (in grams). (c) With a regression line, we can predict the amount of protein for a given number of calories. This may be useful if only calorie counts for the food items are posted but the amount of protein in each food item is not readily available. (d) Food menu items with higher predicted protein are predicted with higher variability than those without, suggesting that the model is doing a better job predicting protein amount for food menu items with lower predicted proteins.\n\\addtocounter{enumi}{1}\n\n1. \\(a) First calculate the slope: $b_1 = R\\times s_y/s_x = 0.636 \\times 113 / 99 = 0.726$. Next, make use of the fact that the regression line passes through the point $(\\bar{x},\\bar{y})$: $\\bar{y} = b_0 + b_1 \\times \\bar{x}$. Plug in $\\bar{x}$, $\\bar{y}$, and $b_1$, and solve for $b_0$: 51. Solution: $\\widehat{travel~time} = 51 + 0.726 \\times distance$. (b) $b_1$: For each additional mile in distance, the model predicts an additional 0.726 minutes in travel time. $b_0$: When the distance travelled is 0 miles, the travel time is expected to be 51 minutes. It does not make sense to have a travel distance of 0 miles in this context. Here, the $y$-intercept serves only to adjust the height of the line and is meaningless by itself. (c) $R^2 = 0.636^2 = 0.40$. About 40% of the variability in travel time is accounted for by the model, i.e., explained by the distance travelled. (d) $\\widehat{travel~time} = 51 + 0.726 \\times distance = 51 + 0.726 \\times 103 \\approx 126$ minutes. (Note: we should be cautious in our predictions with this model since we have not yet evaluated whether it is a well-fit model.) (e) $e_i = y_i - \\hat{y}_i = 168 - 126 = 42$ minutes. A positive residual means that the model underestimates the travel time. (f) No, this calculation would require extrapolation.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $\\widehat{\\texttt{poverty}} = 4.60 + 2.05 \\times \\texttt{unemployment_rate}.$ (b) The model predicts a poverty rate of 4.60\\% for counties with 0\\% unemployment, on average. This is not a meaningful value as no counties have such low unexmployment, it just serves to adjust the height of the regression line. (c) For each additional percentage increase in unemployment rate, poverty rate is predicted to be higher, on average, by 2.05\\%. (d) Unemployment rate explains 46\\% of the variability in poverty levels in US counties. (e) $\\sqrt{0.46} = 0.678.$\n\\addtocounter{enumi}{1}\n\n1. \\(a) There is an outlier in the bottom right. Since it is far from the center of the data, it is a point with high leverage. It is also an influential point since, without that observation, the regression line would have a very different slope. (b) There is an outlier in the bottom right. Since it is far from the center of the data, it is a point with high leverage. However, it does not appear to be affecting the line much, so it is not an influential point. (c) The observation is in the center of the data (in the x-axis direction), so this point does *not* have high leverage. This means the point won't have much effect on the slope of the line and so is not an influential point.\n\\addtocounter{enumi}{1}\n\n1. \\(a) There is a negative, moderate-to-strong, somewhat linear relationship between percent of families who own their home and the percent of the population living in urban areas in 2010. There is one outlier: a state where 100% of the population is urban. The variability in the percent of homeownership also increases as we move from left to right in the plot. (b) The outlier is located in the bottom right corner, horizontally far from the center of the other points, so it is a point with high leverage. It is an influential point since excluding this point from the analysis would greatly affect the slope of the regression line.\n\\addtocounter{enumi}{1}\n\n1. \\(a) True. (b) False, correlation is a measure of the linear association between any two numerical variables.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $r = 0.7 \\to (1)$ (b) $r = 0.09 \\to (4)$ (c) $r = -0.91 \\to (2)$ (d) $r = 0.96 \\to (3)$.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 8 {#sec-exercise-solutions-08 .unlisted}\n\n::: exercises-solution\n1. Annika is right. All variables being highly correlated, including the predictor variables being highly correlated with each other, is not desirable as this would result in multicollinearity.\n\\addtocounter{enumi}{1}\n\n1. No, they shouldn't include all variables as `days_since_start` and `days_since_race` are perfectly correlated with each other. They should only include one of them.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $\\widehat{\\texttt{weight}} = 7.270 - 0.593 \\times \\texttt{habit}_\\texttt{smoker}$. (b) The estimated body weight of babies born to smoking mothers is 0.593 pounds lower than those who are born to non-smoking mothers. Smoker: $\\widehat{\\texttt{weight}} = 7.270 - 0.593 \\times 1 = 6.68$ pounds. Non-smoker: $\\widehat{\\texttt{weight}} = 7.270 - 0.593 \\times 0 = 7.270$ pounds.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Horror movies. (b) Not necessarily, the change in adjusted $R^2$ is quite small.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $\\widehat{\\texttt{weight}} = -3.82 + 0.26 \\times \\texttt{weeks} + 0.02 \\times \\texttt{mage} + 0.37 \\times \\texttt{sex_male} + 0.02 \\times \\texttt{visits} - 0.43 \\times \\texttt{habit}\\texttt{_smoker}.$ (b) $b_{\\texttt{weeks}}$: The model predicts a 0.26 pound increase in the birth weight of the baby for each additional week in length of pregnancy, all else held constant. $b_{\\texttt{habit}_\\texttt{smoker}}$: The model predicts a 0.43 pound decrease in the birth weight of the babies born to smoker mothers compared to non-smokers, all else held constant. (c) Habit might be correlated with one of the other variables in the model, which introduces multicollinearity and complicates model estimation. (d) 7.13~lbs.\n\\addtocounter{enumi}{1}\n\n1. Remove `gained`.\n\\addtocounter{enumi}{1}\n\n1. Add `weeks`.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 9 {#sec-exercise-solutions-09 .unlisted}\n\n::: exercises-solution\n1. (a) False. The line is fit to predict the probability of success, not the binary outcome. (b) False. Residuals are not used in logistic regression like they are in linear regression because the observed value is always either zero or one (and the predicted value is a probability). The goal of the logistic regression is not to get a perfect prediction (of zero or one), so minimizing residuals is not part of the modeling process. (c) True.\n\\addtocounter{enumi}{1}\n\n1. \\(a) There are a few potential outliers, e.g., on the left in the variable, but nothing that will be of serious concern in a dataset this large. (b) When coefficient estimates are sensitive to which variables are included in the model, this typically indicates that some variables are collinear. For example, a possum's gender may be related to its head length, which would explain why the coefficient (and p-value) changed when we removed the variable. Likewise, a possum's skull width is likely to be related to its head length, probably even much more closely related than the head length was to gender.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The logistic model relating $\\hat{p}$ to the predictors may be written as $\\log\\left( \\frac{\\hat{p}}{1 - \\hat{p}} \\right) = 33.5095 - 1.4207\\times \\texttt{sex}_{\\texttt{male}} - 0.2787 \\times \\texttt{skull_w} + 0.5687 \\times \\texttt{total_l} - 1.8057 \\times \\texttt{tail_l}$. Only `total_l` has a positive association with a possum being from Victoria. (b) $\\hat{p} = 0.0062$. While the probability is very near zero, we have not run diagnostics on the model. We might also be a little skeptical that the model will remain accurate for a possum found in a US zoo. For example, perhaps the zoo selected a possum with specific characteristics but only looked in one region. On the other hand, it is encouraging that the possum was caught in the wild. (Answers regarding the reliability of the model probability will vary.)\n\\addtocounter{enumi}{1}\n\n1. \\(a) The variable `exclaim_subj` should be removed, since it's removal reduces AIC the most (and the resulting model has lower AIC than the None Dropped model). (b) The variable `cc` should be removed. (c) Removing any variable will increase AIC, so we should not remove any variables from this set.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 10 {#sec-exercise-solutions-10 .unlisted}\n\nApplication chapter, no exercises.\n\n## Chapter 11 {#sec-exercise-solutions-11 .unlisted}\n\n::: exercises-solution\n1. \\(a) Mean. Each student reports a numerical value: a number of hours. (b) Mean. Each student reports a number, which is a percentage, and we can average over these percentages. (c) Proportion. Each student reports Yes or No, so this is a categorical variable and we use a proportion. (d) Mean. Each student reports a number, which is a percentage like in part (b). (e) Proportion. Each student reports whether s/he expects to get a job, so this is a categorical variable and we use a proportion.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Alternative. (b) Null. (c) Alternative. (d) Alternative. (e) Null. (f) Alternative. (g) Null.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0: \\mu = 8$ (On average, New Yorkers sleep 8 hours a night.) $H_A: \\mu < 8$ (On average, New Yorkers sleep less than 8 hours a night.) (b) $H_0: \\mu = 15$ (The average amount of company time each employee spends not working is 15 minutes for March Madness.) $H_A: \\mu > 15$ (The average amount of company time each employee spends not working is greater than 15 minutes for March Madness.)\n\\addtocounter{enumi}{1}\n\n1. \\(a) (i) False. Instead, of comparing counts, we should compare percentages of people in each group who suffered cardiovascular problems. (ii) True. (iii) False. Association does not imply causation. We cannot infer a causal relationship based on an observational study. The difference from part (ii) is subtle. (iv) True. (b) Proportion of all patients who had cardiovascular problems: $\\frac{7,979}{227,571} \\approx 0.035$ (c) The expected number of heart attacks in the Rosiglitazone group, if having cardiovascular problems and treatment were independent, can be calculated as the number of patients in that group multiplied by the overall cardiovascular problem rate in the study: $67,593 * \\frac{7,979}{227,571} \\approx 2370$. (d) (i) $H_0$: The treatment and cardiovascular problems are independent. They have no relationship, and the difference in incidence rates between the Rosiglitazone and Pioglitazone groups is due to chance. $H_A$: The treatment and cardiovascular problems are not independent. The difference in the incidence rates between the Rosiglitazone and Pioglitazone groups is not due to chance and Rosiglitazone is associated with an increased risk of serious cardiovascular problems. (ii) A higher number of patients with cardiovascular problems than expected under the assumption of independence would provide support for the alternative hypothesis as this would suggest that Rosiglitazone increases the risk of such problems. (iii) In the actual study, we observed 2,593 cardiovascular events in the Rosiglitazone group. In the 100 simulations under the independence model, the simulated differences were never so high, which suggests that the actual results did not come from the independence model. That is, the variables do not appear to be independent, and we reject the independence model in favor of the alternative. The study's results provide convincing evidence that Rosiglitazone is associated with an increased risk of cardiovascular problems.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 12 {#sec-exercise-solutions-12 .unlisted}\n\n::: exercises-solution\n1. \\(a) The statistic is the sample proportion (0.289); the parameter is the population proportion (unknown). (b) $\\hat{p}$ and $p$. (c) Bootstrap sample proportion. (d) 0.289. (e) Roughly (0.22, 0.35). (f) We can be 95% confident that between 22% and 35% of all YouTube videos take place outdoors.\n\\addtocounter{enumi}{1}\n\n1. With 98% confidence, the true proportion of all US adult Twitter users (in 2013) who get at least some of the news from Twitter is between 0.48 and 0.56.\n\\addtocounter{enumi}{1}\n\n1. \\(a) A or perhaps D. (b) A, B, C, or D. (c) B or C. (d) B. (e) None.\n\\addtocounter{enumi}{1}\n\n1. \\(a) This claim is reasonable, since the entire interval lies above 50%. (b) The value of 70% lies outside of the interval, so we have convincing evidence that the researcher's conjecture is wrong. (c) A 90% confidence interval will be narrower than a 95% confidence interval. Even without calculating the interval, we can tell that 70% would not fall in the interval, and we would reject the researcher's conjecture based on a 90% confidence level as well.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 13 {#sec-exercise-solutions-13 .unlisted}\n\n::: exercises-solution\n1. \\(a) Recall that the general formula is $point~estimate \\pm z^{\\star} \\times SE$. First, identify the three different values. The point estimate is 45%, $z^{\\star} = 1.96$ for a 95% confidence level, and $SE = 1.2\\%$. Then, plug the values into the formula: $45\\% \\pm 1.96 \\times 1.2\\% \\quad\\to\\quad (42.6\\%, 47.4\\%)$ We are 95% confident that the proportion of US adults who live with one or more chronic conditions is between 42.6% and 47.4%. (b) (i) False. Confidence intervals provide a range of plausible values, and sometimes the truth is missed. A 95% confidence interval \"misses\" about 5% of the time. (ii) True. Notice that the description focuses on the true population value. (iii) True. If we examine the 95% confidence interval, we can see that 50% is not included in this interval. This means that in a hypothesis test, we would reject the null hypothesis that the proportion is 0.5. (iv) False. The standard error describes the uncertainty in the overall estimate from natural fluctuations due to randomness, not the uncertainty corresponding to individuals' responses.\n\\addtocounter{enumi}{1}\n\n1. A Z score of 0.47 denotes that the sample proportion is 0.47 standard errors greater than the hypothesized value of the population proportion. \n\\addtocounter{enumi}{1}\n\n1. \\(a) Sampling distribution. (b) To know whether the distribution is skewed, we need to know the proportion. We've been told the proportion is likely above 5% and below 30%, and the success-failure condition would be satisfied for any of these values. If the population proportion is in this range, the sampling distribution will be symmetric. (c) Standard error. (d) The distribution will tend to be more variable when we have fewer observations per sample.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 14 {#sec-exercise-solutions-14 .unlisted}\n\n::: exercises-solution\n1. \\(a) $H_0$: Anti-depressants do not affect the symptoms of Fibromyalgia. $H_A$: Anti-depressants do affect the symptoms of Fibromyalgia (either helping or harming). (b) Concluding that anti-depressants either help or worsen Fibromyalgia symptoms when they actually do neither. (c) Concluding that anti-depressants do not affect Fibromyalgia symptoms when they actually do.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0$: The restaurant meets food safety and sanitation regulations. $H_A$: The restaurant does not meet food safety and sanitation regulations. (b) The food safety inspector concludes that the restaurant does not meet food safety and sanitation regulations and shuts down the restaurant when the restaurant is actually safe. (c) The food safety inspector concludes that the restaurant meets food safety and sanitation regulations and the restaurant stays open when the restaurant is actually not safe. (d) A Type 1 Error may be more problematic for the restaurant owner since his restaurant gets shut down even though it meets the food safety and sanitation regulations. (e) A Type 2 Error may be more problematic for diners since the restaurant deemed safe by the inspector is actually not. (f) Strong evidence. Diners would rather a restaurant that meet the regulations get shut down than a restaurant that does not meet the regulations not get shut down.\n\\addtocounter{enumi}{1}\n\n1. The hypotheses should be about the population proportion ($p$), not the sample proportion. The null hypothesis should have an equal sign. The alternative hypothesis should have a not-equals sign, and it should reference the null value, $p_0 = 0.6$, not the observed sample proportion. The correct way to set up these hypotheses is: $H_0: p = 0.6$ and $H_A: p \\neq 0.6$.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 15 {#sec-exercise-solutions-15 .unlisted}\n\nApplication chapter, no exercises.\n\n## Chapter 16 {#sec-exercise-solutions-16 .unlisted}\n\n::: exercises-solution\n1. First, the hypotheses should be about the population proportion ($p$), not the sample proportion. Second, the null value should be what we are testing (0.25), not the observed value (0.29). The correct way to set up these hypotheses is: $H_0: p = 0.25$ and $H_A: p > 0.25.$\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0 : p = 0.20,$ $H_A : p > 0.20.$ (b) $\\hat{p} = 159/650 = 0.245.$ (c) Answers will vary. Each student can be represented with a card. Take 100 cards, 20 black cards representing those who support proposals to defund police departments and 80 red cards representing those who do not. Shuffle the cards and draw with replacement (shuffling each time in between draws) 650 cards representing the 650 respondents to the poll. Calculate the proportion of black cards in this sample, $\\hat{p}_{sim},$ i.e., the proportion of those who upport proposals to defund police departments. The p-value will be the proportion of simulations where $\\hat{p}_{sim} \\geq 0.245.$ (Note: We would generally use a computer to perform the simulations.) (d) There 1 only one simulated proportion that is at least 0.245, therefore the approximate p-value is 0.001. Your p-value may vary slightly since it is based on a visual estimate. Since the p-value is smaller than 0.05, we reject $H_0.$ The data provide convincing evidence that the proportion of Seattle adults who support proposals to defund police departments is greater than 0.20, i.e. more than one in five.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0: p = 0.5$, $H_A: p \\ne 0.5$. (b) The p-value is roughly 0.4, There is not evidence in the data (possibly because there are only 7 cats being measured!) to conclude that the cats have a preference one way or the other between the two shapes.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $SE(\\hat{p}) = 0.189$. (c) Roughly 0.188. (c) Yes. (d) No. (e) The parametric bootstrap is discrete (only a few distinct options) and the mathematical model is continuous (infinite options on a continuum).\n\\addtocounter{enumi}{1}\n\n1. \\(a) The parametric bootstrap simulation was done with $p=0.7$, and the data bootstrap simulation was done with $p = 0.6.$ (b) The parametric bootstrap is centered at 0.7; the data bootstrap is centered at 0.6. (c) The standard error of the sample proportion is given to be roughly 0.1 for both histograms. (d) Both histograms are reasonably symmetric. Note that histograms which describe the variability of proportions become more skewed as the center of the distribution gets closer to 1 (or zero) because the boundary of 1.0 restricts the symmetry of the tail of the distribution. For this reason, the parametric bootstrap histogram is slightly more skewed (left).\n\\addtocounter{enumi}{1}\n\n1. \\(a) The parametric bootstrap for testing. The data bootstrap distribution for confidence intervals. (b) $H_0: p = 0.7;$ $H_A: p \\ne 0.7.$ p-value $> 0.05.$ There is no evidence that the proportion of full-time statistics majors who work is different from 70%. (c) We are 98% confident that the true proportion of all full-time student statistics majors who work at least 5 hours per week is between 35% and 80%. (d) Using $z^\\star = 2.33$, the 98% confidence interval is 0.367 to 0.833.\n\\addtocounter{enumi}{1}\n\n1. \\(a)  False. Doesn't satisfy success-failure condition. (b) True. The success-failure condition is not satisfied. In most samples we would expect $\\hat{p}$ to be close to 0.08, the true population proportion. While $\\hat{p}$ can be much above 0.08, it is bound below by 0, suggesting it would take on a right skewed shape. Plotting the sampling distribution would confirm this suspicion. (c) False. $SE_{\\hat{p}} = 0.0243$, and $\\hat{p} = 0.12$ is only $\\frac{0.12 - 0.08}{0.0243} = 1.65$ SEs away from the mean, which would not be considered unusual. (d) True. $\\hat{p}=0.12$ is 2.32 standard errors away from the mean, which is often considered unusual. (e) False. Decreases the SE by a factor of $1/\\sqrt{2}$.\n\\addtocounter{enumi}{1}\n\n1. \\(a)  True. See the reasoning of 6.1(b). (b) True. We take the square root of the sample size in the SE formula. (c) True. The independence and success-failure conditions are satisfied. (d) True. The independence and success-failure conditions are satisfied.\n\\addtocounter{enumi}{1}\n\n1. \\(a)  False. A confidence interval is constructed to estimate the population proportion, not the sample proportion. (b) True. 95% CI: $82\\%\\ \\pm\\ 2\\%$. (c) True. By the definition of the confidence level. (d) True. Quadrupling the sample size decreases the SE and ME by a factor of $1/\\sqrt{4}$. (e) True. The 95% CI is entirely above 50%.\n\\addtocounter{enumi}{1}\n\n1. With a random sample, independence is satisfied. The success-failure condition is also satisfied. $ME = z^{\\star} \\sqrt{ \\frac{\\hat{p} (1-\\hat{p})} {n} } = 1.96 \\sqrt{ \\frac{0.56 \\times 0.44}{600} }= 0.0397 \\approx 4\\%.$\n\\addtocounter{enumi}{1}\n\n1. \\(a)  No. The sample only represents students who took the SAT, and this was also an online survey. (b) (0.5289, 0.5711). We are 90% confident that 53% to 57% of high school seniors who took the SAT are fairly certain that they will participate in a study abroad program in college. (c) 90% of such random samples would produce a 90% confidence interval that includes the true proportion. (d) Yes. The interval lies entirely above 50%.\n\\addtocounter{enumi}{1}\n\n1. \\(a)  We want to check for a majority (or minority), so we use the following hypotheses: $H_0: p = 0.5$ and $H_A: p \\neq 0.5$. We have a sample proportion of $\\hat{p} = 0.55$ and a sample size of $n = 617$ independents. Since this is a random sample, independence is satisfied. The success-failure condition is also satisfied: $617 \\times 0.5$ and $617 \\times (1 - 0.5)$ are both at least 10 (we use the null proportion $p_0 = 0.5$ for this check in a one-proportion hypothesis test). Therefore, we can model $\\hat{p}$ using a normal distribution with a standard error of $SE = \\sqrt{\\frac{p(1 - p)}{n}} = 0.02$. (We use the null proportion $p_0 = 0.5$ to compute the standard error for a one-proportion hypothesis test.) Next, we compute the test statistic: $Z = \\frac{0.55 - 0.5}{0.02} = 2.5.$ This yields a one-tail area of 0.0062, and a p-value of $2 \\times 0.0062 = 0.0124.$ Because the p-value is smaller than 0.05, we reject the null hypothesis. We have strong evidence that the support is different from 0.5, and since the data provide a point estimate above 0.5, we have strong evidence to support this claim by the TV pundit. (b) No. Generally we expect a hypothesis test and a confidence interval to align, so we would expect the confidence interval to show a range of plausible values entirely above 0.5. However, if the confidence level is misaligned (e.g., a 99% confidence level and a $\\alpha = 0.05$ significance level), then this is no longer generally true.\n\\addtocounter{enumi}{1}\n\n1. \\(a)  $H_0: p = 0.5$. $H_A: p > 0.5$. Independence (random sample, $<10\\%$ of population) is satisfied, as is the success-failure conditions (using $p_0 = 0.5$, we expect 40 successes and 40 failures). $Z = 2.91$ $\\to$ p- value $= 0.0018$. Since the p-value $< 0.05$, we reject the null hypothesis. The data provide strong evidence that the rate of correctly identifying a soda for these people is significantly better than just by random guessing. (b) If in fact people cannot tell the difference between diet and regular soda and they randomly guess, the probability of getting a random sample of 80 people where 53 or more identify a soda correctly would be 0.0018.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The sample is from all computer chips manufactured at the factory during the week of production. We might be tempted to generalize the population to represent all weeks, but we should exercise caution here since the rate of defects may change over time. (b) The fraction of computer chips manufactured at the factory during the week of production that had defects. (c) Estimate the parameter using the data: $\\hat{p} = \\frac{27}{212} = 0.127$. (d) *Standard error* (or $SE$). (e) Compute the $SE$ using $\\hat{p} = 0.127$ in place of $p$: $SE \\approx \\sqrt{\\frac{\\hat{p}(1 - \\hat{p})}{n}} = \\sqrt{\\frac{0.127(1 - 0.127)}{212}} = 0.023$. (f) The standard error is the standard deviation of $\\hat{p}$. A value of 0.10 would be about one standard error away from the observed value, which would not represent a very uncommon deviation. (Usually beyond about 2 standard errors is a good rule of thumb.) The engineer should not be surprised. (g) Recomputed standard error using $p = 0.1$: $SE = \\sqrt{\\frac{0.1(1 - 0.1)}{212}} = 0.021$. This value isn't very different, which is typical when the standard error is computed using relatively similar proportions (and even sometimes when those proportions are quite different!).\n\\addtocounter{enumi}{1}\n\n1. \\(a) The visitors are from a simple random sample, so independence is satisfied. The success-failure condition is also satisfied, with both 64 and $752 - 64 = 688$ above 10. Therefore, we can use a normal distribution to model $\\hat{p}$ and construct a confidence interval. (b) The sample proportion is $\\hat{p} = \\frac{64}{752} = 0.085$. The standard error is $SE = \\sqrt{\\frac{0.085 (1 - 0.085)}{752}} = 0.010.$ (c) For a 90% confidence interval, use $z^{\\star} = 1.65$. The confidence interval is $0.085 \\pm 1.65 \\times 0.010 \\to (0.0685, 0.1015)$. We are 90% confident that 6.85% to 10.15% of first-time site visitors will register using the new design.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 17 {#sec-exercise-solutions-17 .unlisted}\n\n::: exercises-solution\n1. \\(a) The parameter is $p_{Asican-Indian} - p_{Chinese}.$ The statistic is $\\hat{p}_{Asian-Indian} - \\hat{p}_{Chinese} = 223/4373 - 279/4736 = -0.008$ (b) Roughly 0.005. (c) $H_0: p_{Asian-Indian} - p_{Chinese} = 0;$, $H_A: p_{Asian-Indian} - p_{Chinese} \\ne 0.$ The evidence is borderline but worth further study. There is not strong evidence that the true difference in proportion of current smokers is different across the two ethnic groups.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Roughly 0.00625. (b) We are 95% confident that the true proportion of Filipino Americans who are current smokers is between 5.28 and 7.72 percentage points higher in the control vaccine group than the proportion of Chinese Americans who smoke. (c) We are 95% confident that the true proportion of Filipino Americans who are current smokers is between 5.2 and 7.7 percentage points higher in the control vaccine group than the proportion of Chinese Americans who smoke.\n\\addtocounter{enumi}{1}\n\n1. \\(a) While the standard errors of the difference in proportion across the two graphs are roughly the same (approximately 0.012), the centers are not. Computational method A is centered at 0.07 (the difference in the observed sample proportions) and Computational method B is centered at 0. (b) What is the difference between the proportions of Bachelor's and Associate's students who believe that the COVID-19 pandemic will negatively impact their ability to complete the degree? (c) Is the proportion of Bachelor's students who believe that their ability to complete the degree will be negatively impacted by the COVID-19 pandemic different than that of Associate's students?\n\\addtocounter{enumi}{1}\n\n1. \\(a) 26 Yes and 94 No in Nevaripine and 10 Yes and 110 No in Lopinavir group. (b) $H_0: p_N = p_L$. There is no difference in virologic failure rates between the Nevaripine and Lopinavir groups. $H_A: p_N \\ne p_L$. There is some difference in virologic failure rates between the Nevaripine and Lopinavir groups. (c) Random assignment was used, so the observations in each group are independent. If the patients in the study are representative of those in the general population (something impossible to check with the given information), then we can also confidently generalize the findings to the population. The success-failure condition, which we would check using the pooled proportion ($\\hat{p}_{pool} = 36/240 = 0.15$), is satisfied. $Z = 2.89$ $\\to$ p-value $=0.0039$. Since the p-value is low, we reject $H_0$. There is strong evidence of a difference in virologic failure rates between the Nevaripine and Lopinavir groups. Treatment and virologic failure do not appear to be independent.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Standard error: $SE = \\sqrt{\\frac{0.79(1 - 0.79)}{347} + \\frac{0.55(1 - 0.55)}{617}} = 0.03.$ Using $z^{\\star} = 1.96$, we get: $0.79 - 0.55 \\pm 1.96 \\times 0.03 \\to (0.181, 0.299).$ We are 95% confident that the proportion of Democrats who support the plan is 18.1% to 29.9% higher than the proportion of Independents who support the plan. (b) True.\n\\addtocounter{enumi}{1}\n\n1. \\(a) In effect, we are checking whether men are paid more than women (or vice-versa), and we would expect these outcomes with either chance under the null hypothesis: $H_0: p = 0.5$ and $H_A: p \\neq 0.5.$ We'll use $p$ to represent the fraction of cases where men are paid more than women. (b) There isn't a good way to check independence here since the jobs are not a simple random sample. However, independence does not seem unreasonable, since the individuals in each job are different from each other. The success-failure condition is met since we check it using the null proportion: $p_0 n = (1 - p_0) n = 10.5$ is greater than 10. We can compute the sample proportion, $SE$, and test statistic: $\\hat{p} = 19 / 21 = 0.905$ and $SE = \\sqrt{\\frac{0.5 \\times (1 - 0.5)}{21}} = 0.109$ and $Z = \\frac{0.905 - 0.5}{0.109} = 3.72.$ The test statistic $Z$ corresponds to an upper tail area of about 0.0001, so the p-value is 2 times this value: 0.0002. Because the p-value is smaller than 0.05, we reject the notion that all these gender pay disparities are due to chance. Because we observe that men are paid more in a higher proportion of cases and we have rejected $H_0$, we can conclude that men are being paid higher amounts in ways not explainable by chance alone. If you're curious for more info around this topic, including a discussion about adjusting for additional factors that affect pay, please see the following video by Healthcare Triage: youtu.be/aVhgKSULNQA.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0: p = 0.5$. $H_A: p \\neq 0.5$. Independence (random sample) is satisfied, as is the success-failure conditions (using $p_0 = 0.5$, we expect 40 successes and 40 failures). $Z = 2.91$ $\\to$ the one tail area is 0.0018, so the p-value is 0.0036. Since the p-value $< 0.05$, we reject the null hypothesis. Since we rejected $H_0$ and the point estimate suggests people are better than random guessing, we can conclude the rate of correctly identifying a soda for these people is significantly better than just by random guessing. (b) If in fact people cannot tell the difference between diet and regular soda and they were randomly guessing, the probability of getting a random sample of 80 people where 53 or more identify a soda correctly (or 53 or more identify a soda incorrectly) would be 0.0036.\n\\addtocounter{enumi}{1}\n\n1. Before we can calculate a confidence interval, we must first check that the conditions are met. There aren't at least 10 successes and 10 failures in each of the four groups (treatment/control and yawn/not yawn), $(\\hat{p}_C - \\hat{p}_T)$ is not expected to be approximately normal and therefore cannot calculate a confidence interval for the difference between the proportions of participants who yawned in the treatment and control groups using large sample techniques and a critical Z score.\n\\addtocounter{enumi}{1}\n\n1. \\(a) False. The confidence interval includes 0. (b) False. We are 95% confident that 16% fewer to 2% Americans who make less than \\$40,000 per year are not at all personally affected by the government shutdown compared to those who make \\$40,000 or more per year. (c) False. As the confidence level decreases the width of the confidence level decreases as well. (d) True.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Type 1. (b) Type 2. (c) Type 2.\n\\addtocounter{enumi}{1}\n\n1. No. The samples at the beginning and at the end of the semester are not independent since the survey is conducted on the same students.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The proportion of the normal curve centered at -0.1 with a standard deviation of 0.15 that is less than -2 * standard error is 0.09. (b) The proportion of the normal curve centered at -0.4 with a standard deviation of 0.145 that is less than 2 * standard error is 0.78. (c) The proportion of the normal curve centered at -0.1 with a standard deviation of 0.0671 that is less than 2 * standard error is 0.31. (d) The proportion of the normal curve centered at -0.4 with a standard deviation of 0.0678 that is less than 2 * standard error is 1. (e) The larger the value of $\\delta$ and the larger the sample size, the more likely that the future study will lead to sample proportions which are able to reject the null hypothesis.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 18 {#sec-exercise-solutions-18 .unlisted}\n\n::: exercises-solution\n1. \\(a) Two-way table is shown below. (b-i) $E_{row_1, col_1} = \\frac{(row~1~total)\\times(col~1~total)}{table~total} = 35$. This is lower than the observed value. (b-ii) $E_{row_2, col_2} = \\frac{(row~2~total)\\times(col~2~total)}{table~total} = 115$. This is lower than the observed value.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Quit
Treatment Yes No Total
Patch + support group 40 110 150
Only patch 30 120 150
Total 70 230 300
\n \n `````\n :::\n :::\n\\addtocounter{enumi}{1}\n\n1. \\(a) Sun = 0.343, Partial = 0.325, Shade = 0.331. (b) For each, the numbers are listed in the order sun, partial, and shade: Desert (40,9, 38,7, 39.4), Mountain (36.7, 34.8, 35.5), Valley (36.4, 34.5, 35.1). (c) Yes. (d) We can't evaluate the association without a formal test.\n\\addtocounter{enumi}{1}\n\n1. The original dataset will have a higher Chi-squared statistic than the randomized dataset.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The two variables are independent. (b) The randomized Chi-squared values range from zero to approximately 15. (c) The null hypothesis is that the variables are independent; the alternative hypothesis is that the variables are associated. The p-value is extremely small. The habitat provides information about the likelihood of being in the different sunshine states.\n\\addtocounter{enumi}{1}\n\n1. \\(a) The two variables are independent. (b) The randomized Chi-squared values range from zero to approximately 25. (c) The null hypothesis is that the variables are independent; the alternative hypothesis is that the variables are associated. The p-value is around 0. There is convincing evidence to claim that site and sunlight preference are associated. (d) With larger sample sizes, the power (the probability of rejecting $H_0$ when $H_A$ is true) is higher.\n\\addtocounter{enumi}{1}\n\n1. \\(a) False. The Chi-square distribution has one parameter called degrees of freedom. (b) True. (c) True. (d) False. As the degrees of freedom increases, the shape of the Chi-square distribution becomes more symmetric.\n\\addtocounter{enumi}{1}\n\n1. The hypotheses are $H_0:$ Sleep levels and profession are independent. $H_A:$ Sleep levels and profession are associated. The observations are independent and the sample sizes are large enough to conduct a Chi-square test of independence. The Chi-square statistic is 1 with 2 degrees of freedom. The p-value is 0.6. Since the p-value is high (default to alpha = 0.05), we fail to reject $H_0$. The data do not provide convincing evidence of an association between sleep levels and profession.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0$: The age of Los Angeles residents is independent of shipping carrier preference variable. $H_A$: The age of Los Angeles residents is associated with the shipping carrier preference variable. (b) The conditions are not satisfied since some expected counts are below 5.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 19 {#sec-exercise-solutions-19 .unlisted}\n\n::: exercises-solution\n1. \\(a) Average sleep of 20 in sample vs. all New Yorkers. (b) Average height of students in study vs all undergraduates.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Use the sample mean to estimate the population mean: 171.1. Likewise, use the sample median to estimate the population median: 170.3. (b) Use the sample standard deviation (9.4) and sample IQR ($177.8-163.8 = 14$). (c) $Z_{180} = 0.95$ and $Z_{155} = -1.71.$ Neither of these observations is more than two standard deviations away from the mean, so neither would be considered unusual. (d) No, sample point estimates only estimate the population parameter, and they vary from one sample to another. Therefore we cannot expect to get the same mean and standard deviation with each random sample. (e) We use the standard error of the mean to measure the variability in means of random samples of same size taken from a population. The variability in the means of random samples is quantified by the standard error. Based on this sample, $SE_{\\bar{x}} = \\frac{9.4}{\\sqrt{507}} = 0.417.$\n\\addtocounter{enumi}{1}\n\n1. \\(a) The kindergartners will have a smaller standard deviation of heights. We would expect their heights to be more similar to each other compared to a group of adults' heights. (b) The standard error of the mean will depend on the variability of individual heights. The standard error of the adult sample averages will be around 9.4/$\\sqrt{100}$ = 0.94cm. The standard error of the kindergartner sample averages will be smaller.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $df=6-1=5$, $t_{5}^{\\star} = 2.02$. (b) $df=21-1=20$, $t_{20}^{\\star} = 2.53$. (c) $df=28$, $t_{28}^{\\star} = 2.05$. (d) $df=11$, $t_{11}^{\\star} = 3.11$.\n\\addtocounter{enumi}{1}\n\n1. \\(a) 0.085, do not reject $H_0$. (b) 0.003, reject $H_0$. (c) 0.438, do not reject $H_0$. (d) 0.042, reject $H_0$.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Roughly 0.1 weeks. (b) Roughly (38.45 weeks, 38.85 weeks). (c) Roughly (38.49 weeks, 38.91 weeks).\n\\addtocounter{enumi}{1}\n\n1. \\(a) False (b) False. (c) True. (d) False.\n\\addtocounter{enumi}{1}\n\n1. The mean is the midpoint: $\\bar{x} = 20$. Identify the margin of error: $ME = 1.015$, then use $t^{\\star}_{35} = 2.03$ and $SE = s/ \\sqrt{n}$ in the formula for margin of error to identify $s = 3$.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0$: $\\mu = 8$ (New Yorkers sleep 8 hrs per night on average.) \n$H_A$: $\\mu \\neq 8$ (New Yorkers sleep less or more than 8 hrs per night on average.) (b) Independence: The sample is random. The min/max suggest there are no concerning outliers. $T = -1.75$. $df=25-1=24$. (c) p-value $= 0.093$. If in fact the true population mean of the amount New Yorkers sleep per night was 8 hours, the probability of getting a random sample of 25 New Yorkers where the average amount of sleep is 7.73 hours per night or less (or 8.27 hours or more) is 0.093. (d) Since p-value $>$ 0.05, do not reject $H_0$. The data do not provide strong evidence that New Yorkers sleep more or less than 8 hours per night on average. (e) Yes, since we did not rejected $H_0$.\n\\addtocounter{enumi}{1}\n\n1. With a larger critical value, the confidence interval ends up being wider. This makes intuitive sense as when we have a small sample size and the population standard deviation is unknown, we should have a wider interval than if we knew the population standard deviation, or if we had a large enough sample size.\n\\addtocounter{enumi}{1}\n\n1. \\(a) We will conduct a 1-sample $t$-test. $H_0$: $\\mu = 5$. $H_A$: $\\mu \\neq 5$. We'll use $\\alpha = 0.05$. This is a random sample, so the observations are independent. To proceed, we assume the distribution of years of piano lessons is approximately normal. $SE = 2.2 / \\sqrt{20} = 0.4919$. The test statistic is $T = (4.6 - 5) / SE = -0.81$. $df = 20 - 1 = 19$. The one-tail area is about 0.21, so the p-value is about 0.42, which is bigger than $\\alpha = 0.05$ and we do not reject $H_0$. That is, we do not have sufficiently strong evidence to reject the notion that the average is 5 years. (b) Using $SE = 0.4919$ and $t_{df = 19}^{\\star} = 2.093$, the confidence interval is (3.57, 5.63). We are 95% confident that the average number of years a child takes piano lessons in this city is 3.57 to 5.63 years. (c) They agree, since we did not reject the null hypothesis and the null value of 5 was in the $t$-interval.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 20 {#sec-exercise-solutions-20 .unlisted}\n\n::: exercises-solution\n1. The hypotheses should use population means ($\\mu$) not sample means ($\\bar{x}$), the null hypothesis should set the two population means equal to each other, the alternative hypothesis should be two-tailed and use a not equal to sign.\n\\addtocounter{enumi}{1}\n\n1. $H_0: \\mu_{0.99} = \\mu_{1}$ and $H_A: \\mu_{0.99} \\ne \\mu_{1}.$ p-value $<$ 0.05, reject $H_0.$ The data provide convincing evidence that the difference in population averages of price per carat of 0.99 carats and 1 carat diamonds are different.\n\\addtocounter{enumi}{1}\n\n1. \\(a) We are 95% confident that the population average price per carat of 0.99 carat diamonds is \\$2 to \\$23 lower than the population average price per carat of 1 carat diamonds. (b) We are 95% confident that the population average price per carat of 0.99 carat diamonds is \\$2.91 to \\$21.10 lower than the population average price per carat of 1 carat diamonds.\n\\addtocounter{enumi}{1}\n\n1. The difference is not zero (statistically significant), but there is no evidence that the difference is large (practically significant), because the interval provides values as low as 1 lb.\n\\addtocounter{enumi}{1}\n\n1. $H_0: \\mu_{0.99} = \\mu_{1}$ and $H_A: \\mu_{0.99} \\ne \\mu_{1}$. Independence: Both samples are random and represent less than 10% of their respective populations. Also, we have no reason to think that the 0.99 carats are not independent of the 1 carat diamonds since they are both sampled randomly. Normality: The distributions are not extremely skewed, hence we can assume that the distribution of the average differences will be nearly normal as well. $T_{22} = 2.23$, p-value = 0.0131. Since p-value less than 0.05, reject $H_0$. The data provide convincing evidence that the difference in population averages of price per carat of 0.99 carats and 1 carat diamonds are different.\n\\addtocounter{enumi}{1}\n\n1. We are 95% confident that the population average price per carat of 0.99 carat diamonds is \\$2.96 to \\$22.42 lower than the population average price per carat of 1 carat diamonds.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $\\mu_{\\bar{x}_1} = 15$, $\\sigma_{\\bar{x}_1} = 20 / \\sqrt{50} = 2.8284.$ (b) $\\mu_{\\bar{x}_2} = 20$, $\\sigma_{\\bar{x}_1} = 10 / \\sqrt{30} = 1.8257.$ (c) $\\mu_{\\bar{x}_2 - \\bar{x}_1} = 20 - 15 = 5$, $\\sigma_{\\bar{x}_2 - \\bar{x}_1} = \\sqrt{\\left(20 / \\sqrt{50}\\right)^2 + \\left(10 / \\sqrt{30}\\right)^2} = 3.3665.$ (d) Think of $\\bar{x}_1$ and $\\bar{x}_2$ as being random variables, and we are considering the standard deviation of the difference of these two random variables, so we square each standard deviation, add them together, and then take the square root of the sum: $SD_{\\bar{x}_2 - \\bar{x}_1} = \\sqrt{SD_{\\bar{x}_2}^2 + SD_{\\bar{x}_1}^2}.$\n\\addtocounter{enumi}{1}\n\n1. \\(a) Chicken fed linseed weighed an average of 218.75 grams while those fed horsebean weighed an average of 160.20 grams. Both distributions are relatively symmetric with no apparent outliers. There is more variability in the weights of chicken fed linseed. (b) $H_0: \\mu_{ls} = \\mu_{hb}$. $H_A: \\mu_{ls} \\ne \\mu_{hb}$. We leave the conditions to you to consider. $T=3.02$, $df = min(11, 9) = 9$ $\\to$ p-value $= 0.014$. Since p-value $<$ 0.05, reject $H_0$. The data provide strong evidence that there is a significant difference between the average weights of chickens that were fed linseed and horsebean. (c) Type 1 Error, since we rejected $H_0$. (d) Yes, since p-value $>$ 0.01, we would not have rejected $H_0$.\n\\addtocounter{enumi}{1}\n\n1. $H_0: \\mu_C = \\mu_S$. $H_A: \\mu_C \\ne \\mu_S$. $T = 3.27$, $df=11$ $\\to$ p-value $= 0.007$. Since p-value $< 0.05$, reject $H_0$. The data provide strong evidence that the average weight of chickens that were fed casein is different than the average weight of chickens that were fed soybean (with weights from casein being higher). Since this is a randomized experiment, the observed difference can be attributed to the diet.\n\\addtocounter{enumi}{1}\n\n1. $H_0: \\mu_{T} = \\mu_{C}$. $H_A: \\mu_{T} \\ne \\mu_{C}$. $T=2.24$, $df=21$ $\\to$ p-value $= 0.036$. Since p-value $<$ 0.05, reject $H_0$. The data provide strong evidence that the average food consumption by the patients in the treatment and control groups are different. Furthermore, the data indicate patients in the distracted eating (treatment) group consume more food than patients in the control group.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 21 {#sec-exercise-solutions-21 .unlisted}\n\n::: exercises-solution\n1. Paired, data are recorded in the same cities at two different time points. The temperature in a city at one point is not independent of the temperature in the same city at another time point\n\\addtocounter{enumi}{1}\n\n1. \\(a) Since it's the same students at the beginning and the end of the semester, there is a pairing between the datasets, for a given student their beginning and end of semester grades are dependent. (b) Since the subjects were sampled randomly, each observation in the men's group does not have a special correspondence with exactly one observation in the other (women's) group. (c) Since it's the same subjects at the beginning and the end of the study, there is a pairing between the datasets, for a subject student their beginning and end of semester artery thickness are dependent. (d) Since it's the same subjects at the beginning and the end of the study, there is a pairing between the datasets, for a subject student their beginning and end of semester weights are dependent.\n\\addtocounter{enumi}{1}\n\n1. False. While it is true that paired analysis requires equal sample sizes, only having the equal sample sizes isn't, on its own, sufficient for doing a paired test. Paired tests require that there be a special correspondence between each pair of observations in the two groups.\n\\addtocounter{enumi}{1}\n\n1. The data are paired, since this is a before-after measurement of the same trees, so we will construct a confidence interval using the differences summary statistics. But before we proceed with a confidence interval, we must first check conditions: Independent: this is satisfied since the trees were randomly sampled. Normality: since $n = 50 \\geq 30$, we only need consider whether there are any particularly extreme outliers. None are mentioned, and it does not seem like we would expect to observe any such cases from data of this type, so we'll consider this condition to be satisfied. With the conditions satisfied, we can proceed with calculations. First, compute the standard error and degrees of freedom: $SE = \\frac{7.2}{\\sqrt{50}} = 1.02$ and $df = 50 - 1 = 49$. Next, we find $t^{\\star} = 2.68$ for a 99% confidence interval using a $t$-distribution with 49 degrees of freedom, and then we construct the confidence interval: $\\bar{x} \\pm t^{\\star} \\times SE = 12.5 \\pm 2.68 \\times 1.02 = (9.77, 15.23)$. We are 99% confident that the average growth of young trees in this area during the 10-year period was 9.77 to 15.23 feet.\n\\addtocounter{enumi}{1}\n\n1. \\(a) No. (b) Yes. (c) No. (d) No and yes. (e) Yes.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Let $diff = 2018 - 1948$. Then, $H_0: \\mu_{diff} = 0$ and $H_A: \\mu_{diff} \\ne 0$. (b) The observed average of difference is just outside the randomized differences. (c) Since the p-value $<$ 0.05, reject $H_0$. The data provide convincing evidence of a difference between the average number of 90F degree days in 2018 and the average number of 90F degree days in 1948.\n\\addtocounter{enumi}{1}\n\n1. \\(a) For each observation in one dataset, there is exactly one specially corresponding observation in the other dataset for the same geographic location. The data are paired. (b) $H_0: \\mu_{\\text{diff}} = 0$ (There is no difference in average number of days exceeding 90F in 1948 and 2018 for NOAA stations.) $H_A: \\mu_{\\text{diff}} \\neq 0$ (There is a difference.) (c) Locations were randomly sampled, so independence is reasonable. The sample size is at least 30, so we are just looking for particularly extreme outliers: none are present (the observation off left in the histogram would be considered a clear outlier, but not a particularly extreme one). Therefore, the conditions are satisfied. (d) $SE = 17.2 / \\sqrt{197} = 1.23$. $T = \\frac{2.9 - 0}{1.23} = 2.36$ with degrees of freedom $df = 197 - 1 = 196$. This leads to a one-tail area of 0.0096 and a p-value of about 0.019. (e) Since the p-value is less than 0.05, we reject $H_0$. The data provide strong evidence that NOAA stations observed more 90F days in 2018 than in 1948. (f) Type 1 Error, since we may have incorrectly rejected $H_0$. This error would mean that NOAA stations did not actually observe a decrease, but the sample we took just so happened to make it appear that this was the case. (g) No, since we rejected $H_0$, which had a null value of 0.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $SE = 1.23$ and $z^{\\star} = 1.65$. $2.9 \\pm 1.65 \\times 1.23 \\to (0.87, 4.93)$. (b) We are 90% confident that there was an increase of 0.87 to 4.93 in the average number of days that hit 90F in 2018 relative to 1948 for NOAA stations. (c) Yes, since the interval lies entirely above 0.\n\\addtocounter{enumi}{1}\n\n1. \\(a)These data are paired. For example, the Friday the 13th in say, September 1991, would probably be more similar to the Friday the 6th in September 1991 than to Friday the 6th in another month or year. (b) Let $\\mu_{\\textit{diff}} = \\mu_{sixth} - \\mu_{thirteenth}$. $H_0: \\mu_{\\textit{diff}} = 0$. $H_A: \\mu_{\\textit{diff}} \\ne 0$. (c) Independence: The months selected are not random. However, if we think these dates are roughly equivalent to a simple random sample of all such Friday 6th/13th date pairs, then independence is reasonable. To proceed, we must make this strong assumption, though we should note this assumption in any reported results. Normality: With fewer than 10 observations, we would need to see clear outliers to be concerned. There is a borderline outlier on the right of the histogram of the differences, so we would want to report this in formal analysis results. (d) $T = 4.93$ for $df = 10 - 1 = 9$ $\\to$ p-value = 0.001. (e) Since p-value $<$ 0.05, reject $H_0$. The data provide strong evidence that the average number of cars at the intersection is higher on Friday the 6$^{\\text{th}}$ than on Friday the 13$^{\\text{th}}$. (We should exercise caution about generalizing the interpetation to all intersections or roads.) (f) If the average number of cars passing the intersection actually was the same on Friday the 6$^{\\text{th}}$ and $13^{th}$, then the probability that we would observe a test statistic so far from zero is less than 0.01.\\ (g) We might have made a Type 1 Error, i.e., incorrectly rejected the null hypothesis.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 22 {#sec-exercise-solutions-22 .unlisted}\n\n::: exercises-solution\n1. Alternative.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Means across original data are more variable. (b) Standard deviation of egg lengths are about the same for both plots. (c) F statistic is bigger for the original data.\n\\addtocounter{enumi}{1}\n\n1. $H_0$: $\\mu_1 = \\mu_2 = \\cdots = \\mu_6$. $H_A$: The average weight varies across some (or all) groups. Independence: Chicks are randomly assigned to feed types (presumably kept separate from one another), therefore independence of observations is reasonable. Approx. normal: the distributions of weights within each feed type appear to be fairly symmetric. Constant variance: Based on the side-by-side box plots, the constant variance assumption appears to be reasonable. There are differences in the actual computed standard deviations, but these might be due to chance as these are quite small samples. $F_{5,65} = 15.36$ and the p-value is approximately 0. With such a small p-value, we reject $H_0$. The data provide convincing evidence that the average weight of chicks varies across some (or all) feed supplement groups.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0$: The population mean of MET for each group is equal to the others. $H_A$: At least one pair of means is different. (b) Independence: We do not have any information on how the data were collected, so we cannot assess independence. To proceed, we must assume the subjects in each group are independent. In practice, we would inquire for more details. Normality: The data are bound below by zero and the standard deviations are larger than the means, indicating very strong skew. However, since the sample sizes are extremely large, even extreme skew is acceptable. Constant variance: This condition is sufficiently met, as the standard deviations are reasonably consistent across groups. (c) Since p-value is very small, reject $H_0$. The data provide convincing evidence that the average MET differs between at least one pair of groups.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0$: Average GPA is the same for all majors. $H_A$: At least one pair of means are different. (b) Since p-value $>$ 0.05, fail to reject $H_0$. The data do not provide convincing evidence of a difference between the average GPAs across three groups of majors. (c) The total degrees of freedom is $195 + 2 = 197$, so the sample size is $197+1=198$.\n\\addtocounter{enumi}{1}\n\n1. \\(a) False. As the number of groups increases, so does the number of comparisons and hence the modified significance level decreases. (b) True. (c) True. (d) False. We need observations to be independent regardless of sample size.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Left is Dataset B. (b) Right is Dataset A.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 23 {#sec-exercise-solutions-23 .unlisted}\n\nApplication chapter, no exercises.\n\n## Chapter 24 {#sec-exercise-solutions-24 .unlisted}\n\n::: exercises-solution\n1. \\(a) $H_0: \\beta_1 = 0$, $H_A: \\beta_1 \\ne 0$. (b) The observed slope of 0.604 is not a plausible value, the p-value is extremely small, and the null hypothesis can be rejected. c. The p-value is also extremely small.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Roughly 0.53 to 0.67. (b) For individuals with one cm larger shoulder girth, their average height is predicted to be between 0.53 and 0.67 cm taller, with 98% confidence.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0: \\beta_1 = 0$, $H_A: \\beta_1 \\ne 0$. (b) The observed slope of 2.559 is not a plausible value, the p-value is extremely small, and the null hypothesis can be rejected. (c) The p-value is also extremely small.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Rough 90% confidence interval is 1.9 to 3.1. (b) For a one unit (one percentage point) increase in poverty across given metropolitan areas, the predicted average annual murder rate will be between 1.9 and 3.1 persons per million larger, with 90% confidence.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0: \\beta_1 = 0$, $H_A: \\beta_1 \\ne 0$. (b) The p-value is roughly 0.45 which is much bigger than 0.05. The null hypothesis cannot be rejected.There is no evidence with these data that there is a linear relationship between a father's age and the baby's weight. (c) The p-value of 0.449 is quite similar. The hypothesis test conclusion is the same, the data do not support a linear model.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Rough 95% confidence interval is (-.008, 0.016). (b) 95% confident that for individuals with fathers who are one year older, their average weight is predicted to be between -0.008 and 0.016 pounds heavier.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $H_0$: The true slope coefficient of body weight is zero ($\\beta_1 = 0$). $H_A$: The true slope coefficient of body weight is different than zero ($\\beta_1 \\neq 0$). (b) The p-value is extremely small (zero to 4 decimal places), which is lower than the significance level of 0.05. With such a low p-value, we reject $H_0$. The data provide strong evidence that the true slope coefficient of body weight is greater than zero and that body weight is positively associated with heart weight in cats. (c) (3.539, 4.529). We are 95% confident that for each additional kilogram in cats' weights, we expect their hearts to be heavier by 3.539 to 4.529 grams, on average. (d) Yes, we rejected the null hypothesis and the confidence interval lies above 0.\n\\addtocounter{enumi}{1}\n\n1. \\(a) $r = \\sqrt{0.292} \\approx -0.54$. We know the correlation is negative due to the negative association shown in the scatterplot. (b) The residuals appear to be fan shaped, indicating non-constant variance. Therefore a simple least squares fit is not appropriate for these data.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 25 {#sec-exercise-solutions-25 .unlisted}\n\n::: exercises-solution\n1. \\(a) (-0.044, 0.346). We are 95% confident that student who go out more than two nights a week on average have GPAs 0.044 points lower to 0.346 points higher than those who do not go out more than two nights a week, when controlling for the other variables in the model. (b) Yes, since the p-value is larger than 0.05 in all cases (not including the intercept).\n\\addtocounter{enumi}{1}\n\n1. \\(a) There is a positive, very strong, linear association between the number of tourists and spending. (b) Explanatory: number of tourists (in thousands). Response: spending (in millions of US dollars). (c) We can predict spending for a given number of tourists using a regression line. This may be useful information for determining how much the country may want to spend in advertising abroad, or to forecast expected revenues from tourism. (d) Even though the relationship appears linear in the scatterplot, the residual plot actually shows a nonlinear relationship (**L**inearity is violated). This is not a contradiction: residual plots can show divergences from linearity that can be difficult to see in a scatterplot. A simple linear model is inadequate for modeling these data. It is also important to consider that these data are observed sequentially, which means there may be a hidden structure not evident in the current plots but that is important to consider (and might lead to a violation of the **I**ndependence condition).\n\\addtocounter{enumi}{1}\n\n1. \\(a) **L**inearity: Horror movies seem to show a much different pattern than the other genres. While the residuals plots show a random scatter over years and in order of data collection, there is a clear pattern in residuals for various genres, which signals that this regression model is not appropriate for these data. **I**ndependent observations: The variability of the residuals is higher for data that comes later in the dataset. We do not know if the data are sorted by year, but if so, there may be a temporal pattern in the data that voilates the independence condition. **N**ormality: The residuals are right skewed (skewed to the high end). Constant or **E**qual variability: The residuals vs. predicted values plot reveals some outliers. This plot for only babies with predicted birth weights between 6 and 8.5 pounds looks a lot better, suggesting that for bulk of the data the constant variance condition is met.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Linearity: With so many observations in the dataset, we look for particularly extreme outliers in the histogram of residuals and do not see any. We also do not see a non-linear pattern emerging in the residuals vs. predicted plot. Independent observations: The sample is random and there does not seem ti be a trend in the residuals vs. order of data collection plot. Normality: The histogram of residuals appears to be unimodal and symmetic, centered at 0. Constant or equal variability: The residuals vs. predicted values plot reveals some outliers. This plot for only babies with predicted birth weights between 6 and 8.5 pounds looks a lot better, suggesting that for bulk of the data the constant variance condition is met. All concerns raised here are relatively mild. There are some outliers, but there is so much data that the influence of such observations will be minor. (b) $H_0$: The true slope coefficient of habit is zero ($\\beta_5 = 0$). $H_A$: The true slope coefficient of habit is different than zero ($\\beta_5 \\neq 0$). The p-value for the two-sided alternative hypothesis ($\\beta_5 \\ne 0$) is incredibly 0.0007 (smaller than 0.05), so we reject $H_0$. The data provide convincing evidence that habit and weight are positively correlated, given the other variables in the model. The true slope parameter is indeed greater than 0.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Roughly $\\widehat{\\texttt{weight}} = 11$ pounds and $\\texttt{weight}_i = 7$ pounds. (b) Folds 1, 2, and 4 were used to build the prediction model. (c) The plot on the left estimates 8 parameters; the plot on the right estimates 3 parameters. (d) The residuals are not substantially different.\n\\addtocounter{enumi}{1}\n\n1. \\(a) Roughly $\\widehat{\\texttt{volume}} = 400$ riders and $\\texttt{volume}_i = 500$ riders. (b) Folds 2 and 3 were used to build the prediction model. (c) The plot on the left estimates 7 parameters; the plot on the right estimates 3 parameters. (d) The residuals are not substantially different.\n\\addtocounter{enumi}{1}\n\n\n:::\n\n## Chapter 26 {#sec-exercise-solutions-26 .unlisted}\n\n::: exercises-solution\n1. \\(a) $H_0: \\beta_1 = 0$, the slope of the model predicting kids' marijuana use in college from their parents' marijuana use in college is 0. $H_A: \\beta_1 \\neq 0$, the slope of the model predicting kids' marijuana use in college from their parents' marijuana use in college is different than 0. (b) The test statistic is $Z = 4.09$ and the associated p-value is less than 0.0001. (c) With a small p-value we reject $H_0$. The data provide convincing evidence that the slope of the model predicting kids' marijuana use in college from their parents' marijuana use in college is different than 0, i.e. that parents' marijuana use in college is a significant predictor of kids' marijuana use in college.\n\\addtocounter{enumi}{1}\n\n1. \\(a) 26 observations are in Fold2. 8 correctly and 2 incorrectly predicted to be from Victoria. (b) 78 observations are used to build the model. (c) 2 coefficients for tail length; 3 coefficients for total length and sex.\n\\addtocounter{enumi}{1}\n\n1. \\(a) 298 observations are in Fold2. 10 correctly and 26 incorrectly predicted to be premature. (b) 596 observations are used to build the model. (c) The vast majority of the observations fall into the row corresponding to the observed status of full term. (d) 7 coefficients for the larger model; 3 coefficients for the smaller model.\n\\addtocounter{enumi}{1}\n\n\n:::\n", + "supporting": [ + "exercise-solutions_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/exercise-solutions/figure-html/unnamed-chunk-2-1.png b/_freeze/exercise-solutions/figure-html/unnamed-chunk-2-1.png new file mode 100644 index 00000000..588938d5 Binary files /dev/null and b/_freeze/exercise-solutions/figure-html/unnamed-chunk-2-1.png differ diff --git a/_freeze/index/execute-results/html.json b/_freeze/index/execute-results/html.json new file mode 100644 index 00000000..c7b07dc6 --- /dev/null +++ b/_freeze/index/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "1dff4029a690380fb3ddad0b8c87e824", + "result": { + "markdown": "::: welcome\n::: {.content-visible when-format=\"html\"}\n# Welcome to IMS2 {.unnumbered}\n:::\n\n\\chapter*{}\n\n\\vfill\n\n::: {.content-visible when-format=\"html\"}\nThis is the website for **Introduction to Modern Statistics**, Second Edition by Mine Çetinkaya-Rundel and Johanna Hardin.\nIntroduction to Modern Statistics, which we'll refer to as IMS going forward, is a textbook from the [OpenIntro](https://www.openintro.org/) project.\n:::\n\n\n```{=html}\n\n```\n\n\n::: {.content-visible when-format=\"html\"}\nCopyright © 2023.\n:::\n\n::: {.content-hidden unless-format=\"pdf\"}\nCopyright $\\copyright$ 2023.\n:::\n\nSecond Edition.\n\nVersion date: September 10, 2023.\n\nThis textbook and its supplements, including slides, labs, and interactive tutorials, may be downloaded for free at\\\n[**openintro.org/book/ims**](http://openintro.org/book/ims).\n\nThis textbook is a derivative of *OpenIntro Statistics* 4th Edition and *Introduction to Statistics with Randomization and Simulation* 1st Edition by Diez, Barr, and Çetinkaya-Rundel, and it's available under a Creative Commons Attribution-ShareAlike 3.0 Unported United States License.\nLicense details are available at the Creative Commons website:\\\n[**creativecommons.org**](https://www.openintro.org/go/?id=creativecommons_org&referrer=ims1_pdf).\n\nSource files for this book can be found on GitHub at\\\n[github.com/OpenIntroStat/ims](https://github.com/OpenIntroStat/ims).\n:::\n\n\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/site_libs/bsTable-3.3.7/bootstrapTable.js b/_freeze/site_libs/bsTable-3.3.7/bootstrapTable.js new file mode 100644 index 00000000..0c83d3b8 --- /dev/null +++ b/_freeze/site_libs/bsTable-3.3.7/bootstrapTable.js @@ -0,0 +1,801 @@ +/* ======================================================================== + * Bootstrap: tooltip.js v3.4.1 + * https://getbootstrap.com/docs/3.4/javascript/#tooltip + * Inspired by the original jQuery.tipsy by Jason Frame + * ======================================================================== + * Copyright 2011-2019 Twitter, Inc. + * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) + * ======================================================================== */ + ++function ($) { + 'use strict'; + + var DISALLOWED_ATTRIBUTES = ['sanitize', 'whiteList', 'sanitizeFn'] + + var uriAttrs = [ + 'background', + 'cite', + 'href', + 'itemtype', + 'longdesc', + 'poster', + 'src', + 'xlink:href' + ] + + var ARIA_ATTRIBUTE_PATTERN = /^aria-[\w-]*$/i + + var DefaultWhitelist = { + // Global attributes allowed on any supplied element below. + '*': ['class', 'dir', 'id', 'lang', 'role', ARIA_ATTRIBUTE_PATTERN], + a: ['target', 'href', 'title', 'rel'], + area: [], + b: [], + br: [], + col: [], + code: [], + div: [], + em: [], + hr: [], + h1: [], + h2: [], + h3: [], + h4: [], + h5: [], + h6: [], + i: [], + img: ['src', 'alt', 'title', 'width', 'height'], + li: [], + ol: [], + p: [], + pre: [], + s: [], + small: [], + span: [], + sub: [], + sup: [], + strong: [], + u: [], + ul: [] + } + + /** + * A pattern that recognizes a commonly useful subset of URLs that are safe. + * + * Shoutout to Angular 7 https://github.com/angular/angular/blob/7.2.4/packages/core/src/sanitization/url_sanitizer.ts + */ + var SAFE_URL_PATTERN = /^(?:(?:https?|mailto|ftp|tel|file):|[^&:/?#]*(?:[/?#]|$))/gi + + /** + * A pattern that matches safe data URLs. Only matches image, video and audio types. + * + * Shoutout to Angular 7 https://github.com/angular/angular/blob/7.2.4/packages/core/src/sanitization/url_sanitizer.ts + */ + var DATA_URL_PATTERN = /^data:(?:image\/(?:bmp|gif|jpeg|jpg|png|tiff|webp)|video\/(?:mpeg|mp4|ogg|webm)|audio\/(?:mp3|oga|ogg|opus));base64,[a-z0-9+/]+=*$/i + + function allowedAttribute(attr, allowedAttributeList) { + var attrName = attr.nodeName.toLowerCase() + + if ($.inArray(attrName, allowedAttributeList) !== -1) { + if ($.inArray(attrName, uriAttrs) !== -1) { + return Boolean(attr.nodeValue.match(SAFE_URL_PATTERN) || attr.nodeValue.match(DATA_URL_PATTERN)) + } + + return true + } + + var regExp = $(allowedAttributeList).filter(function (index, value) { + return value instanceof RegExp + }) + + // Check if a regular expression validates the attribute. + for (var i = 0, l = regExp.length; i < l; i++) { + if (attrName.match(regExp[i])) { + return true + } + } + + return false + } + + function sanitizeHtml(unsafeHtml, whiteList, sanitizeFn) { + if (unsafeHtml.length === 0) { + return unsafeHtml + } + + if (sanitizeFn && typeof sanitizeFn === 'function') { + return sanitizeFn(unsafeHtml) + } + + // IE 8 and below don't support createHTMLDocument + if (!document.implementation || !document.implementation.createHTMLDocument) { + return unsafeHtml + } + + var createdDocument = document.implementation.createHTMLDocument('sanitization') + createdDocument.body.innerHTML = unsafeHtml + + var whitelistKeys = $.map(whiteList, function (el, i) { return i }) + var elements = $(createdDocument.body).find('*') + + for (var i = 0, len = elements.length; i < len; i++) { + var el = elements[i] + var elName = el.nodeName.toLowerCase() + + if ($.inArray(elName, whitelistKeys) === -1) { + el.parentNode.removeChild(el) + + continue + } + + var attributeList = $.map(el.attributes, function (el) { return el }) + var whitelistedAttributes = [].concat(whiteList['*'] || [], whiteList[elName] || []) + + for (var j = 0, len2 = attributeList.length; j < len2; j++) { + if (!allowedAttribute(attributeList[j], whitelistedAttributes)) { + el.removeAttribute(attributeList[j].nodeName) + } + } + } + + return createdDocument.body.innerHTML + } + + // TOOLTIP PUBLIC CLASS DEFINITION + // =============================== + + var Tooltip = function (element, options) { + this.type = null + this.options = null + this.enabled = null + this.timeout = null + this.hoverState = null + this.$element = null + this.inState = null + + this.init('tooltip', element, options) + } + + Tooltip.VERSION = '3.4.1' + + Tooltip.TRANSITION_DURATION = 150 + + Tooltip.DEFAULTS = { + animation: true, + placement: 'top', + selector: false, + template: '', + trigger: 'hover focus', + title: '', + delay: 0, + html: false, + container: false, + viewport: { + selector: 'body', + padding: 0 + }, + sanitize : true, + sanitizeFn : null, + whiteList : DefaultWhitelist + } + + Tooltip.prototype.init = function (type, element, options) { + this.enabled = true + this.type = type + this.$element = $(element) + this.options = this.getOptions(options) + this.$viewport = this.options.viewport && $(document).find($.isFunction(this.options.viewport) ? this.options.viewport.call(this, this.$element) : (this.options.viewport.selector || this.options.viewport)) + this.inState = { click: false, hover: false, focus: false } + + if (this.$element[0] instanceof document.constructor && !this.options.selector) { + throw new Error('`selector` option must be specified when initializing ' + this.type + ' on the window.document object!') + } + + var triggers = this.options.trigger.split(' ') + + for (var i = triggers.length; i--;) { + var trigger = triggers[i] + + if (trigger == 'click') { + this.$element.on('click.' + this.type, this.options.selector, $.proxy(this.toggle, this)) + } else if (trigger != 'manual') { + var eventIn = trigger == 'hover' ? 'mouseenter' : 'focusin' + var eventOut = trigger == 'hover' ? 'mouseleave' : 'focusout' + + this.$element.on(eventIn + '.' + this.type, this.options.selector, $.proxy(this.enter, this)) + this.$element.on(eventOut + '.' + this.type, this.options.selector, $.proxy(this.leave, this)) + } + } + + this.options.selector ? + (this._options = $.extend({}, this.options, { trigger: 'manual', selector: '' })) : + this.fixTitle() + } + + Tooltip.prototype.getDefaults = function () { + return Tooltip.DEFAULTS + } + + Tooltip.prototype.getOptions = function (options) { + var dataAttributes = this.$element.data() + + for (var dataAttr in dataAttributes) { + if (dataAttributes.hasOwnProperty(dataAttr) && $.inArray(dataAttr, DISALLOWED_ATTRIBUTES) !== -1) { + delete dataAttributes[dataAttr] + } + } + + options = $.extend({}, this.getDefaults(), dataAttributes, options) + + if (options.delay && typeof options.delay == 'number') { + options.delay = { + show: options.delay, + hide: options.delay + } + } + + if (options.sanitize) { + options.template = sanitizeHtml(options.template, options.whiteList, options.sanitizeFn) + } + + return options + } + + Tooltip.prototype.getDelegateOptions = function () { + var options = {} + var defaults = this.getDefaults() + + this._options && $.each(this._options, function (key, value) { + if (defaults[key] != value) options[key] = value + }) + + return options + } + + Tooltip.prototype.enter = function (obj) { + var self = obj instanceof this.constructor ? + obj : $(obj.currentTarget).data('bs.' + this.type) + + if (!self) { + self = new this.constructor(obj.currentTarget, this.getDelegateOptions()) + $(obj.currentTarget).data('bs.' + this.type, self) + } + + if (obj instanceof $.Event) { + self.inState[obj.type == 'focusin' ? 'focus' : 'hover'] = true + } + + if (self.tip().hasClass('in') || self.hoverState == 'in') { + self.hoverState = 'in' + return + } + + clearTimeout(self.timeout) + + self.hoverState = 'in' + + if (!self.options.delay || !self.options.delay.show) return self.show() + + self.timeout = setTimeout(function () { + if (self.hoverState == 'in') self.show() + }, self.options.delay.show) + } + + Tooltip.prototype.isInStateTrue = function () { + for (var key in this.inState) { + if (this.inState[key]) return true + } + + return false + } + + Tooltip.prototype.leave = function (obj) { + var self = obj instanceof this.constructor ? + obj : $(obj.currentTarget).data('bs.' + this.type) + + if (!self) { + self = new this.constructor(obj.currentTarget, this.getDelegateOptions()) + $(obj.currentTarget).data('bs.' + this.type, self) + } + + if (obj instanceof $.Event) { + self.inState[obj.type == 'focusout' ? 'focus' : 'hover'] = false + } + + if (self.isInStateTrue()) return + + clearTimeout(self.timeout) + + self.hoverState = 'out' + + if (!self.options.delay || !self.options.delay.hide) return self.hide() + + self.timeout = setTimeout(function () { + if (self.hoverState == 'out') self.hide() + }, self.options.delay.hide) + } + + Tooltip.prototype.show = function () { + var e = $.Event('show.bs.' + this.type) + + if (this.hasContent() && this.enabled) { + this.$element.trigger(e) + + var inDom = $.contains(this.$element[0].ownerDocument.documentElement, this.$element[0]) + if (e.isDefaultPrevented() || !inDom) return + var that = this + + var $tip = this.tip() + + var tipId = this.getUID(this.type) + + this.setContent() + $tip.attr('id', tipId) + this.$element.attr('aria-describedby', tipId) + + if (this.options.animation) $tip.addClass('fade') + + var placement = typeof this.options.placement == 'function' ? + this.options.placement.call(this, $tip[0], this.$element[0]) : + this.options.placement + + var autoToken = /\s?auto?\s?/i + var autoPlace = autoToken.test(placement) + if (autoPlace) placement = placement.replace(autoToken, '') || 'top' + + $tip + .detach() + .css({ top: 0, left: 0, display: 'block' }) + .addClass(placement) + .data('bs.' + this.type, this) + + this.options.container ? $tip.appendTo($(document).find(this.options.container)) : $tip.insertAfter(this.$element) + this.$element.trigger('inserted.bs.' + this.type) + + var pos = this.getPosition() + var actualWidth = $tip[0].offsetWidth + var actualHeight = $tip[0].offsetHeight + + if (autoPlace) { + var orgPlacement = placement + var viewportDim = this.getPosition(this.$viewport) + + placement = placement == 'bottom' && pos.bottom + actualHeight > viewportDim.bottom ? 'top' : + placement == 'top' && pos.top - actualHeight < viewportDim.top ? 'bottom' : + placement == 'right' && pos.right + actualWidth > viewportDim.width ? 'left' : + placement == 'left' && pos.left - actualWidth < viewportDim.left ? 'right' : + placement + + $tip + .removeClass(orgPlacement) + .addClass(placement) + } + + var calculatedOffset = this.getCalculatedOffset(placement, pos, actualWidth, actualHeight) + + this.applyPlacement(calculatedOffset, placement) + + var complete = function () { + var prevHoverState = that.hoverState + that.$element.trigger('shown.bs.' + that.type) + that.hoverState = null + + if (prevHoverState == 'out') that.leave(that) + } + + $.support.transition && this.$tip.hasClass('fade') ? + $tip + .one('bsTransitionEnd', complete) + .emulateTransitionEnd(Tooltip.TRANSITION_DURATION) : + complete() + } + } + + Tooltip.prototype.applyPlacement = function (offset, placement) { + var $tip = this.tip() + var width = $tip[0].offsetWidth + var height = $tip[0].offsetHeight + + // manually read margins because getBoundingClientRect includes difference + var marginTop = parseInt($tip.css('margin-top'), 10) + var marginLeft = parseInt($tip.css('margin-left'), 10) + + // we must check for NaN for ie 8/9 + if (isNaN(marginTop)) marginTop = 0 + if (isNaN(marginLeft)) marginLeft = 0 + + offset.top += marginTop + offset.left += marginLeft + + // $.fn.offset doesn't round pixel values + // so we use setOffset directly with our own function B-0 + $.offset.setOffset($tip[0], $.extend({ + using: function (props) { + $tip.css({ + top: Math.round(props.top), + left: Math.round(props.left) + }) + } + }, offset), 0) + + $tip.addClass('in') + + // check to see if placing tip in new offset caused the tip to resize itself + var actualWidth = $tip[0].offsetWidth + var actualHeight = $tip[0].offsetHeight + + if (placement == 'top' && actualHeight != height) { + offset.top = offset.top + height - actualHeight + } + + var delta = this.getViewportAdjustedDelta(placement, offset, actualWidth, actualHeight) + + if (delta.left) offset.left += delta.left + else offset.top += delta.top + + var isVertical = /top|bottom/.test(placement) + var arrowDelta = isVertical ? delta.left * 2 - width + actualWidth : delta.top * 2 - height + actualHeight + var arrowOffsetPosition = isVertical ? 'offsetWidth' : 'offsetHeight' + + $tip.offset(offset) + this.replaceArrow(arrowDelta, $tip[0][arrowOffsetPosition], isVertical) + } + + Tooltip.prototype.replaceArrow = function (delta, dimension, isVertical) { + this.arrow() + .css(isVertical ? 'left' : 'top', 50 * (1 - delta / dimension) + '%') + .css(isVertical ? 'top' : 'left', '') + } + + Tooltip.prototype.setContent = function () { + var $tip = this.tip() + var title = this.getTitle() + + if (this.options.html) { + if (this.options.sanitize) { + title = sanitizeHtml(title, this.options.whiteList, this.options.sanitizeFn) + } + + $tip.find('.tooltip-inner').html(title) + } else { + $tip.find('.tooltip-inner').text(title) + } + + $tip.removeClass('fade in top bottom left right') + } + + Tooltip.prototype.hide = function (callback) { + var that = this + var $tip = $(this.$tip) + var e = $.Event('hide.bs.' + this.type) + + function complete() { + if (that.hoverState != 'in') $tip.detach() + if (that.$element) { // TODO: Check whether guarding this code with this `if` is really necessary. + that.$element + .removeAttr('aria-describedby') + .trigger('hidden.bs.' + that.type) + } + callback && callback() + } + + this.$element.trigger(e) + + if (e.isDefaultPrevented()) return + + $tip.removeClass('in') + + $.support.transition && $tip.hasClass('fade') ? + $tip + .one('bsTransitionEnd', complete) + .emulateTransitionEnd(Tooltip.TRANSITION_DURATION) : + complete() + + this.hoverState = null + + return this + } + + Tooltip.prototype.fixTitle = function () { + var $e = this.$element + if ($e.attr('title') || typeof $e.attr('data-original-title') != 'string') { + $e.attr('data-original-title', $e.attr('title') || '').attr('title', '') + } + } + + Tooltip.prototype.hasContent = function () { + return this.getTitle() + } + + Tooltip.prototype.getPosition = function ($element) { + $element = $element || this.$element + + var el = $element[0] + var isBody = el.tagName == 'BODY' + + var elRect = el.getBoundingClientRect() + if (elRect.width == null) { + // width and height are missing in IE8, so compute them manually; see https://github.com/twbs/bootstrap/issues/14093 + elRect = $.extend({}, elRect, { width: elRect.right - elRect.left, height: elRect.bottom - elRect.top }) + } + var isSvg = window.SVGElement && el instanceof window.SVGElement + // Avoid using $.offset() on SVGs since it gives incorrect results in jQuery 3. + // See https://github.com/twbs/bootstrap/issues/20280 + var elOffset = isBody ? { top: 0, left: 0 } : (isSvg ? null : $element.offset()) + var scroll = { scroll: isBody ? document.documentElement.scrollTop || document.body.scrollTop : $element.scrollTop() } + var outerDims = isBody ? { width: $(window).width(), height: $(window).height() } : null + + return $.extend({}, elRect, scroll, outerDims, elOffset) + } + + Tooltip.prototype.getCalculatedOffset = function (placement, pos, actualWidth, actualHeight) { + return placement == 'bottom' ? { top: pos.top + pos.height, left: pos.left + pos.width / 2 - actualWidth / 2 } : + placement == 'top' ? { top: pos.top - actualHeight, left: pos.left + pos.width / 2 - actualWidth / 2 } : + placement == 'left' ? { top: pos.top + pos.height / 2 - actualHeight / 2, left: pos.left - actualWidth } : + /* placement == 'right' */ { top: pos.top + pos.height / 2 - actualHeight / 2, left: pos.left + pos.width } + + } + + Tooltip.prototype.getViewportAdjustedDelta = function (placement, pos, actualWidth, actualHeight) { + var delta = { top: 0, left: 0 } + if (!this.$viewport) return delta + + var viewportPadding = this.options.viewport && this.options.viewport.padding || 0 + var viewportDimensions = this.getPosition(this.$viewport) + + if (/right|left/.test(placement)) { + var topEdgeOffset = pos.top - viewportPadding - viewportDimensions.scroll + var bottomEdgeOffset = pos.top + viewportPadding - viewportDimensions.scroll + actualHeight + if (topEdgeOffset < viewportDimensions.top) { // top overflow + delta.top = viewportDimensions.top - topEdgeOffset + } else if (bottomEdgeOffset > viewportDimensions.top + viewportDimensions.height) { // bottom overflow + delta.top = viewportDimensions.top + viewportDimensions.height - bottomEdgeOffset + } + } else { + var leftEdgeOffset = pos.left - viewportPadding + var rightEdgeOffset = pos.left + viewportPadding + actualWidth + if (leftEdgeOffset < viewportDimensions.left) { // left overflow + delta.left = viewportDimensions.left - leftEdgeOffset + } else if (rightEdgeOffset > viewportDimensions.right) { // right overflow + delta.left = viewportDimensions.left + viewportDimensions.width - rightEdgeOffset + } + } + + return delta + } + + Tooltip.prototype.getTitle = function () { + var title + var $e = this.$element + var o = this.options + + title = $e.attr('data-original-title') + || (typeof o.title == 'function' ? o.title.call($e[0]) : o.title) + + return title + } + + Tooltip.prototype.getUID = function (prefix) { + do prefix += ~~(Math.random() * 1000000) + while (document.getElementById(prefix)) + return prefix + } + + Tooltip.prototype.tip = function () { + if (!this.$tip) { + this.$tip = $(this.options.template) + if (this.$tip.length != 1) { + throw new Error(this.type + ' `template` option must consist of exactly 1 top-level element!') + } + } + return this.$tip + } + + Tooltip.prototype.arrow = function () { + return (this.$arrow = this.$arrow || this.tip().find('.tooltip-arrow')) + } + + Tooltip.prototype.enable = function () { + this.enabled = true + } + + Tooltip.prototype.disable = function () { + this.enabled = false + } + + Tooltip.prototype.toggleEnabled = function () { + this.enabled = !this.enabled + } + + Tooltip.prototype.toggle = function (e) { + var self = this + if (e) { + self = $(e.currentTarget).data('bs.' + this.type) + if (!self) { + self = new this.constructor(e.currentTarget, this.getDelegateOptions()) + $(e.currentTarget).data('bs.' + this.type, self) + } + } + + if (e) { + self.inState.click = !self.inState.click + if (self.isInStateTrue()) self.enter(self) + else self.leave(self) + } else { + self.tip().hasClass('in') ? self.leave(self) : self.enter(self) + } + } + + Tooltip.prototype.destroy = function () { + var that = this + clearTimeout(this.timeout) + this.hide(function () { + that.$element.off('.' + that.type).removeData('bs.' + that.type) + if (that.$tip) { + that.$tip.detach() + } + that.$tip = null + that.$arrow = null + that.$viewport = null + that.$element = null + }) + } + + Tooltip.prototype.sanitizeHtml = function (unsafeHtml) { + return sanitizeHtml(unsafeHtml, this.options.whiteList, this.options.sanitizeFn) + } + + // TOOLTIP PLUGIN DEFINITION + // ========================= + + function Plugin(option) { + return this.each(function () { + var $this = $(this) + var data = $this.data('bs.tooltip') + var options = typeof option == 'object' && option + + if (!data && /destroy|hide/.test(option)) return + if (!data) $this.data('bs.tooltip', (data = new Tooltip(this, options))) + if (typeof option == 'string') data[option]() + }) + } + + var old = $.fn.tooltip + + $.fn.tooltip = Plugin + $.fn.tooltip.Constructor = Tooltip + + + // TOOLTIP NO CONFLICT + // =================== + + $.fn.tooltip.noConflict = function () { + $.fn.tooltip = old + return this + } + +}(jQuery); + +/* ======================================================================== + * Bootstrap: popover.js v3.4.1 + * https://getbootstrap.com/docs/3.4/javascript/#popovers + * ======================================================================== + * Copyright 2011-2019 Twitter, Inc. + * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) + * ======================================================================== */ + + ++function ($) { + 'use strict'; + + // POPOVER PUBLIC CLASS DEFINITION + // =============================== + + var Popover = function (element, options) { + this.init('popover', element, options) + } + + if (!$.fn.tooltip) throw new Error('Popover requires tooltip.js') + + Popover.VERSION = '3.4.1' + + Popover.DEFAULTS = $.extend({}, $.fn.tooltip.Constructor.DEFAULTS, { + placement: 'right', + trigger: 'click', + content: '', + template: '' + }) + + + // NOTE: POPOVER EXTENDS tooltip.js + // ================================ + + Popover.prototype = $.extend({}, $.fn.tooltip.Constructor.prototype) + + Popover.prototype.constructor = Popover + + Popover.prototype.getDefaults = function () { + return Popover.DEFAULTS + } + + Popover.prototype.setContent = function () { + var $tip = this.tip() + var title = this.getTitle() + var content = this.getContent() + + if (this.options.html) { + var typeContent = typeof content + + if (this.options.sanitize) { + title = this.sanitizeHtml(title) + + if (typeContent === 'string') { + content = this.sanitizeHtml(content) + } + } + + $tip.find('.popover-title').html(title) + $tip.find('.popover-content').children().detach().end()[ + typeContent === 'string' ? 'html' : 'append' + ](content) + } else { + $tip.find('.popover-title').text(title) + $tip.find('.popover-content').children().detach().end().text(content) + } + + $tip.removeClass('fade top bottom left right in') + + // IE8 doesn't accept hiding via the `:empty` pseudo selector, we have to do + // this manually by checking the contents. + if (!$tip.find('.popover-title').html()) $tip.find('.popover-title').hide() + } + + Popover.prototype.hasContent = function () { + return this.getTitle() || this.getContent() + } + + Popover.prototype.getContent = function () { + var $e = this.$element + var o = this.options + + return $e.attr('data-content') + || (typeof o.content == 'function' ? + o.content.call($e[0]) : + o.content) + } + + Popover.prototype.arrow = function () { + return (this.$arrow = this.$arrow || this.tip().find('.arrow')) + } + + + // POPOVER PLUGIN DEFINITION + // ========================= + + function Plugin(option) { + return this.each(function () { + var $this = $(this) + var data = $this.data('bs.popover') + var options = typeof option == 'object' && option + + if (!data && /destroy|hide/.test(option)) return + if (!data) $this.data('bs.popover', (data = new Popover(this, options))) + if (typeof option == 'string') data[option]() + }) + } + + var old = $.fn.popover + + $.fn.popover = Plugin + $.fn.popover.Constructor = Popover + + + // POPOVER NO CONFLICT + // =================== + + $.fn.popover.noConflict = function () { + $.fn.popover = old + return this + } + +}(jQuery); diff --git a/_freeze/site_libs/bsTable-3.3.7/bootstrapTable.min.css b/_freeze/site_libs/bsTable-3.3.7/bootstrapTable.min.css new file mode 100644 index 00000000..7d1e81f0 --- /dev/null +++ b/_freeze/site_libs/bsTable-3.3.7/bootstrapTable.min.css @@ -0,0 +1,14 @@ +/*! + * Bootstrap v3.3.7 (http://getbootstrap.com) + * Copyright 2011-2018 Twitter, Inc. + * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) + */ + +/*! + * Generated using the Bootstrap Customizer () + * Config saved to config.json and + *//*! + * Bootstrap v3.3.7 (http://getbootstrap.com) + * Copyright 2011-2016 Twitter, Inc. + * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) + *//*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */table{border-collapse:collapse;border-spacing:0}td,th{padding:0}*{-webkit-box-sizing:border-box;-moz-box-sizing:border-box;box-sizing:border-box}*:before,*:after{-webkit-box-sizing:border-box;-moz-box-sizing:border-box;box-sizing:border-box}table{background-color:transparent}caption{padding-top:8px;padding-bottom:8px;color:#777;text-align:left}th{text-align:left}.table{width:100%;max-width:100%;margin-bottom:20px}.table>thead>tr>th,.table>tbody>tr>th,.table>tfoot>tr>th,.table>thead>tr>td,.table>tbody>tr>td,.table>tfoot>tr>td{padding:8px;line-height:1.42857143;vertical-align:top;border-top:1px solid #ddd}.table>thead>tr>th{vertical-align:bottom;border-bottom:2px solid #ddd}.table>caption+thead>tr:first-child>th,.table>colgroup+thead>tr:first-child>th,.table>thead:first-child>tr:first-child>th,.table>caption+thead>tr:first-child>td,.table>colgroup+thead>tr:first-child>td,.table>thead:first-child>tr:first-child>td{border-top:0}.table>tbody+tbody{border-top:2px solid #ddd}.table .table{background-color:#fff}.table-condensed>thead>tr>th,.table-condensed>tbody>tr>th,.table-condensed>tfoot>tr>th,.table-condensed>thead>tr>td,.table-condensed>tbody>tr>td,.table-condensed>tfoot>tr>td{padding:5px}.table-bordered{border:1px solid #ddd}.table-bordered>thead>tr>th,.table-bordered>tbody>tr>th,.table-bordered>tfoot>tr>th,.table-bordered>thead>tr>td,.table-bordered>tbody>tr>td,.table-bordered>tfoot>tr>td{border:1px solid #ddd}.table-bordered>thead>tr>th,.table-bordered>thead>tr>td{border-bottom-width:2px}.table-striped>tbody>tr:nth-of-type(odd){background-color:#f9f9f9}.table-hover>tbody>tr:hover{background-color:#f5f5f5}table col[class*="col-"]{position:static;float:none;display:table-column}table td[class*="col-"],table th[class*="col-"]{position:static;float:none;display:table-cell}.table>thead>tr>td.active,.table>tbody>tr>td.active,.table>tfoot>tr>td.active,.table>thead>tr>th.active,.table>tbody>tr>th.active,.table>tfoot>tr>th.active,.table>thead>tr.active>td,.table>tbody>tr.active>td,.table>tfoot>tr.active>td,.table>thead>tr.active>th,.table>tbody>tr.active>th,.table>tfoot>tr.active>th{background-color:#f5f5f5}.table-hover>tbody>tr>td.active:hover,.table-hover>tbody>tr>th.active:hover,.table-hover>tbody>tr.active:hover>td,.table-hover>tbody>tr:hover>.active,.table-hover>tbody>tr.active:hover>th{background-color:#e8e8e8}.table>thead>tr>td.success,.table>tbody>tr>td.success,.table>tfoot>tr>td.success,.table>thead>tr>th.success,.table>tbody>tr>th.success,.table>tfoot>tr>th.success,.table>thead>tr.success>td,.table>tbody>tr.success>td,.table>tfoot>tr.success>td,.table>thead>tr.success>th,.table>tbody>tr.success>th,.table>tfoot>tr.success>th{background-color:#dff0d8}.table-hover>tbody>tr>td.success:hover,.table-hover>tbody>tr>th.success:hover,.table-hover>tbody>tr.success:hover>td,.table-hover>tbody>tr:hover>.success,.table-hover>tbody>tr.success:hover>th{background-color:#d0e9c6}.table>thead>tr>td.info,.table>tbody>tr>td.info,.table>tfoot>tr>td.info,.table>thead>tr>th.info,.table>tbody>tr>th.info,.table>tfoot>tr>th.info,.table>thead>tr.info>td,.table>tbody>tr.info>td,.table>tfoot>tr.info>td,.table>thead>tr.info>th,.table>tbody>tr.info>th,.table>tfoot>tr.info>th{background-color:#d9edf7}.table-hover>tbody>tr>td.info:hover,.table-hover>tbody>tr>th.info:hover,.table-hover>tbody>tr.info:hover>td,.table-hover>tbody>tr:hover>.info,.table-hover>tbody>tr.info:hover>th{background-color:#c4e3f3}.table>thead>tr>td.warning,.table>tbody>tr>td.warning,.table>tfoot>tr>td.warning,.table>thead>tr>th.warning,.table>tbody>tr>th.warning,.table>tfoot>tr>th.warning,.table>thead>tr.warning>td,.table>tbody>tr.warning>td,.table>tfoot>tr.warning>td,.table>thead>tr.warning>th,.table>tbody>tr.warning>th,.table>tfoot>tr.warning>th{background-color:#fcf8e3}.table-hover>tbody>tr>td.warning:hover,.table-hover>tbody>tr>th.warning:hover,.table-hover>tbody>tr.warning:hover>td,.table-hover>tbody>tr:hover>.warning,.table-hover>tbody>tr.warning:hover>th{background-color:#faf2cc}.table>thead>tr>td.danger,.table>tbody>tr>td.danger,.table>tfoot>tr>td.danger,.table>thead>tr>th.danger,.table>tbody>tr>th.danger,.table>tfoot>tr>th.danger,.table>thead>tr.danger>td,.table>tbody>tr.danger>td,.table>tfoot>tr.danger>td,.table>thead>tr.danger>th,.table>tbody>tr.danger>th,.table>tfoot>tr.danger>th{background-color:#f2dede}.table-hover>tbody>tr>td.danger:hover,.table-hover>tbody>tr>th.danger:hover,.table-hover>tbody>tr.danger:hover>td,.table-hover>tbody>tr:hover>.danger,.table-hover>tbody>tr.danger:hover>th{background-color:#ebcccc}.table-responsive{overflow-x:auto;min-height:0.01%}@media screen and (max-width:767px){.table-responsive{width:100%;margin-bottom:15px;overflow-y:hidden;-ms-overflow-style:-ms-autohiding-scrollbar;border:1px solid #ddd}.table-responsive>.table{margin-bottom:0}.table-responsive>.table>thead>tr>th,.table-responsive>.table>tbody>tr>th,.table-responsive>.table>tfoot>tr>th,.table-responsive>.table>thead>tr>td,.table-responsive>.table>tbody>tr>td,.table-responsive>.table>tfoot>tr>td{white-space:nowrap}.table-responsive>.table-bordered{border:0}.table-responsive>.table-bordered>thead>tr>th:first-child,.table-responsive>.table-bordered>tbody>tr>th:first-child,.table-responsive>.table-bordered>tfoot>tr>th:first-child,.table-responsive>.table-bordered>thead>tr>td:first-child,.table-responsive>.table-bordered>tbody>tr>td:first-child,.table-responsive>.table-bordered>tfoot>tr>td:first-child{border-left:0}.table-responsive>.table-bordered>thead>tr>th:last-child,.table-responsive>.table-bordered>tbody>tr>th:last-child,.table-responsive>.table-bordered>tfoot>tr>th:last-child,.table-responsive>.table-bordered>thead>tr>td:last-child,.table-responsive>.table-bordered>tbody>tr>td:last-child,.table-responsive>.table-bordered>tfoot>tr>td:last-child{border-right:0}.table-responsive>.table-bordered>tbody>tr:last-child>th,.table-responsive>.table-bordered>tfoot>tr:last-child>th,.table-responsive>.table-bordered>tbody>tr:last-child>td,.table-responsive>.table-bordered>tfoot>tr:last-child>td{border-bottom:0}}.tooltip{position:absolute;z-index:1070;display:block;font-family:"Helvetica Neue",Helvetica,Arial,sans-serif;font-style:normal;font-weight:normal;letter-spacing:normal;line-break:auto;line-height:1.42857143;text-align:left;text-align:start;text-decoration:none;text-shadow:none;text-transform:none;white-space:normal;word-break:normal;word-spacing:normal;word-wrap:normal;font-size:12px;opacity:0;filter:alpha(opacity=0)}.tooltip.in{opacity:.9;filter:alpha(opacity=90)}.tooltip.top{margin-top:-3px;padding:5px 0}.tooltip.right{margin-left:3px;padding:0 5px}.tooltip.bottom{margin-top:3px;padding:5px 0}.tooltip.left{margin-left:-3px;padding:0 5px}.tooltip-inner{max-width:200px;padding:3px 8px;color:#fff;text-align:center;background-color:#000;border-radius:4px}.tooltip-arrow{position:absolute;width:0;height:0;border-color:transparent;border-style:solid}.tooltip.top .tooltip-arrow{bottom:0;left:50%;margin-left:-5px;border-width:5px 5px 0;border-top-color:#000}.tooltip.top-left .tooltip-arrow{bottom:0;right:5px;margin-bottom:-5px;border-width:5px 5px 0;border-top-color:#000}.tooltip.top-right .tooltip-arrow{bottom:0;left:5px;margin-bottom:-5px;border-width:5px 5px 0;border-top-color:#000}.tooltip.right .tooltip-arrow{top:50%;left:0;margin-top:-5px;border-width:5px 5px 5px 0;border-right-color:#000}.tooltip.left .tooltip-arrow{top:50%;right:0;margin-top:-5px;border-width:5px 0 5px 5px;border-left-color:#000}.tooltip.bottom .tooltip-arrow{top:0;left:50%;margin-left:-5px;border-width:0 5px 5px;border-bottom-color:#000}.tooltip.bottom-left .tooltip-arrow{top:0;right:5px;margin-top:-5px;border-width:0 5px 5px;border-bottom-color:#000}.tooltip.bottom-right .tooltip-arrow{top:0;left:5px;margin-top:-5px;border-width:0 5px 5px;border-bottom-color:#000}.popover{position:absolute;top:0;left:0;z-index:1060;display:none;max-width:276px;padding:1px;font-family:"Helvetica Neue",Helvetica,Arial,sans-serif;font-style:normal;font-weight:normal;letter-spacing:normal;line-break:auto;line-height:1.42857143;text-align:left;text-align:start;text-decoration:none;text-shadow:none;text-transform:none;white-space:normal;word-break:normal;word-spacing:normal;word-wrap:normal;font-size:14px;background-color:#fff;-webkit-background-clip:padding-box;background-clip:padding-box;border:1px solid #ccc;border:1px solid rgba(0,0,0,0.2);border-radius:6px;-webkit-box-shadow:0 5px 10px rgba(0,0,0,0.2);box-shadow:0 5px 10px rgba(0,0,0,0.2)}.popover.top{margin-top:-10px}.popover.right{margin-left:10px}.popover.bottom{margin-top:10px}.popover.left{margin-left:-10px}.popover-title{margin:0;padding:8px 14px;font-size:14px;background-color:#f7f7f7;border-bottom:1px solid #ebebeb;border-radius:5px 5px 0 0}.popover-content{padding:9px 14px}.popover>.arrow,.popover>.arrow:after{position:absolute;display:block;width:0;height:0;border-color:transparent;border-style:solid}.popover>.arrow{border-width:11px}.popover>.arrow:after{border-width:10px;content:""}.popover.top>.arrow{left:50%;margin-left:-11px;border-bottom-width:0;border-top-color:#999;border-top-color:rgba(0,0,0,0.25);bottom:-11px}.popover.top>.arrow:after{content:" ";bottom:1px;margin-left:-10px;border-bottom-width:0;border-top-color:#fff}.popover.right>.arrow{top:50%;left:-11px;margin-top:-11px;border-left-width:0;border-right-color:#999;border-right-color:rgba(0,0,0,0.25)}.popover.right>.arrow:after{content:" ";left:1px;bottom:-10px;border-left-width:0;border-right-color:#fff}.popover.bottom>.arrow{left:50%;margin-left:-11px;border-top-width:0;border-bottom-color:#999;border-bottom-color:rgba(0,0,0,0.25);top:-11px}.popover.bottom>.arrow:after{content:" ";top:1px;margin-left:-10px;border-top-width:0;border-bottom-color:#fff}.popover.left>.arrow{top:50%;right:-11px;margin-top:-11px;border-right-width:0;border-left-color:#999;border-left-color:rgba(0,0,0,0.25)}.popover.left>.arrow:after{content:" ";right:1px;border-right-width:0;border-left-color:#fff;bottom:-10px}.clearfix:before,.clearfix:after{content:" ";display:table}.clearfix:after{clear:both}.center-block{display:block;margin-left:auto;margin-right:auto}.pull-right{float:right !important}.pull-left{float:left !important}.hide{display:none !important}.show{display:block !important}.invisible{visibility:hidden}.text-hide{font:0/0 a;color:transparent;text-shadow:none;background-color:transparent;border:0}.hidden{display:none !important}.affix{position:fixed} diff --git a/_freeze/site_libs/clipboard/clipboard.min.js b/_freeze/site_libs/clipboard/clipboard.min.js new file mode 100644 index 00000000..1103f811 --- /dev/null +++ b/_freeze/site_libs/clipboard/clipboard.min.js @@ -0,0 +1,7 @@ +/*! + * clipboard.js v2.0.11 + * https://clipboardjs.com/ + * + * Licensed MIT © Zeno Rocha + */ +!function(t,e){"object"==typeof exports&&"object"==typeof module?module.exports=e():"function"==typeof define&&define.amd?define([],e):"object"==typeof exports?exports.ClipboardJS=e():t.ClipboardJS=e()}(this,function(){return n={686:function(t,e,n){"use strict";n.d(e,{default:function(){return b}});var e=n(279),i=n.n(e),e=n(370),u=n.n(e),e=n(817),r=n.n(e);function c(t){try{return document.execCommand(t)}catch(t){return}}var a=function(t){t=r()(t);return c("cut"),t};function o(t,e){var n,o,t=(n=t,o="rtl"===document.documentElement.getAttribute("dir"),(t=document.createElement("textarea")).style.fontSize="12pt",t.style.border="0",t.style.padding="0",t.style.margin="0",t.style.position="absolute",t.style[o?"right":"left"]="-9999px",o=window.pageYOffset||document.documentElement.scrollTop,t.style.top="".concat(o,"px"),t.setAttribute("readonly",""),t.value=n,t);return e.container.appendChild(t),e=r()(t),c("copy"),t.remove(),e}var f=function(t){var e=1