The chi-square (
Types of Chi-Square Tests:
- The goodness-of-fit test determines whether an observed frequency distribution aligns with an expected distribution.
- The test of homogeneity assesses if different populations share the same distribution of a single categorical variable.
- The test of independence evaluates whether two categorical variables are independent within a single population.
Categorical data involves variables that represent groupings or categories rather than numerical values. These categories are usually qualitative and can be nominal (no inherent order) or ordinal (with a logical order). Each data point falls into one and only one category.
- An example would be the survival status of passengers on the Titanic, categorized by ticket class.
- The survival status could be classified as "Survived" or "Died."
- The ticket class includes categories such as First, Second, Third, and Crew.
A contingency table (also known as a cross-tabulation or crosstab) is a matrix used to display the frequency distribution of variables. It allows us to analyze the relationship between two or more categorical variables.
Example: The Titanic survival data organized into a 2×4 contingency table:
First Class | Second Class | Third Class | Crew | Total | |
---|---|---|---|---|---|
Survived | a | b | c | d | S |
Died | e | f | g | h | D |
Total | 325 | 285 | 706 | 885 | 2,201 |
Here,
- The null hypothesis (
$H_0$ ) states that the observed frequencies match the expected frequencies, indicating no significant difference between the observed and expected distributions. - The alternative hypothesis (
$H_A$ ) claims that the observed frequencies do not match the expected frequencies.
Suppose we want to test if the color distribution of M&Ms has changed since 2008.
2008 Expected Color Distribution:
Color | Percentage (%) |
---|---|
Blue | 24 |
Orange | 20 |
Green | 16 |
Yellow | 14 |
Red | 13 |
Brown | 13 |
Observed Counts: From a sample of 410 M&Ms, we record the number of each color.
Color | Count |
---|---|
Blue | 105 |
Orange | 91 |
Green | 70 |
Yellow | 50 |
Red | 45 |
Brown | 49 |
For each color, calculate the expected count (
-
$N$ : Total sample size (410). -
$P_i$ : Expected proportion for color$i$ (e.g., 24% for blue).
Example for Blue M&Ms:
The chi-square statistic is:
-
$O_i$ : Observed frequency for category$i$ . -
$E_i$ : Expected frequency for category$i$ . -
$k$ : Number of categories (6 colors).
Calculate
- For 6 colors:
- The significance level (
$\alpha$ ) is commonly set at 0.05. - The critical value is obtained from the chi-square distribution table with
$df = 5$ . -
Compare the calculated
$\chi^2$ with the critical value: if$\chi^2_{\text{calculated}}$ exceeds$\chi^2_{\text{critical}}$ , reject$H_0$ ; otherwise, fail to reject$H_0$ .
-
Rejecting
$H_0$ suggests that the color distribution has changed since 2008. -
Failing to Reject
$H_0$ indicates no significant change in the color distribution.
Analysis Results:
- Chi-square statistic: 4.32
- p-value: 0.5045
- Critical value: 11.07
- Decision: Fail to reject the null hypothesis. There is no significant change in the color distribution of M&Ms since 2008.
This suggests that based on the sample of 410 M&Ms, the observed color distribution does not significantly differ from the expected 2008 distribution.
-
Null Hypothesis (
$H_0$ ): Different populations have the same distribution of the categorical variable. -
Alternative Hypothesis (
$H_A$ ): At least one population has a different distribution.
We want to test whether survival rates are the same across ticket classes.
Data Summary:
Survived | Died | Total | |
---|---|---|---|
First Class | 203 | 122 | 325 |
Second Class | 118 | 167 | 285 |
Third Class | 178 | 528 | 706 |
Crew | 212 | 673 | 885 |
Total | 711 | 1,490 | 2,201 |
Expected count for each cell:
Example for First Class Survivors:
-
$r$ : Number of rows (4 ticket classes). -
$c$ : Number of columns (2 survival statuses).
Calculate
- For 4 ticket classes and 2 survival statuses:
- The significance level (
$\alpha$ ) is usually set at 0.05. - The critical value is determined from the chi-square table with
$df = 3$ . -
Compare
$\chi^2_{\text{calculated}}$ to$\chi^2_{\text{critical}}$ : if$\chi^2_{\text{calculated}}$ exceeds$\chi^2_{\text{critical}}$ , reject$H_0$ ; otherwise, fail to reject$H_0$ .
-
Rejecting
$H_0$ indicates that survival rates differ across ticket classes. -
Failing to Reject
$H_0$ suggests no significant difference in survival rates among classes.
Analysis Results:
- Chi-square statistic: 190.40
- p-value: 4.9999e-41
- Critical value: 7.81
- Decision: Reject the null hypothesis. Survival rates differ across ticket classes.
This suggests that there is a significant difference in survival rates among the different ticket classes (First Class, Second Class, Third Class, and Crew) on the Titanic. The plot compares observed and expected counts for survival and death in each class, highlighting the differences between them.
- The null hypothesis (
$H_0$ ) states that the two categorical variables are independent. - The alternative hypothesis (
$H_A$ ) asserts that the variables are associated, meaning they are not independent.
Suppose we survey individuals to see if gender is associated with voting preference.
Data Summary:
Liberal | Conservative | Total | |
---|---|---|---|
Male | 40 | 60 | 100 |
Female | 70 | 30 | 100 |
Total | 110 | 90 | 200 |
Example for Male Liberals:
Calculate
For a 2×2 table, apply Yates' correction to adjust for continuity:
- The significance level (
$\alpha$ ) is commonly set at 0.05. - The critical value is obtained from the chi-square table with
$df = 1$ . -
Compare
$\chi^2_{\text{calculated}}$ to$\chi^2_{\text{critical}}$ : if$\chi^2_{\text{calculated}}$ is greater than$\chi^2_{\text{critical}}$ , reject$H_0$ ; otherwise, fail to reject$H_0$ .
-
Rejecting
$H_0$ suggests a significant association between gender and voting preference. -
Failing to Reject
$H_0$ indicates no significant association.
Analysis Results:
- Chi-square statistic: 16.99
- p-value: 3.76e-05
- Critical value: 3.84
- Decision: Reject the null hypothesis. There is a significant association between gender and voting preference.
This result suggests that gender is indeed significantly associated with voting preference based on the observed data. The plot provides a clear comparison between observed and expected counts for "Liberal" and "Conservative" preferences across genders, using a minimalistic and professional color scheme for clarity and readability.
Although both tests use the chi-square statistic and similar computations, they differ in their applications and interpretations.
- The objective is to determine whether multiple populations share the same distribution of a single categorical variable.
- The data structure involves separate random samples drawn from different populations.
- An example would be comparing the distribution of M&M colors across different product lines, such as milk chocolate, peanut, and caramel.
- The objective is to evaluate whether two categorical variables are independent within a single population.
- The data structure consists of one random sample where observations are categorized by two variables.
- An example would be examining the relationship between gender and voting preference in a survey.
The population focus differs:
- In a homogeneity test, there are multiple populations being compared.
- In an independence test, we look at a single population.
Regarding the research question:
- In a homogeneity test, we ask whether distributions are the same across different populations.
- In an independence test, we explore whether variables are associated within the population.
For chi-square tests to be valid, several conditions must be met:
- Random sampling ensures that the data is collected appropriately through random methods.
- The expected frequency in each cell should be at least 5.
- Independence of observations must be maintained.
- The data used should involve categorical variables.
-
State the hypotheses by defining
$H_0$ and$H_A$ . - Collect data and organize observed frequencies into a contingency table.
- Calculate expected counts using the appropriate formulas based on the test.
- Compute the chi-square statistic by applying the chi-square formula.
- Determine degrees of freedom based on the dimensions of the table.
- Find the critical value or p-value using chi-square distribution tables or statistical software.
-
Make a decision by comparing
$\chi^2_{\text{calculated}}$ with$\chi^2_{\text{critical}}$ . - Interpret the results and draw conclusions in the context of the research question.