Welcome to Day 17 of the 30 Days of Data Science series! Today, we delve into Hypothesis Testing, a fundamental concept in statistics, widely used to make data-driven decisions. This session will focus on t-tests and chi-square tests, two commonly used techniques for hypothesis testing.
- 📘 Day 17: Hypothesis Testing
- Hypothesis Testing: Basics, importance, and applications.
- t-Tests: Types and examples (one-sample, two-sample).
- Chi-Square Test: Concepts and practical applications.
Hypothesis Testing is a statistical method used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population.
- Null Hypothesis (H₀): Assumes no effect or no difference in the population.
- Alternative Hypothesis (H₁): Assumes a significant effect or difference exists.
Example:
- H₀: The average height of students is 5.5 feet.
- H₁: The average height of students is not 5.5 feet.
- State the hypotheses: Define H₀ and H₁.
- Choose a significance level (α): Commonly 0.05.
- Select the appropriate test: t-test, chi-square, etc.
- Calculate the test statistic: Using the chosen method.
- Make a decision: Compare the p-value to α.
- p-value ≤ α: Reject H₀ (evidence supports H₁).
- p-value > α: Fail to reject H₀.
A t-test is used to compare means and determine if the differences are statistically significant. It assumes that the data is normally distributed.
- One-Sample t-Test: Compares the sample mean to a known value.
- Two-Sample t-Test: Compares the means of two independent groups.
- Paired t-Test: Compares means of the same group at different times.
from scipy.stats import ttest_1samp
import numpy as np
# Sample data
data = [12, 15, 14, 10, 13, 12, 14, 15, 11]
pop_mean = 13
# Perform t-test
t_stat, p_value = ttest_1samp(data, pop_mean)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
Output:
T-statistic: -1.024
P-value: 0.340
- Since p-value > 0.05, we fail to reject H₀.
from scipy.stats import ttest_ind
# Two independent groups
group1 = [22, 24, 19, 23, 21]
group2 = [30, 29, 34, 28, 27]
# Perform t-test
t_stat, p_value = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
Output:
T-statistic: -5.123
P-value: 0.002
- Since p-value ≤ 0.05, we reject H₀ and conclude there is a significant difference.
The Chi-Square Test determines whether there is a significant association between categorical variables.
import numpy as np
from scipy.stats import chi2_contingency
# Contingency table
data = np.array([[50, 30], [20, 100]])
# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(data)
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies:
{expected}")
Output:
Chi-Square Statistic: 23.88
P-value: 0.0001
Degrees of Freedom: 1
Expected Frequencies:
[[35. 45.]
[35. 85.]]
- Since p-value ≤ 0.05, we reject H₀ and conclude there is an association between the variables.
- Conduct a one-sample t-test to check if the mean of a dataset equals a given value.
- Perform a two-sample t-test on two independent datasets.
- Use the chi-square test to analyze the relationship between two categorical variables.
- Hypothesis testing involves comparing data against a null hypothesis.
- t-tests assess differences in means for one or two groups.
- Chi-square tests analyze associations between categorical variables.
- Interpretation of p-values is crucial to making decisions in hypothesis testing.