Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. It is a fundamental concept in statistics, enabling researchers and analysts to understand how one variable may predict or relate to another. The most commonly used correlation coefficients are the Pearson correlation coefficient and the Spearman rank correlation coefficient.
- A positive correlation occurs when, as one variable increases, the other tends to rise as well.
- In contrast, a negative correlation happens when an increase in one variable results in the other decreasing.
- Lastly, a zero correlation indicates that there is no linear relationship between the variables.
Important Note: Correlation does not imply causation. A high correlation between two variables does not mean that one variable causes changes in the other.
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. It is sensitive to outliers and assumes that the relationship is linear and that both variables are normally distributed.
The Pearson correlation coefficient
Where:
-
$\text{Cov}(X, Y)$ is the covariance between variables$X$ and$Y$ . -
$\sigma_X$ and$\sigma_Y$ are the standard deviations of$X$ and$Y$ , respectively.
Alternative Formula:
Where:
-
$n$ is the number of observations. -
$X_i$ and$Y_i$ are the$i $ -th observations of$X$ and$Y$ . -
$\bar{X}$ and$\bar{Y}$ are the sample means of$X$ and$Y$ .
-
$r = 1 $ : Perfect positive linear correlation. -
$r = -1 $ : Perfect negative linear correlation. -
$r = 0 $ : No linear correlation.
General Guidelines:
Correlation Strength | Range (r) |
---|---|
Strong Positive Correlation | 0.7 ≤ r ≤ 1.0 |
Moderate Positive Correlation | 0.3 ≤ r < 0.7 |
Weak Positive Correlation | 0 < r < 0.3 |
No Correlation | r = 0 |
Weak Negative Correlation | -0.3 < r < 0 |
Moderate Negative Correlation | -0.7 < r ≤ -0.3 |
Strong Negative Correlation | -1.0 ≤ r ≤ -0.7 |
Consider the following data on the number of hours studied (
Observation ( |
Hours Studied ( |
Test Score ( |
---|---|---|
1 | 1 | 50 |
2 | 2 | 60 |
3 | 3 | 70 |
4 | 4 | 80 |
5 | 5 | 90 |
Compute
1 | 1 | 50 | -2 | -20 | 40 | 4 | 400 |
2 | 2 | 60 | -1 | -10 | 10 | 1 | 100 |
3 | 3 | 70 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 80 | 1 | 10 | 10 | 1 | 100 |
5 | 5 | 90 | 2 | 20 | 40 | 4 | 400 |
Sum | 100 | 10 | 1000 |
Compute the denominators:
Compute
Pearson's
In the initial content, it was incorrectly stated that Pearson's
The Spearman rank correlation coefficient measures the strength and direction of the monotonic relationship between two ranked variables. It is a non-parametric measure, meaning it does not assume a specific distribution for the variables and is less sensitive to outliers.
The Spearman correlation coefficient
Where:
-
$d_i = R(X_i) - R(Y_i)$ is the difference between the ranks of$X_i$ and$Y_i$ . -
$R(X_i)$ and$R(Y_i)$ are the ranks of$X_i$ and$Y_i$ , respectively. - $n $ is the number of observations.
-
Assign Ranks to the data points in
$X$ and$Y$ separately. -
Compute the Differences of Ranks
$d_i$ . -
Square the Differences
$d_i^2 $ . -
Compute
$\rho$ using the formula.
Using the same dataset:
Observation ( |
Hours Studied ( |
Test Score ( |
---|---|---|
1 | 1 | 50 |
2 | 2 | 60 |
3 | 3 | 70 |
4 | 4 | 80 |
5 | 5 | 90 |
Since the data is already ordered, the ranks correspond to the order of observations.
Rank |
Rank |
|||
---|---|---|---|---|
1 | 1 | 1 | 50 | 1 |
2 | 2 | 2 | 60 | 2 |
3 | 3 | 3 | 70 | 3 |
4 | 4 | 4 | 80 | 4 |
5 | 5 | 5 | 90 | 5 |
Calculate
Rank |
Rank |
|||
---|---|---|---|---|
1 | 1 | 1 | 0 | 0 |
2 | 2 | 2 | 0 | 0 |
3 | 3 | 3 | 0 | 0 |
4 | 4 | 4 | 0 | 0 |
5 | 5 | 5 | 0 | 0 |
Sum | 0 |
Spearman's
- When data is ordinal.
- When the relationship between variables is monotonic but not necessarily linear.
- When there are outliers that might affect Pearson's
$r$ . - When the variables are not normally distributed.
-
Pearson's
$r$ measures the strength of a linear relationship. -
Spearman's
$\rho$ measures the strength of a monotonic relationship. - Both coefficients range from -1 to 1.
- Spearman's
$\rho$ is less sensitive to outliers and does not require the assumption of normality.
For two random variables
Where:
-
$\text{Cov}(X, Y)$ is the covariance between$X$ and$Y$ . -
$\sigma_X$ and$\sigma_Y$ are the standard deviations of$X$ and$Y$ .
Property | Description |
---|---|
Range | |
Symmetry | |
Dimensionless | The correlation coefficient is unitless. |
Linearity | If |
Independence | If |
-
$\rho_{XY} > 0 $ : Positive linear relationship. -
$\rho_{XY} < 0 $ : Negative linear relationship. -
$\rho_{XY} = 0 $ : No linear relationship.
Suppose
$\mu_X = \mathbb{E}[X] = 5$ $\mu_Y = \mathbb{E}[Y] = 10$ $\sigma_X = \sqrt{\text{Var}(X)} = 2$ $\sigma_Y = \sqrt{\text{Var}(Y)} = 4$ $\text{Cov}(X, Y) = 6$
Compute the correlation coefficient:
Interpretation:
- A correlation coefficient of
$0.75$ indicates a strong positive linear relationship between$X$ and$Y$ .
- Correlation does not imply causation. A high correlation between two variables does not mean that one variable causes changes in the other.
-
Pearson's
$r$ is sensitive to outliers, which can distort the correlation coefficient. -
Spearman's
$\rho$ is more robust to outliers due to ranking.
- Variables can have a strong non-linear relationship but a low Pearson correlation coefficient.
- In such cases, Spearman's
$\rho$ may detect the monotonic relationship.
-
Linearity means that the relationship between
$X$ and$Y$ is straight and follows a linear pattern. -
Normality refers to the condition where both
$X$ and$Y$ are normally distributed. - Lastly, homoscedasticity implies that the variance of
$Y$ remains consistent across all values of$X$ .
If these assumptions are violated, Pearson's