-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_collinearity() does not work with orthogonal polynomials #733
Comments
Have you checked with |
car returns an error: x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)
fit <- lm(y ~ poly(x, degree = 2), data = df) performance::check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#> multicollinearity.
#> NULL car::vif(fit)
#> Error in vif.default(fit): model contains fewer than 2 terms Created on 2024-06-12 with reprex v2.1.0 |
I think that makes sense - there is only 1 underlying variable, and adding orthogonal functions of the variable won't have any collinearity between them. |
They won't have collinearity and they are powered by a single variable, but technically won't there be one term per degree? And in some weird world, couldn't it be possible for the different terms to have different VIFs in a model with additional variables? |
Also, note that this same behavior occurs even when using raw polynomials, which definitely could be collinear: x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)
fit <- lm(y ~ poly(x, degree = 2, raw = TRUE), data = df) performance::check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#> multicollinearity.
#> NULL car::vif(fit)
#> Error in vif.default(fit): model contains fewer than 2 terms Created on 2024-06-12 with reprex v2.1.0 |
poly is considered one term - like factors or matrices: data("mtcars")
mod <- lm(mpg ~ cbind(disp, hp) + poly(qsec, 2) + factor(cyl),
data = mtcars)
insight::find_terms(mod)
#> $response
#> [1] "mpg"
#>
#> $conditional
#> [1] "cbind(disp, hp)" "poly(qsec, 2)" "factor(cyl)" anova(mod)
#> Analysis of Variance Table
#>
#> Response: mpg
#> Df Sum Sq Mean Sq F value Pr(>F)
#> cbind(disp, hp) 2 842.55 421.28 51.7367 1.3e-09 ***
#> poly(qsec, 2) 2 11.95 5.97 0.7337 0.49020
#> factor(cyl) 2 67.98 33.99 4.1742 0.02728 *
#> Residuals 25 203.57 8.14
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 performance::check_collinearity(mod)
#> # Check for Multicollinearity
#>
#> Low Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> poly(qsec, 2) 4.39 [ 2.91, 6.99] 2.09 0.23 [0.14, 0.34]
#>
#> Moderate Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> factor(cyl) 9.45 [ 5.99, 15.31] 3.07 0.11 [0.07, 0.17]
#>
#> High Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> cbind(disp, hp) 23.03 [14.25, 37.63] 4.80 0.04 [0.03, 0.07] Created on 2024-06-13 with reprex v2.1.0 If you want each feature to be a term, you will need to hand code them. mod <- lm(mpg ~ disp + hp + qsec + I(qsec^2) + I(cyl==6) + I(cyl==8),
data = mtcars)
insight::find_terms(mod)
#> $response
#> [1] "mpg"
#>
#> $conditional
#> [1] "disp" "hp" "qsec" "I(qsec^2)" "I(cyl == 6)"
#> [6] "I(cyl == 8)" anova(mod)
#> Analysis of Variance Table
#>
#> Response: mpg
#> Df Sum Sq Mean Sq F value Pr(>F)
#> disp 1 808.89 808.89 99.3390 3.428e-10 ***
#> hp 1 33.67 33.67 4.1344 0.05277 .
#> qsec 1 6.71 6.71 0.8235 0.37282
#> I(qsec^2) 1 5.24 5.24 0.6438 0.42990
#> I(cyl == 6) 1 59.13 59.13 7.2616 0.01241 *
#> I(cyl == 8) 1 8.85 8.85 1.0867 0.30718
#> Residuals 25 203.57 8.14
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 performance::check_collinearity(mod)
#> # Check for Multicollinearity
#>
#> Low Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> I(cyl == 6) 2.09 [ 1.53, 3.25] 1.45 0.48 [0.31, 0.65]
#>
#> Moderate Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> disp 7.88 [ 5.03, 12.72] 2.81 0.13 [0.08, 0.20]
#> hp 9.03 [ 5.74, 14.62] 3.01 0.11 [0.07, 0.17]
#>
#> High Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> qsec 348.67 [212.31, 573.00] 18.67 2.87e-03 [0.00, 0.00]
#> I(qsec^2) 320.53 [195.20, 526.75] 17.90 3.12e-03 [0.00, 0.01]
#> I(cyl == 8) 11.38 [ 7.17, 18.48] 3.37 0.09 [0.05, 0.14] Created on 2024-06-13 with reprex v2.1.0 |
If I modified the function to do the following, would you accept the PR?
|
I don't think it makes sense to treat polynomials as being more than one variable/term. The collinearity of polynomial terms is non-essential, so VIF diagnostics don't make much sense for them. Collinearity isn't diagnostic of model issues for polynomial terms (raw or not)--the use of orthogonal polynomials is for computational reasons, not model validity reasons. |
That makes sense to me. What about just a message when giving it only a poly term that explains this? For example: check_collinearity(lm(y ~ poly(x, 2), data = df))
#> Message: Polynomial transformations are considered a single term and thus VIFs are not calculated between them. |
I think a note in the details section of the function documentation should suffice to avoid a very verbose messaging |
Related to easystats/see#346
I know that orthogonal polynomials won't be collinear (that's their point). But I want to be able to show that while teaching.
Created on 2024-06-12 with reprex v2.1.0
When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?
Created on 2024-06-12 with reprex v2.1.0
The text was updated successfully, but these errors were encountered: