Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contrast coding systems for categorical variables #757

Closed
awellis opened this issue Jan 5, 2015 · 6 comments
Closed

Contrast coding systems for categorical variables #757

awellis opened this issue Jan 5, 2015 · 6 comments

Comments

@awellis
Copy link

awellis commented Jan 5, 2015

Are there any plans to add more contrast coding systems for categorical variables? At the moment, contr_treatment seems to be the only one.

@johnmyleswhite
Copy link
Contributor

I don't know if anyone is planning to work on that soon, but it seems clearly worth having eventually. @dmbates might have more to say. I'm still planning on rewriting our underlying representation of categorical variables, but that should be orthogonal.

@awellis
Copy link
Author

awellis commented Jan 5, 2015

Another thing:
in R, omitting the intercept will create summary variables for all levels:

df <- as.data.frame(c("A", "A", "B", "C", "C", "B"))
names(df) <- "var"
df <- model.matrix(~ 0 + df$var)

will return

  df$varA df$varB df$varC
1       1       0       0
2       1       0       0
3       0       1       0
4       0       0       1
5       0       0       1
6       0       1       0

In Julia, if I omit the intercept

df = DataFrame(var = ["A", "A", "B", "C", "C", "B"], y = randn(6))
pool!(df, [:var])
mm = ModelMatrix(ModelFrame( y ~ var + 0, df))
ModelMatrix{Float64}(6x2 Array{Float64,2}:
 0.0  0.0
 0.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0,[1,1])

No dummy variable is created for the reference level ("A"). Is this intended?

@johnmyleswhite
Copy link
Contributor

That seems like a bug to me.

@awellis
Copy link
Author

awellis commented Jan 5, 2015

should I open a separate issue?

@kleinschmidt
Copy link
Contributor

The contrast generation code was pretty primitive last time I looked, the most obvious thing being that it doesn't check for whether an intercept is present or not. Things get tricky when you have more than one categorical variable (in R, only the first categorical variable gets the full complement of predictors), or interactions between categorical predictors and other predictors (you need to include more levels when an interaction is included but the lower-order terms are not).

I think someone had opened an issue or pull request about other contrast types a long time ago but obviously it didn't make it in.

@nalimilan
Copy link
Member

Fixed by #870.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants