Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating model matrix broken for categorical variables #788

Closed
kleinschmidt opened this issue Mar 24, 2015 · 3 comments
Closed

generating model matrix broken for categorical variables #788

kleinschmidt opened this issue Mar 24, 2015 · 3 comments

Comments

@kleinschmidt
Copy link
Contributor

I came across this issue while writing tests for the formula/statsmodel functionality: if you create a model matrix for two different subsets of a data frame where the levels of a categorical variable are different in the two subsets, you get different columns in your model matrix.

using DataFrames

dd = DataFrame()
dd[:x] = @pdata round(rand(100) * 3)
dd[:y] = 2 + dd[:x]

f = y ~ x

mf = ModelFrame(f, dd)
mm = ModelMatrix(mf)

dd_sub = dd[ dd[:x] .!= 1, : ]
mf_sub = ModelFrame(f, dd_sub)
mm_sub = ModelMatrix(mf_sub)

size(mm, 3)                             # == 4
size(mm_sub, 2)                         # == 3

I think the reason that we're doing it this way is to avoid creating unnecessary columns in the model matrix. This is desirable/totally fine when you just want to fit a model to different subsets of a DataFrame, but if you're trying to generate predictions for a fitted model you'll get an error if all the levels that were originally present are not there in the data:

using GLM

dd[:y] = 2 + dd[:x] + rand(100) - 0.5

fitted = lm(y ~ x, dd)
predict(fitted, dd)
predict(fitted, dd[ dd[:x] .!= 1, :])   # ERROR: DimensionMismatch

(I just did an analysis like this in R that relied on fitting one big model and then generating predictions for separate subsets, so it's not a super weird edge case.)

One way around this would be to just do some checks in the predict method that takes a fitted model and a DataFrame to make sure all the columns are there.

However, I think that the "right" way would be to store some additional information in the ModelFrame type that tells how to transform variables to columns. Currently this is handled by multiple dispatch on the cols function. I'd stated working on something to this effect already, in order to allow for specifying different contrast coding schemes (which is something we need to do anyway, #757).

@kleinschmidt
Copy link
Contributor Author

The other possibility I initially thought about for the contrast coding was to create something like a ContrastedPooledDataArray container type that could be parametrized with contrast coding information and dispatched on, but that, I think, requires a lot of overhead in delegating all of the internal array-like functionality to the underlying PooledDataArray. But perhaps that's the more Julian way to go about it?

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

@kleinschmidt, still relevant?

@kleinschmidt
Copy link
Contributor Author

Nope this is fixed (and tested)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants