generating model matrix broken for categorical variables #788

kleinschmidt · 2015-03-24T02:39:57Z

I came across this issue while writing tests for the formula/statsmodel functionality: if you create a model matrix for two different subsets of a data frame where the levels of a categorical variable are different in the two subsets, you get different columns in your model matrix.

using DataFrames

dd = DataFrame()
dd[:x] = @pdata round(rand(100) * 3)
dd[:y] = 2 + dd[:x]

f = y ~ x

mf = ModelFrame(f, dd)
mm = ModelMatrix(mf)

dd_sub = dd[ dd[:x] .!= 1, : ]
mf_sub = ModelFrame(f, dd_sub)
mm_sub = ModelMatrix(mf_sub)

size(mm, 3)                             # == 4
size(mm_sub, 2)                         # == 3

I think the reason that we're doing it this way is to avoid creating unnecessary columns in the model matrix. This is desirable/totally fine when you just want to fit a model to different subsets of a DataFrame, but if you're trying to generate predictions for a fitted model you'll get an error if all the levels that were originally present are not there in the data:

using GLM

dd[:y] = 2 + dd[:x] + rand(100) - 0.5

fitted = lm(y ~ x, dd)
predict(fitted, dd)
predict(fitted, dd[ dd[:x] .!= 1, :])   # ERROR: DimensionMismatch

(I just did an analysis like this in R that relied on fitting one big model and then generating predictions for separate subsets, so it's not a super weird edge case.)

One way around this would be to just do some checks in the predict method that takes a fitted model and a DataFrame to make sure all the columns are there.

However, I think that the "right" way would be to store some additional information in the ModelFrame type that tells how to transform variables to columns. Currently this is handled by multiple dispatch on the cols function. I'd stated working on something to this effect already, in order to allow for specifying different contrast coding schemes (which is something we need to do anyway, #757).

The text was updated successfully, but these errors were encountered:

kleinschmidt · 2015-03-24T02:48:53Z

The other possibility I initially thought about for the contrast coding was to create something like a ContrastedPooledDataArray container type that could be parametrized with contrast coding information and dispatched on, but that, I think, requires a lot of overhead in delegating all of the internal array-like functionality to the underlying PooledDataArray. But perhaps that's the more Julian way to go about it?

quinnj · 2017-09-07T05:30:10Z

@kleinschmidt, still relevant?

kleinschmidt · 2017-09-07T12:33:26Z

Nope this is fixed (and tested)

This was referenced Sep 17, 2015

Categorical variables in Formula #867

Closed

RFC: contrast coding #870

Merged

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

kleinschmidt closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generating model matrix broken for categorical variables #788

generating model matrix broken for categorical variables #788

kleinschmidt commented Mar 24, 2015

kleinschmidt commented Mar 24, 2015

quinnj commented Sep 7, 2017

kleinschmidt commented Sep 7, 2017

generating model matrix broken for categorical variables #788

generating model matrix broken for categorical variables #788

Comments

kleinschmidt commented Mar 24, 2015

kleinschmidt commented Mar 24, 2015

quinnj commented Sep 7, 2017

kleinschmidt commented Sep 7, 2017