You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across this issue while writing tests for the formula/statsmodel functionality: if you create a model matrix for two different subsets of a data frame where the levels of a categorical variable are different in the two subsets, you get different columns in your model matrix.
using DataFrames
dd =DataFrame()
dd[:x] =@pdataround(rand(100) *3)
dd[:y] =2+ dd[:x]
f = y ~ x
mf =ModelFrame(f, dd)
mm =ModelMatrix(mf)
dd_sub = dd[ dd[:x] .!=1, : ]
mf_sub =ModelFrame(f, dd_sub)
mm_sub =ModelMatrix(mf_sub)
size(mm, 3) # == 4size(mm_sub, 2) # == 3
I think the reason that we're doing it this way is to avoid creating unnecessary columns in the model matrix. This is desirable/totally fine when you just want to fit a model to different subsets of a DataFrame, but if you're trying to generate predictions for a fitted model you'll get an error if all the levels that were originally present are not there in the data:
(I just did an analysis like this in R that relied on fitting one big model and then generating predictions for separate subsets, so it's not a super weird edge case.)
One way around this would be to just do some checks in the predict method that takes a fitted model and a DataFrame to make sure all the columns are there.
However, I think that the "right" way would be to store some additional information in the ModelFrame type that tells how to transform variables to columns. Currently this is handled by multiple dispatch on the cols function. I'd stated working on something to this effect already, in order to allow for specifying different contrast coding schemes (which is something we need to do anyway, #757).
The text was updated successfully, but these errors were encountered:
The other possibility I initially thought about for the contrast coding was to create something like a ContrastedPooledDataArray container type that could be parametrized with contrast coding information and dispatched on, but that, I think, requires a lot of overhead in delegating all of the internal array-like functionality to the underlying PooledDataArray. But perhaps that's the more Julian way to go about it?
I came across this issue while writing tests for the formula/statsmodel functionality: if you create a model matrix for two different subsets of a data frame where the levels of a categorical variable are different in the two subsets, you get different columns in your model matrix.
I think the reason that we're doing it this way is to avoid creating unnecessary columns in the model matrix. This is desirable/totally fine when you just want to fit a model to different subsets of a DataFrame, but if you're trying to generate predictions for a fitted model you'll get an error if all the levels that were originally present are not there in the data:
(I just did an analysis like this in R that relied on fitting one big model and then generating predictions for separate subsets, so it's not a super weird edge case.)
One way around this would be to just do some checks in the
predict
method that takes a fitted model and a DataFrame to make sure all the columns are there.However, I think that the "right" way would be to store some additional information in the ModelFrame type that tells how to transform variables to columns. Currently this is handled by multiple dispatch on the
cols
function. I'd stated working on something to this effect already, in order to allow for specifying different contrast coding schemes (which is something we need to do anyway, #757).The text was updated successfully, but these errors were encountered: