-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelMatrix need to be able to align categorical variables #946
Conversation
@@ -79,7 +79,7 @@ function StatsBase.predict(mm::DataFrameRegressionModel, df::AbstractDataFrame) | |||
newTerms = remove_response(mm.mf.terms) | |||
# create new model frame/matrix | |||
mf = ModelFrame(newTerms, df) | |||
newX = ModelMatrix(mf).m | |||
newX = ModelMatrix(mf, mm.mf.df[1:0,:]).m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you passing an empty data frame? AFAICT, passing the whole object will have no additional cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to be explicit about that no values of the original dataframe is used. Merely the pool of any pooled data.
Thanks for the fix. But I think we shouldn't choose a more radical fix, by storing the |
Thanks for fixing this! |
@nalimilan, what about storing a reference to the ModelFrame that generated the ModelMatrix? Does that have any downsides? As it stands, the |
@kleinschmidt Makes sense. Though I think we should at the same time remove the reference to the data frame from Then, should the |
I think the Also, now that I'm thinking about this again, the stats models interfaces all hold onto the model frame, not the matrix. I think that's the appropriate place to store any information about how to generate a new |
So by keeping 0 rows but all column names and any associated levels we can describe the data without wasting any space. |
Now that I've looked at it again, #870 conflicts with this change. The approach there is more general, storing information in the |
As you and I said, I'd prefer the more general approach, in which |
The test added in this PR passes in master now that #870 is merged, nice! |
That's good to hear since fixing this was on of the motivations for #870 😅 (please pardon my thumb-typing) dave.f.kleinschmidt@gmail.com
|
Categorical variables are described by PooledDataArrays. This fix is needed for two reasons.
I uncommented a previously failing test.