Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelMatrix need to be able to align categorical variables #946

Closed
wants to merge 1 commit into from

Conversation

gustafsson
Copy link
Contributor

Categorical variables are described by PooledDataArrays. This fix is needed for two reasons.

  1. PooledDataArrays could be recoded with a different order, thus the design matrix will be wrong.
  2. The rows for which to run predict might not span all factors, consider for instance running predict for a single row.

I uncommented a previously failing test.

@@ -79,7 +79,7 @@ function StatsBase.predict(mm::DataFrameRegressionModel, df::AbstractDataFrame)
newTerms = remove_response(mm.mf.terms)
# create new model frame/matrix
mf = ModelFrame(newTerms, df)
newX = ModelMatrix(mf).m
newX = ModelMatrix(mf, mm.mf.df[1:0,:]).m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you passing an empty data frame? AFAICT, passing the whole object will have no additional cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to be explicit about that no values of the original dataframe is used. Merely the pool of any pooled data.

@nalimilan
Copy link
Member

Thanks for the fix. But I think we shouldn't choose a more radical fix, by storing the levels in the model matrix when creating it. In general, storing a matrix without the information needed to interpret it isn't great. Also, in practical terms, one shouldn't have to store the original data frame to be able to call predict. Do you think you could update the PR to do that?

@kleinschmidt
Copy link
Contributor

Thanks for fixing this!

@kleinschmidt
Copy link
Contributor

@nalimilan, what about storing a reference to the ModelFrame that generated the ModelMatrix? Does that have any downsides? As it stands, the assign field of a ModelMatrix tells you the mapping between MM and df columns, but that doesn't do much if you lose track of the MF...

@nalimilan
Copy link
Member

@kleinschmidt Makes sense. Though I think we should at the same time remove the reference to the data frame from ModelFrame, and only store information regarding the levels. Keeping a copy of the data is a waste of space, which doesn't even make sense for some data sources like data bases (I think that's one of the design mistakes in R). Using a common format for all data sources will make it possible for models to store information regarding the meaning of their coefficients, which is currently added by DataFrameRegressionModel accessors.

Then, should the ModelFrame be stored in the ModelMatrix, or should it be the contrary, or should a model store both separately?

@kleinschmidt
Copy link
Contributor

I think the ModelFrame actually acts as that kind of glue: there's no reason that it needs to hold onto a copy of the data. Conceptually, it describes how to transform the data source into a matrix suitable for regression etc. I don't think there's any reason (in principle) that you couldn't implement a ModelFrame parametrized by the type of the data source and dispatch ModelMatrix based on that.

Also, now that I'm thinking about this again, the stats models interfaces all hold onto the model frame, not the matrix. I think that's the appropriate place to store any information about how to generate a new ModelMatrix (e.g., contrast coding scheme in #870, which also solves this problem)

@gustafsson
Copy link
Contributor Author

So by keeping 0 rows but all column names and any associated levels we can describe the data without wasting any space.

@kleinschmidt
Copy link
Contributor

Now that I've looked at it again, #870 conflicts with this change. The approach there is more general, storing information in the ModelFrame on contrast coding for all the categorical variables that are present in the DataFrame that it's constructed with. Those stored contrasts play the role that storing a zero-row copy of the dataframe does here, in addition to allowing users to specify the contrast coding that they want to use for individual variables. Is there any reason to prefer the approach here instead of/on top of #870?

@nalimilan
Copy link
Member

As you and I said, I'd prefer the more general approach, in which ModelFrame would contain everything needed to reconstruct the ModelMatrix, but not the actual data. So it would be great if you fixed the current issue with #870.

@gustafsson
Copy link
Contributor Author

The test added in this PR passes in master now that #870 is merged, nice!

@gustafsson gustafsson closed this Aug 26, 2016
@gustafsson gustafsson deleted the align-factors branch August 26, 2016 09:56
@kleinschmidt
Copy link
Contributor

That's good to hear since fixing this was on of the motivations for #870 😅

(please pardon my thumb-typing)

dave.f.kleinschmidt@gmail.com
http://davekleinschmidt.com
413-884-2741

On Aug 26, 2016, at 5:56 AM, Johan notifications@github.com wrote:

The test added in this PR passes in master now that #870 is merged, nice!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants