ModelMatrix need to be able to align categorical variables #946

gustafsson · 2016-04-28T19:16:46Z

Categorical variables are described by PooledDataArrays. This fix is needed for two reasons.

PooledDataArrays could be recoded with a different order, thus the design matrix will be wrong.
The rows for which to run predict might not span all factors, consider for instance running predict for a single row.

I uncommented a previously failing test.

…odelFrame).

nalimilan · 2016-04-29T13:42:48Z

src/statsmodels/statsmodel.jl

@@ -79,7 +79,7 @@ function StatsBase.predict(mm::DataFrameRegressionModel, df::AbstractDataFrame)
    newTerms = remove_response(mm.mf.terms)
    # create new model frame/matrix
    mf = ModelFrame(newTerms, df)
-    newX = ModelMatrix(mf).m
+    newX = ModelMatrix(mf, mm.mf.df[1:0,:]).m


Why are you passing an empty data frame? AFAICT, passing the whole object will have no additional cost.

The idea is to be explicit about that no values of the original dataframe is used. Merely the pool of any pooled data.

nalimilan · 2016-04-29T13:46:55Z

Thanks for the fix. But I think we shouldn't choose a more radical fix, by storing the levels in the model matrix when creating it. In general, storing a matrix without the information needed to interpret it isn't great. Also, in practical terms, one shouldn't have to store the original data frame to be able to call predict. Do you think you could update the PR to do that?

kleinschmidt · 2016-04-29T13:48:26Z

Thanks for fixing this!

kleinschmidt · 2016-04-29T13:53:37Z

@nalimilan, what about storing a reference to the ModelFrame that generated the ModelMatrix? Does that have any downsides? As it stands, the assign field of a ModelMatrix tells you the mapping between MM and df columns, but that doesn't do much if you lose track of the MF...

nalimilan · 2016-04-29T14:54:53Z

@kleinschmidt Makes sense. Though I think we should at the same time remove the reference to the data frame from ModelFrame, and only store information regarding the levels. Keeping a copy of the data is a waste of space, which doesn't even make sense for some data sources like data bases (I think that's one of the design mistakes in R). Using a common format for all data sources will make it possible for models to store information regarding the meaning of their coefficients, which is currently added by DataFrameRegressionModel accessors.

Then, should the ModelFrame be stored in the ModelMatrix, or should it be the contrary, or should a model store both separately?

kleinschmidt · 2016-04-30T01:39:19Z

I think the ModelFrame actually acts as that kind of glue: there's no reason that it needs to hold onto a copy of the data. Conceptually, it describes how to transform the data source into a matrix suitable for regression etc. I don't think there's any reason (in principle) that you couldn't implement a ModelFrame parametrized by the type of the data source and dispatch ModelMatrix based on that.

Also, now that I'm thinking about this again, the stats models interfaces all hold onto the model frame, not the matrix. I think that's the appropriate place to store any information about how to generate a new ModelMatrix (e.g., contrast coding scheme in #870, which also solves this problem)

gustafsson · 2016-04-30T08:37:59Z

So by keeping 0 rows but all column names and any associated levels we can describe the data without wasting any space.

kleinschmidt · 2016-04-30T13:11:39Z

Now that I've looked at it again, #870 conflicts with this change. The approach there is more general, storing information in the ModelFrame on contrast coding for all the categorical variables that are present in the DataFrame that it's constructed with. Those stored contrasts play the role that storing a zero-row copy of the dataframe does here, in addition to allowing users to specify the contrast coding that they want to use for individual variables. Is there any reason to prefer the approach here instead of/on top of #870?

nalimilan · 2016-04-30T13:57:03Z

As you and I said, I'd prefer the more general approach, in which ModelFrame would contain everything needed to reconstruct the ModelMatrix, but not the actual data. So it would be great if you fixed the current issue with #870.

gustafsson · 2016-08-26T09:56:29Z

The test added in this PR passes in master now that #870 is merged, nice!

kleinschmidt · 2016-08-26T15:04:47Z

That's good to hear since fixing this was on of the motivations for #870 😅

(please pardon my thumb-typing)

dave.f.kleinschmidt@gmail.com
http://davekleinschmidt.com
413-884-2741

On Aug 26, 2016, at 5:56 AM, Johan notifications@github.com wrote:

The test added in this PR passes in master now that #870 is merged, nice!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

ModelMatrix need to be able to align factors (pooled arrays used in M…

ede097f

…odelFrame).

nalimilan reviewed Apr 29, 2016
View reviewed changes

gustafsson closed this Aug 26, 2016

gustafsson deleted the align-factors branch August 26, 2016 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelMatrix need to be able to align categorical variables #946

ModelMatrix need to be able to align categorical variables #946

gustafsson commented Apr 28, 2016

nalimilan Apr 29, 2016

gustafsson Apr 29, 2016

nalimilan commented Apr 29, 2016

kleinschmidt commented Apr 29, 2016

kleinschmidt commented Apr 29, 2016

nalimilan commented Apr 29, 2016

kleinschmidt commented Apr 30, 2016

gustafsson commented Apr 30, 2016

kleinschmidt commented Apr 30, 2016

nalimilan commented Apr 30, 2016

gustafsson commented Aug 26, 2016

kleinschmidt commented Aug 26, 2016

ModelMatrix need to be able to align categorical variables #946

ModelMatrix need to be able to align categorical variables #946

Conversation

gustafsson commented Apr 28, 2016

nalimilan Apr 29, 2016

Choose a reason for hiding this comment

gustafsson Apr 29, 2016

Choose a reason for hiding this comment

nalimilan commented Apr 29, 2016

kleinschmidt commented Apr 29, 2016

kleinschmidt commented Apr 29, 2016

nalimilan commented Apr 29, 2016

kleinschmidt commented Apr 30, 2016

gustafsson commented Apr 30, 2016

kleinschmidt commented Apr 30, 2016

nalimilan commented Apr 30, 2016

gustafsson commented Aug 26, 2016

kleinschmidt commented Aug 26, 2016