Categorical variables in Formula #867

matthieugomez · 2015-09-17T15:19:34Z

In the current syntax for formula, categorical variables must be PooledDataVectors.

However, in typical data analysis, the same variable can be alternatively seen as a continuous or a categorical variable, depending on the model. For instance one may want to regress on a trend (i.e. year as a continuous variable) or on year dummies.

It would be great to be able to specify in a formula that a variable should be treated as categorical variable. For instance

fit(LinearModel, y ~ pool(x), df)

johnmyleswhite · 2015-09-17T15:21:30Z

I think this is a good idea, but it requires massive changes to how things work.

That said, I'm very much on board. I've often thought we should do something like y ~ x' to make dummy variables really cheap to request.

kleinschmidt · 2015-09-17T15:30:07Z

A few months ago I started to mock up something similar to change the way that a categorical variable is transformed into a model matrix (e.g., dummy, sum, Helmert, etc. coding). I think that coding it as a continuous variable might fit into the same sort of framework: they all are just ways of specifying transformations between data frame columns (or data vectors more generally) and model matrix columns.

My first idea had been to specify a dictionary of column-to-contrast type mappings in the ModelFrame (since that's what encapsulates the transformation from data to ModelMatrix) but I was a bit stuck on how to specify that mapping (other than just directly passing such a dictionary). But it might be possible to wrap formula terms in constructors for contrast types and build up the contrast mappings in the parsing of the formulas.

nalimilan · 2015-09-17T15:51:13Z

Somewhat related is the question of whether to allow y ~ log(x) in formulas. If calling arbitrary functions like this was supported, then y ~ pool(x) could simply call pool(), i.e. create a PooledDataArray, which would then be handled as currently in a second step, by creating dummies when computing the design matrix.

@kleinschmidt I don't think the contrasts matrices can be computed when parsing the formulas, since they depend on the data (e.g., on the levels of the PDA). Why not pass the dictionary?

kleinschmidt · 2015-09-17T16:30:51Z

@nalimilan no reason other than that it feels kind of clunky/bolted on, for something that's such an important part of how you specify the transformation from data to model matrix. To be honest, I suspect that my desire to put it in the Formula itself comes from a bit of a pet peeve of mine: for the most part, psychologists/psycholinguists ignore how their categorical variables (like experimental condition) are transformed into regressors, which can have huge implications for how the resulting model is interpreted (and whether it can be fit at all, since many choices lead to collinearity in the design matrix).

Also, you're right that the contrasts columns themselves can't be computed until there's data, so for that reason it makes sense to pass them only when you have data. On the other hand, there are complications surrounding how to generate contrasts based on whether certain lower-order terms are present in the model matrix. For instance, if you have a three-level categorical variable, you can only do two contrast columns when there's also an intercept present in the model, but you probably want three (e.g. dummy coding) when there's no intercept present (see #757). Same thing for an interaction between a categorical variable and another variable: if there's no main effect for the other variable, you want to generate as many contrast columns as there are levels, otherwise, one less. So even evaluating the contrast scheme depends on the formula, not just the contrast itself, and so that seems like a reason to put the contrasts in the Formula/Terms bits. Of course, if you're passing them to ModelFrame you can do the same thing just by examining the parsed Terms, and I haven't really worked out whether this would make the implementation easier or harder. Just a hunch.

One last complication is that you might want to maintain the same contrasts across multiple datasets (e.g., for making out-of-sample predictions). If you re-generate the contrasts every time you construct a new ModelFrame, you can end up in a situation where if the levels are different between the two data subsets (e.g., the prediction data only has a subset of the levels of the original) you'll get a model matrix of the wrong size or with contrast columns that don't correspond to fitted coefficients in the same way (#788). So even though you need the data to generate the contrasts, it also makes sense to make them portable across different model frames. Again, doesn't preclude just sticking them in a field in the model frame, just a hunch!

nalimilan · 2015-09-17T21:17:58Z

Honestly, I'm not even sure how it would be possible to compute the contrasts for categorical variables when parsing the formula: you need to have access to the data for that. Of course you also need the formula, which seems to imply that you need to create the contrasts when creating the model frame. Regarding #788, fixing predict could be enough.

matthieugomez · 2015-09-18T13:51:26Z

@nalimilan Yes. It's similar to the fact that one can do y ~ as.factor(year) in R, right? One minor issue with this implementation is that the table of coefficients prints something like pool(year)-1999. rather than year-1999, but that's the only thing I can think of.

nalimilan · 2016-08-31T12:16:33Z

@kleinschmidt How do you think this should be addressed now that #870 has been merged?

kleinschmidt · 2016-08-31T19:26:01Z

I see two ways forward on this. In the short term I'll look into adding functionality for specifying the contrasts in the formula itself. The scheme used in #870 allows for specifying the contrast type independent of the data so that's not an issue. I'm imagining something like R's y ~ C(x) or C(x, DummyCoding). This can be stored as an evaluation term which then gets passed to the normal contrast evaluating mechanism. It's not totally straightforward but I think it won't be too much of a pain.

In the long term, as @nalimilan noted, this is a specific case of the more general problem of specifying transformations of variables in formulas. I know @johnmyleswhite has been working on making that performant but I imagine that's a way off. But treating contrast codings as a specific kind of data transformation is I think the right way to think of it.

quinnj · 2017-09-07T05:16:11Z

Can be re-opened at https://github.com/JuliaStats/StatsModels.jl if still relevant.

nalimilan · 2017-09-07T12:48:58Z

Yes, that's JuliaStats/StatsModels.jl#25.

kleinschmidt mentioned this issue Sep 21, 2015

RFC: contrast coding #870

Merged

nalimilan mentioned this issue Feb 22, 2016

Formula/ModelMatrix should support functions of variables #19

Closed

nalimilan mentioned this issue Jun 9, 2017

applying transformations on formula arguments JuliaStats/StatsModels.jl#25

Closed

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

quinnj closed this as completed Sep 7, 2017

nalimilan mentioned this issue Sep 23, 2017

Make CategoricalValue <: AbstractString JuliaData/CategoricalArrays.jl#77

Merged

nalimilan mentioned this issue Jul 3, 2018

CategoricalArrays without CategoricalValue JuliaData/CategoricalArrays.jl#151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical variables in Formula #867

Categorical variables in Formula #867

matthieugomez commented Sep 17, 2015

johnmyleswhite commented Sep 17, 2015

kleinschmidt commented Sep 17, 2015

nalimilan commented Sep 17, 2015

kleinschmidt commented Sep 17, 2015

nalimilan commented Sep 17, 2015

matthieugomez commented Sep 18, 2015

nalimilan commented Aug 31, 2016

kleinschmidt commented Aug 31, 2016

quinnj commented Sep 7, 2017

nalimilan commented Sep 7, 2017

Categorical variables in Formula #867

Categorical variables in Formula #867

Comments

matthieugomez commented Sep 17, 2015

johnmyleswhite commented Sep 17, 2015

kleinschmidt commented Sep 17, 2015

nalimilan commented Sep 17, 2015

kleinschmidt commented Sep 17, 2015

nalimilan commented Sep 17, 2015

matthieugomez commented Sep 18, 2015

nalimilan commented Aug 31, 2016

kleinschmidt commented Aug 31, 2016

quinnj commented Sep 7, 2017

nalimilan commented Sep 7, 2017