Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical variables in Formula #867

Closed
matthieugomez opened this issue Sep 17, 2015 · 10 comments
Closed

Categorical variables in Formula #867

matthieugomez opened this issue Sep 17, 2015 · 10 comments

Comments

@matthieugomez
Copy link
Contributor

In the current syntax for formula, categorical variables must be PooledDataVectors.

However, in typical data analysis, the same variable can be alternatively seen as a continuous or a categorical variable, depending on the model. For instance one may want to regress on a trend (i.e. year as a continuous variable) or on year dummies.

It would be great to be able to specify in a formula that a variable should be treated as categorical variable. For instance

fit(LinearModel, y ~ pool(x), df)
@johnmyleswhite
Copy link
Contributor

I think this is a good idea, but it requires massive changes to how things work.

That said, I'm very much on board. I've often thought we should do something like y ~ x' to make dummy variables really cheap to request.

@kleinschmidt
Copy link
Contributor

A few months ago I started to mock up something similar to change the way that a categorical variable is transformed into a model matrix (e.g., dummy, sum, Helmert, etc. coding). I think that coding it as a continuous variable might fit into the same sort of framework: they all are just ways of specifying transformations between data frame columns (or data vectors more generally) and model matrix columns.

My first idea had been to specify a dictionary of column-to-contrast type mappings in the ModelFrame (since that's what encapsulates the transformation from data to ModelMatrix) but I was a bit stuck on how to specify that mapping (other than just directly passing such a dictionary). But it might be possible to wrap formula terms in constructors for contrast types and build up the contrast mappings in the parsing of the formulas.

@nalimilan
Copy link
Member

Somewhat related is the question of whether to allow y ~ log(x) in formulas. If calling arbitrary functions like this was supported, then y ~ pool(x) could simply call pool(), i.e. create a PooledDataArray, which would then be handled as currently in a second step, by creating dummies when computing the design matrix.

@kleinschmidt I don't think the contrasts matrices can be computed when parsing the formulas, since they depend on the data (e.g., on the levels of the PDA). Why not pass the dictionary?

@kleinschmidt
Copy link
Contributor

@nalimilan no reason other than that it feels kind of clunky/bolted on, for something that's such an important part of how you specify the transformation from data to model matrix. To be honest, I suspect that my desire to put it in the Formula itself comes from a bit of a pet peeve of mine: for the most part, psychologists/psycholinguists ignore how their categorical variables (like experimental condition) are transformed into regressors, which can have huge implications for how the resulting model is interpreted (and whether it can be fit at all, since many choices lead to collinearity in the design matrix).

Also, you're right that the contrasts columns themselves can't be computed until there's data, so for that reason it makes sense to pass them only when you have data. On the other hand, there are complications surrounding how to generate contrasts based on whether certain lower-order terms are present in the model matrix. For instance, if you have a three-level categorical variable, you can only do two contrast columns when there's also an intercept present in the model, but you probably want three (e.g. dummy coding) when there's no intercept present (see #757). Same thing for an interaction between a categorical variable and another variable: if there's no main effect for the other variable, you want to generate as many contrast columns as there are levels, otherwise, one less. So even evaluating the contrast scheme depends on the formula, not just the contrast itself, and so that seems like a reason to put the contrasts in the Formula/Terms bits. Of course, if you're passing them to ModelFrame you can do the same thing just by examining the parsed Terms, and I haven't really worked out whether this would make the implementation easier or harder. Just a hunch.

One last complication is that you might want to maintain the same contrasts across multiple datasets (e.g., for making out-of-sample predictions). If you re-generate the contrasts every time you construct a new ModelFrame, you can end up in a situation where if the levels are different between the two data subsets (e.g., the prediction data only has a subset of the levels of the original) you'll get a model matrix of the wrong size or with contrast columns that don't correspond to fitted coefficients in the same way (#788). So even though you need the data to generate the contrasts, it also makes sense to make them portable across different model frames. Again, doesn't preclude just sticking them in a field in the model frame, just a hunch!

@nalimilan
Copy link
Member

Honestly, I'm not even sure how it would be possible to compute the contrasts for categorical variables when parsing the formula: you need to have access to the data for that. Of course you also need the formula, which seems to imply that you need to create the contrasts when creating the model frame. Regarding #788, fixing predict could be enough.

@matthieugomez
Copy link
Contributor Author

@nalimilan Yes. It's similar to the fact that one can do y ~ as.factor(year) in R, right? One minor issue with this implementation is that the table of coefficients prints something like pool(year)-1999. rather than year-1999, but that's the only thing I can think of.

@nalimilan
Copy link
Member

@kleinschmidt How do you think this should be addressed now that #870 has been merged?

@kleinschmidt
Copy link
Contributor

I see two ways forward on this. In the short term I'll look into adding functionality for specifying the contrasts in the formula itself. The scheme used in #870 allows for specifying the contrast type independent of the data so that's not an issue. I'm imagining something like R's y ~ C(x) or C(x, DummyCoding). This can be stored as an evaluation term which then gets passed to the normal contrast evaluating mechanism. It's not totally straightforward but I think it won't be too much of a pain.

In the long term, as @nalimilan noted, this is a specific case of the more general problem of specifying transformations of variables in formulas. I know @johnmyleswhite has been working on making that performant but I imagine that's a way off. But treating contrast codings as a specific kind of data transformation is I think the right way to think of it.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Can be re-opened at https://github.com/JuliaStats/StatsModels.jl if still relevant.

@quinnj quinnj closed this as completed Sep 7, 2017
@nalimilan
Copy link
Member

Yes, that's JuliaStats/StatsModels.jl#25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants