-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical variables in Formula #867
Comments
I think this is a good idea, but it requires massive changes to how things work. That said, I'm very much on board. I've often thought we should do something like |
A few months ago I started to mock up something similar to change the way that a categorical variable is transformed into a model matrix (e.g., dummy, sum, Helmert, etc. coding). I think that coding it as a continuous variable might fit into the same sort of framework: they all are just ways of specifying transformations between data frame columns (or data vectors more generally) and model matrix columns. My first idea had been to specify a dictionary of column-to-contrast type mappings in the ModelFrame (since that's what encapsulates the transformation from data to ModelMatrix) but I was a bit stuck on how to specify that mapping (other than just directly passing such a dictionary). But it might be possible to wrap formula terms in constructors for contrast types and build up the contrast mappings in the parsing of the formulas. |
Somewhat related is the question of whether to allow @kleinschmidt I don't think the contrasts matrices can be computed when parsing the formulas, since they depend on the data (e.g., on the levels of the PDA). Why not pass the dictionary? |
@nalimilan no reason other than that it feels kind of clunky/bolted on, for something that's such an important part of how you specify the transformation from data to model matrix. To be honest, I suspect that my desire to put it in the Formula itself comes from a bit of a pet peeve of mine: for the most part, psychologists/psycholinguists ignore how their categorical variables (like experimental condition) are transformed into regressors, which can have huge implications for how the resulting model is interpreted (and whether it can be fit at all, since many choices lead to collinearity in the design matrix). Also, you're right that the contrasts columns themselves can't be computed until there's data, so for that reason it makes sense to pass them only when you have data. On the other hand, there are complications surrounding how to generate contrasts based on whether certain lower-order terms are present in the model matrix. For instance, if you have a three-level categorical variable, you can only do two contrast columns when there's also an intercept present in the model, but you probably want three (e.g. dummy coding) when there's no intercept present (see #757). Same thing for an interaction between a categorical variable and another variable: if there's no main effect for the other variable, you want to generate as many contrast columns as there are levels, otherwise, one less. So even evaluating the contrast scheme depends on the formula, not just the contrast itself, and so that seems like a reason to put the contrasts in the Formula/Terms bits. Of course, if you're passing them to ModelFrame you can do the same thing just by examining the parsed Terms, and I haven't really worked out whether this would make the implementation easier or harder. Just a hunch. One last complication is that you might want to maintain the same contrasts across multiple datasets (e.g., for making out-of-sample predictions). If you re-generate the contrasts every time you construct a new ModelFrame, you can end up in a situation where if the levels are different between the two data subsets (e.g., the prediction data only has a subset of the levels of the original) you'll get a model matrix of the wrong size or with contrast columns that don't correspond to fitted coefficients in the same way (#788). So even though you need the data to generate the contrasts, it also makes sense to make them portable across different model frames. Again, doesn't preclude just sticking them in a field in the model frame, just a hunch! |
Honestly, I'm not even sure how it would be possible to compute the contrasts for categorical variables when parsing the formula: you need to have access to the data for that. Of course you also need the formula, which seems to imply that you need to create the contrasts when creating the model frame. Regarding #788, fixing |
@nalimilan Yes. It's similar to the fact that one can do |
@kleinschmidt How do you think this should be addressed now that #870 has been merged? |
I see two ways forward on this. In the short term I'll look into adding functionality for specifying the contrasts in the formula itself. The scheme used in #870 allows for specifying the contrast type independent of the data so that's not an issue. I'm imagining something like R's In the long term, as @nalimilan noted, this is a specific case of the more general problem of specifying transformations of variables in formulas. I know @johnmyleswhite has been working on making that performant but I imagine that's a way off. But treating contrast codings as a specific kind of data transformation is I think the right way to think of it. |
Can be re-opened at https://github.com/JuliaStats/StatsModels.jl if still relevant. |
Yes, that's JuliaStats/StatsModels.jl#25. |
In the current syntax for formula, categorical variables must be PooledDataVectors.
However, in typical data analysis, the same variable can be alternatively seen as a continuous or a categorical variable, depending on the model. For instance one may want to regress on a trend (i.e. year as a continuous variable) or on year dummies.
It would be great to be able to specify in a formula that a variable should be treated as categorical variable. For instance
The text was updated successfully, but these errors were encountered: