-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: contrast coding #870
RFC: contrast coding #870
Conversation
This is really timely as the issue of contrast coding came this weekend when talking with the folks at Dato about their DataFrame -> Matrix conversion tools. Excited to review this. |
|
||
type TreatmentContrast <: AbstractContrast | ||
base::Integer | ||
matrix::Array{Any,2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably shouldn't be Any
, but rather a type parameter inferred from the arguments to the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the matrix, I think the type should actually be Float64
, at least by analogy with how continuous columns are coerced to Vector{Float64}
. But the types for termnames
and levels
probably should be inferred based on the underlying data (or that termnames is always going to be coerced to a string...)
Thanks, this is quite nice! I've only made superficial comments, but the global strategy sounds good to me -- though I'm not a specialist of this code at all. A feature I've been dreaming of for some time, and which R doesn't have, would be to have the ability to compute the redundant/omitted coefficient after the model has been estimated. This would be extremely useful to print tables for publications. For example, for treatment contrasts, to get the name of the reference level, and the associated coefficient (0). For sum contrasts, it would consist in getting the name of the omitted level, with the coefficient equal to the sum of all coefficients estimated in the model for other levels. A method for each contrasts type could provide this information. |
## Could write this as a macro, too, so that people can register their own | ||
## contrast types easily, without having to write out this boilerplate...the | ||
## downside of that would be that they'd be locked in to a particular set of | ||
## fields...although they could always just write the boilerplate themselves... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boilerplate isn't that bad, I wouldn't bother too much. Creating new contrast types isn't that common AFAIK.
I've updated this PR with the following two big changes:
|
@johnmyleswhite @dmbates Any opinions about our discussion above? |
Hm, this PR completely got off the radar. Anybody willing to review? |
@kleinschmidt What's the state of this PR? Since nobody else seems to have time to review this, I think we should go ahead with the design we consider as the best one. Would you have time to resume starting on it? |
Oof, yeah this did totally fall off the radar (mine included). I might have some time to look at this over the weekend. |
I think I now see the appeal of being able to specify the levels directly in the contrast instances (vs. the Other than that, the only substantive thing was the possibility of storing contrasts in Finally, there are some cosmetic issues: how to format term names, renaming the internal I think it's worth getting this merged. Currently the generation of model matrices from formulas with categorical variables is broken and this fixes the biggest problems. |
@@ -44,6 +48,7 @@ export @~, | |||
combine, | |||
complete_cases, | |||
complete_cases!, | |||
contrast!, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the convention would be to call this setcontrasts!
.
I think this is about ready to merge. The last thing to settle is whether to be more specific about types in a few places (I've replied to line comments about that above). @nalimilan, want to take one more look? |
(Test failures on appveyor are unrelated, I think; failing in RDA.jl) |
I won't have the time to review this again, but since you addressed my comments, I think this can be merged. Do you think you could squash this into a few meaningful separate commits, or would it be too much work? |
Okay, thanks for all your careful attention to this! I'll merge tomorrow unless anyone else objects. @johnmyleswhite, @ararslan, @dmbates, speak now or forever hold your peace. |
My peace will forever be held. Looks good to me! Great work here! |
LGTM |
@nalimilan it's going to be a pain in the ass to squash these commits manually. Is it okay to use the autosquash "Squash and merge" option? |
Yes, that's fine since all changes are logically related. |
I just wanted to say thanks for doing such an impressive job with this, @kleinschmidt. This was one of the most impressive PR's I've ever seen in a JuliaStats repo. |
Thanks, it was my pleasure! Lots more to work on here :) |
implement contrast coding for categorical variables * types for specific contrast coding schemes and contrasts matrices * smarter generation of model matrix columns, including generating full-rank versions for terms for categorical variables which are not redundant with lower-order terms.
This PR introduces a system for contrast coding, or controlling how categorical variables are converted into columns in a
ModelMatrix
. The basic interface is that theModelFrame
constructor takes an optional keyword argumentcontrasts
, which is aDict
mapping column name symbols to contrasts, specified as subtypes ofAbstractContrast
or instances thereof. This addresses (at least partially) #757 and #788.The code could be cleaned up quite a bit and refactored, but I wanted to get feedback on the general approach so far. It also still doesn't address the problem of needing to switch contrast coding based on whether or not lower-order effects are also present (also raised in #757), but I'm planning on also implementing those checks.