Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: contrast coding #870

Merged
merged 75 commits into from
Jul 31, 2016
Merged

Conversation

kleinschmidt
Copy link
Contributor

This PR introduces a system for contrast coding, or controlling how categorical variables are converted into columns in a ModelMatrix. The basic interface is that the ModelFrame constructor takes an optional keyword argument contrasts, which is a Dict mapping column name symbols to contrasts, specified as subtypes of AbstractContrast or instances thereof. This addresses (at least partially) #757 and #788.

The code could be cleaned up quite a bit and refactored, but I wanted to get feedback on the general approach so far. It also still doesn't address the problem of needing to switch contrast coding based on whether or not lower-order effects are also present (also raised in #757), but I'm planning on also implementing those checks.

@johnmyleswhite
Copy link
Contributor

This is really timely as the issue of contrast coding came this weekend when talking with the folks at Dato about their DataFrame -> Matrix conversion tools.

Excited to review this.


type TreatmentContrast <: AbstractContrast
base::Integer
matrix::Array{Any,2}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably shouldn't be Any, but rather a type parameter inferred from the arguments to the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the matrix, I think the type should actually be Float64, at least by analogy with how continuous columns are coerced to Vector{Float64}. But the types for termnames and levels probably should be inferred based on the underlying data (or that termnames is always going to be coerced to a string...)

@nalimilan
Copy link
Member

Thanks, this is quite nice! I've only made superficial comments, but the global strategy sounds good to me -- though I'm not a specialist of this code at all.

A feature I've been dreaming of for some time, and which R doesn't have, would be to have the ability to compute the redundant/omitted coefficient after the model has been estimated. This would be extremely useful to print tables for publications. For example, for treatment contrasts, to get the name of the reference level, and the associated coefficient (0). For sum contrasts, it would consist in getting the name of the omitted level, with the coefficient equal to the sum of all coefficients estimated in the model for other levels. A method for each contrasts type could provide this information.

## Could write this as a macro, too, so that people can register their own
## contrast types easily, without having to write out this boilerplate...the
## downside of that would be that they'd be locked in to a particular set of
## fields...although they could always just write the boilerplate themselves...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boilerplate isn't that bad, I wouldn't bother too much. Creating new contrast types isn't that common AFAIK.

@kleinschmidt
Copy link
Contributor Author

I've updated this PR with the following two big changes:

  • Split off a ContrastMatrix type that is a contrasts + data and holds the levels, term names, and the actual matrix of contrast codes. The subtypes of AbstractContrast are now just containers for settings that don't depend on the data (e.g., the base index for the contrast).
  • Introduce checking for non-redundant terms in model matrix construction, and "promote" non-redundant categorical variables to have full rank (Contrast coding systems for categorical variables #757)

@nalimilan
Copy link
Member

@johnmyleswhite @dmbates Any opinions about our discussion above?

@nalimilan
Copy link
Member

Hm, this PR completely got off the radar. Anybody willing to review?

@nalimilan
Copy link
Member

@kleinschmidt What's the state of this PR? Since nobody else seems to have time to review this, I think we should go ahead with the design we consider as the best one. Would you have time to resume starting on it?

@kleinschmidt
Copy link
Contributor Author

kleinschmidt commented Apr 29, 2016

Oof, yeah this did totally fall off the radar (mine included). I might have some time to look at this over the weekend.

@kleinschmidt
Copy link
Contributor Author

kleinschmidt commented Apr 30, 2016

I think I now see the appeal of being able to specify the levels directly in the contrast instances (vs. the ContrastMatrix. It would make the code a little messier (and would slightly muddy the currently clear distinction between AbstractContrast subtypes that are independent of data, and ContrastMatrix instances that depend on data), but might improve the interface from the user's point of view. This is related to the suggestion to specify the base level, rather than the base index. It would also allow for things like re-ordering contrast levels in fitting models, or limiting to only certain levels. That might be too clever for its own good though. I'll try this out and see how it feels.

Other than that, the only substantive thing was the possibility of storing contrasts in Terms (from Formulas). But those ultimately get ingested into ModelFrames so that's something that's best tackled separately.

Finally, there are some cosmetic issues: how to format term names, renaming the internal cols function something more descriptive, cleaning up @compats now that we require 0.4, and possibly other things I've missed.

I think it's worth getting this merged. Currently the generation of model matrices from formulas with categorical variables is broken and this fixes the biggest problems.

@@ -44,6 +48,7 @@ export @~,
combine,
complete_cases,
complete_cases!,
contrast!,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the convention would be to call this setcontrasts!.

@kleinschmidt
Copy link
Contributor Author

I think this is about ready to merge. The last thing to settle is whether to be more specific about types in a few places (I've replied to line comments about that above). @nalimilan, want to take one more look?

@kleinschmidt
Copy link
Contributor Author

(Test failures on appveyor are unrelated, I think; failing in RDA.jl)

@nalimilan
Copy link
Member

I won't have the time to review this again, but since you addressed my comments, I think this can be merged. Do you think you could squash this into a few meaningful separate commits, or would it be too much work?

@kleinschmidt
Copy link
Contributor Author

Okay, thanks for all your careful attention to this! I'll merge tomorrow unless anyone else objects. @johnmyleswhite, @ararslan, @dmbates, speak now or forever hold your peace.

@ararslan
Copy link
Member

My peace will forever be held. Looks good to me! Great work here!

@dmbates
Copy link
Contributor

dmbates commented Jul 30, 2016

LGTM

@kleinschmidt
Copy link
Contributor Author

@nalimilan it's going to be a pain in the ass to squash these commits manually. Is it okay to use the autosquash "Squash and merge" option?

@nalimilan
Copy link
Member

Yes, that's fine since all changes are logically related.

@kleinschmidt kleinschmidt merged commit 3b21a03 into JuliaData:master Jul 31, 2016
@johnmyleswhite
Copy link
Contributor

I just wanted to say thanks for doing such an impressive job with this, @kleinschmidt. This was one of the most impressive PR's I've ever seen in a JuliaStats repo.

@kleinschmidt
Copy link
Contributor Author

Thanks, it was my pleasure! Lots more to work on here :)

maximerischard pushed a commit to maximerischard/DataFrames.jl that referenced this pull request Sep 28, 2016
implement contrast coding for categorical variables

* types for specific contrast coding schemes and contrasts matrices
* smarter generation of model matrix columns, including generating full-rank
  versions for terms for categorical variables which are not redundant with
  lower-order terms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants