RFC: contrast coding #870

kleinschmidt · 2015-09-21T04:50:48Z

This PR introduces a system for contrast coding, or controlling how categorical variables are converted into columns in a ModelMatrix. The basic interface is that the ModelFrame constructor takes an optional keyword argument contrasts, which is a Dict mapping column name symbols to contrasts, specified as subtypes of AbstractContrast or instances thereof. This addresses (at least partially) #757 and #788.

The code could be cleaned up quite a bit and refactored, but I wanted to get feedback on the general approach so far. It also still doesn't address the problem of needing to switch contrast coding based on whether or not lower-order effects are also present (also raised in #757), but I'm planning on also implementing those checks.

johnmyleswhite · 2015-09-21T15:03:58Z

This is really timely as the issue of contrast coding came this weekend when talking with the folks at Dato about their DataFrame -> Matrix conversion tools.

Excited to review this.

nalimilan · 2015-09-21T18:18:20Z

src/statsmodels/contrasts.jl

+
+type TreatmentContrast <: AbstractContrast
+    base::Integer
+    matrix::Array{Any,2}


Probably shouldn't be Any, but rather a type parameter inferred from the arguments to the constructor.

For the matrix, I think the type should actually be Float64, at least by analogy with how continuous columns are coerced to Vector{Float64}. But the types for termnames and levels probably should be inferred based on the underlying data (or that termnames is always going to be coerced to a string...)

nalimilan · 2015-09-21T20:34:26Z

Thanks, this is quite nice! I've only made superficial comments, but the global strategy sounds good to me -- though I'm not a specialist of this code at all.

A feature I've been dreaming of for some time, and which R doesn't have, would be to have the ability to compute the redundant/omitted coefficient after the model has been estimated. This would be extremely useful to print tables for publications. For example, for treatment contrasts, to get the name of the reference level, and the associated coefficient (0). For sum contrasts, it would consist in getting the name of the omitted level, with the coefficient equal to the sum of all coefficients estimated in the model for other levels. A method for each contrasts type could provide this information.

nalimilan · 2015-09-22T07:34:45Z

src/statsmodels/contrasts.jl

+## Could write this as a macro, too, so that people can register their own
+## contrast types easily, without having to write out this boilerplate...the
+## downside of that would be that they'd be locked in to a particular set of
+## fields...although they could always just write the boilerplate themselves...


The boilerplate isn't that bad, I wouldn't bother too much. Creating new contrast types isn't that common AFAIK.

kleinschmidt · 2015-10-05T04:11:22Z

I've updated this PR with the following two big changes:

Split off a ContrastMatrix type that is a contrasts + data and holds the levels, term names, and the actual matrix of contrast codes. The subtypes of AbstractContrast are now just containers for settings that don't depend on the data (e.g., the base index for the contrast).
Introduce checking for non-redundant terms in model matrix construction, and "promote" non-redundant categorical variables to have full rank (Contrast coding systems for categorical variables #757)

nalimilan · 2015-10-05T11:56:02Z

@johnmyleswhite @dmbates Any opinions about our discussion above?

nalimilan · 2016-02-22T12:42:52Z

Hm, this PR completely got off the radar. Anybody willing to review?

nalimilan · 2016-04-29T14:56:30Z

@kleinschmidt What's the state of this PR? Since nobody else seems to have time to review this, I think we should go ahead with the design we consider as the best one. Would you have time to resume starting on it?

kleinschmidt · 2016-04-29T15:18:19Z

Oof, yeah this did totally fall off the radar (mine included). I might have some time to look at this over the weekend.

kleinschmidt · 2016-04-30T02:50:19Z

I think I now see the appeal of being able to specify the levels directly in the contrast instances (vs. the ContrastMatrix. It would make the code a little messier (and would slightly muddy the currently clear distinction between AbstractContrast subtypes that are independent of data, and ContrastMatrix instances that depend on data), but might improve the interface from the user's point of view. This is related to the suggestion to specify the base level, rather than the base index. It would also allow for things like re-ordering contrast levels in fitting models, or limiting to only certain levels. That might be too clever for its own good though. I'll try this out and see how it feels.

Other than that, the only substantive thing was the possibility of storing contrasts in Terms (from Formulas). But those ultimately get ingested into ModelFrames so that's something that's best tackled separately.

Finally, there are some cosmetic issues: how to format term names, renaming the internal cols function something more descriptive, cleaning up @compats now that we require 0.4, and possibly other things I've missed.

I think it's worth getting this merged. Currently the generation of model matrices from formulas with categorical variables is broken and this fixes the biggest problems.

nalimilan · 2016-05-06T11:33:40Z

src/DataFrames.jl

@@ -44,6 +48,7 @@ export @~,
       combine,
       complete_cases,
       complete_cases!,
+       contrast!,


I think the convention would be to call this setcontrasts!.

kleinschmidt · 2016-07-30T14:23:53Z

I think this is about ready to merge. The last thing to settle is whether to be more specific about types in a few places (I've replied to line comments about that above). @nalimilan, want to take one more look?

kleinschmidt · 2016-07-30T14:24:39Z

(Test failures on appveyor are unrelated, I think; failing in RDA.jl)

nalimilan · 2016-07-30T15:14:52Z

I won't have the time to review this again, but since you addressed my comments, I think this can be merged. Do you think you could squash this into a few meaningful separate commits, or would it be too much work?

kleinschmidt · 2016-07-30T17:40:08Z

Okay, thanks for all your careful attention to this! I'll merge tomorrow unless anyone else objects. @johnmyleswhite, @ararslan, @dmbates, speak now or forever hold your peace.

ararslan · 2016-07-30T17:43:01Z

My peace will forever be held. Looks good to me! Great work here!

dmbates · 2016-07-30T18:24:34Z

LGTM

This reverts commit a2e3ff5.

kleinschmidt · 2016-07-31T19:01:45Z

@nalimilan it's going to be a pain in the ass to squash these commits manually. Is it okay to use the autosquash "Squash and merge" option?

nalimilan · 2016-07-31T21:55:45Z

Yes, that's fine since all changes are logically related.

johnmyleswhite · 2016-08-03T16:59:47Z

I just wanted to say thanks for doing such an impressive job with this, @kleinschmidt. This was one of the most impressive PR's I've ever seen in a JuliaStats repo.

kleinschmidt · 2016-08-03T21:48:16Z

Thanks, it was my pleasure! Lots more to work on here :)

implement contrast coding for categorical variables * types for specific contrast coding schemes and contrasts matrices * smarter generation of model matrix columns, including generating full-rank versions for terms for categorical variables which are not redundant with lower-order terms.

nalimilan reviewed Sep 21, 2015
View reviewed changes

nalimilan reviewed Sep 22, 2015
View reviewed changes

kleinschmidt mentioned this pull request Apr 30, 2016

ModelMatrix need to be able to align categorical variables #946

Closed

nalimilan reviewed May 6, 2016
View reviewed changes

kleinschmidt added 10 commits July 29, 2016 22:32

update docstrings

42c5b6f

explicit test for dummy coding

ea3ff23

vector -> abstractvector

4108bef

one-line summaries of coding schemes

666647f

replace ternary operator with explicit if

527e248

do not convert levels on contrasts to data type

3c4a144

no space after !

709740c

cosmetic tweaks

0a9c7a8

contrasts types must be instantiated

94ec01e

move todo

6dc7755

kleinschmidt added 4 commits July 31, 2016 11:55

fix doc to use concrete type

796c871

add tests for specifying contrasts in fit()

a20f290

typealiases for R names

a2e3ff5

Revert "typealiases for R names"

c113bbe

This reverts commit a2e3ff5.

kleinschmidt merged commit 3b21a03 into JuliaData:master Jul 31, 2016

GordStephen mentioned this pull request Aug 20, 2016

RFC: Sparse ModelMatrix support #1040

Merged

GordStephen mentioned this pull request Aug 28, 2016

Add compatibility with pre-contrasts ModelFrame constructor #1042

Merged

This was referenced Aug 31, 2016

Contrast coding systems for categorical variables #757

Closed

Categorical variables in Formula #867

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: contrast coding #870

RFC: contrast coding #870

kleinschmidt commented Sep 21, 2015

johnmyleswhite commented Sep 21, 2015

nalimilan Sep 21, 2015

kleinschmidt Sep 21, 2015

nalimilan commented Sep 21, 2015

nalimilan Sep 22, 2015

kleinschmidt commented Oct 5, 2015

nalimilan commented Oct 5, 2015

nalimilan commented Feb 22, 2016

nalimilan commented Apr 29, 2016

kleinschmidt commented Apr 29, 2016 •

edited

Loading

kleinschmidt commented Apr 30, 2016 •

edited

Loading

nalimilan May 6, 2016

kleinschmidt commented Jul 30, 2016

kleinschmidt commented Jul 30, 2016

nalimilan commented Jul 30, 2016

kleinschmidt commented Jul 30, 2016

ararslan commented Jul 30, 2016

dmbates commented Jul 30, 2016

kleinschmidt commented Jul 31, 2016

nalimilan commented Jul 31, 2016

johnmyleswhite commented Aug 3, 2016

kleinschmidt commented Aug 3, 2016

RFC: contrast coding #870

RFC: contrast coding #870

Conversation

kleinschmidt commented Sep 21, 2015

johnmyleswhite commented Sep 21, 2015

nalimilan Sep 21, 2015

Choose a reason for hiding this comment

kleinschmidt Sep 21, 2015

Choose a reason for hiding this comment

nalimilan commented Sep 21, 2015

nalimilan Sep 22, 2015

Choose a reason for hiding this comment

kleinschmidt commented Oct 5, 2015

nalimilan commented Oct 5, 2015

nalimilan commented Feb 22, 2016

nalimilan commented Apr 29, 2016

kleinschmidt commented Apr 29, 2016 • edited Loading

kleinschmidt commented Apr 30, 2016 • edited Loading

nalimilan May 6, 2016

Choose a reason for hiding this comment

kleinschmidt commented Jul 30, 2016

kleinschmidt commented Jul 30, 2016

nalimilan commented Jul 30, 2016

kleinschmidt commented Jul 30, 2016

ararslan commented Jul 30, 2016

dmbates commented Jul 30, 2016

kleinschmidt commented Jul 31, 2016

nalimilan commented Jul 31, 2016

johnmyleswhite commented Aug 3, 2016

kleinschmidt commented Aug 3, 2016

kleinschmidt commented Apr 29, 2016 •

edited

Loading

kleinschmidt commented Apr 30, 2016 •

edited

Loading