Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: contrast coding #870

Merged
merged 75 commits into from
Jul 31, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
49cccb1
[WIP] contrast coding flexibility
kleinschmidt Jan 22, 2015
ef415f7
actually store matrix in contrast type
kleinschmidt Sep 20, 2015
45f704f
fix concat syntax, add defaults for cols/termnames, notes, and
kleinschmidt Sep 20, 2015
d921818
evaluate contrast _type_ passed in dict
kleinschmidt Sep 20, 2015
25334cf
update coefnames to be more like cols construction
kleinschmidt Sep 20, 2015
66ded8e
fix helmert coding constructor
kleinschmidt Sep 21, 2015
6495dd0
modify ModelFrame contrasts in place
kleinschmidt Sep 21, 2015
1e8bc79
tests for contrasts
kleinschmidt Sep 21, 2015
9194096
set default contrasts, copy on predict, and test
kleinschmidt Sep 21, 2015
83efff3
remove Base.call for 0.3 compatibility
kleinschmidt Sep 21, 2015
78dee18
...and fix push!([], ...)
kleinschmidt Sep 21, 2015
7371fe7
code review: type of matrix, spacing, validation
kleinschmidt Sep 21, 2015
2808f70
factor out contrast matrix construction
kleinschmidt Sep 21, 2015
20cbb00
code generation to factor out contrast type boilerplate
kleinschmidt Sep 22, 2015
13c4ffd
stray end
kleinschmidt Sep 22, 2015
fe05a55
parametrize contrast types by eltype of data
kleinschmidt Sep 24, 2015
e444a41
find non-redundant columns for promoting to full rank
kleinschmidt Sep 24, 2015
a93a4e1
full-rank dummy contrasts type
kleinschmidt Sep 26, 2015
a0672e7
include non_redundants in model frame
kleinschmidt Oct 4, 2015
d590913
only evaluate contrasts for columns in the df
kleinschmidt Oct 4, 2015
1c76104
clean up mm constructor using mf cols function
kleinschmidt Oct 4, 2015
62a9cf7
fix iterator syntax and add missing Dict for kwargs
kleinschmidt Oct 4, 2015
5d3bb94
handle non-redundant columns right in model matrix
kleinschmidt Oct 4, 2015
efb93fe
termnames checking for non-redundancy, plus tests
kleinschmidt Oct 5, 2015
bdb2119
default contrast is managed in formula.jl/ModelFrame
kleinschmidt Oct 5, 2015
9eb4570
introduce ContrastMatrix type
kleinschmidt Oct 5, 2015
03303ef
add @compat on Dicts
kleinschmidt Oct 9, 2015
b2c0475
optionally specify levels and base level
kleinschmidt Apr 30, 2016
c3eae6e
actually used specified levels (+tests)
kleinschmidt Apr 30, 2016
1e05fd0
actually, don't allow baseind in Contrast(...)
kleinschmidt Apr 30, 2016
0b87e93
docstrings for contrasts
kleinschmidt Apr 30, 2016
ca5be00
settled on behavior for mismatched levels
kleinschmidt May 3, 2016
74eed83
coerce levels when constructing ContrastMatrix
kleinschmidt May 3, 2016
5415c3b
move cols and termnames to be w/ friends
kleinschmidt May 3, 2016
b09b847
use "x: a" instead of "x - a" for termnames
kleinschmidt May 3, 2016
04c702e
Contrast -> Contrasts
kleinschmidt May 3, 2016
eaee627
cols -> modelmat_cols
kleinschmidt May 3, 2016
0dc6fa3
revise documentation
kleinschmidt May 3, 2016
b3c547c
contrast! -> setcontrasts!
kleinschmidt May 7, 2016
1e875ca
test for 1 + x&y when x and y are categorical
kleinschmidt May 7, 2016
faa592f
code review
kleinschmidt May 11, 2016
566adae
generalize termnames generation
kleinschmidt May 11, 2016
87e0662
rename C->contrasts in ContrastsMatrix constructor
kleinschmidt May 12, 2016
c405df5
type parameters + convert for promote_contrast
kleinschmidt May 12, 2016
bdcca01
restrict type in model frame contrasts dict
kleinschmidt May 12, 2016
eb44d54
clean up check_non_redundancy
kleinschmidt May 12, 2016
dc261c2
replace needsContrasts and can_promote with is_categorical
kleinschmidt May 12, 2016
4106835
explicit tuple and clearer if/index
kleinschmidt May 12, 2016
34402bf
evaluateContrasts -> ContrastMatrix methods
kleinschmidt May 12, 2016
109df41
pass contrasts in DataFramesModel fitting
kleinschmidt May 12, 2016
c26d2ec
contrastsmatrix constructor fix and test
kleinschmidt May 12, 2016
7493d9a
more tests: redundancy
kleinschmidt May 13, 2016
433bfd2
use get for Nullable contrasts.base
kleinschmidt Jul 22, 2016
83b6f96
[WIP] documenting contrasts in formulas.md
kleinschmidt Jul 22, 2016
f515628
rep -> Compat.repeat
kleinschmidt Jul 22, 2016
997d963
merge refactored ModelMatrix
kleinschmidt Jul 23, 2016
5410e21
fix Nothing failure on nightlies
kleinschmidt Jul 23, 2016
95c515d
use throw(ArgumentError()) instead of error()
kleinschmidt Jul 23, 2016
8798f3b
rename contrasts
kleinschmidt Jul 26, 2016
d7ce429
wrap long error message lines
kleinschmidt Jul 26, 2016
8632624
add ContrastsCoding
kleinschmidt Jul 27, 2016
42c5b6f
update docstrings
kleinschmidt Jul 30, 2016
ea3ff23
explicit test for dummy coding
kleinschmidt Jul 30, 2016
4108bef
vector -> abstractvector
kleinschmidt Jul 30, 2016
666647f
one-line summaries of coding schemes
kleinschmidt Jul 30, 2016
527e248
replace ternary operator with explicit if
kleinschmidt Jul 30, 2016
3c4a144
do not convert levels on contrasts to data type
kleinschmidt Jul 30, 2016
709740c
no space after !
kleinschmidt Jul 30, 2016
0a9c7a8
cosmetic tweaks
kleinschmidt Jul 30, 2016
94ec01e
contrasts types must be instantiated
kleinschmidt Jul 30, 2016
6dc7755
move todo
kleinschmidt Jul 30, 2016
796c871
fix doc to use concrete type
kleinschmidt Jul 31, 2016
a20f290
add tests for specifying contrasts in fit()
kleinschmidt Jul 31, 2016
a2e3ff5
typealiases for R names
kleinschmidt Jul 31, 2016
c113bbe
Revert "typealiases for R names"
kleinschmidt Jul 31, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/src/man/formulas.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,18 @@ If you would like to specify both main effects and an interaction term at once,
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df))
```

You can control how categorical variables (e.g., `PooledDataArray` columns) are converted to `ModelMatrix` columns by specifying _contrasts_ when you construct a `ModelFrame`:

```julia
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df, contrasts = Dict(:X => HelmertCoding())))
```

Contrasts can also be modified in an existing `ModelFrame`:

```julia
mf = ModelFrame(Z ~ X*Y, df)
contrasts!(mf, X = HelmertCoding())
```

The construction of model matrices makes it easy to formulate complex statistical models. These are used to good effect by the [GLM Package.](https://github.com/JuliaStats/GLM.jl)

7 changes: 7 additions & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ export @~,
@wsv_str,

AbstractDataFrame,
AbstractContrasts,
DataFrame,
DataFrameRow,
Formula,
Expand All @@ -43,6 +44,10 @@ export @~,
ModelFrame,
ModelMatrix,
SubDataFrame,
EffectsCoding,
DummyCoding,
HelmertCoding,
ContrastsCoding,

aggregate,
by,
Expand All @@ -51,6 +56,7 @@ export @~,
combine,
complete_cases,
complete_cases!,
setcontrasts!,
deleterows!,
describe,
eachcol,
Expand Down Expand Up @@ -109,6 +115,7 @@ for (dir, filename) in [
("abstractdataframe", "sort.jl"),
("dataframe", "sort.jl"),

("statsmodels", "contrasts.jl"),
("statsmodels", "formula.jl"),
("statsmodels", "statsmodel.jl"),

Expand Down
327 changes: 327 additions & 0 deletions src/statsmodels/contrasts.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
# Specify contrasts for coding categorical data in model matrix. Contrasts types
# are a subtype of AbstractContrasts. ContrastsMatrix types hold a contrast
# matrix, levels, and term names and provide the interface for creating model
# matrix columns and coefficient names.
#
# Contrasts types themselves can be instantiated to provide containers for
# contrast settings (currently, just the base level).
#
# ModelFrame will hold a Dict{Symbol, ContrastsMatrix} that maps column
# names to contrasts.
#
# ModelMatrix will check this dict when evaluating terms, falling back to a
# default for any categorical data without a specified contrast.


"""
Interface to describe contrast coding schemes for categorical variables.

Concrete subtypes of `AbstractContrasts` describe a particular way of converting a
categorical data vector into numeric columns in a `ModelMatrix`. Each
instantiation optionally includes the levels to generate columns for and the base
level. If not specified these will be taken from the data when a `ContrastsMatrix` is
generated (during `ModelFrame` construction).

# Constructors

For `C <: AbstractContrast`:

```julia
C() # levels are inferred later
C(levels = ::Vector{Any}) # levels checked against data later
C(base = ::Any) # specify base level
C(levels = ::Vector{Any}, base = ::Any) # specify levels and base
```

If specified, levels will be checked against data when generating a
`ContrastsMatrix`. Any mismatch will result in an error, because missing data
levels would lead to empty columns in the model matrix, and missing contrast
levels would lead to empty or undefined rows.

You can also specify the base level of the contrasts. The actual interpretation
of this depends on the particular contrast type, but in general it can be
thought of as a "reference" level. It defaults to the first level.

# Concrete types

* `DummyCoding` - Code each non-base level as a 0-1 indicator column.
* `EffectsCoding` - Code each non-base level as 1, and base as -1.
* `HelmertCoding` - Code each non-base level as the difference from the mean of
the lower levels
* `ContrastsCoding` - Manually specify contrasts matrix

The last coding type, `ContrastsCoding`, provides a way to manually specify a
contrasts matrix. For a variable `x` with k levels, a contrasts matrix `M` is a
k by k-1 matrix, that maps the k levels onto k-1 model matrix columns.
Specifically, let ``X^*`` be the full-rank indicator matrix for `x`, where
``X^*_{i,j} = 1`` if `x[i]` is level `j`, and 0 otherwise. Then the model matrix
columns generated by the contrasts matrix `M` are ``X = X^* M``.

To implement your own `AbstractContrasts` type, implement a constructor, a
`contrasts_matrix` method for constructing the actual contrasts matrix that maps
from levels to `ModelMatrix` column values, and (optionally) a `termnames`
method:

```julia
type MyCoding <: AbstractContrasts
...
end

contrasts_matrix(C::MyCoding, baseind, n) = ...
termnames(C::MyCoding, levels, baseind) = ...
```

"""
abstract AbstractContrasts

# Contrasts + Levels (usually from data) = ContrastsMatrix
type ContrastsMatrix{C <: AbstractContrasts, T}
matrix::Matrix{Float64}
termnames::Vector{T}
levels::Vector{T}
contrasts::C
end

"""
ContrastsMatrix{C<:AbstractContrasts}(contrasts::C, levels::AbstractVector)

Compute contrasts matrix for given data levels.

If levels are specified in the `AbstractContrasts`, those will be used, and likewise
for the base level (which defaults to the first level).
"""
function ContrastsMatrix{C <: AbstractContrasts}(contrasts::C, levels::AbstractVector)

# if levels are defined on contrasts, use those, validating that they line up.
# what does that mean? either:
#
# 1. contrasts.levels == levels (best case)
# 2. data levels missing from contrast: would generate empty/undefined rows.
# better to filter data frame first
# 3. contrast levels missing from data: would have empty columns, generate a
# rank-deficient model matrix.
c_levels = get(contrasts.levels, levels)
if eltype(c_levels) != eltype(levels)
throw(ArgumentError("mismatching levels types: got $(eltype(levels)), expected " *
"$(eltype(c_levels)) based on contrasts levels."))
end
mismatched_levels = symdiff(c_levels, levels)
if !isempty(mismatched_levels)
throw(ArgumentError("contrasts levels not found in data or vice-versa: " *
"$mismatched_levels." *
"\n Data levels: $levels." *
"\n Contrast levels: $c_levels"))
end

n = length(c_levels)
if n == 0
throw(ArgumentError("empty set of levels found (need at least two to compute " *
"contrasts)."))
elseif n == 1
throw(ArgumentError("only one level found: $(c_levels[1]) (need at least two to " *
"compute contrasts)."))
end

# find index of base level. use contrasts.base, then default (1).
baseind = isnull(contrasts.base) ?
1 :
findfirst(c_levels, get(contrasts.base))
if baseind < 1
throw(ArgumentError("base level $(get(contrasts.base)) not found in levels " *
"$c_levels."))
end

tnames = termnames(contrasts, c_levels, baseind)

mat = contrasts_matrix(contrasts, baseind, n)

ContrastsMatrix(mat, tnames, c_levels, contrasts)
end

# Methods for constructing ContrastsMatrix from data. These are called in
# ModelFrame constructor and setcontrasts!.
# TODO: add methods for new categorical types

ContrastsMatrix(C::AbstractContrasts, v::PooledDataArray) =
ContrastsMatrix(C, levels(v))
ContrastsMatrix{C <: AbstractContrasts}(c::Type{C}, col::PooledDataArray) =
throw(ArgumentError("contrast types must be instantiated (use $c() instead of $c)"))
# given an existing ContrastsMatrix, check that all of the levels present in the
# data are present in the contrasts. Note that this behavior is different from the
# ContrastsMatrix constructor, which requires that the levels be exactly the same.
# This method exists to support things like `predict` that can operate on new data
# which may contain only a subset of the original data's levels. Checking here
# (instead of in `modelmat_cols`) allows an informative error message.
function ContrastsMatrix(c::ContrastsMatrix, col::PooledDataArray)
if !isempty(setdiff(levels(col), c.levels))
throw(ArgumentError("there are levels in data that are not in ContrastsMatrix: " *
"$(setdiff(levels(col), c.levels))" *
"\n Data levels: $(levels(col))" *
"\n Contrast levels: $(c.levels)"))
end
return c
end

function termnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer)
not_base = [1:(baseind-1); (baseind+1):length(levels)]
levels[not_base]
end

nullify(x::Nullable) = x
nullify(x) = Nullable(x)

# Making a contrast type T only requires that there be a method for
# contrasts_matrix(T, v::PooledDataArray). The rest is boilerplate.
for contrastType in [:DummyCoding, :EffectsCoding, :HelmertCoding]
@eval begin
type $contrastType <: AbstractContrasts
base::Nullable{Any}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make fields types parametric on type (including Vector{T} for levels). With this, I think you could even get rid of nullify(), since convert() should do its job automatically as long as you call ContrastType{T}(base, levels).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should restrict the types of the fields. levels should be Nullable{Vector{T}} with T a type parameter, and base should be a Nullable{Int} (or Nullable{T}). Apart from removing type instability, that should help making it clearer which types are expected in the rest of the code.

Copy link
Contributor Author

@kleinschmidt kleinschmidt Jul 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought a bit about this and I'm not so sure. For one, if we only allow instantiate contrasts types to be passed to ModelFrame et al., that would force users to actually specify the type of the column when they specify the contrasts, as in DummyCoding{String}() instead of just DummyCoding(). I'm also not sure that type instability would actually have any substantial performance costs here, since all of the heavy lifting is done by the ContrastsMatrix type which is parametrized on the element type of the column it is created from (or rather the eltype of the levels).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

levels::Nullable{Vector}
end
## constructor with optional keyword arguments, defaulting to Nullables
$contrastType(;
base=Nullable{Any}(),
levels=Nullable{Vector}()) =
$contrastType(nullify(base),
nullify(levels))
end
end

"""
FullDummyCoding()

Coding that generates one indicator (1 or 0) column for each level,
__including__ the base level.

Needed internally when a term is non-redundant with lower-order terms (e.g., in
`~0+x` vs. `~1+x`, or in the interactions terms in `~1+x+x&y` vs. `~1+x+y+x&y`. In the
non-redundant cases, we can expand x into `length(levels(x))` columns
without creating a non-identifiable model matrix (unless the user has done
something foolish in specifying the model, which we can't do much about anyway).
"""
type FullDummyCoding <: AbstractContrasts
# Dummy contrasts have no base level (since all levels produce a column)
end

ContrastsMatrix{T}(C::FullDummyCoding, lvls::Vector{T}) =
ContrastsMatrix(eye(Float64, length(lvls)), lvls, lvls, C)

"Promote contrasts matrix to full rank version"
Base.convert(::Type{ContrastsMatrix{FullDummyCoding}}, C::ContrastsMatrix) =
ContrastsMatrix(FullDummyCoding(), C.levels)

Copy link
Member

@nalimilan nalimilan May 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two spaces line breaks should be enough... :-)

"""
DummyCoding([base[, levels]])

Contrast coding that generates one indicator column (1 or 0) for each non-base level.

Columns have non-zero mean and are collinear with an intercept column (and
lower-order columns for interactions) but are orthogonal to each other. In a
regression model, dummy coding leads to an intercept that is the mean of the
dependent variable for base level.

Also known as "treatment coding" (`contr.treatment` in R) or "one-hot encoding".
"""
DummyCoding

contrasts_matrix(C::DummyCoding, baseind, n) = eye(n)[:, [1:(baseind-1); (baseind+1):n]]


"""
EffectsCoding([base[, levels]])

Contrast coding that generates columns that code each non-base level as the
deviation from the base level. For each non-base level `x` of `variable`, a
column is generated with 1 where `variable .== x` and -1 where `col .== base`.

`EffectsCoding` is like `DummyCoding`, but using -1 for the base level instead
of 0.

When all levels are equally frequent, effects coding generates model matrix
columns that are mean centered (have mean 0). For more than two levels the
generated columns are not orthogonal. In a regression model with an
effects-coded variable, the intercept corresponds to the grand mean.

Also known as "sum coding" (`contr.sum` in R) or "simple coding" (SPSS). Note
though that the default in R and SPSS is to use the _last_ level as the base.
Here we use the _first_ level as the base, for consistency with other coding
schemes.
"""
EffectsCoding

function contrasts_matrix(C::EffectsCoding, baseind, n)
not_base = [1:(baseind-1); (baseind+1):n]
mat = eye(n)[:, not_base]
mat[baseind, :] = -1
return mat
end

"""
HelmertCoding([base[, levels]])

Contrasts that code each level as the difference from the average of the lower
levels.

For each non-base level, Helmert coding generates a columns with -1 for each of
n levels below, n for that level, and 0 above.

# Examples

```julia
julia> ContrastsMatrix(HelmertCoding(), collect(1:4)).matrix
4x3 Array{Float64,2}:
-1.0 -1.0 -1.0
1.0 -1.0 -1.0
0.0 2.0 -1.0
0.0 0.0 3.0
```

When all levels are equally frequent, Helmert coding generates columns that are
mean-centered (mean 0) and orthogonal.
"""
HelmertCoding

function contrasts_matrix(C::HelmertCoding, baseind, n)
mat = zeros(n, n-1)
for i in 1:n-1
mat[1:i, i] = -1
mat[i+1, i] = i
end

# re-shuffle the rows such that base is the all -1.0 row (currently first)
mat = mat[[baseind; 1:(baseind-1); (baseind+1):end], :]
return mat
end

"""
ContrastsCoding(mat::Matrix[, base[, levels]])

Coding by manual specification of contrasts matrix. For k levels, the contrasts
must be a k by k-1 Matrix.
"""
type ContrastsCoding <: AbstractContrasts
mat::Matrix
base::Nullable{Any}
levels::Nullable{Vector}

function ContrastsCoding(mat, base, levels)
if !isnull(levels)
check_contrasts_size(mat, length(get(levels)))
end
new(mat, base, levels)
end
end

check_contrasts_size(mat::Matrix, n_lev) =
size(mat) == (n_lev, n_lev-1) ||
throw(ArgumentError("contrasts matrix wrong size for $n_lev levels. " *
"Expected $((n_lev, n_lev-1)), got $(size(mat))"))

## constructor with optional keyword arguments, defaulting to Nullables
ContrastsCoding(mat::Matrix; base=Nullable{Any}(), levels=Nullable{Vector}()) =
ContrastsCoding(mat, nullify(base), nullify(levels))

function contrasts_matrix(C::ContrastsCoding, baseind, n)
check_contrasts_size(C.mat, n)
C.mat
end
Loading