Skip to content

Commit

Permalink
RFC: contrast coding (JuliaData#870)
Browse files Browse the repository at this point in the history
implement contrast coding for categorical variables

* types for specific contrast coding schemes and contrasts matrices
* smarter generation of model matrix columns, including generating full-rank
  versions for terms for categorical variables which are not redundant with
  lower-order terms.
  • Loading branch information
kleinschmidt authored and maximerischard committed Sep 28, 2016
1 parent 685d949 commit 25c3e78
Show file tree
Hide file tree
Showing 9 changed files with 843 additions and 90 deletions.
13 changes: 13 additions & 0 deletions docs/src/man/formulas.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,18 @@ If you would like to specify both main effects and an interaction term at once,
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df))
```

You can control how categorical variables (e.g., `PooledDataArray` columns) are converted to `ModelMatrix` columns by specifying _contrasts_ when you construct a `ModelFrame`:

```julia
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df, contrasts = Dict(:X => HelmertCoding())))
```

Contrasts can also be modified in an existing `ModelFrame`:

```julia
mf = ModelFrame(Z ~ X*Y, df)
contrasts!(mf, X = HelmertCoding())
```

The construction of model matrices makes it easy to formulate complex statistical models. These are used to good effect by the [GLM Package.](https://github.com/JuliaStats/GLM.jl)

7 changes: 7 additions & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ export @~,
@wsv_str,

AbstractDataFrame,
AbstractContrasts,
DataFrame,
DataFrameRow,
Formula,
Expand All @@ -43,6 +44,10 @@ export @~,
ModelFrame,
ModelMatrix,
SubDataFrame,
EffectsCoding,
DummyCoding,
HelmertCoding,
ContrastsCoding,

aggregate,
by,
Expand All @@ -51,6 +56,7 @@ export @~,
combine,
complete_cases,
complete_cases!,
setcontrasts!,
deleterows!,
describe,
eachcol,
Expand Down Expand Up @@ -109,6 +115,7 @@ for (dir, filename) in [
("abstractdataframe", "sort.jl"),
("dataframe", "sort.jl"),

("statsmodels", "contrasts.jl"),
("statsmodels", "formula.jl"),
("statsmodels", "statsmodel.jl"),

Expand Down
327 changes: 327 additions & 0 deletions src/statsmodels/contrasts.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
# Specify contrasts for coding categorical data in model matrix. Contrasts types
# are a subtype of AbstractContrasts. ContrastsMatrix types hold a contrast
# matrix, levels, and term names and provide the interface for creating model
# matrix columns and coefficient names.
#
# Contrasts types themselves can be instantiated to provide containers for
# contrast settings (currently, just the base level).
#
# ModelFrame will hold a Dict{Symbol, ContrastsMatrix} that maps column
# names to contrasts.
#
# ModelMatrix will check this dict when evaluating terms, falling back to a
# default for any categorical data without a specified contrast.


"""
Interface to describe contrast coding schemes for categorical variables.
Concrete subtypes of `AbstractContrasts` describe a particular way of converting a
categorical data vector into numeric columns in a `ModelMatrix`. Each
instantiation optionally includes the levels to generate columns for and the base
level. If not specified these will be taken from the data when a `ContrastsMatrix` is
generated (during `ModelFrame` construction).
# Constructors
For `C <: AbstractContrast`:
```julia
C() # levels are inferred later
C(levels = ::Vector{Any}) # levels checked against data later
C(base = ::Any) # specify base level
C(levels = ::Vector{Any}, base = ::Any) # specify levels and base
```
If specified, levels will be checked against data when generating a
`ContrastsMatrix`. Any mismatch will result in an error, because missing data
levels would lead to empty columns in the model matrix, and missing contrast
levels would lead to empty or undefined rows.
You can also specify the base level of the contrasts. The actual interpretation
of this depends on the particular contrast type, but in general it can be
thought of as a "reference" level. It defaults to the first level.
# Concrete types
* `DummyCoding` - Code each non-base level as a 0-1 indicator column.
* `EffectsCoding` - Code each non-base level as 1, and base as -1.
* `HelmertCoding` - Code each non-base level as the difference from the mean of
the lower levels
* `ContrastsCoding` - Manually specify contrasts matrix
The last coding type, `ContrastsCoding`, provides a way to manually specify a
contrasts matrix. For a variable `x` with k levels, a contrasts matrix `M` is a
k by k-1 matrix, that maps the k levels onto k-1 model matrix columns.
Specifically, let ``X^*`` be the full-rank indicator matrix for `x`, where
``X^*_{i,j} = 1`` if `x[i]` is level `j`, and 0 otherwise. Then the model matrix
columns generated by the contrasts matrix `M` are ``X = X^* M``.
To implement your own `AbstractContrasts` type, implement a constructor, a
`contrasts_matrix` method for constructing the actual contrasts matrix that maps
from levels to `ModelMatrix` column values, and (optionally) a `termnames`
method:
```julia
type MyCoding <: AbstractContrasts
...
end
contrasts_matrix(C::MyCoding, baseind, n) = ...
termnames(C::MyCoding, levels, baseind) = ...
```
"""
abstract AbstractContrasts

# Contrasts + Levels (usually from data) = ContrastsMatrix
type ContrastsMatrix{C <: AbstractContrasts, T}
matrix::Matrix{Float64}
termnames::Vector{T}
levels::Vector{T}
contrasts::C
end

"""
ContrastsMatrix{C<:AbstractContrasts}(contrasts::C, levels::AbstractVector)
Compute contrasts matrix for given data levels.
If levels are specified in the `AbstractContrasts`, those will be used, and likewise
for the base level (which defaults to the first level).
"""
function ContrastsMatrix{C <: AbstractContrasts}(contrasts::C, levels::AbstractVector)

# if levels are defined on contrasts, use those, validating that they line up.
# what does that mean? either:
#
# 1. contrasts.levels == levels (best case)
# 2. data levels missing from contrast: would generate empty/undefined rows.
# better to filter data frame first
# 3. contrast levels missing from data: would have empty columns, generate a
# rank-deficient model matrix.
c_levels = get(contrasts.levels, levels)
if eltype(c_levels) != eltype(levels)
throw(ArgumentError("mismatching levels types: got $(eltype(levels)), expected " *
"$(eltype(c_levels)) based on contrasts levels."))
end
mismatched_levels = symdiff(c_levels, levels)
if !isempty(mismatched_levels)
throw(ArgumentError("contrasts levels not found in data or vice-versa: " *
"$mismatched_levels." *
"\n Data levels: $levels." *
"\n Contrast levels: $c_levels"))
end

n = length(c_levels)
if n == 0
throw(ArgumentError("empty set of levels found (need at least two to compute " *
"contrasts)."))
elseif n == 1
throw(ArgumentError("only one level found: $(c_levels[1]) (need at least two to " *
"compute contrasts)."))
end

# find index of base level. use contrasts.base, then default (1).
baseind = isnull(contrasts.base) ?
1 :
findfirst(c_levels, get(contrasts.base))
if baseind < 1
throw(ArgumentError("base level $(get(contrasts.base)) not found in levels " *
"$c_levels."))
end

tnames = termnames(contrasts, c_levels, baseind)

mat = contrasts_matrix(contrasts, baseind, n)

ContrastsMatrix(mat, tnames, c_levels, contrasts)
end

# Methods for constructing ContrastsMatrix from data. These are called in
# ModelFrame constructor and setcontrasts!.
# TODO: add methods for new categorical types

ContrastsMatrix(C::AbstractContrasts, v::PooledDataArray) =
ContrastsMatrix(C, levels(v))
ContrastsMatrix{C <: AbstractContrasts}(c::Type{C}, col::PooledDataArray) =
throw(ArgumentError("contrast types must be instantiated (use $c() instead of $c)"))
# given an existing ContrastsMatrix, check that all of the levels present in the
# data are present in the contrasts. Note that this behavior is different from the
# ContrastsMatrix constructor, which requires that the levels be exactly the same.
# This method exists to support things like `predict` that can operate on new data
# which may contain only a subset of the original data's levels. Checking here
# (instead of in `modelmat_cols`) allows an informative error message.
function ContrastsMatrix(c::ContrastsMatrix, col::PooledDataArray)
if !isempty(setdiff(levels(col), c.levels))
throw(ArgumentError("there are levels in data that are not in ContrastsMatrix: " *
"$(setdiff(levels(col), c.levels))" *
"\n Data levels: $(levels(col))" *
"\n Contrast levels: $(c.levels)"))
end
return c
end

function termnames(C::AbstractContrasts, levels::AbstractVector, baseind::Integer)
not_base = [1:(baseind-1); (baseind+1):length(levels)]
levels[not_base]
end

nullify(x::Nullable) = x
nullify(x) = Nullable(x)

# Making a contrast type T only requires that there be a method for
# contrasts_matrix(T, v::PooledDataArray). The rest is boilerplate.
for contrastType in [:DummyCoding, :EffectsCoding, :HelmertCoding]
@eval begin
type $contrastType <: AbstractContrasts
base::Nullable{Any}
levels::Nullable{Vector}
end
## constructor with optional keyword arguments, defaulting to Nullables
$contrastType(;
base=Nullable{Any}(),
levels=Nullable{Vector}()) =
$contrastType(nullify(base),
nullify(levels))
end
end

"""
FullDummyCoding()
Coding that generates one indicator (1 or 0) column for each level,
__including__ the base level.
Needed internally when a term is non-redundant with lower-order terms (e.g., in
`~0+x` vs. `~1+x`, or in the interactions terms in `~1+x+x&y` vs. `~1+x+y+x&y`. In the
non-redundant cases, we can expand x into `length(levels(x))` columns
without creating a non-identifiable model matrix (unless the user has done
something foolish in specifying the model, which we can't do much about anyway).
"""
type FullDummyCoding <: AbstractContrasts
# Dummy contrasts have no base level (since all levels produce a column)
end

ContrastsMatrix{T}(C::FullDummyCoding, lvls::Vector{T}) =
ContrastsMatrix(eye(Float64, length(lvls)), lvls, lvls, C)

"Promote contrasts matrix to full rank version"
Base.convert(::Type{ContrastsMatrix{FullDummyCoding}}, C::ContrastsMatrix) =
ContrastsMatrix(FullDummyCoding(), C.levels)

"""
DummyCoding([base[, levels]])
Contrast coding that generates one indicator column (1 or 0) for each non-base level.
Columns have non-zero mean and are collinear with an intercept column (and
lower-order columns for interactions) but are orthogonal to each other. In a
regression model, dummy coding leads to an intercept that is the mean of the
dependent variable for base level.
Also known as "treatment coding" (`contr.treatment` in R) or "one-hot encoding".
"""
DummyCoding

contrasts_matrix(C::DummyCoding, baseind, n) = eye(n)[:, [1:(baseind-1); (baseind+1):n]]


"""
EffectsCoding([base[, levels]])
Contrast coding that generates columns that code each non-base level as the
deviation from the base level. For each non-base level `x` of `variable`, a
column is generated with 1 where `variable .== x` and -1 where `col .== base`.
`EffectsCoding` is like `DummyCoding`, but using -1 for the base level instead
of 0.
When all levels are equally frequent, effects coding generates model matrix
columns that are mean centered (have mean 0). For more than two levels the
generated columns are not orthogonal. In a regression model with an
effects-coded variable, the intercept corresponds to the grand mean.
Also known as "sum coding" (`contr.sum` in R) or "simple coding" (SPSS). Note
though that the default in R and SPSS is to use the _last_ level as the base.
Here we use the _first_ level as the base, for consistency with other coding
schemes.
"""
EffectsCoding

function contrasts_matrix(C::EffectsCoding, baseind, n)
not_base = [1:(baseind-1); (baseind+1):n]
mat = eye(n)[:, not_base]
mat[baseind, :] = -1
return mat
end

"""
HelmertCoding([base[, levels]])
Contrasts that code each level as the difference from the average of the lower
levels.
For each non-base level, Helmert coding generates a columns with -1 for each of
n levels below, n for that level, and 0 above.
# Examples
```julia
julia> ContrastsMatrix(HelmertCoding(), collect(1:4)).matrix
4x3 Array{Float64,2}:
-1.0 -1.0 -1.0
1.0 -1.0 -1.0
0.0 2.0 -1.0
0.0 0.0 3.0
```
When all levels are equally frequent, Helmert coding generates columns that are
mean-centered (mean 0) and orthogonal.
"""
HelmertCoding

function contrasts_matrix(C::HelmertCoding, baseind, n)
mat = zeros(n, n-1)
for i in 1:n-1
mat[1:i, i] = -1
mat[i+1, i] = i
end

# re-shuffle the rows such that base is the all -1.0 row (currently first)
mat = mat[[baseind; 1:(baseind-1); (baseind+1):end], :]
return mat
end

"""
ContrastsCoding(mat::Matrix[, base[, levels]])
Coding by manual specification of contrasts matrix. For k levels, the contrasts
must be a k by k-1 Matrix.
"""
type ContrastsCoding <: AbstractContrasts
mat::Matrix
base::Nullable{Any}
levels::Nullable{Vector}

function ContrastsCoding(mat, base, levels)
if !isnull(levels)
check_contrasts_size(mat, length(get(levels)))
end
new(mat, base, levels)
end
end

check_contrasts_size(mat::Matrix, n_lev) =
size(mat) == (n_lev, n_lev-1) ||
throw(ArgumentError("contrasts matrix wrong size for $n_lev levels. " *
"Expected $((n_lev, n_lev-1)), got $(size(mat))"))

## constructor with optional keyword arguments, defaulting to Nullables
ContrastsCoding(mat::Matrix; base=Nullable{Any}(), levels=Nullable{Vector}()) =
ContrastsCoding(mat, nullify(base), nullify(levels))

function contrasts_matrix(C::ContrastsCoding, baseind, n)
check_contrasts_size(C.mat, n)
C.mat
end
Loading

0 comments on commit 25c3e78

Please sign in to comment.