Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Sparse ModelMatrix support #1040

Merged
merged 11 commits into from
Aug 26, 2016
31 changes: 17 additions & 14 deletions src/statsmodels/formula.jl
Original file line number Diff line number Diff line change
Expand Up @@ -323,8 +323,8 @@ function setcontrasts!(mf::ModelFrame, new_contrasts::Dict)
end
setcontrasts!(mf::ModelFrame; kwargs...) = setcontrasts!(mf, Dict(kwargs))

asmatrix(a::AbstractMatrix) = a
asmatrix(v::AbstractVector) = reshape(v, (length(v), 1))
asmatrix(T::Type, a::AbstractMatrix) = convert(T, a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is only used in modelmat_cols, probably better remove it altogether. You can call reshape(a, size(a, 1), size(a, 2))) instead, which works for both vectors and matrices.

asmatrix(T::Type, v::AbstractVector) = convert(T, reshape(v, (length(v), 1)))

"""
StatsBase.model_response(mf::ModelFrame)
Expand All @@ -339,33 +339,35 @@ function StatsBase.model_response(mf::ModelFrame)
end
end

modelmat_cols(v::DataVector) = asmatrix(convert(Vector{Float64}, v.data))
modelmat_cols(v::Vector) = asmatrix(convert(Vector{Float64}, v))
modelmat_cols{T<:ModelMatrixContainer}(::Type{T}, v::DataVector) = asmatrix(T, convert(Vector{Float64}, v.data))
modelmat_cols{T<:ModelMatrixContainer}(::Type{T}, v::Vector) = asmatrix(T, convert(Vector{Float64}, v))

## construct model matrix columns from model frame + name (checks for contrasts)
function modelmat_cols(name::Symbol, mf::ModelFrame; non_redundant::Bool = false)
function modelmat_cols{T<:ModelMatrixContainer}(::Type{T}, name::Symbol, mf::ModelFrame; non_redundant::Bool = false)
if haskey(mf.contrasts, name)
modelmat_cols(mf.df[name],
modelmat_cols(T, mf.df[name],
non_redundant ?
ContrastsMatrix{FullDummyCoding}(mf.contrasts[name]) :
mf.contrasts[name])
else
modelmat_cols(mf.df[name])
modelmat_cols(T, mf.df[name])
end
end

"""
modelmat_cols(v::PooledDataVector, contrast::ContrastsMatrix)
modelmat_cols(T::Type{ModelMatrixContainer}, v::PooledDataVector, contrast::ContrastsMatrix)

Construct `ModelMatrix` columns based on specified contrasts, ensuring that
Construct `ModelMatrix` columns of type `T` based on specified contrasts, ensuring that
levels align properly.
"""
function modelmat_cols(v::PooledDataVector, contrast::ContrastsMatrix)
function modelmat_cols{T<:ModelMatrixContainer}(::Type{T}, v::PooledDataVector, contrast::ContrastsMatrix)
## make sure the levels of the contrast matrix and the categorical data
## are the same by constructing a re-indexing vector. Indexing into
## reindex with v.refs will give the corresponding row number of the
## contrast matrix
reindex = [findfirst(contrast.levels, l) for l in levels(v)]
return contrast.matrix[reindex[v.refs], :]
contrastmatrix = convert(T, contrast.matrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases can T be different from contrast.matrix's type?

Copy link
Author

@GordStephen GordStephen Aug 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contrast.matrix is always a Matrix{Float64}, while T is whatever your desired output type is (e.g. any AbstractFloatMatrix subtype).

return contrastmatrix[reindex[v.refs], :]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This array creation can be extremely slow for sparse T and large datasets... but I'm not sure if there's a faster way to do it without creating dense columns first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Why is it slow? Indexing rows shouldn't be a problem for sparse matrices AFAIK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on sparse matrix indexing, but it seems to spend a lot of time sorting... Truncated profile output from a million-row reference vector and 5-column constrast matrix:

 3594 ./event.jl:68; (::Base.REPL.##3#4{Base.REPL.REPLBackend})()
 3594 ./REPL.jl:95; macro expansion
  3594 ./REPL.jl:64; eval_user_input(::Any, ::Base.REPL.REPLBackend)
   3594 ./boot.jl:234; eval(::Module, ::Any)
    3594 ./<missing>:?; anonymous
     3594 ./profile.jl:16; macro expansion;
      3594 ./sparse/sparsematrix.jl:2099; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
       4    ./sparse/sparsematrix.jl:2437; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
        3 ./reduce.jl:371; extrema(::Array{Int64,1})
        1 ./reduce.jl:372; extrema(::Array{Int64,1})
       3590 ./sparse/sparsematrix.jl:2448; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
        3547 ./sparse/sparsematrix.jl:2422; getindex_general(::SparseMatrixCSC{Float64,Int64}, ::Array{In...
         1    ./sort.jl:451; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
         4    ./sort.jl:452; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
         3542 ./sort.jl:454; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
          3542 ./sort.jl:404; sort!(::Array{Int64,1}, ::Base.Sort.QuickSortAlg, ::Base.Ord...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... You could ask on the mailing list for advice about the best algorithm to do this for sparse matrices. I guess working column by column (for SparseMatrixCSC) would make more sense.

end

"""
Expand All @@ -374,7 +376,7 @@ Create pairwise products of columns from a vector of matrices
"""
function expandcols(trm::Vector)
if length(trm) == 1
asmatrix(convert(Array{Float64}, trm[1]))
asmatrix(Matrix{Float64}, convert(Array{Float64}, trm[1]))
Copy link
Author

@GordStephen GordStephen Aug 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the conversions here (and just above) are redundant since elements of trm are always created by modelmat_cols, which already handles this?

else
a = convert(Array{Float64}, trm[1])
b = expandcols(trm[2 : end])
Expand Down Expand Up @@ -439,7 +441,8 @@ If there is an intercept in the model, that column occurs first and its
Mixed-effects models include "random-effects" terms which are ignored when
creating the model matrix.
"""
function (::Type{ModelMatrix{T}}){T<:ModelMatrixContainer}(mf::ModelFrame)
@compat function (::Type{ModelMatrix{T}}){T<:ModelMatrixContainer}(mf::ModelFrame)
sparsemm = T <: AbstractSparseMatrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable doesn't seem to be used anywhere.

dfrm = mf.df
terms = droprandomeffects(dropresponse!(mf.terms))

Expand Down Expand Up @@ -473,7 +476,7 @@ function (::Type{ModelMatrix{T}}){T<:ModelMatrixContainer}(mf::ModelFrame)
## and storing as necessary)
for (et, nr) in zip(eterms, non_redundants)
if ! haskey(eterm_cols, (et, nr))
eterm_cols[(et, nr)] = modelmat_cols(et, mf, non_redundant=nr)
eterm_cols[(et, nr)] = modelmat_cols(T, et, mf, non_redundant=nr)
end
push!(term_cols, eterm_cols[(et, nr)])
end
Expand Down