-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Sparse ModelMatrix support #1040
RFC: Sparse ModelMatrix support #1040
Conversation
@@ -48,8 +48,8 @@ type ModelFrame | |||
contrasts::Dict{Symbol, ContrastsMatrix} | |||
end | |||
|
|||
type ModelMatrix{T <: @compat(Union{Float32, Float64})} | |||
m::Matrix{T} | |||
type ModelMatrix{T <: @compat(Union{Matrix{Float32}, Matrix{Float64}, SparseMatrixCSC{Float32,Int}, SparseMatrixCSC{Float64,Int}})} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could drop the @compat
here since the package no longer supports Julia 0.3.
end | ||
|
||
factors = terms.factors | ||
|
||
## Map eval. term name + redundancy bool to cached model matrix columns | ||
eterm_cols = @compat Dict{Tuple{Symbol,Bool}, Array{Float64}}() | ||
eterm_cols = @compat Dict{Tuple{Symbol,Bool}, T}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any reason not to restrict the Array dimension here. Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only issue I can think of is the case where a single-column term would give a column vector instead of a one-column matrix. But conversion will probably happen automatically, and tests should catch this. Have you run the tests of GLM.jl on the modified package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I could tell modelmat_cols
handles the vector->matrix conversion. The GLM.jl tests pass as well, so hopefully this is OK.
Should the new constructor be |
Both are type-stable, but the former is probably more idiomatic. You can implement it via the |
else | ||
a = convert(Array{Float64}, trm[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the conversions here (and just above) are redundant since elements of trm
are always created by modelmat_cols
, which already handles this?
## make sure the levels of the contrast matrix and the categorical data | ||
## are the same by constructing a re-indexing vector. Indexing into | ||
## reindex with v.refs will give the corresponding row number of the | ||
## contrast matrix | ||
reindex = [findfirst(contrast.levels, l) for l in levels(v)] | ||
return contrast.matrix[reindex[v.refs], :] | ||
contrastmatrix = convert(T, contrast.matrix) | ||
return contrastmatrix[reindex[v.refs], :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This array creation can be extremely slow for sparse T
and large datasets... but I'm not sure if there's a faster way to do it without creating dense columns first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. Why is it slow? Indexing rows shouldn't be a problem for sparse matrices AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on sparse matrix indexing, but it seems to spend a lot of time sorting... Truncated profile output from a million-row reference vector and 5-column constrast matrix:
3594 ./event.jl:68; (::Base.REPL.##3#4{Base.REPL.REPLBackend})()
3594 ./REPL.jl:95; macro expansion
3594 ./REPL.jl:64; eval_user_input(::Any, ::Base.REPL.REPLBackend)
3594 ./boot.jl:234; eval(::Module, ::Any)
3594 ./<missing>:?; anonymous
3594 ./profile.jl:16; macro expansion;
3594 ./sparse/sparsematrix.jl:2099; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
4 ./sparse/sparsematrix.jl:2437; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
3 ./reduce.jl:371; extrema(::Array{Int64,1})
1 ./reduce.jl:372; extrema(::Array{Int64,1})
3590 ./sparse/sparsematrix.jl:2448; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
3547 ./sparse/sparsematrix.jl:2422; getindex_general(::SparseMatrixCSC{Float64,Int64}, ::Array{In...
1 ./sort.jl:451; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
4 ./sort.jl:452; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
3542 ./sort.jl:454; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
3542 ./sort.jl:404; sort!(::Array{Int64,1}, ::Base.Sort.QuickSortAlg, ::Base.Ord...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... You could ask on the mailing list for advice about the best algorithm to do this for sparse matrices. I guess working column by column (for SparseMatrixCSC
) would make more sense.
Ok, this should be functionally complete now. I essentially just generalized the existing model matrix creation logic to abritrary types beyond |
@@ -48,8 +48,10 @@ type ModelFrame | |||
contrasts::Dict{Symbol, ContrastsMatrix} | |||
end | |||
|
|||
type ModelMatrix{T <: @compat(Union{Float32, Float64})} | |||
m::Matrix{T} | |||
typealias ModelMatrixContainer{T<:AbstractFloat} AbstractMatrix{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer if it was called AbstractFloatMatrix
or something like that.
else | ||
a = convert(Array{Float64}, trm[1]) | ||
b = expandcols(trm[2 : end]) | ||
a, b = trm[1], expandcols(trm[2 : end]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the assignments on two lines.
""" | ||
modelmat_cols(v::PooledDataVector, contrast::ContrastsMatrix) | ||
modelmat_cols(T::Type{AbstractFloatMatrix}, v::PooledDataVector, contrast::ContrastsMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signature is incorrect, follow the actual one from the code.
LGTM apart from the details I just commented. The test failures are due to RDA. Anything else to do before merging? Could you also test that MixedModels tests pass? |
MixedModels tests pass locally on 0.4 and 0.5 - GLM tests fail on 0.4 on both I think that's it then... I'll look into the sparse indexing performance. For now waiting a bit during large sparse model matrix creation is certainly preferable to an |
@@ -352,7 +352,7 @@ modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::DataVector) = convert(T, res | |||
modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::Vector) = convert(T, reshape(v, length(v), 1)) | |||
|
|||
""" | |||
modelmat_cols(T::Type{AbstractFloatMatrix}, v::PooledDataVector, contrast::ContrastsMatrix) | |||
modelmat_cols(::Type{T}, v::PooledDataVector, contrast::ContrastsMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also mention the restriction T<:AbstractFloatMatrix
as a type parameter.
## make sure the levels of the contrast matrix and the categorical data | ||
## are the same by constructing a re-indexing vector. Indexing into | ||
## reindex with v.refs will give the corresponding row number of the | ||
## contrast matrix | ||
reindex = [findfirst(contrast.levels, l) for l in levels(v)] | ||
return contrast.matrix[reindex[v.refs], :] | ||
contrastmatrix = convert(T, contrast.matrix) | ||
return contrastmatrix[reindex[v.refs], :] | ||
end | ||
|
||
""" | ||
expandcols(trm::Vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature should also be updated to mention the restriction on the element type.
Ok, docstrings updated. |
Thanks! |
Parametrize ModelMatrix container type.
Parametrize ModelMatrix container type.
In case anyone is googling and cannot find the right way to call this functionality, you can use
Note that SparseMatrixCSC requires two types, while Matrix only requires one, e.g. Matrix{Float64} |
Or rather |
As discussed in #614 - coming back to this now that #870 and #1017 have been merged.