RFC: Sparse ModelMatrix support #1040

GordStephen · 2016-08-20T14:33:14Z

As discussed in #614 - coming back to this now that #870 and #1017 have been merged.

ararslan · 2016-08-20T18:09:11Z

src/statsmodels/formula.jl

@@ -48,8 +48,8 @@ type ModelFrame
    contrasts::Dict{Symbol, ContrastsMatrix}
 end

-type ModelMatrix{T <: @compat(Union{Float32, Float64})}
-    m::Matrix{T}
+type ModelMatrix{T <: @compat(Union{Matrix{Float32}, Matrix{Float64}, SparseMatrixCSC{Float32,Int}, SparseMatrixCSC{Float64,Int}})}


We could drop the @compat here since the package no longer supports Julia 0.3.

…m ModelFrame

GordStephen · 2016-08-20T20:57:56Z

src/statsmodels/formula.jl

    end

    factors = terms.factors

    ## Map eval. term name + redundancy bool to cached model matrix columns
-    eterm_cols = @compat Dict{Tuple{Symbol,Bool}, Array{Float64}}()
+    eterm_cols = @compat Dict{Tuple{Symbol,Bool}, T}()


I couldn't find any reason not to restrict the Array dimension here. Did I miss something?

The only issue I can think of is the case where a single-column term would give a column vector instead of a one-column matrix. But conversion will probably happen automatically, and tests should catch this. Have you run the tests of GLM.jl on the modified package?

From what I could tell modelmat_cols handles the vector->matrix conversion. The GLM.jl tests pass as well, so hopefully this is OK.

GordStephen · 2016-08-20T21:07:47Z

Should the new constructor be ModelMatrix{Matrix{Float64}}(mf) instead of ModelMatrix(Matrix{Float64}, mf)? The latter isn't type-stable (right?), but I wasn't sure how to implement the former.

nalimilan · 2016-08-21T10:46:07Z

Both are type-stable, but the former is probably more idiomatic. You can implement it via the (::Type{ModelMatrix{T}}){T<:AbstractMatrix{...}}(...) = ... syntax.

…strictions

GordStephen · 2016-08-21T15:32:00Z

src/statsmodels/formula.jl

    else
-        a = convert(Array{Float64}, trm[1])


As far as I can tell, the conversions here (and just above) are redundant since elements of trm are always created by modelmat_cols, which already handles this?

GordStephen · 2016-08-21T16:09:36Z

src/statsmodels/formula.jl

    ## make sure the levels of the contrast matrix and the categorical data
    ## are the same by constructing a re-indexing vector. Indexing into
    ## reindex with v.refs will give the corresponding row number of the
    ## contrast matrix
    reindex = [findfirst(contrast.levels, l) for l in levels(v)]
-    return contrast.matrix[reindex[v.refs], :]
+    contrastmatrix = convert(T, contrast.matrix)
+    return contrastmatrix[reindex[v.refs], :]


This array creation can be extremely slow for sparse T and large datasets... but I'm not sure if there's a faster way to do it without creating dense columns first?

No idea. Why is it slow? Indexing rows shouldn't be a problem for sparse matrices AFAIK.

I'm not an expert on sparse matrix indexing, but it seems to spend a lot of time sorting... Truncated profile output from a million-row reference vector and 5-column constrast matrix:

3594 ./event.jl:68; (::Base.REPL.##3#4{Base.REPL.REPLBackend})() 3594 ./REPL.jl:95; macro expansion 3594 ./REPL.jl:64; eval_user_input(::Any, ::Base.REPL.REPLBackend) 3594 ./boot.jl:234; eval(::Module, ::Any) 3594 ./<missing>:?; anonymous 3594 ./profile.jl:16; macro expansion; 3594 ./sparse/sparsematrix.jl:2099; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :... 4 ./sparse/sparsematrix.jl:2437; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :... 3 ./reduce.jl:371; extrema(::Array{Int64,1}) 1 ./reduce.jl:372; extrema(::Array{Int64,1}) 3590 ./sparse/sparsematrix.jl:2448; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :... 3547 ./sparse/sparsematrix.jl:2422; getindex_general(::SparseMatrixCSC{Float64,Int64}, ::Array{In... 1 ./sort.jl:451; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function... 4 ./sort.jl:452; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function... 3542 ./sort.jl:454; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function... 3542 ./sort.jl:404; sort!(::Array{Int64,1}, ::Base.Sort.QuickSortAlg, ::Base.Ord...

Hmm... You could ask on the mailing list for advice about the best algorithm to do this for sparse matrices. I guess working column by column (for SparseMatrixCSC) would make more sense.

GordStephen · 2016-08-21T16:17:49Z

Ok, this should be functionally complete now. I essentially just generalized the existing model matrix creation logic to abritrary types beyond Matrix{Float64}, of which SparseMatrixCSC is one possible option. There may be a more type-specific approach that's faster... Not sure.

nalimilan · 2016-08-21T17:36:37Z

src/statsmodels/formula.jl

@@ -48,8 +48,10 @@ type ModelFrame
    contrasts::Dict{Symbol, ContrastsMatrix}
 end

-type ModelMatrix{T <: @compat(Union{Float32, Float64})}
-    m::Matrix{T}
+typealias ModelMatrixContainer{T<:AbstractFloat} AbstractMatrix{T}


I think it would be clearer if it was called AbstractFloatMatrix or something like that.

nalimilan · 2016-08-21T20:36:29Z

src/statsmodels/formula.jl

    else
-        a = convert(Array{Float64}, trm[1])
-        b = expandcols(trm[2 : end])
+        a, b = trm[1], expandcols(trm[2 : end])


Keep the assignments on two lines.

nalimilan · 2016-08-22T13:47:00Z

src/statsmodels/formula.jl

 """
-    modelmat_cols(v::PooledDataVector, contrast::ContrastsMatrix)
+    modelmat_cols(T::Type{AbstractFloatMatrix}, v::PooledDataVector, contrast::ContrastsMatrix)


Signature is incorrect, follow the actual one from the code.

nalimilan · 2016-08-22T13:52:34Z

LGTM apart from the details I just commented. The test failures are due to RDA. Anything else to do before merging? Could you also test that MixedModels tests pass?

GordStephen · 2016-08-23T00:07:46Z

MixedModels tests pass locally on 0.4 and 0.5 - GLM tests fail on 0.4 on both master and this PR (with the same error... not anything to do with DataFrames as far as I can tell). GLM tests pass on v0.5.

I think that's it then... I'll look into the sparse indexing performance. For now waiting a bit during large sparse model matrix creation is certainly preferable to an OutOfMemoryError.

nalimilan · 2016-08-23T09:42:13Z

src/statsmodels/formula.jl

@@ -352,7 +352,7 @@ modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::DataVector) = convert(T, res
 modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::Vector) = convert(T, reshape(v, length(v), 1))

 """
-    modelmat_cols(T::Type{AbstractFloatMatrix}, v::PooledDataVector, contrast::ContrastsMatrix)
+    modelmat_cols(::Type{T}, v::PooledDataVector, contrast::ContrastsMatrix)


You should also mention the restriction T<:AbstractFloatMatrix as a type parameter.

nalimilan · 2016-08-23T09:43:08Z

src/statsmodels/formula.jl

    ## make sure the levels of the contrast matrix and the categorical data
    ## are the same by constructing a re-indexing vector. Indexing into
    ## reindex with v.refs will give the corresponding row number of the
    ## contrast matrix
    reindex = [findfirst(contrast.levels, l) for l in levels(v)]
-    return contrast.matrix[reindex[v.refs], :]
+    contrastmatrix = convert(T, contrast.matrix)
+    return contrastmatrix[reindex[v.refs], :]
 end

 """
    expandcols(trm::Vector)


This signature should also be updated to mention the restriction on the element type.

GordStephen · 2016-08-26T22:00:44Z

Ok, docstrings updated.

nalimilan · 2016-08-26T22:22:19Z

Thanks!

Parametrize ModelMatrix container type.

jeffwong · 2017-07-05T02:23:01Z

In case anyone is googling and cannot find the right way to call this functionality, you can use

X = ModelMatrix{SparseMatrixCSC{Float64, Integer}}(ModelFrame(@formula(y ~ x), df)).m

Note that SparseMatrixCSC requires two types, while Matrix only requires one, e.g. Matrix{Float64}

nalimilan · 2017-07-05T07:36:36Z

Or rather X = ModelMatrix{SparseMatrixCSC{Float64, Int}}(ModelFrame(@formula(y ~ x), df)).m, since the abstract Integer will make the code quite slower. It would make sense to allow skipping the second type parameter and default to Int just like sparse does.

Parametrize ModelMatrix container type

c61a0ec

ararslan reviewed Aug 20, 2016
View reviewed changes

Eliminate hardcoded model matrix container type when constructing fro…

d46fc37

…m ModelFrame

GordStephen reviewed Aug 20, 2016
View reviewed changes

Gord Stephen added 3 commits August 21, 2016 09:58

More idiomatic model matrix constructor and relaxed container type re…

288552c

…strictions

Generalize modelmat_cols output typing

e1b068e

Generalize expandcols output types

4a9a65f

GordStephen reviewed Aug 21, 2016
View reviewed changes

Added sparse ModelMatrix creation tests

fd12a5d

GordStephen reviewed Aug 21, 2016
View reviewed changes

GordStephen changed the title ~~WIP: Sparse ModelMatrix support~~ RFC: Sparse ModelMatrix support Aug 21, 2016

nalimilan reviewed Aug 21, 2016
View reviewed changes

Gord Stephen added 2 commits August 21, 2016 15:21

More explicit model matrix constructor type output testing

ff6f706

Rename ModelMatrixContainer and remove unneeded variables/methods

f94dd83

nalimilan reviewed Aug 21, 2016
View reviewed changes

Split value assignment onto two lines

dd0ae91

nalimilan reviewed Aug 22, 2016
View reviewed changes

Fix test result spacing and incorrect method signature documentation

db58318

nalimilan reviewed Aug 23, 2016
View reviewed changes

Docstring updates

25935c0

nalimilan merged commit d853418 into JuliaData:master Aug 26, 2016

GordStephen pushed a commit to GordStephen/DataFrames.jl that referenced this pull request Sep 13, 2016

Sparse ModelMatrix support (JuliaData#1040)

04989a4

Parametrize ModelMatrix container type.

maximerischard pushed a commit to maximerischard/DataFrames.jl that referenced this pull request Sep 28, 2016

Sparse ModelMatrix support (JuliaData#1040)

f0bbd59

Parametrize ModelMatrix container type.

GordStephen deleted the gs/sparse-model-matrix branch July 5, 2017 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Sparse ModelMatrix support #1040

RFC: Sparse ModelMatrix support #1040

GordStephen commented Aug 20, 2016

ararslan Aug 20, 2016

GordStephen Aug 20, 2016

nalimilan Aug 20, 2016

GordStephen Aug 20, 2016 •

edited

Loading

GordStephen commented Aug 20, 2016 •

edited

Loading

nalimilan commented Aug 21, 2016

GordStephen Aug 21, 2016 •

edited

Loading

GordStephen Aug 21, 2016

nalimilan Aug 21, 2016

GordStephen Aug 21, 2016

nalimilan Aug 21, 2016

GordStephen commented Aug 21, 2016

nalimilan Aug 21, 2016

nalimilan Aug 21, 2016

nalimilan Aug 22, 2016

nalimilan commented Aug 22, 2016

GordStephen commented Aug 23, 2016

nalimilan Aug 23, 2016

nalimilan Aug 23, 2016

GordStephen commented Aug 26, 2016

nalimilan commented Aug 26, 2016

jeffwong commented Jul 5, 2017

nalimilan commented Jul 5, 2017

RFC: Sparse ModelMatrix support #1040

RFC: Sparse ModelMatrix support #1040

Conversation

GordStephen commented Aug 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GordStephen Aug 20, 2016 • edited Loading

Choose a reason for hiding this comment

GordStephen commented Aug 20, 2016 • edited Loading

nalimilan commented Aug 21, 2016

GordStephen Aug 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GordStephen commented Aug 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Aug 22, 2016

GordStephen commented Aug 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GordStephen commented Aug 26, 2016

nalimilan commented Aug 26, 2016

jeffwong commented Jul 5, 2017

nalimilan commented Jul 5, 2017

GordStephen Aug 20, 2016 •

edited

Loading

GordStephen commented Aug 20, 2016 •

edited

Loading

GordStephen Aug 21, 2016 •

edited

Loading