Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Sparse ModelMatrix support #1040

Merged
merged 11 commits into from
Aug 26, 2016

Conversation

GordStephen
Copy link

As discussed in #614 - coming back to this now that #870 and #1017 have been merged.

@@ -48,8 +48,8 @@ type ModelFrame
contrasts::Dict{Symbol, ContrastsMatrix}
end

type ModelMatrix{T <: @compat(Union{Float32, Float64})}
m::Matrix{T}
type ModelMatrix{T <: @compat(Union{Matrix{Float32}, Matrix{Float64}, SparseMatrixCSC{Float32,Int}, SparseMatrixCSC{Float64,Int}})}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could drop the @compat here since the package no longer supports Julia 0.3.

end

factors = terms.factors

## Map eval. term name + redundancy bool to cached model matrix columns
eterm_cols = @compat Dict{Tuple{Symbol,Bool}, Array{Float64}}()
eterm_cols = @compat Dict{Tuple{Symbol,Bool}, T}()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any reason not to restrict the Array dimension here. Did I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue I can think of is the case where a single-column term would give a column vector instead of a one-column matrix. But conversion will probably happen automatically, and tests should catch this. Have you run the tests of GLM.jl on the modified package?

Copy link
Author

@GordStephen GordStephen Aug 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I could tell modelmat_cols handles the vector->matrix conversion. The GLM.jl tests pass as well, so hopefully this is OK.

@GordStephen
Copy link
Author

GordStephen commented Aug 20, 2016

Should the new constructor be ModelMatrix{Matrix{Float64}}(mf) instead of ModelMatrix(Matrix{Float64}, mf)? The latter isn't type-stable (right?), but I wasn't sure how to implement the former.

@nalimilan
Copy link
Member

Both are type-stable, but the former is probably more idiomatic. You can implement it via the (::Type{ModelMatrix{T}}){T<:AbstractMatrix{...}}(...) = ... syntax.

else
a = convert(Array{Float64}, trm[1])
Copy link
Author

@GordStephen GordStephen Aug 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the conversions here (and just above) are redundant since elements of trm are always created by modelmat_cols, which already handles this?

## make sure the levels of the contrast matrix and the categorical data
## are the same by constructing a re-indexing vector. Indexing into
## reindex with v.refs will give the corresponding row number of the
## contrast matrix
reindex = [findfirst(contrast.levels, l) for l in levels(v)]
return contrast.matrix[reindex[v.refs], :]
contrastmatrix = convert(T, contrast.matrix)
return contrastmatrix[reindex[v.refs], :]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This array creation can be extremely slow for sparse T and large datasets... but I'm not sure if there's a faster way to do it without creating dense columns first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Why is it slow? Indexing rows shouldn't be a problem for sparse matrices AFAIK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on sparse matrix indexing, but it seems to spend a lot of time sorting... Truncated profile output from a million-row reference vector and 5-column constrast matrix:

 3594 ./event.jl:68; (::Base.REPL.##3#4{Base.REPL.REPLBackend})()
 3594 ./REPL.jl:95; macro expansion
  3594 ./REPL.jl:64; eval_user_input(::Any, ::Base.REPL.REPLBackend)
   3594 ./boot.jl:234; eval(::Module, ::Any)
    3594 ./<missing>:?; anonymous
     3594 ./profile.jl:16; macro expansion;
      3594 ./sparse/sparsematrix.jl:2099; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
       4    ./sparse/sparsematrix.jl:2437; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
        3 ./reduce.jl:371; extrema(::Array{Int64,1})
        1 ./reduce.jl:372; extrema(::Array{Int64,1})
       3590 ./sparse/sparsematrix.jl:2448; getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{Int64,1}, :...
        3547 ./sparse/sparsematrix.jl:2422; getindex_general(::SparseMatrixCSC{Float64,Int64}, ::Array{In...
         1    ./sort.jl:451; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
         4    ./sort.jl:452; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
         3542 ./sort.jl:454; #sortperm#11(::Base.Sort.QuickSortAlg, ::Function, ::Function...
          3542 ./sort.jl:404; sort!(::Array{Int64,1}, ::Base.Sort.QuickSortAlg, ::Base.Ord...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... You could ask on the mailing list for advice about the best algorithm to do this for sparse matrices. I guess working column by column (for SparseMatrixCSC) would make more sense.

@GordStephen
Copy link
Author

Ok, this should be functionally complete now. I essentially just generalized the existing model matrix creation logic to abritrary types beyond Matrix{Float64}, of which SparseMatrixCSC is one possible option. There may be a more type-specific approach that's faster... Not sure.

@GordStephen GordStephen changed the title WIP: Sparse ModelMatrix support RFC: Sparse ModelMatrix support Aug 21, 2016
@@ -48,8 +48,10 @@ type ModelFrame
contrasts::Dict{Symbol, ContrastsMatrix}
end

type ModelMatrix{T <: @compat(Union{Float32, Float64})}
m::Matrix{T}
typealias ModelMatrixContainer{T<:AbstractFloat} AbstractMatrix{T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if it was called AbstractFloatMatrix or something like that.

else
a = convert(Array{Float64}, trm[1])
b = expandcols(trm[2 : end])
a, b = trm[1], expandcols(trm[2 : end])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the assignments on two lines.

"""
modelmat_cols(v::PooledDataVector, contrast::ContrastsMatrix)
modelmat_cols(T::Type{AbstractFloatMatrix}, v::PooledDataVector, contrast::ContrastsMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature is incorrect, follow the actual one from the code.

@nalimilan
Copy link
Member

LGTM apart from the details I just commented. The test failures are due to RDA. Anything else to do before merging? Could you also test that MixedModels tests pass?

@GordStephen
Copy link
Author

MixedModels tests pass locally on 0.4 and 0.5 - GLM tests fail on 0.4 on both master and this PR (with the same error... not anything to do with DataFrames as far as I can tell). GLM tests pass on v0.5.

I think that's it then... I'll look into the sparse indexing performance. For now waiting a bit during large sparse model matrix creation is certainly preferable to an OutOfMemoryError.

@@ -352,7 +352,7 @@ modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::DataVector) = convert(T, res
modelmat_cols{T<:AbstractFloatMatrix}(::Type{T}, v::Vector) = convert(T, reshape(v, length(v), 1))

"""
modelmat_cols(T::Type{AbstractFloatMatrix}, v::PooledDataVector, contrast::ContrastsMatrix)
modelmat_cols(::Type{T}, v::PooledDataVector, contrast::ContrastsMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also mention the restriction T<:AbstractFloatMatrix as a type parameter.

## make sure the levels of the contrast matrix and the categorical data
## are the same by constructing a re-indexing vector. Indexing into
## reindex with v.refs will give the corresponding row number of the
## contrast matrix
reindex = [findfirst(contrast.levels, l) for l in levels(v)]
return contrast.matrix[reindex[v.refs], :]
contrastmatrix = convert(T, contrast.matrix)
return contrastmatrix[reindex[v.refs], :]
end

"""
expandcols(trm::Vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature should also be updated to mention the restriction on the element type.

@GordStephen
Copy link
Author

Ok, docstrings updated.

@nalimilan nalimilan merged commit d853418 into JuliaData:master Aug 26, 2016
@nalimilan
Copy link
Member

Thanks!

GordStephen pushed a commit to GordStephen/DataFrames.jl that referenced this pull request Sep 13, 2016
Parametrize ModelMatrix container type.
maximerischard pushed a commit to maximerischard/DataFrames.jl that referenced this pull request Sep 28, 2016
Parametrize ModelMatrix container type.
@jeffwong
Copy link

jeffwong commented Jul 5, 2017

In case anyone is googling and cannot find the right way to call this functionality, you can use

X = ModelMatrix{SparseMatrixCSC{Float64, Integer}}(ModelFrame(@formula(y ~ x), df)).m

Note that SparseMatrixCSC requires two types, while Matrix only requires one, e.g. Matrix{Float64}

@nalimilan
Copy link
Member

Or rather X = ModelMatrix{SparseMatrixCSC{Float64, Int}}(ModelFrame(@formula(y ~ x), df)).m, since the abstract Integer will make the code quite slower. It would make sense to allow skipping the second type parameter and default to Int just like sparse does.

@GordStephen GordStephen deleted the gs/sparse-model-matrix branch July 5, 2017 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants