Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Model.Matrix #614

Closed
gragusa opened this issue Jun 6, 2014 · 9 comments
Closed

Sparse Model.Matrix #614

gragusa opened this issue Jun 6, 2014 · 9 comments

Comments

@gragusa
Copy link

gragusa commented Jun 6, 2014

Are there any plans to make Model.Matrix return a Sparse matrix? Something similar to sparse.model.matrix implemented in the R package Matrix written by @dmbates. I wanted to give it a try, but I wanted to know first whether there are issues blocking such an attempt.

@johnmyleswhite
Copy link
Contributor

I'd love to provide methods for doing this. It should be a separate function to maintain type-stability of the output.

The only thing that could be seen as blocking this is the generally weak state of the Formula interface these days. Since we're no longer eval-ing things, we need a formal definition of what Formula means.

@GordStephen
Copy link

Has this situation evolved in the past few years? This is something I'd be very interested in and would be happy to help contribute to.

@nalimilan
Copy link
Member

I don't think anybody has worked on it, but we'd be happy to help you if you want to make a pull request.

@nalimilan
Copy link
Member

As for the implementation/interface, I guess we should just make the T parameter of ModelMatrix include the container type and not only the element type. Then a default constructor would use Matrix, but you could also specify SparseMatrixCSC.

@GordStephen
Copy link

Ok, thanks, I'll take a look and come back with any questions.

@GordStephen
Copy link

OK, so if I'm understanding things properly, what would need to be done is:

  1. Provide a sparse equivalent to cols - shouldn't be difficult since the existing PooledDataArray case just calls contr_treatment, which already has a sparse option, and the other cases are trivial
  2. Provide a sparse equivalent to expandcols. Also should be fairly straightforward.

With that, hcat should abstract away the rest. Does that seem sane?

@nalimilan
Copy link
Member

Actually, I would be a shame to base your work on the current code while there's a large rework in the tubes at #870. Better base your work on that PR. You should look at the contrasts_matrix functions, and add version taking a type parameter (see my previous comment).

Provide a sparse equivalent to cols - shouldn't be difficult since the existing PooledDataArray case just calls contr_treatment, which already has a sparse option, and the other cases are trivial

I think you can keep cols (renamed to modelmat_cols in the PR) as-is, and maybe just add a method col(x::AbstractSparseVector) = convert(SparseVector{Float64}, x): if the input (non-categorical) data isn't stored in a sparse vector, there's probably no gain to expect from that conversion. Anyway, I guess most of the gains should come from categorical variables, right?

Provide a sparse equivalent to expandcols. Also should be fairly straightforward.

I guess so. Though the current code is quite inefficient even for dense matrices, see https://github.com/JuliaStats/DataFrames.jl/pull/870/files#r62338400. Since using sparse matrices is all about performance and memory use, you may want to give more attention to the way you concatenate columns to create the matrix.

@GordStephen
Copy link

Thanks, that's exactly the kind of feedback I was looking for! I'll check out #870.

@nalimilan
Copy link
Member

@GordStephen Should this be closed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants