Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SparseVector support + tests #8718

Closed
wants to merge 1 commit into from
Closed

[WIP] SparseVector support + tests #8718

wants to merge 1 commit into from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Oct 18, 2014

Addresses #6479

Definitely not ready yet, but there's been some chatter, so I figured I'd push what I've been working on. I'm kind of adding to it here and there (since there's a lot to do), but if anyone else wants to help push this along, I'd appreciate the help.

Also a question, I'm not sure I quite understand the use of the two type parameters for SparseMatrixCSC{Tv, Ti}. Tv is obvious, but what does Ti really do as opposed to just using a consistent Int32 or Int64 vector. Is it really a space conservation option? I'm sure there's a good reason that I'm not aware of.

@quinnj
Copy link
Member Author

quinnj commented Oct 18, 2014

cc: @tanmaykm, @ViralBShah, @tkelman, and anyone else interested.

@ViralBShah
Copy link
Member

I really think we should be doing this as part of the general N-d COO support, which is not too much more effort. Given the interest, perhaps we can all do this in short time.

@ViralBShah ViralBShah added the sparse Sparse arrays label Oct 18, 2014
@ViralBShah
Copy link
Member

Ti is the index type. For sparse matrices with a lot of fill, 32-bit indices use much lesser memory. In sparse solvers, as they fill up during the solve, 32-bit indices should have lesser pressure on the memory buses and such. Things vary by architecture, but for these reasons, solvers offer 32-bit and 64-bit index types, and so does Julia.

@quinnj
Copy link
Member Author

quinnj commented Oct 18, 2014

Thanks @ViralBShah, that's along the lines of what I though, but wasn't sure if there was something else going on.

What would the type internals look like for COO? A quick google search didn't seem to turn up how things are actually stored apart from somehow using tuples.

m::Int # Number of rows
rowval::Vector{Ti} # Indices of nonzeros
nzval::Vector{Tv} # Nonzero values
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems strange to talk about the number of rows in a 1-d vector. I'd rename rowval to nzindex or something similarly generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually left it like this to make it really easy to port code from sparsematrix.jl (just deleting the J vector references). I think once things are ported over from there, it could easily be renamed to something more appropriate with a find and replace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about nzind/nzval, or just ind/val?

@ViralBShah
Copy link
Member

@quinnj To start with, I would have N-vectors for the n dimensional sparse representation. One can then sort them based on operations to perform. Here's an example and related papers:

http://www.sandia.gov/~tgkolda/TensorToolbox/index-2.5.html

@quinnj
Copy link
Member Author

quinnj commented Oct 20, 2014

I can't seem to access any of the papers (behind paywalls). I would argue it may be worth having a separate SparseVector type as the code mirrors the sparse matrix code very cleanly and it could be done in a week or two with tests/docs as well.

@tkelman
Copy link
Contributor

tkelman commented Oct 21, 2014

cc @jasax hopefully this could be used to fix the broadcasting behavior in JuliaLang/LinearAlgebra.jl#142

@ViralBShah
Copy link
Member

It is certainly useful to have a separate SparseVector type. Let's get this PR in shape.

@quinnj
Copy link
Member Author

quinnj commented Nov 12, 2014

I'll continue plugging away on this then. It may be slow going until December though when my time frees up.

@ViralBShah
Copy link
Member

@quinnj Do you think we can get this in place for 0.4?

@ViralBShah ViralBShah added this to the 0.4 milestone Feb 15, 2015
@ViralBShah
Copy link
Member

Marking this 0.4 for now.

@quinnj
Copy link
Member Author

quinnj commented Feb 15, 2015

Yeah I'll start tackling this again. Definitely good to have for 0.4

@andreasnoack
Copy link
Member

Sorry being a bit negative so late in this discussion, but I think we should consider this one in relation to the discussion in JuliaLang/LinearAlgebra.jl#42 and maybe also #10064.

A partial conclusion from the first discussion is that as long as n<3 then there is really no need for a vectors since a one columns or row matrix would suffice. To my knowledge the CSC format doesn't generalize to larger dimensions, so I think we should think carefully if a SparseVector makes thing simpler or more complicated.

The only argument for a SparseVector I could find in #6479 was to have similar behavior for columns indexing of a SparseMatrixCSC and other matrix types. I think we should come up with more arguments for a SparseVector. Indexing is not really a key feature of CSC matrices and #10064 also questions if SparseMatrixCSC should be considered a AbstractMatrix.

I think it would be more natural to get sparse vector functionality through a COO type.

@tkelman
Copy link
Contributor

tkelman commented Feb 16, 2015

If we have and treat dense vectors as separate things from dense 1-column matrices, we should do the same for sparse. If we don't, there is a slight overhead in the sparse case since a 1-column SparseMatrixCSC needs to store 3 extra integers that a sparse Vector doesn't. Probably not a big deal. But conflating 1-column matrices with vectors is harmful and buggy if we only do it for sparse (see JuliaLang/LinearAlgebra.jl#142).

We could transition to the Matlab convention of doing all vectors as 1-column matrices even in the dense case, but I don't think that was the direction JuliaLang/LinearAlgebra.jl#42 or #10064 were going, was it? That would also have significant consequences for broadcast and elsewhere.

I think it would be more natural to get sparse vector functionality through a COO type.

This is true but would be a more complicated implementation. I don't see any reason we couldn't deprecate a non-generalizable SparseVector type in the future as the 1-dimensional special case of COO. That requires having an arbitrary-dimensional COO implementation first, though.

@ViralBShah
Copy link
Member

I do think that irrespective of the storage, it is confusing for users if sparse and dense cases behave differently, when it comes to vectors. I suspect there will be other use cases for sparse vectors, and avoiding the extra loop over colptr will help with performance.

I think the eventual best solution is to have a 1-vector as a special case of COO, but it is probably ok to do a special case implementation until we get there.

@andreasnoack
Copy link
Member

Think that JuliaLang/LinearAlgebra.jl#142 is mainly a problem with the broadcasting behavior of .* for sparse matrices.

The point is that by definition CSC cannot be similar to our dense arrays because it doesn't generalize. Adding a SparseVector type requires work to ensure interoperability and it doesn't really add any functionality.

@tkelman
Copy link
Contributor

tkelman commented Feb 16, 2015

CSC doesn't generalize beyond dimension 2, but that's not a very good reason for dimensions 1 and 2 to be inconsistent between sparse and dense. We need to make a decision, globally, whether or not 1-d vectors are the same thing as 1-column matrices. If they aren't, there needs to be a way to tell them apart, so the existing implementation of sparse vectors is broken and it would be good to replace it. Or remove the current sparsevec function - which people do use, and have reported issues due to its implementation being inconsistent with our handling of dense vectors.

@quinnj
Copy link
Member Author

quinnj commented Feb 19, 2015

I (obviously) think this is worth the effort. My use cases came from wanting broader support in the Distances.jl and Clustering.jl packages for sparse, but running into design issues because of the lack of a 1st class sparse vector type. I think the general feeling is that this is good, so I'll pick it back up and push forward, but if there are serious objections, it'd be nice to decide on a green light/red light before I go further.

@ViralBShah
Copy link
Member

I personally feel it is worth seriously exploring, and if we can do it as part of COO, that would be even better. It will need changes across the sparse code, which will be a bit painful, but not terrible.

@andreasnoack
Copy link
Member

I'm still not sure this is the right solution here, but I'm also a democrat and it appears that I'm the only one against, so please go ahead.

I'm just wondering about the design here. If this is a first step towards COO then the storage should probably be something like an index vector Vector{NTuple{1,Ti}} and a vector of values Vector{Tv}.

@IainNZ
Copy link
Member

IainNZ commented Feb 19, 2015

+1 against. I think it should be in a package first, I feel like its not a good idea to be adding code to Base without it having the tires kicked. There are some funny things going on with field names that I think are a symptom of something deeper being wrong too. I don't see why a whole new system of sparse types (column-, row-, coordinate, vectors, tensors, whatever) can't live in a package for a while. Those who need that stuff will use it and improve. My only hesitation against critiqueing this is that I'm not going to be doing that work, because I don't need it myself. Thats why I'd like someone who needs it to use it - but outside Base first.

@StefanKarpinski
Copy link
Member

Please let's have it in a package first. The sparse matrix code is only just starting to feel less half-baked.

@johnmyleswhite
Copy link
Member

Can we make getting reliable sparse matrix support a GSoC project?

@ViralBShah
Copy link
Member

@andreasnoack I originally had the same view about not having sparse vectors. However, it does get annoying when dense and sparse codes behave differently. For now, we should probably remove the sparsevec stuff, and have a pure 2d implementation in base, which we should go ahead and make fully robust and reliable. Then, we would have a COO package, with all operations on COO implemented. A SparseVector type can be bolted later as a special case into base when all this is ready.

@lindahua
Copy link
Contributor

+100 for a sparse vector in Julia base.

From an application standpoint, sparse vectors have been widely useful in AI-related fields, e.g. stochastic optimization, machine learning, document analysis, etc. The lack of sparse vector support makes it particularly difficult to develop such applications.

As a stop gap, I pull a package SparseVectors.jl. However, I strongly agree that we need sparse vectors in Julia base, and allow getting views of columns of SparseMatrixCSC as sparse vectors (or sparse vector views).

@ViralBShah
Copy link
Member

Why not just bring in that SparseVectors package into Base?

@lindahua
Copy link
Contributor

It seems to me that people are still debating whether we should have this in a package or in the Base. If the consensus has been reached, we can proceed with this PR.

Also, that package can work with Julia 0.3.

@lindahua
Copy link
Contributor

Note that even if this PR is merged, the package is still useful, as we want to have many machine learning packages to work with Julia 0.3, and the SparseVectors.jl package can provide such support.

That being said, if this PR is merged, the package can be rewritten as

if VERSION < xxxx # the version that the PR is merged
    include("x.jl")
    include("xx.jl")
end

@ViralBShah
Copy link
Member

I think there is reasonable consensus that we should do this. Do they already operate with sparse matrices (like matvecs and such)? Perhaps some things like views may still need to stay in the package, until we have a coherent framework in Base, which also works for sparse matrices.

@ViralBShah ViralBShah modified the milestones: 0.4.0, 0.4.x May 17, 2015
L = 1/L

rows = Array(Int, 0)
for j in randsubseq(1:n, coldensity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is obviously for multiple-column cases. For a single vector, it can be much simpler.

@ViralBShah
Copy link
Member

@lindahua IIUC, parts of your package would replace this PR. Or, would you merge some of the stuff here, and parts of your package?

@lindahua
Copy link
Contributor

If people think that it is good to have this PR, I can go ahead to fix issues in this PR, add tests, and make it mergeable.

@quinnj
Copy link
Member Author

quinnj commented May 17, 2015

@lindahua, yes, please go ahead with a PR for your SparseVectors.jl package. This has unfortunately been a couple notches down on my TODO since I first needed the functionality and I haven't gotten around to really finishing it.

@quinnj quinnj closed this May 17, 2015
@lindahua
Copy link
Contributor

I will make a new PR later.

@StefanKarpinski
Copy link
Member

If we're going to move any packages into Base I think it makes sense to insist that they have 100% test coverage first. I know Base itself isn't up to that standard yet, but we are actively working towards it.

@mlubin
Copy link
Member

mlubin commented May 17, 2015

Is there really a need to rush this into base? Now that we have a usable package (thanks @lindahua!), what's the harm of leaving it there for a bit so it can mature and develop according to people's needs?

With sparse data structures, the general rule seems to be that everyone wants something different. For example, in JuMP, we use indexed vectors which use dense linear storage together with a list of nonzero indices. These feature O(1) getindex and setindex, which is useful for us. On the other hand, @lindahua's sparse vector uses packed storage and sorted indices. Someone else might want to use packed storage and unsorted indices. Given that it's very hard to change a name that's exported from Base, it seems possibly premature for this particular sparse vector implementation to have the name SparseVector. At least sparse matrices include the data structure in the name.

@tkelman
Copy link
Contributor

tkelman commented May 17, 2015

The problem is that we're currently faking sparse vectors in Base via 1-column CSC matrices, and we need to stop doing that because it's buggy and wrong.

@tkelman
Copy link
Contributor

tkelman commented May 18, 2015

Given that Dahua's new package is quite good already, and being outside of Base does allow it to experiment and add features considerably more quickly, it might actually be best to remove sparsevec from base at this time. It's been said that we shouldn't remove any more code until 0.5-dev which I largely agree with, but Miles makes a good point too.

@lindahua
Copy link
Contributor

I am ok with either way, leaving this stuff in the package or moving it to Base.

If we decide to have SparseVector in a package for a while, then we should definitely remove sparsevec and vec(::SparseMatrixCSC), etc. As @tkelman said, these functions are supposed to return a vector, that is, an instance of AbstractVector, instead they return a matrix.

Corrected versions of these functions can be provided in SparseVectors.jl after they are removed from the Base.

@tkelman
Copy link
Contributor

tkelman commented May 18, 2015

Corrected versions of these functions can be provided in SparseVectors.jl after they are removed from the Base.

Could they be added to the package before deleting them from base? Perhaps not exported from the package to avoid name conflicts.

@lindahua
Copy link
Contributor

Sounds like a good idea.

@ViralBShah
Copy link
Member

This PR is not about adding a new package to base, but SparseVector functionality that has been developed separately in a package. Doing this now, will help avoid a lot of pain going forward - otherwise 0.4 will have the fake sparse vectors too.

@ViralBShah
Copy link
Member

@mlubin We could call this SparseVectorCompressed or something else so that the generic name is not taken. But, the general design of compressed and sorted indices is in line with the current sparse matrix data structure. This is exactly what this PR has been discussing for a really long time, and we finally have an implementation!

I really would like to get this into 0.4, since as @tkelman pointed out, not doing anything means continuing to do what we are doing currently.

@StefanKarpinski I do agree about 100% test coverage.

@tkelman
Copy link
Contributor

tkelman commented May 18, 2015

Given that we have SparseMatrixCSC in base and that's not going anywhere before 0.4, we should make a decision to either deprecate or fix the broken sparse vectors that we have now. We should make that decision and do one or the other relatively soon.

Fixing them requires introducing a 1-d sparse vector type. There are many possible representations for a 1-d sparse vector type that someone might want to use, but for the methods related to taking a slice or view from a column of a SparseMatrixCSC, there's a pretty obvous natural choice. It doesn't have to be called exactly SparseVector, and it might not even need to be exported.

(and I apparently typed this at the same time as @ViralBShah made almost exactly the same point, and @lindahua too in #11324)

@lindahua lindahua mentioned this pull request May 18, 2015
@lindahua
Copy link
Contributor

#11324 makes sure that we will make a decision before 0.4.

@tkelman tkelman deleted the jq/sparsevec branch October 4, 2015 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants