[WIP] SparseVector support + tests #8718

quinnj · 2014-10-18T01:25:01Z

Addresses #6479

Definitely not ready yet, but there's been some chatter, so I figured I'd push what I've been working on. I'm kind of adding to it here and there (since there's a lot to do), but if anyone else wants to help push this along, I'd appreciate the help.

Also a question, I'm not sure I quite understand the use of the two type parameters for SparseMatrixCSC{Tv, Ti}. Tv is obvious, but what does Ti really do as opposed to just using a consistent Int32 or Int64 vector. Is it really a space conservation option? I'm sure there's a good reason that I'm not aware of.

quinnj · 2014-10-18T01:25:25Z

cc: @tanmaykm, @ViralBShah, @tkelman, and anyone else interested.

ViralBShah · 2014-10-18T11:27:56Z

I really think we should be doing this as part of the general N-d COO support, which is not too much more effort. Given the interest, perhaps we can all do this in short time.

ViralBShah · 2014-10-18T12:49:24Z

Ti is the index type. For sparse matrices with a lot of fill, 32-bit indices use much lesser memory. In sparse solvers, as they fill up during the solve, 32-bit indices should have lesser pressure on the memory buses and such. Things vary by architecture, but for these reasons, solvers offer 32-bit and 64-bit index types, and so does Julia.

quinnj · 2014-10-18T19:45:26Z

Thanks @ViralBShah, that's along the lines of what I though, but wasn't sure if there was something else going on.

What would the type internals look like for COO? A quick google search didn't seem to turn up how things are actually stored apart from somehow using tuples.

mlubin · 2014-10-19T03:02:10Z

base/sparse/sparsevector.jl

+    m::Int                  # Number of rows
+    rowval::Vector{Ti}      # Indices of nonzeros
+    nzval::Vector{Tv}       # Nonzero values
+end


Seems strange to talk about the number of rows in a 1-d vector. I'd rename rowval to nzindex or something similarly generic.

I actually left it like this to make it really easy to port code from sparsematrix.jl (just deleting the J vector references). I think once things are ported over from there, it could easily be renamed to something more appropriate with a find and replace.

How about nzind/nzval, or just ind/val?

ViralBShah · 2014-10-20T03:51:32Z

@quinnj To start with, I would have N-vectors for the n dimensional sparse representation. One can then sort them based on operations to perform. Here's an example and related papers:

http://www.sandia.gov/~tgkolda/TensorToolbox/index-2.5.html

quinnj · 2014-10-20T15:05:03Z

I can't seem to access any of the papers (behind paywalls). I would argue it may be worth having a separate SparseVector type as the code mirrors the sparse matrix code very cleanly and it could be done in a week or two with tests/docs as well.

tkelman · 2014-10-21T13:09:47Z

cc @jasax hopefully this could be used to fix the broadcasting behavior in JuliaLang/LinearAlgebra.jl#142

ViralBShah · 2014-11-06T19:46:33Z

It is certainly useful to have a separate SparseVector type. Let's get this PR in shape.

quinnj · 2014-11-12T00:17:48Z

I'll continue plugging away on this then. It may be slow going until December though when my time frees up.

ViralBShah · 2015-02-14T11:13:13Z

@quinnj Do you think we can get this in place for 0.4?

ViralBShah · 2015-02-15T09:22:22Z

Marking this 0.4 for now.

quinnj · 2015-02-15T14:40:18Z

Yeah I'll start tackling this again. Definitely good to have for 0.4

andreasnoack · 2015-02-15T19:13:01Z

Sorry being a bit negative so late in this discussion, but I think we should consider this one in relation to the discussion in JuliaLang/LinearAlgebra.jl#42 and maybe also #10064.

A partial conclusion from the first discussion is that as long as n<3 then there is really no need for a vectors since a one columns or row matrix would suffice. To my knowledge the CSC format doesn't generalize to larger dimensions, so I think we should think carefully if a SparseVector makes thing simpler or more complicated.

The only argument for a SparseVector I could find in #6479 was to have similar behavior for columns indexing of a SparseMatrixCSC and other matrix types. I think we should come up with more arguments for a SparseVector. Indexing is not really a key feature of CSC matrices and #10064 also questions if SparseMatrixCSC should be considered a AbstractMatrix.

I think it would be more natural to get sparse vector functionality through a COO type.

tkelman · 2015-02-16T05:39:47Z

If we have and treat dense vectors as separate things from dense 1-column matrices, we should do the same for sparse. If we don't, there is a slight overhead in the sparse case since a 1-column SparseMatrixCSC needs to store 3 extra integers that a sparse Vector doesn't. Probably not a big deal. But conflating 1-column matrices with vectors is harmful and buggy if we only do it for sparse (see JuliaLang/LinearAlgebra.jl#142).

We could transition to the Matlab convention of doing all vectors as 1-column matrices even in the dense case, but I don't think that was the direction JuliaLang/LinearAlgebra.jl#42 or #10064 were going, was it? That would also have significant consequences for broadcast and elsewhere.

I think it would be more natural to get sparse vector functionality through a COO type.

This is true but would be a more complicated implementation. I don't see any reason we couldn't deprecate a non-generalizable SparseVector type in the future as the 1-dimensional special case of COO. That requires having an arbitrary-dimensional COO implementation first, though.

ViralBShah · 2015-02-16T06:02:43Z

I do think that irrespective of the storage, it is confusing for users if sparse and dense cases behave differently, when it comes to vectors. I suspect there will be other use cases for sparse vectors, and avoiding the extra loop over colptr will help with performance.

I think the eventual best solution is to have a 1-vector as a special case of COO, but it is probably ok to do a special case implementation until we get there.

andreasnoack · 2015-02-16T12:19:28Z

Think that JuliaLang/LinearAlgebra.jl#142 is mainly a problem with the broadcasting behavior of .* for sparse matrices.

The point is that by definition CSC cannot be similar to our dense arrays because it doesn't generalize. Adding a SparseVector type requires work to ensure interoperability and it doesn't really add any functionality.

tkelman · 2015-02-16T12:26:01Z

CSC doesn't generalize beyond dimension 2, but that's not a very good reason for dimensions 1 and 2 to be inconsistent between sparse and dense. We need to make a decision, globally, whether or not 1-d vectors are the same thing as 1-column matrices. If they aren't, there needs to be a way to tell them apart, so the existing implementation of sparse vectors is broken and it would be good to replace it. Or remove the current sparsevec function - which people do use, and have reported issues due to its implementation being inconsistent with our handling of dense vectors.

quinnj · 2015-02-19T14:09:07Z

I (obviously) think this is worth the effort. My use cases came from wanting broader support in the Distances.jl and Clustering.jl packages for sparse, but running into design issues because of the lack of a 1st class sparse vector type. I think the general feeling is that this is good, so I'll pick it back up and push forward, but if there are serious objections, it'd be nice to decide on a green light/red light before I go further.

ViralBShah · 2015-02-19T14:25:12Z

I personally feel it is worth seriously exploring, and if we can do it as part of COO, that would be even better. It will need changes across the sparse code, which will be a bit painful, but not terrible.

andreasnoack · 2015-02-19T14:39:46Z

I'm still not sure this is the right solution here, but I'm also a democrat and it appears that I'm the only one against, so please go ahead.

I'm just wondering about the design here. If this is a first step towards COO then the storage should probably be something like an index vector Vector{NTuple{1,Ti}} and a vector of values Vector{Tv}.

IainNZ · 2015-02-19T15:10:52Z

+1 against. I think it should be in a package first, I feel like its not a good idea to be adding code to Base without it having the tires kicked. There are some funny things going on with field names that I think are a symptom of something deeper being wrong too. I don't see why a whole new system of sparse types (column-, row-, coordinate, vectors, tensors, whatever) can't live in a package for a while. Those who need that stuff will use it and improve. My only hesitation against critiqueing this is that I'm not going to be doing that work, because I don't need it myself. Thats why I'd like someone who needs it to use it - but outside Base first.

StefanKarpinski · 2015-02-19T15:12:14Z

Please let's have it in a package first. The sparse matrix code is only just starting to feel less half-baked.

johnmyleswhite · 2015-02-19T15:14:45Z

Can we make getting reliable sparse matrix support a GSoC project?

ViralBShah · 2015-02-19T16:08:04Z

@andreasnoack I originally had the same view about not having sparse vectors. However, it does get annoying when dense and sparse codes behave differently. For now, we should probably remove the sparsevec stuff, and have a pure 2d implementation in base, which we should go ahead and make fully robust and reliable. Then, we would have a COO package, with all operations on COO implemented. A SparseVector type can be bolted later as a special case into base when all this is ready.

lindahua · 2015-05-17T09:19:44Z

+100 for a sparse vector in Julia base.

From an application standpoint, sparse vectors have been widely useful in AI-related fields, e.g. stochastic optimization, machine learning, document analysis, etc. The lack of sparse vector support makes it particularly difficult to develop such applications.

As a stop gap, I pull a package SparseVectors.jl. However, I strongly agree that we need sparse vectors in Julia base, and allow getting views of columns of SparseMatrixCSC as sparse vectors (or sparse vector views).

ViralBShah · 2015-05-17T10:04:13Z

Why not just bring in that SparseVectors package into Base?

lindahua · 2015-05-17T10:12:15Z

It seems to me that people are still debating whether we should have this in a package or in the Base. If the consensus has been reached, we can proceed with this PR.

Also, that package can work with Julia 0.3.

lindahua · 2015-05-17T10:19:13Z

Note that even if this PR is merged, the package is still useful, as we want to have many machine learning packages to work with Julia 0.3, and the SparseVectors.jl package can provide such support.

That being said, if this PR is merged, the package can be rewritten as

if VERSION < xxxx # the version that the PR is merged
    include("x.jl")
    include("xx.jl")
end

ViralBShah · 2015-05-17T10:19:53Z

I think there is reasonable consensus that we should do this. Do they already operate with sparse matrices (like matvecs and such)? Perhaps some things like views may still need to stay in the package, until we have a coherent framework in Base, which also works for sparse matrices.

lindahua · 2015-05-17T10:43:59Z

base/sparse/sparsevector.jl

+    L = 1/L
+
+    rows = Array(Int, 0)
+    for j in randsubseq(1:n, coldensity)


This logic is obviously for multiple-column cases. For a single vector, it can be much simpler.

ViralBShah · 2015-05-17T10:54:51Z

@lindahua IIUC, parts of your package would replace this PR. Or, would you merge some of the stuff here, and parts of your package?

lindahua · 2015-05-17T11:34:05Z

If people think that it is good to have this PR, I can go ahead to fix issues in this PR, add tests, and make it mergeable.

quinnj · 2015-05-17T14:28:05Z

@lindahua, yes, please go ahead with a PR for your SparseVectors.jl package. This has unfortunately been a couple notches down on my TODO since I first needed the functionality and I haven't gotten around to really finishing it.

lindahua · 2015-05-17T14:54:47Z

I will make a new PR later.

StefanKarpinski · 2015-05-17T16:41:59Z

If we're going to move any packages into Base I think it makes sense to insist that they have 100% test coverage first. I know Base itself isn't up to that standard yet, but we are actively working towards it.

mlubin · 2015-05-17T17:22:01Z

Is there really a need to rush this into base? Now that we have a usable package (thanks @lindahua!), what's the harm of leaving it there for a bit so it can mature and develop according to people's needs?

With sparse data structures, the general rule seems to be that everyone wants something different. For example, in JuMP, we use indexed vectors which use dense linear storage together with a list of nonzero indices. These feature O(1) getindex and setindex, which is useful for us. On the other hand, @lindahua's sparse vector uses packed storage and sorted indices. Someone else might want to use packed storage and unsorted indices. Given that it's very hard to change a name that's exported from Base, it seems possibly premature for this particular sparse vector implementation to have the name SparseVector. At least sparse matrices include the data structure in the name.

tkelman · 2015-05-17T21:12:20Z

The problem is that we're currently faking sparse vectors in Base via 1-column CSC matrices, and we need to stop doing that because it's buggy and wrong.

tkelman · 2015-05-18T04:43:42Z

Given that Dahua's new package is quite good already, and being outside of Base does allow it to experiment and add features considerably more quickly, it might actually be best to remove sparsevec from base at this time. It's been said that we shouldn't remove any more code until 0.5-dev which I largely agree with, but Miles makes a good point too.

lindahua · 2015-05-18T05:17:15Z

I am ok with either way, leaving this stuff in the package or moving it to Base.

If we decide to have SparseVector in a package for a while, then we should definitely remove sparsevec and vec(::SparseMatrixCSC), etc. As @tkelman said, these functions are supposed to return a vector, that is, an instance of AbstractVector, instead they return a matrix.

Corrected versions of these functions can be provided in SparseVectors.jl after they are removed from the Base.

tkelman · 2015-05-18T05:23:55Z

Corrected versions of these functions can be provided in SparseVectors.jl after they are removed from the Base.

Could they be added to the package before deleting them from base? Perhaps not exported from the package to avoid name conflicts.

lindahua · 2015-05-18T05:33:53Z

Sounds like a good idea.

ViralBShah · 2015-05-18T06:32:38Z

This PR is not about adding a new package to base, but SparseVector functionality that has been developed separately in a package. Doing this now, will help avoid a lot of pain going forward - otherwise 0.4 will have the fake sparse vectors too.

ViralBShah · 2015-05-18T06:46:54Z

@mlubin We could call this SparseVectorCompressed or something else so that the generic name is not taken. But, the general design of compressed and sorted indices is in line with the current sparse matrix data structure. This is exactly what this PR has been discussing for a really long time, and we finally have an implementation!

I really would like to get this into 0.4, since as @tkelman pointed out, not doing anything means continuing to do what we are doing currently.

@StefanKarpinski I do agree about 100% test coverage.

tkelman · 2015-05-18T06:49:46Z

Given that we have SparseMatrixCSC in base and that's not going anywhere before 0.4, we should make a decision to either deprecate or fix the broken sparse vectors that we have now. We should make that decision and do one or the other relatively soon.

Fixing them requires introducing a 1-d sparse vector type. There are many possible representations for a 1-d sparse vector type that someone might want to use, but for the methods related to taking a slice or view from a column of a SparseMatrixCSC, there's a pretty obvous natural choice. It doesn't have to be called exactly SparseVector, and it might not even need to be exported.

(and I apparently typed this at the same time as @ViralBShah made almost exactly the same point, and @lindahua too in #11324)

lindahua · 2015-05-18T06:53:36Z

#11324 makes sure that we will make a decision before 0.4.

Initial SparseVector support + tests

2c81b7c

ViralBShah added the sparse Sparse arrays label Oct 18, 2014

mlubin reviewed Oct 19, 2014
View reviewed changes

jiahao force-pushed the master branch from cdde4df to 7fdc860 Compare October 28, 2014 04:20

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

ViralBShah added this to the 0.4 milestone Feb 15, 2015

quinnj mentioned this pull request Mar 10, 2015

So how do I use this thing? JuliaTime/TimeZones.jl#1

Closed

tkelman mentioned this pull request Apr 14, 2015

Base.squeeze: bug for sparse matrix #10813

Closed

ViralBShah modified the milestones: 0.4.0, 0.4.x May 17, 2015

lindahua reviewed May 17, 2015
View reviewed changes

quinnj closed this May 17, 2015

lindahua mentioned this pull request May 18, 2015

Remove methods for creating "sparse vectors" #11323

Closed

lindahua mentioned this pull request May 18, 2015

About SparseVectors #11324

Closed

tkelman deleted the jq/sparsevec branch October 4, 2015 07:35

[WIP] SparseVector support + tests #8718

[WIP] SparseVector support + tests #8718

Conversation

quinnj commented Oct 18, 2014

quinnj commented Oct 18, 2014

ViralBShah commented Oct 18, 2014

ViralBShah commented Oct 18, 2014

quinnj commented Oct 18, 2014

mlubin Oct 19, 2014

Choose a reason for hiding this comment

quinnj Oct 20, 2014

Choose a reason for hiding this comment

ViralBShah Feb 15, 2015

Choose a reason for hiding this comment

ViralBShah commented Oct 20, 2014

quinnj commented Oct 20, 2014

tkelman commented Oct 21, 2014

ViralBShah commented Nov 6, 2014

quinnj commented Nov 12, 2014

ViralBShah commented Feb 14, 2015

ViralBShah commented Feb 15, 2015

quinnj commented Feb 15, 2015

andreasnoack commented Feb 15, 2015

tkelman commented Feb 16, 2015

ViralBShah commented Feb 16, 2015

andreasnoack commented Feb 16, 2015

tkelman commented Feb 16, 2015

quinnj commented Feb 19, 2015

ViralBShah commented Feb 19, 2015

andreasnoack commented Feb 19, 2015

IainNZ commented Feb 19, 2015

StefanKarpinski commented Feb 19, 2015

johnmyleswhite commented Feb 19, 2015

ViralBShah commented Feb 19, 2015

lindahua commented May 17, 2015

ViralBShah commented May 17, 2015

lindahua commented May 17, 2015

lindahua commented May 17, 2015

ViralBShah commented May 17, 2015

lindahua May 17, 2015

Choose a reason for hiding this comment

ViralBShah commented May 17, 2015

lindahua commented May 17, 2015

quinnj commented May 17, 2015

lindahua commented May 17, 2015

StefanKarpinski commented May 17, 2015

mlubin commented May 17, 2015

tkelman commented May 17, 2015

tkelman commented May 18, 2015

lindahua commented May 18, 2015

tkelman commented May 18, 2015

lindahua commented May 18, 2015

ViralBShah commented May 18, 2015

ViralBShah commented May 18, 2015

tkelman commented May 18, 2015

lindahua commented May 18, 2015