Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster sprand, new randsubseq (fixes #6714 and #6708) #6726

Merged
merged 2 commits into from
May 6, 2014

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented May 3, 2014

This patch implements a much faster (at least 5× on my laptop, but on an older machine I tried it was only 2×) sprand (fixing #6708), by filling in the matrix one (nonempty) column at a time. It also changes the meaning of the specified density to something that is (I think) easier to explain: the probability of any element of the matrix being nonzero is (independently) given by density.

A subroutine used by both this sprand and the previous sprand is a function to compute a random subsequence of a given AbstractArray (e.g. 1:n). Since this is generally useful functionality, it seemed sensible to export it as functions randsubseq(A, p) and randsubseq!(S, A, p). What these functions do is to create a random (ordered) subsequence of A by selecting each element of A with independent probability p. (This is different from the problem of finding a random subset of a fixed size.)

randsubseq is implemented using a new algorithm different from the previous one (which had the same mean (modulo the bug in #6714) but a more complicated, and I think less useful, distribution). The new algorithm is somewhat faster, requires no temporary arrays, has Θ(p * length(A)) complexity, and avoids the statistical bug of #6714; it is inspired by an algorithm by Vitter for a related problem.

(As discussed in #6714, it might be nice to also have a randsubseq(S, A) that finds a random subsequence of fixed length == length(S). The algorithm in StatsBase for this problem is currently quite suboptimal according to my benchmarks as well as theoretically. However, that may go into StatsBase rather than Base and in any case should be a separate PR.)

cc: @timholy, @StefanKarpinski, @lindahua

@stevengj
Copy link
Member Author

stevengj commented May 3, 2014

Whoops, the statistics of sprand still aren't quite right for small matrices: once I've selected a column to fill, I need to ensure that the rows array is not empty. This could be accomplished by while isempty(rows); randsubseq!(rows, 1:m, density); end, but that is quite inefficient for low density. I'm pretty sure I can do better with a simple fix, however, just by computing the expected last nonzero index of a nonempty column.... will revise shortly.

Okay, should be fixed now:

julia> ns = 10 .^ [2:7]; er = [mean([nfilled(sprand(10,10,0.1)) for i = 1:n]) for n in ns] .- 10
6-element Array{Any,1}:
 -0.23    
 -0.141   
  0.0075  
  0.00411 
  0.003047
 -0.001179

@stevengj
Copy link
Member Author

stevengj commented May 3, 2014

cc: @alanedelman, our resident random-matrix expert

@andreasnoack
Copy link
Member

I have always been confused by MATLAB's sprandn. The documentation doesn't say how the random sparsity pattern is chosen, but only how many non-zeros you should expect (and that number happens to be way off). The number of non-zeros doesn't characterise the distribution at all.

I think it is a big improvement over MATLAB's implementation that you, in addition to have the right expected number of non-zeros, are explicit about how the random sparsity structure is chosen. As I understand your explanation, your new sprandn would return something like m*n independent Bernoulli(density) draws where each non-zero is then replaced by a normal variate. If so, I think the independence property should be made explicit in the documentation.

A minor comment is that I think the label density is misleading and I would prefer p or probability or something like that.

@stevengj
Copy link
Member Author

stevengj commented May 3, 2014

@andreasnoackjensen, that's right. I will update the docs.

@timholy
Copy link
Member

timholy commented May 4, 2014

LGTM

@stevengj
Copy link
Member Author

stevengj commented May 4, 2014

Hmm, Travis error seems unrelated:

exception on 9: Worker 2 terminated.
ERROR: ProcessExitedException()
in remotecall_fetch at multi.jl:696
in remotecall_fetch at multi.jl:701
in runtests at /tmp/julia/share/julia/test/testdefs.jl:5
in anonymous at multi.jl:847
in run_work_thunk at multi.jl:613
in anonymous at task.jl:847
while loading parallel.jl, in expression starting on line 1276
ERROR: ProcessExitedException()
in anonymous at task.jl:1352
while loading /tmp/julia/share/julia/test/runtests.jl, in expression starting on line 46

Another one of the usual random Travis failures.

@StefanKarpinski
Copy link
Member

Seems like a very positive change. It would be good to have @alanedelman's input on the new behavior. Is this behavior something that's useful? Are random matrices generated like this handy in practice? Do they have well understood statistical structure? Is there some other kind of random sparse matrix that's particularly useful? Should this be in base at all?

@stevengj
Copy link
Member Author

stevengj commented May 5, 2014

There seem to be a large number of papers on sparse random matrices (google "random sparse matrix").

Many (though not all) seem to use a distribution similar to my sprand (e.g. this paper and the ~100 papers that cite it, or more recent papers such as this one), albeit often restricted to symmetric sparsity patterns, motivated by various types of random-network models (e.g. spin glass models). (It might be useful to have an analogous sprandsym for random symmetric sparse matrices, similar to Matlab except with a documented and useful distribution.)

Aside from models of physical systems, random networks etc., and compressive-sensing applications, random sparse matrices are also pretty useful for tests and benchmarks of sparse-matrix algorithms (for which application the precise distribution probably doesn't matter much, but it is useful to have a well-defined distribution in any case).

I'll try to bug Alan about it tomorrow if I run into him.

@stevengj
Copy link
Member Author

stevengj commented May 5, 2014

Seems to compare well to Matlab in terms of performance:

julia> @time sprand(10^7,10^7,1e-7);
elapsed time: 2.889734603 seconds (770679184 bytes allocated)

vs.

>> tic; sprand(10^7,10^7,1e-7); toc
Elapsed time is 12.458704 seconds.

on the same machine with a single thread.

@alanedelman
Copy link
Contributor

The random matrix literature on sparse matrices may not yet be ready to be relevant
for computation -- though it's an interesting question.

One could ask for random matrices given a specific sparsity pattern, or an approximate nnz number.
Very cute is the algorithm that generates a random sparse matrix with given singular values
or condition number, by picking jacobi rotations carefully --- though you only have approximate
control over this so you get not great answers in small cases.

@StefanKarpinski
Copy link
Member

Any comments on whether this particular sparse random matrix generation is a decent default choice?

@stevengj
Copy link
Member Author

stevengj commented May 5, 2014

I just spoke to Alan. Matlab's sprand has three variants: sprand(S), which keeps the sparsity pattern of S but randomizes the entries; sprand(m,n,density), which is kinda like ours but has a poorly specified distribution; and sprand(m,n,density,rc), in which you specify the desired reciprocal condition number rc (or the desired singular values). Alan only seems to have strong feelings that the third variant is useful (for testing) and would be good to have, and isn't much interested one way or the other in sprand(m,n,density).

@stevengj
Copy link
Member Author

stevengj commented May 5, 2014

If we are going to have an sprand(m,n,density) function, it seems like this is a reasonable default choice (possibly the most reasonable), and is a clear improvement over the previous version. Okay to merge?

@JeffBezanson
Copy link
Member

Ok by me.

@StefanKarpinski
Copy link
Member

Go for it.

On May 5, 2014, at 5:06 PM, Jeff Bezanson notifications@github.com wrote:

Ok by me.


Reply to this email directly or view it on GitHub.

stevengj added a commit that referenced this pull request May 6, 2014
faster sprand, new randsubseq (fixes #6714 and #6708)
@stevengj stevengj merged commit 593a9b8 into JuliaLang:master May 6, 2014
@stevengj stevengj deleted the sprand branch May 6, 2014 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants