Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are typical problem sizes? #6

Merged
merged 5 commits into from
Oct 12, 2021
Merged

What are typical problem sizes? #6

merged 5 commits into from
Oct 12, 2021

Conversation

adamreichold
Copy link
Contributor

Considering my experience benchmarking the summator and krige functions so far and finding two significantly different problem sizes in the benchmark input generation script, I wonder what typical problem sizes in the real world are?

This is important insofar the best strategy for parallelizing the algorithms does depend on the problem sizes. For example, using the larger of the two problem sizes, not parallelizing the second loop level in the summator_incompr function and parallelinzing it in the krige functions both improve performance as indicated in the commit messages.

Using larger problem sizes does increase the benchmark runtime significantly, but using Criterion's --save-baseline and --baseline features as well as reducing the --sample-size during iterative development would make this feasible IMHO, if the larger problems are actually more realistic that is.

Further, if problem size is expected to vary significantly, it might be preferable to always limit parallelization to the outermost loops as a conservative choice that should always yield acceptable performance.

Comment on lines -70 to -80
pos_no = 10000
cond_no = 500
krige_mat, k_vec, krige_cond = prepare_data(pos_no, cond_no)

np.savetxt(path / 'krige_vs_krige_mat.txt', krige_mat)
np.savetxt(path / 'krige_vs_k_vec.txt', k_vec)
np.savetxt(path / 'krige_vs_krige_cond.txt', krige_cond)

pos_no = 1000
cond_no = 50
krige_mat, k_vec, krige_cond = prepare_data(pos_no, cond_no)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question about typical problem sizes is hard to answer, as you might have already thought.

I just checked a publication of mine, where I used a very old implementation in C++ of the summator_incompr function. I used a 2d grid of 4600 x 1800 ~ 8*10^6 cells, which I would consider a larger problem size (for a 2d problem). I also actually used 6400 modes (len of z1, z2), which is probably a bit much (nowadays I always use exactly 1000 modes). Running one calcuation took 48 minutes on JURECA.

I guess @MuellerSeb can make much better estimates on typical sizes of kriging problems, as he has much more experience with it.

And just for the record, in this publication, I estimated a variogram on an unstructured grid with 412 cells and a bin length of 28, which, computationally wise is rather small, but I think some variogram estimations are much smaller.
In this example, which uses non-synthetic data, I used 2000 data points and about 30 bins to estimate the variogram. But we actually only sampled a small part of the data due to computational costs.

But all in all, for small problem sizes, the run times are hardly noticeable, so I support the idea of using larger problem sizes and reducing the criterion sample sizes.

Regarding your final point to maybe only parallelize the outer most loops: as long as we are not working on 64 or more cores, that would probably at least be the more conservative solution and save some potential and surprising declines in performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So

But all in all, for small problem sizes, the run times are hardly noticeable, so I support the idea of using larger problem sizes and reducing the criterion sample sizes.

would suggest, to just fix up the first commit to just drop the *_bench_* variants whereas

Regarding your final point to maybe only parallelize the outer most loops: as long as we are not working on 64 or more cores, that would probably at least be the more conservative solution and save some potential and surprising declines in performance.

would imply to drop the last commit, i.e. remove all parallelization except for the outermost loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would imply to drop the last commit, i.e. remove all parallelization except for the outermost loops.

Will wait for @MuellerSeb to chime in before making that change.

…tion

When switching to larger problem sizes, this yields a measurable speed-up

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 149.7s, or reduce sample count to 10.
field summate incompr   time:   [1.4950 s 1.4954 s 1.4958 s]
                        change: [-24.384% -23.691% -23.212%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe
When switching to larger problem sizes, this does slightly improve performance:

Benchmarking krige error: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 350.8s, or reduce sample count to 10.
krige error             time:   [1.2209 s 1.2643 s 1.3303 s]
                        change: [-6.9990% -3.7794% +0.8052%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

Benchmarking krige: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 121.4s, or reduce sample count to 10.
krige                   time:   [1.2105 s 1.2118 s 1.2134 s]
                        change: [-8.6930% -8.4186% -8.1473%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
@adamreichold
Copy link
Contributor Author

adamreichold commented Oct 2, 2021

As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all.

@LSchueler
Copy link
Member

As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all.

That's a nice thought! Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data.

@LSchueler
Copy link
Member

I just stumbled upon a small project where I actually did some Kriging. It had 103 data points (len of cond) and I did the interpolation on a 2d mesh with 500 by 500 grid cells.
But Sebastian will soon also join the discussion.

@adamreichold
Copy link
Contributor Author

Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data.

Well they do, but some amount of overhead is often unavoidable: Even if Rayon's heuristic makes the decision to not parallelize at all, just making it will have a runtime cost. Furthermore, parallel algorithms often need intermediate buffers to avoid synchronizing writes which is an overhead paid even by effectively single-threaded instantiations.

Another thing to note is that while Rayon does make the decision, it does not know the workload, i.e. whether it is beneficial to put every item into a separate task or just batches of 1000. For indexed parallel iterators this can be controlled via with_min_len but ndarray's Zip does seem to be limited to geometric splits (i.e. it has a split but no split_at method like ArrayBase and friends). Not sure yet how this can be worked around...

…atedly computing k_2.

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.7s.
field summate incompr   time:   [1.0707 s 1.0732 s 1.0761 s]
                        change: [-28.910% -28.377% -27.966%] (p = 0.00 < 0.05)
                        Performance has improved.
@adamreichold
Copy link
Contributor Author

adamreichold commented Oct 5, 2021

Not sure yet how this can be worked around...

So instead of doing nothing like I am supposed to at the moment, I banged my head into this a few times. I was able to further improve the performance of the summator_incompr function by 25% by switching the loop nesting so that the k_2 computation is not repeated redundantly which is the last commit on this branch. However, while the slow down is not as bad as before, parallelizing this implementation is still slower than the parallel version.

I think the main culprit here is https://github.com/rust-ndarray/ndarray/blob/566177b2e4d4556f841eebe78cafdfb1632235bf/src/zip/mod.rs#L878 which is used to determine if a Zip can be further split. This leads to a large number of intermediate temporary arrays being allocated, e.g. if 1_000 modes are used then on average 500 temporaries are allocated on my 8-hardware-thread CPU. (This was significantly worse before switching the nesting order yielding millions of temporaries and hence allocations, even though each one was just a 3x1 array instead of a 3x100_000 array.)

If I use the slightly contrived version

cov_samples
    .axis_iter(Axis(1))
    .into_par_iter()
    .zip(z1.axis_iter(Axis(0)))
    .zip(z2.axis_iter(Axis(0)))
    .with_min_len(125)
...

to get a IndexedParallelIterator and thereby access to with_min_len and set the optimal block size of 1_000 / 8 = 125 for my system, I do get measurable speed-ups

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.0s.
field summate incompr   time:   [987.01 ms 990.54 ms 994.57 ms]                                
                        change: [-8.0567% -7.6046% -7.1657%] (p = 0.00 < 0.05)
                        Performance has improved.

but I feel both the implementation and the fine tuning are wrong. I am not sure how to best fix this, whether in Rayon (Should it be possible to limit the number of jobs a Fold creates?) or in ndarray (Should NdProducer: IntoIterator and thereby Zip: IndexedParallelIterator?)

@adamreichold
Copy link
Contributor Author

I am not sure how to best fix this

rayon-rs/rayon#857 improves things considerably and I opened rust-ndarray/ndarray#1081 which would allow limiting of the splitting depending on work load characteristics.

Using only the patch to ndarray and the following parallel implementation

Zip::from(cov_samples.columns())
    .and(z1)
    .and(z2)
    .into_par_iter()
    .with_min_len(100)
    .fold(
        || Array2::<f64>::zeros(pos.dim()),
        |mut summed_modes, (cov_samples, z1, z2)| {
            let k_2 = cov_samples[0] / cov_samples.dot(&cov_samples);

            Zip::from(pos.columns())
                .and(summed_modes.columns_mut())
                .par_for_each(|pos, mut summed_modes| {
                    let phase = cov_samples.dot(&pos);
                    let z12 = z1 * phase.cos() + z2 * phase.sin();

                    Zip::from(&mut summed_modes)
                        .and(&e1)
                        .and(cov_samples)
                        .for_each(|sum, e1, cs| {
                            *sum += (*e1 - cs * k_2) * z12;
                        });
                });

            summed_modes
        },
    )
    .reduce_with(|mut lhs, rhs| {
        lhs += &rhs;
        lhs
    })
    .unwrap()

I get small but consistent speed-ups, e.g.

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.1s.
field summate incompr   time:   [1.0061 s 1.0096 s 1.0120 s]                                   
                        change: [-6.3574% -5.9467% -5.7063%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild

@LSchueler
Copy link
Member

Those are some really cool insights into Rayon!
Somehow summator_incompr is a tough nut, right?

@adamreichold
Copy link
Contributor Author

adamreichold commented Oct 8, 2021

Somehow summator_incompr is a tough nut, right?

Actually, I think this more of an ecosystem "issue": Rayon makes it almost trivial to parallelize algorithms once libraries like ndarray provide the necessary plumbing for their data types. Most of the time this is more than enough to achieve useful speed-ups, but remembering how the same algorithms would be handled using e.g. manual threading or even OpenMP, it would be rather surprising that some tuning would not be beneficial or sometimes even required to achieve speed-ups. Rayon is often described as "magic" by developers first making use of it. Alas, there is no such thing and when the defaults do not work out, one has to look deeper into the stack.

Since I do not know how long it will take to land rust-ndarray/ndarray#1081 (if it lands at all), I think I will commit the "iterate along the first axis of a single axis" trick as a workaround for now. I think we should be in a position to merge this afterwards? The problems you looked up are as large or larger than the benchmarks now and all algorithms (expect variograms which are not yet index-free) would benefit from parallelisation.

This is a speed-up due to the call to `with_min_len` which however requires
indexed parallel iterators which `ndarray` provides only via `axis_iter` and
`axis_chunk_iter`, so this workaround is necessary until [1] is merged.

[1] rust-ndarray/ndarray#1081
@adamreichold
Copy link
Contributor Author

Just for old times sake:

I just checked a publication of mine, where I used a very old implementation in C++ of the summator_incompr function. I used a 2d grid of 4600 x 1800 ~ 8*10^6 cells, which I would consider a larger problem size (for a 2d problem). I also actually used 6400 modes (len of z1, z2), which is probably a bit much (nowadays I always use exactly 1000 modes). Running one calcuation took 48 minutes on JURECA.

diff --git a/benches/gen_benchmark_inputs.py b/benches/gen_benchmark_inputs.py
index 142581d..8485d5f 100755
--- a/benches/gen_benchmark_inputs.py
+++ b/benches/gen_benchmark_inputs.py
@@ -8,8 +8,8 @@ from gstools.field.generator import RandMeth, IncomprRandMeth
 
 
 def gen_field_summate(path, seed):
-    pos_no = 100_000
-    mode_no = 1_000
+    pos_no = 4600 * 1800
+    mode_no = 6400
     x = np.linspace(0.0, 10.0, pos_no)
     y = np.linspace(-5.0, 5.0, pos_no)
     z = np.linspace(-6.0, 8.0, pos_no)
> /usr/bin/time ./target/release/examples/incompr
4151.12user 19.38system 8:53.53elapsed 781%CPU (0avgtext+0avgdata 6405948maxresident)k
0inputs+0outputs (0major+3149538minor)pagefaults 0swaps

@LSchueler LSchueler merged commit 8561a04 into GeoStat-Framework:main Oct 12, 2021
@adamreichold adamreichold deleted the benchmark-problem-sizes branch October 12, 2021 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants