What are typical problem sizes? #6

adamreichold · 2021-09-30T08:06:22Z

Considering my experience benchmarking the summator and krige functions so far and finding two significantly different problem sizes in the benchmark input generation script, I wonder what typical problem sizes in the real world are?

This is important insofar the best strategy for parallelizing the algorithms does depend on the problem sizes. For example, using the larger of the two problem sizes, not parallelizing the second loop level in the summator_incompr function and parallelinzing it in the krige functions both improve performance as indicated in the commit messages.

Using larger problem sizes does increase the benchmark runtime significantly, but using Criterion's --save-baseline and --baseline features as well as reducing the --sample-size during iterative development would make this feasible IMHO, if the larger problems are actually more realistic that is.

Further, if problem size is expected to vary significantly, it might be preferable to always limit parallelization to the outermost loops as a conservative choice that should always yield acceptable performance.

benches/gen_benchmark_inputs.py

LSchueler · 2021-09-30T16:18:46Z

benches/gen_benchmark_inputs.py

-    pos_no = 10000
-    cond_no = 500
-    krige_mat, k_vec, krige_cond = prepare_data(pos_no, cond_no)
-
-    np.savetxt(path / 'krige_vs_krige_mat.txt', krige_mat)
-    np.savetxt(path / 'krige_vs_k_vec.txt', k_vec)
-    np.savetxt(path / 'krige_vs_krige_cond.txt', krige_cond)
-
-    pos_no = 1000
-    cond_no = 50
-    krige_mat, k_vec, krige_cond = prepare_data(pos_no, cond_no)


See above comment

The question about typical problem sizes is hard to answer, as you might have already thought.

I just checked a publication of mine, where I used a very old implementation in C++ of the summator_incompr function. I used a 2d grid of 4600 x 1800 ~ 8*10^6 cells, which I would consider a larger problem size (for a 2d problem). I also actually used 6400 modes (len of z1, z2), which is probably a bit much (nowadays I always use exactly 1000 modes). Running one calcuation took 48 minutes on JURECA.

I guess @MuellerSeb can make much better estimates on typical sizes of kriging problems, as he has much more experience with it.

And just for the record, in this publication, I estimated a variogram on an unstructured grid with 412 cells and a bin length of 28, which, computationally wise is rather small, but I think some variogram estimations are much smaller.
In this example, which uses non-synthetic data, I used 2000 data points and about 30 bins to estimate the variogram. But we actually only sampled a small part of the data due to computational costs.

But all in all, for small problem sizes, the run times are hardly noticeable, so I support the idea of using larger problem sizes and reducing the criterion sample sizes.

Regarding your final point to maybe only parallelize the outer most loops: as long as we are not working on 64 or more cores, that would probably at least be the more conservative solution and save some potential and surprising declines in performance.

So

But all in all, for small problem sizes, the run times are hardly noticeable, so I support the idea of using larger problem sizes and reducing the criterion sample sizes.

would suggest, to just fix up the first commit to just drop the *_bench_* variants whereas

Regarding your final point to maybe only parallelize the outer most loops: as long as we are not working on 64 or more cores, that would probably at least be the more conservative solution and save some potential and surprising declines in performance.

would imply to drop the last commit, i.e. remove all parallelization except for the outermost loops.

would imply to drop the last commit, i.e. remove all parallelization except for the outermost loops.

Will wait for @MuellerSeb to chime in before making that change.

…sm at the cost of runtime.

…tion When switching to larger problem sizes, this yields a measurable speed-up Benchmarking field summate incompr: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 149.7s, or reduce sample count to 10. field summate incompr time: [1.4950 s 1.4954 s 1.4958 s] change: [-24.384% -23.691% -23.212%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 4 (4.00%) high mild 9 (9.00%) high severe

When switching to larger problem sizes, this does slightly improve performance: Benchmarking krige error: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 350.8s, or reduce sample count to 10. krige error time: [1.2209 s 1.2643 s 1.3303 s] change: [-6.9990% -3.7794% +0.8052%] (p = 0.07 > 0.05) No change in performance detected. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe Benchmarking krige: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 121.4s, or reduce sample count to 10. krige time: [1.2105 s 1.2118 s 1.2134 s] change: [-8.6930% -8.4186% -8.1473%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe

adamreichold · 2021-10-02T19:46:57Z

As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all.

LSchueler · 2021-10-05T10:03:21Z

As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all.

That's a nice thought! Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data.

LSchueler · 2021-10-05T10:06:14Z

I just stumbled upon a small project where I actually did some Kriging. It had 103 data points (len of cond) and I did the interpolation on a 2d mesh with 500 by 500 grid cells.
But Sebastian will soon also join the discussion.

adamreichold · 2021-10-05T10:34:38Z

Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data.

Well they do, but some amount of overhead is often unavoidable: Even if Rayon's heuristic makes the decision to not parallelize at all, just making it will have a runtime cost. Furthermore, parallel algorithms often need intermediate buffers to avoid synchronizing writes which is an overhead paid even by effectively single-threaded instantiations.

Another thing to note is that while Rayon does make the decision, it does not know the workload, i.e. whether it is beneficial to put every item into a separate task or just batches of 1000. For indexed parallel iterators this can be controlled via with_min_len but ndarray's Zip does seem to be limited to geometric splits (i.e. it has a split but no split_at method like ArrayBase and friends). Not sure yet how this can be worked around...

…atedly computing k_2. Benchmarking field summate incompr: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.7s. field summate incompr time: [1.0707 s 1.0732 s 1.0761 s] change: [-28.910% -28.377% -27.966%] (p = 0.00 < 0.05) Performance has improved.

adamreichold · 2021-10-05T14:21:40Z

Not sure yet how this can be worked around...

So instead of doing nothing like I am supposed to at the moment, I banged my head into this a few times. I was able to further improve the performance of the summator_incompr function by 25% by switching the loop nesting so that the k_2 computation is not repeated redundantly which is the last commit on this branch. However, while the slow down is not as bad as before, parallelizing this implementation is still slower than the parallel version.

I think the main culprit here is https://github.com/rust-ndarray/ndarray/blob/566177b2e4d4556f841eebe78cafdfb1632235bf/src/zip/mod.rs#L878 which is used to determine if a Zip can be further split. This leads to a large number of intermediate temporary arrays being allocated, e.g. if 1_000 modes are used then on average 500 temporaries are allocated on my 8-hardware-thread CPU. (This was significantly worse before switching the nesting order yielding millions of temporaries and hence allocations, even though each one was just a 3x1 array instead of a 3x100_000 array.)

If I use the slightly contrived version

cov_samples
    .axis_iter(Axis(1))
    .into_par_iter()
    .zip(z1.axis_iter(Axis(0)))
    .zip(z2.axis_iter(Axis(0)))
    .with_min_len(125)
...

to get a IndexedParallelIterator and thereby access to with_min_len and set the optimal block size of 1_000 / 8 = 125 for my system, I do get measurable speed-ups

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.0s.
field summate incompr   time:   [987.01 ms 990.54 ms 994.57 ms]                                
                        change: [-8.0567% -7.6046% -7.1657%] (p = 0.00 < 0.05)
                        Performance has improved.

but I feel both the implementation and the fine tuning are wrong. I am not sure how to best fix this, whether in Rayon (Should it be possible to limit the number of jobs a Fold creates?) or in ndarray (Should NdProducer: IntoIterator and thereby Zip: IndexedParallelIterator?)

adamreichold · 2021-10-06T07:19:46Z

I am not sure how to best fix this

rayon-rs/rayon#857 improves things considerably and I opened rust-ndarray/ndarray#1081 which would allow limiting of the splitting depending on work load characteristics.

Using only the patch to ndarray and the following parallel implementation

Zip::from(cov_samples.columns())
    .and(z1)
    .and(z2)
    .into_par_iter()
    .with_min_len(100)
    .fold(
        || Array2::<f64>::zeros(pos.dim()),
        |mut summed_modes, (cov_samples, z1, z2)| {
            let k_2 = cov_samples[0] / cov_samples.dot(&cov_samples);

            Zip::from(pos.columns())
                .and(summed_modes.columns_mut())
                .par_for_each(|pos, mut summed_modes| {
                    let phase = cov_samples.dot(&pos);
                    let z12 = z1 * phase.cos() + z2 * phase.sin();

                    Zip::from(&mut summed_modes)
                        .and(&e1)
                        .and(cov_samples)
                        .for_each(|sum, e1, cs| {
                            *sum += (*e1 - cs * k_2) * z12;
                        });
                });

            summed_modes
        },
    )
    .reduce_with(|mut lhs, rhs| {
        lhs += &rhs;
        lhs
    })
    .unwrap()

I get small but consistent speed-ups, e.g.

Benchmarking field summate incompr: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.1s.
field summate incompr   time:   [1.0061 s 1.0096 s 1.0120 s]                                   
                        change: [-6.3574% -5.9467% -5.7063%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild

LSchueler · 2021-10-08T08:32:44Z

Those are some really cool insights into Rayon!
Somehow summator_incompr is a tough nut, right?

adamreichold · 2021-10-08T08:55:31Z

Somehow summator_incompr is a tough nut, right?

Actually, I think this more of an ecosystem "issue": Rayon makes it almost trivial to parallelize algorithms once libraries like ndarray provide the necessary plumbing for their data types. Most of the time this is more than enough to achieve useful speed-ups, but remembering how the same algorithms would be handled using e.g. manual threading or even OpenMP, it would be rather surprising that some tuning would not be beneficial or sometimes even required to achieve speed-ups. Rayon is often described as "magic" by developers first making use of it. Alas, there is no such thing and when the defaults do not work out, one has to look deeper into the stack.

Since I do not know how long it will take to land rust-ndarray/ndarray#1081 (if it lands at all), I think I will commit the "iterate along the first axis of a single axis" trick as a workaround for now. I think we should be in a position to merge this afterwards? The problems you looked up are as large or larger than the benchmarks now and all algorithms (expect variograms which are not yet index-free) would benefit from parallelisation.

This is a speed-up due to the call to `with_min_len` which however requires indexed parallel iterators which `ndarray` provides only via `axis_iter` and `axis_chunk_iter`, so this workaround is necessary until [1] is merged. [1] rust-ndarray/ndarray#1081

adamreichold · 2021-10-08T10:10:03Z

Just for old times sake:

I just checked a publication of mine, where I used a very old implementation in C++ of the summator_incompr function. I used a 2d grid of 4600 x 1800 ~ 8*10^6 cells, which I would consider a larger problem size (for a 2d problem). I also actually used 6400 modes (len of z1, z2), which is probably a bit much (nowadays I always use exactly 1000 modes). Running one calcuation took 48 minutes on JURECA.

diff --git a/benches/gen_benchmark_inputs.py b/benches/gen_benchmark_inputs.py
index 142581d..8485d5f 100755
--- a/benches/gen_benchmark_inputs.py
+++ b/benches/gen_benchmark_inputs.py
@@ -8,8 +8,8 @@ from gstools.field.generator import RandMeth, IncomprRandMeth
 
 
 def gen_field_summate(path, seed):
-    pos_no = 100_000
-    mode_no = 1_000
+    pos_no = 4600 * 1800
+    mode_no = 6400
     x = np.linspace(0.0, 10.0, pos_no)
     y = np.linspace(-5.0, 5.0, pos_no)
     z = np.linspace(-6.0, 8.0, pos_no)

> /usr/bin/time ./target/release/examples/incompr
4151.12user 19.38system 8:53.53elapsed 781%CPU (0avgtext+0avgdata 6405948maxresident)k
0inputs+0outputs (0major+3149538minor)pagefaults 0swaps

LSchueler reviewed Sep 30, 2021

View reviewed changes

benches/gen_benchmark_inputs.py Outdated Show resolved Hide resolved

LSchueler reviewed Sep 30, 2021

View reviewed changes

adamreichold added 3 commits September 30, 2021 19:16

Generate larger problems for benchmarks, presumably to increase reali…

4c7e115

…sm at the cost of runtime.

This was referenced Oct 5, 2021

Simplify the iterator adaptive splitting strategy rayon-rs/rayon#857

Open

IntoIterator and IndexedParallelIterator for Zip? rust-ndarray/ndarray#1077

Closed

LSchueler approved these changes Oct 12, 2021

View reviewed changes

LSchueler merged commit 8561a04 into GeoStat-Framework:main Oct 12, 2021

adamreichold deleted the benchmark-problem-sizes branch October 12, 2021 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are typical problem sizes? #6

What are typical problem sizes? #6

adamreichold commented Sep 30, 2021

LSchueler Sep 30, 2021

LSchueler Sep 30, 2021

adamreichold Sep 30, 2021

adamreichold Sep 30, 2021

adamreichold commented Oct 2, 2021 •

edited

Loading

LSchueler commented Oct 5, 2021

LSchueler commented Oct 5, 2021

adamreichold commented Oct 5, 2021

adamreichold commented Oct 5, 2021 •

edited

Loading

adamreichold commented Oct 6, 2021

LSchueler commented Oct 8, 2021

adamreichold commented Oct 8, 2021 •

edited

Loading

adamreichold commented Oct 8, 2021

What are typical problem sizes? #6

What are typical problem sizes? #6

Conversation

adamreichold commented Sep 30, 2021

LSchueler Sep 30, 2021

Choose a reason for hiding this comment

LSchueler Sep 30, 2021

Choose a reason for hiding this comment

adamreichold Sep 30, 2021

Choose a reason for hiding this comment

adamreichold Sep 30, 2021

Choose a reason for hiding this comment

adamreichold commented Oct 2, 2021 • edited Loading

LSchueler commented Oct 5, 2021

LSchueler commented Oct 5, 2021

adamreichold commented Oct 5, 2021

adamreichold commented Oct 5, 2021 • edited Loading

adamreichold commented Oct 6, 2021

LSchueler commented Oct 8, 2021

adamreichold commented Oct 8, 2021 • edited Loading

adamreichold commented Oct 8, 2021

adamreichold commented Oct 2, 2021 •

edited

Loading

adamreichold commented Oct 5, 2021 •

edited

Loading

adamreichold commented Oct 8, 2021 •

edited

Loading