-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are typical problem sizes? #6
What are typical problem sizes? #6
Conversation
pos_no = 10000 | ||
cond_no = 500 | ||
krige_mat, k_vec, krige_cond = prepare_data(pos_no, cond_no) | ||
|
||
np.savetxt(path / 'krige_vs_krige_mat.txt', krige_mat) | ||
np.savetxt(path / 'krige_vs_k_vec.txt', k_vec) | ||
np.savetxt(path / 'krige_vs_krige_cond.txt', krige_cond) | ||
|
||
pos_no = 1000 | ||
cond_no = 50 | ||
krige_mat, k_vec, krige_cond = prepare_data(pos_no, cond_no) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question about typical problem sizes is hard to answer, as you might have already thought.
I just checked a publication of mine, where I used a very old implementation in C++ of the summator_incompr
function. I used a 2d grid of 4600 x 1800 ~ 8*10^6 cells, which I would consider a larger problem size (for a 2d problem). I also actually used 6400 modes (len of z1
, z2
), which is probably a bit much (nowadays I always use exactly 1000 modes). Running one calcuation took 48 minutes on JURECA.
I guess @MuellerSeb can make much better estimates on typical sizes of kriging problems, as he has much more experience with it.
And just for the record, in this publication, I estimated a variogram on an unstructured grid with 412 cells and a bin length of 28, which, computationally wise is rather small, but I think some variogram estimations are much smaller.
In this example, which uses non-synthetic data, I used 2000 data points and about 30 bins to estimate the variogram. But we actually only sampled a small part of the data due to computational costs.
But all in all, for small problem sizes, the run times are hardly noticeable, so I support the idea of using larger problem sizes and reducing the criterion sample sizes.
Regarding your final point to maybe only parallelize the outer most loops: as long as we are not working on 64 or more cores, that would probably at least be the more conservative solution and save some potential and surprising declines in performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
But all in all, for small problem sizes, the run times are hardly noticeable, so I support the idea of using larger problem sizes and reducing the criterion sample sizes.
would suggest, to just fix up the first commit to just drop the *_bench_*
variants whereas
Regarding your final point to maybe only parallelize the outer most loops: as long as we are not working on 64 or more cores, that would probably at least be the more conservative solution and save some potential and surprising declines in performance.
would imply to drop the last commit, i.e. remove all parallelization except for the outermost loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would imply to drop the last commit, i.e. remove all parallelization except for the outermost loops.
Will wait for @MuellerSeb to chime in before making that change.
…sm at the cost of runtime.
…tion When switching to larger problem sizes, this yields a measurable speed-up Benchmarking field summate incompr: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 149.7s, or reduce sample count to 10. field summate incompr time: [1.4950 s 1.4954 s 1.4958 s] change: [-24.384% -23.691% -23.212%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 4 (4.00%) high mild 9 (9.00%) high severe
When switching to larger problem sizes, this does slightly improve performance: Benchmarking krige error: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 350.8s, or reduce sample count to 10. krige error time: [1.2209 s 1.2643 s 1.3303 s] change: [-6.9990% -3.7794% +0.8052%] (p = 0.07 > 0.05) No change in performance detected. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) high mild 9 (9.00%) high severe Benchmarking krige: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 121.4s, or reduce sample count to 10. krige time: [1.2105 s 1.2118 s 1.2134 s] change: [-8.6930% -8.4186% -8.1473%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe
As more literary thought: It appears noticeable to me that a memory and thereby concurrency safe language enables one to spend time thinking about interesting problems like "how much parallelism is actually beneficial for my typical problem sizes" instead of having to invest the same mental energy into achieving parallelism without making mistakes at all. |
That's a nice thought! Although, in a perfect world, I think there shouldn't be any performance penalties for inserting Rayon parallel iterators, as they actually decide how to divide the data. |
I just stumbled upon a small project where I actually did some Kriging. It had 103 data points (len of |
Well they do, but some amount of overhead is often unavoidable: Even if Rayon's heuristic makes the decision to not parallelize at all, just making it will have a runtime cost. Furthermore, parallel algorithms often need intermediate buffers to avoid synchronizing writes which is an overhead paid even by effectively single-threaded instantiations. Another thing to note is that while Rayon does make the decision, it does not know the workload, i.e. whether it is beneficial to put every item into a separate task or just batches of 1000. For indexed parallel iterators this can be controlled via |
…atedly computing k_2. Benchmarking field summate incompr: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 10.7s. field summate incompr time: [1.0707 s 1.0732 s 1.0761 s] change: [-28.910% -28.377% -27.966%] (p = 0.00 < 0.05) Performance has improved.
So instead of doing nothing like I am supposed to at the moment, I banged my head into this a few times. I was able to further improve the performance of the I think the main culprit here is https://github.com/rust-ndarray/ndarray/blob/566177b2e4d4556f841eebe78cafdfb1632235bf/src/zip/mod.rs#L878 which is used to determine if a If I use the slightly contrived version cov_samples
.axis_iter(Axis(1))
.into_par_iter()
.zip(z1.axis_iter(Axis(0)))
.zip(z2.axis_iter(Axis(0)))
.with_min_len(125)
... to get a
but I feel both the implementation and the fine tuning are wrong. I am not sure how to best fix this, whether in Rayon (Should it be possible to limit the number of jobs a |
rayon-rs/rayon#857 improves things considerably and I opened rust-ndarray/ndarray#1081 which would allow limiting of the splitting depending on work load characteristics. Using only the patch to Zip::from(cov_samples.columns())
.and(z1)
.and(z2)
.into_par_iter()
.with_min_len(100)
.fold(
|| Array2::<f64>::zeros(pos.dim()),
|mut summed_modes, (cov_samples, z1, z2)| {
let k_2 = cov_samples[0] / cov_samples.dot(&cov_samples);
Zip::from(pos.columns())
.and(summed_modes.columns_mut())
.par_for_each(|pos, mut summed_modes| {
let phase = cov_samples.dot(&pos);
let z12 = z1 * phase.cos() + z2 * phase.sin();
Zip::from(&mut summed_modes)
.and(&e1)
.and(cov_samples)
.for_each(|sum, e1, cs| {
*sum += (*e1 - cs * k_2) * z12;
});
});
summed_modes
},
)
.reduce_with(|mut lhs, rhs| {
lhs += &rhs;
lhs
})
.unwrap() I get small but consistent speed-ups, e.g.
|
Those are some really cool insights into Rayon! |
Actually, I think this more of an ecosystem "issue": Rayon makes it almost trivial to parallelize algorithms once libraries like Since I do not know how long it will take to land rust-ndarray/ndarray#1081 (if it lands at all), I think I will commit the "iterate along the first axis of a single axis" trick as a workaround for now. I think we should be in a position to merge this afterwards? The problems you looked up are as large or larger than the benchmarks now and all algorithms (expect variograms which are not yet index-free) would benefit from parallelisation. |
This is a speed-up due to the call to `with_min_len` which however requires indexed parallel iterators which `ndarray` provides only via `axis_iter` and `axis_chunk_iter`, so this workaround is necessary until [1] is merged. [1] rust-ndarray/ndarray#1081
Just for old times sake:
diff --git a/benches/gen_benchmark_inputs.py b/benches/gen_benchmark_inputs.py
index 142581d..8485d5f 100755
--- a/benches/gen_benchmark_inputs.py
+++ b/benches/gen_benchmark_inputs.py
@@ -8,8 +8,8 @@ from gstools.field.generator import RandMeth, IncomprRandMeth
def gen_field_summate(path, seed):
- pos_no = 100_000
- mode_no = 1_000
+ pos_no = 4600 * 1800
+ mode_no = 6400
x = np.linspace(0.0, 10.0, pos_no)
y = np.linspace(-5.0, 5.0, pos_no)
z = np.linspace(-6.0, 8.0, pos_no)
|
Considering my experience benchmarking the summator and krige functions so far and finding two significantly different problem sizes in the benchmark input generation script, I wonder what typical problem sizes in the real world are?
This is important insofar the best strategy for parallelizing the algorithms does depend on the problem sizes. For example, using the larger of the two problem sizes, not parallelizing the second loop level in the
summator_incompr
function and parallelinzing it in the krige functions both improve performance as indicated in the commit messages.Using larger problem sizes does increase the benchmark runtime significantly, but using Criterion's
--save-baseline
and--baseline
features as well as reducing the--sample-size
during iterative development would make this feasible IMHO, if the larger problems are actually more realistic that is.Further, if problem size is expected to vary significantly, it might be preferable to always limit parallelization to the outermost loops as a conservative choice that should always yield acceptable performance.