Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r.univar: Add parallel support #1634

Merged
merged 23 commits into from
Aug 30, 2022
Merged

Conversation

aaronsms
Copy link
Contributor

@aaronsms aaronsms commented Jun 11, 2021

This PR implements parallelization for all r.univar options except for when "extended (stats)" flag is set to true. This is because there are dynamic allocation and sorting involved, which is trickier to implement, but this could be a work for the future.

Checklists before merging:

  • code review
  • CI passes
  • performance section in documentation
  • confirm values in test are from the old version (run new tests with old code)
  • run tests without OpenMP (runs in CI)
  • visual check of results with custom data ("looks good" with non NC SPM dataset)
  • check that it works with really large data (16B cells)
  • run multi-core benchmark (no degraded performance with many threads)
  • run one core benchmark on many resolutions or many cell
  • rebase to main
  • run valgrind

@wenzeslaus wenzeslaus added the gsoc Reserved for Google Summer of Code student(s) label Jun 12, 2021
Copy link
Member

@wenzeslaus wenzeslaus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial things:

Fails on Ubuntu 18.04, but not 20.04 in CI, too old version of OpenMP?

2021-06-12T13:19:51.6016965Z r.univar_main.c: In function ‘process_raster_threaded’:
2021-06-12T13:19:51.6019743Z r.univar_main.c:500:11: error: ‘value_sz’ is predetermined ‘shared’ for ‘shared’
2021-06-12T13:19:51.6020739Z      shared(stats, fd, fdz, raster_row, zoneraster_row, n, sum, sumsq, sum_abs, min, max, size, region, \
2021-06-12T13:19:51.6021406Z            ^
2021-06-12T13:19:51.6022224Z r.univar_main.c:500:11: error: ‘map_type’ is predetermined ‘shared’ for ‘shared’
2021-06-12T13:19:51.6025019Z r.univar_main.c:500:11: error: ‘n_zones’ is predetermined ‘shared’ for ‘shared’
2021-06-12T13:19:51.6027487Z r.univar_main.c:500:11: error: ‘cols’ is predetermined ‘shared’ for ‘shared’
2021-06-12T13:19:51.6028607Z r.univar_main.c:500:11: error: ‘rows’ is predetermined ‘shared’ for ‘shared’

@@ -0,0 +1,59 @@
"""Benchmarking of r.univar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_r_univar.py should not be deleted. This benchmark_r_univar.py file is extra.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails on my Linux machine too with the same message. What is the min version of GCC that this parallel code supports?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, gcc 5.5.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, all those variables are const and removing them from shared worked. I think consts are shared by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I must've accidentally deleted the file when separating the scripts to respective directories, will fix it.

@@ -52,6 +57,14 @@ void set_params()
_("Percentile to calculate (requires extended statistics flag)");
param.percentile->guisection = _("Extended");

param.threads = G_define_option();
param.threads->key = "nprocs";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a "nproc" standard option for the parser (like e.g G_OPT_M_NPROCS)?
nprocs is used in several Python scripts as well. So having a standard option could secure a harmonized way to handle such an option...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ninsbl That's a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check #1644.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'll modify the code using the standard option.

Copy link
Contributor

@marisn marisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My OpenMP knowledge is too weak to judge if this is the right approach :-(

raster/r.univar/r.univar_main.c Outdated Show resolved Hide resolved
raster/r.univar/r.univar_main.c Outdated Show resolved Hide resolved
@HuidaeCho
Copy link
Member

HuidaeCho commented Jun 19, 2021

I have i5-7300U with 2 cores and 4 threads on my laptop. What does this module do with nprocs > 4? The benchmark script runs fine with up to 12 threads, but I thought my CPU can only do up to 4 threads... I might be wrong. interestingly, more threads didn't always mean faster. See my results below. Is it possible to determine the number of max threads in the code and print a warning if nprocs is greater than that? Also, please consider implementing a fallback terminal size of 80 for redirecting benchmarking outputs; os.get_terminal_size() raises an exception OSError: [Errno 25] Inappropriate ioctl for device on python3 benchmark_r_univar.py >& benchmark_r_univar.log.

r.univar map=elevation,elevation,elevation,elevation,elevation,elevation,elevation,elevation,elevation,elevation percentile=90.0 nprocs=1 separator=pipe -g

Benchmark with 1 thread(s)...
Result - 0.1885438919067383s

Benchmark with 2 thread(s)...
Result - 0.1510328769683838s

Benchmark with 3 thread(s)...
Result - 0.18701410293579102s

Benchmark with 4 thread(s)...
Result - 0.2988687038421631s

Benchmark with 5 thread(s)...
Result - 0.14902639389038086s

Benchmark with 6 thread(s)...
Result - 0.16875758171081542s

Benchmark with 7 thread(s)...
Result - 0.16923255920410157s

Benchmark with 8 thread(s)...
Result - 0.20600790977478028s

Benchmark with 9 thread(s)...
Result - 0.1692127227783203s

Benchmark with 10 thread(s)...
Result - 0.13888721466064452s

Benchmark with 11 thread(s)...
Result - 0.16875271797180175s

Benchmark with 12 thread(s)...
Result - 0.17763447761535645s

r.univar map=elevation,elevation,elevation,elevation,elevation,elevation,elevation,elevation,elevation,elevation zones=basin_50K percentile=90.0 nprocs=1 separator=pipe -g

Benchmark with 1 thread(s)...
Result - 0.27266292572021483s

Benchmark with 2 thread(s)...
Result - 0.17659420967102052s

Benchmark with 3 thread(s)...
Result - 0.2019331455230713s

Benchmark with 4 thread(s)...
Result - 0.42590484619140623s

Benchmark with 5 thread(s)...
Result - 0.2039196491241455s

Benchmark with 6 thread(s)...
Result - 0.1847921848297119s

Benchmark with 7 thread(s)...
Result - 0.1929023265838623s

Benchmark with 8 thread(s)...
Result - 0.22778925895690919s

Benchmark with 9 thread(s)...
Result - 0.3162210464477539s

Benchmark with 10 thread(s)...
Result - 0.2332921504974365s

Benchmark with 11 thread(s)...
Result - 0.2835871696472168s

Benchmark with 12 thread(s)...
Result - 0.24240994453430176s

@wenzeslaus
Copy link
Member

I tested with 4 cores - 8 threads processor and made some additions to the benchmark script. Here is a test up to nprocs=16, 10 runs, and two additional scenarios. The data is still too small I think, so that's still a todo.

Screenshot from 2021-06-19 15-57-35

@wenzeslaus
Copy link
Member

What does this module do with nprocs > 4? ... Is it possible to determine the number of max threads in the code and print a warning if nprocs is greater than that?

I don't think that needs a warning. You need to explicitly say you want n threads and likely you know the number cores/threads on the machine you are using or a look into process manager or specs can tell you. So, warning is really no necessary since likely that's what you meant. Additionally, what will be the result (improvement or degradation) depends on the specific setup, so the warning anyway cannot claim it will be worse.

Well, now, automatically detecting number of cores with nprocs=auto or by default that's a different story!

@HuidaeCho
Copy link
Member

HuidaeCho commented Jun 19, 2021

I don't think that needs a warning. You need to explicitly say you want n threads and likely you know the number cores/threads on the machine

I know I have 4 threads, but I don't know what it's doing with 12 threads when I only have 4. What does it even mean? It needs to be explained at least.

https://forum.openmp.org/viewtopic.php?t=209

@HuidaeCho
Copy link
Member

I tested with 4 cores - 8 threads processor and made some additions to the benchmark script. Here is a test up to nprocs=16, 10 runs, and two additional scenarios. The data is still too small I think, so that's still a todo.

Screenshot from 2021-06-19 15-57-35

That looks nice. Interesting it's not happening on my machine.

@HuidaeCho
Copy link
Member

opm_get_num_threads?

@petrasovaa
Copy link
Contributor

r.univar on a desktop with 28 cores, slope map, 16832104560 cells:
results
Any idea what is going on?

@HuidaeCho
Copy link
Member

@aaronsms Please check this first module. That jump at 14 processes (half of the 28 cores) is interesting. @petrasovaa What is the name of the CPU? Does it have 28 cores with 56 threads or 14 cores with 28 threads (usually 2 threads per core)?

@HuidaeCho
Copy link
Member

HuidaeCho commented Jul 30, 2021

OK, in the weekly meeting with @aaronsms, @petrasovaa confirmed that it has 14 physical cores and 28 threads. Maybe, it's related to this fact. The only thing is why it jumps at 14, not at 14+1=15 where it starts to fully occupy one core for the first time (?).

Not sure about this actually, is it

  • core 1 thread 1 => core 1 thread 2 => core 2 thread 1 => core 2 thread 2 => ... (depth first?) OR
  • core 1 thread 1 => core 2 thread 1 => core 3 thread 1 => ... => core 1 thread 2 => core 2 thread 2 => ... (breadth first?)

marisn
marisn previously requested changes Aug 1, 2021
Copy link
Contributor

@marisn marisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not run the benchmark as it ended with OOM in r.surf.fractal with 100000000 cells in a region. Is there something that can be done to adapt to the RAM size of machine instead of failing with OOM and thus losing all results?

raster/r.univar/r.univar_main.c Outdated Show resolved Hide resolved
raster/r.univar/r.univar_main.c Outdated Show resolved Hide resolved
@HuidaeCho HuidaeCho added raster Related to raster data processing enhancement New feature or request labels Aug 1, 2021
@wenzeslaus
Copy link
Member

Is there something that can be done to adapt to the RAM size of machine instead of failing with OOM and thus losing all results?

Unlike for the test, which should just run everywhere (although skipped in some cases), for benchmarks, we don't have a any portability conceptualized. For example, the grass.benchmark package helps you write a benchmark, but does not tell you how to write it (i.e., you can write a benchmark without using grass.benchmark and that's perfectly fine). So far, we were adapting the benchmark scripts to the test we wanted to do. Suggestions welcome.

As for this particular case (OOM), do you envision the Python code to do some heuristics for memory requirements of r.surf.fractal versus your size of your RAM and do benchmarks according to that?

@marisn
Copy link
Contributor

marisn commented Aug 2, 2021

As for this particular case (OOM), do you envision the Python code to do some heuristics for memory requirements of r.surf.fractal versus your size of your RAM and do benchmarks according to that?

Heuristics would be good, but might need too much work. Probably the easiest solution would be to add try: except: around r.surf.fractal calls to not fail if OOM situation is encountered.

@petrasovaa
Copy link
Contributor

r.univar on a desktop with 28 cores, slope map, 16832104560 cells:
results
Any idea what is going on?

Repeated my benchmark with latest code, this looks much better!
results

@aaronsms
Copy link
Contributor Author

aaronsms commented Aug 18, 2021

@petrasovaa yea I suspect the issues previously was due to memory bandwidth issues or false sharing due to cache inefficiencies. So I made some effort to refactor such that threads now are less likely to share the same cache for accessing variables, to achieve higher cache hits. So I won't no longer include the issues regarding worsening performance with threads overloading. I believe we need to check this for other modules as well.

@aaronsms aaronsms marked this pull request as ready for review August 18, 2021 07:38
@wenzeslaus wenzeslaus added this to the 8.2.0 milestone Aug 24, 2021
@petrasovaa
Copy link
Contributor

With extended statistics (with nprocs=1) valgrind is getting mad:

==90673== Invalid write of size 2
==90673==    at 0x4842B33: memmove (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==90673==    by 0x10C225: process_raster._omp_fn.0 (r.univar_main.c:441)
==90673==    by 0x4A988E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==90673==    by 0x10B6F8: process_raster (r.univar_main.c:342)
==90673==    by 0x10B1FC: main (r.univar_main.c:240)
==90673==  Address 0xee42980 is 0 bytes after a block of size 36,000 alloc'd
==90673==    at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==90673==    by 0x4899F82: G__realloc (alloc.c:126)
==90673==    by 0x10C344: process_raster._omp_fn.0 (r.univar_main.c:431)
==90673==    by 0x4A988E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==90673==    by 0x10B6F8: process_raster (r.univar_main.c:342)
==90673==    by 0x10B1FC: main (r.univar_main.c:240)

@petrasovaa
Copy link
Contributor

The valgrind problem was fixed. When extended stats are requested, only single thread is used.

@petrasovaa petrasovaa merged commit 7a51911 into OSGeo:main Aug 30, 2022
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Oct 26, 2022
Co-authored-by: Anna Petrasova <kratochanna@gmail.com>
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Feb 17, 2023
Co-authored-by: Anna Petrasova <kratochanna@gmail.com>
neteler pushed a commit to nilason/grass that referenced this pull request Nov 7, 2023
Co-authored-by: Anna Petrasova <kratochanna@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gsoc Reserved for Google Summer of Code student(s) raster Related to raster data processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants