You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discovered by Axiom in the process of porting Scroll/Geometry LogUp (Multivariate Lookups) to their own lib and kindly pointed out to us when we're preparing to upstream the work.
The MV Lookup PR assumes that parallelize chunks the iteration domain into same sized chunks.
This has been removed in #186 to fix load imbalance:
Currently if we take 40 items divided into 12 threads (AMD Ryzen 7800X, Apple M2 Pro or Intel i5-12600) the partitioning will lead to 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7 = 3*11 + 7 = 40. The “remainder” thread will have 2.33x more work to do.
This can be quite extreme for example if we have 351 items to split on 32 cores, 351/32 = 10.96, rounded to 10 with integer division. 10*31 = 310, the last core needs to process 41 items, 4.1x more than the others.
Instead we need a load-balanced prefix sum algorithm / parallel scan.
This is a common parallelization problem due to being well constrained but difficult enough with several approaches to teach CS student about parallelization algorithms and also cache concerns.
Also, unsure yet of how MV-lookups are implemented but it seems like currently the code builds a big table and do a prefix sum once everything is there. An alternative, streaming, approach could be to use Fenwick Trees so the prefix sum is updated on-the-fly. Note that this only make sense if the data isn't available at once but cumulated over many parts.
The text was updated successfully, but these errors were encountered:
So it can be used as a stop-gap (but then again we can re-add the old parallelize as a stopgap as well) but if this is a bottleneck, we likely might still want an optimized implementation.
Discovered by Axiom in the process of porting Scroll/Geometry LogUp (Multivariate Lookups) to their own lib and kindly pointed out to us when we're preparing to upstream the work.
Scroll+Geometry PRs:
Axiom / @jonathanpwang comment:
The MV Lookup PR assumes that
parallelize
chunks the iteration domain into same sized chunks.This has been removed in #186 to fix load imbalance:
Instead we need a load-balanced prefix sum algorithm / parallel scan.
This is a common parallelization problem due to being well constrained but difficult enough with several approaches to teach CS student about parallelization algorithms and also cache concerns.
There is an excellent overview here on 3 different algorithms https://github.com/matteomazza91/parallel-prefix-sum/blob/master/report.pdf
An another:
And High-Performance Computing courses:
And GPU techniques (which are quite relevant now that CPUs have 100+ cores)
Also, unsure yet of how MV-lookups are implemented but it seems like currently the code builds a big table and do a prefix sum once everything is there. An alternative, streaming, approach could be to use Fenwick Trees so the prefix sum is updated on-the-fly. Note that this only make sense if the data isn't available at once but cumulated over many parts.
The text was updated successfully, but these errors were encountered: