Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add polynomial benchmark infra, switch poly eval to horners methods #114

Merged
merged 13 commits into from
Dec 7, 2020

Conversation

ValarDragon
Copy link
Member

@ValarDragon ValarDragon commented Dec 6, 2020

Description

This adds infrastructure for benchmarking polynomial operations across polynomial degrees. Furthermore, it switches polynomial evaluation to use horner's method, which both makes evaluation take constant memory, but also results in a 2x speed improvement on my laptop. Benchmark results for serial polynomial evaluation with Horners method (with outlier messages trimmed):

"bls12_381" - evaluate_polynomial/32768
                        time:   [962.88 us 965.60 us 968.56 us]
                        change: [-52.089% -50.962% -50.033%] (p = 0.00 < 0.05)
                        Performance has improved.

 "bls12_381" - evaluate_polynomial/65536
                        time:   [1.9306 ms 1.9472 ms 1.9699 ms]
                        change: [-55.475% -55.072% -54.638%] (p = 0.00 < 0.05)
                        Performance has improved.

"bls12_381" - evaluate_polynomial/131072
                        time:   [3.8705 ms 3.8858 ms 3.9019 ms]
                        change: [-60.158% -59.669% -59.202%] (p = 0.00 < 0.05)
                        Performance has improved.

Additionally this PR implements a fully parallelized implementation of single point evaluation.

closes: #85

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (main)
  • Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
  • Wrote unit tests - existing tests cover correctness
  • Updated relevant documentation in the code
  • Added a relevant changelog entry to the Pending section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

@ValarDragon
Copy link
Member Author

ValarDragon commented Dec 6, 2020

Ah didn't realize cfg_into_iter handled parallelization, forgot to test with the parallelization feature. I'll add parallelization in

@ValarDragon
Copy link
Member Author

The parallel implementation is even more of a speedup! On my 16 core laptop, parallel horners method is a 5x speedup over the current parallel method.

"bls12_381" - evaluate_polynomial/32768
                        time:   [202.18 us 204.50 us 207.14 us]
                        change: [-79.199% -78.846% -78.385%] (p = 0.00 < 0.05)
                        Performance has improved.
"bls12_381" - evaluate_polynomial/65536
                        time:   [381.50 us 386.26 us 391.48 us]
                        change: [-80.146% -79.783% -79.437%] (p = 0.00 < 0.05)
                        Performance has improved.
 "bls12_381" - evaluate_polynomial/131072
                        time:   [738.91 us 746.82 us 755.92 us]
                        change: [-80.641% -80.355% -79.991%] (p = 0.00 < 0.05)
                        Performance has improved

@ValarDragon ValarDragon requested a review from Pratyush December 7, 2020 00:42
@ValarDragon
Copy link
Member Author

ValarDragon commented Dec 7, 2020

Looking into why the parallel speedup was so high, it turns out that the prior polynomial.evaluate was only half parallelized. The computation of all powers {x^i} was done sequentially. (Only the multiplication by the coefficients and summation was parallelized before)

@ValarDragon
Copy link
Member Author

Locally, using par_chunks resulted in a slowdown, but could be due to system noise. I'll check on the benchmark server.

@ValarDragon
Copy link
Member Author

ValarDragon commented Dec 7, 2020

On the benchmark server there was essentially no speed difference, so the par_chunks impl is good to use

@Pratyush
Copy link
Member

Pratyush commented Dec 7, 2020

Awesome, thanks! Final point: should we extract the common core of both versions of the internal_evaluate algorithms into a separate method?

@ValarDragon
Copy link
Member Author

Sure

@Pratyush
Copy link
Member

Pratyush commented Dec 7, 2020

Oh and finally, do you think if we use cfg_par_chunks we can unify the two versions? (if you'd like to keep it separate for clarity that's fine too)

@ValarDragon
Copy link
Member Author

We'd have to add a cfg_par_chunks!(vector, min_num_elements_per_thread) to ark-std, but if we did that we probably could unify them. (Same for batch_inversion).

I'd prefer to merge this as is, and then update usages here after such an update to ark-std

@ValarDragon ValarDragon merged commit 70ebfa6 into master Dec 7, 2020
@ValarDragon ValarDragon deleted the bench_dense_poly branch December 7, 2020 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make polynomial evaluation only take constant memory
2 participants