Unroll loops and use in-place ops for faster `ff` and `ec` arithmetic #199

Pratyush · 2021-02-04T18:30:37Z

Description

BigInteger: unroll loops in add_nocarry, sub_noborrow, mul2, div2, and cmp
SW & TE: use in-place ops in mixed addition and doubling
Extension fields: use in-place ops in mul and square

Overall, these changes provide a ~10% speedup to the relevant ops.

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

Targeted PR against correct branch (master)
Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
Wrote unit tests
Updated relevant documentation in the code
Added a relevant changelog entry to the Pending section in CHANGELOG.md
Re-reviewed Files changed in the Github PR explorer

ec/src/models/short_weierstrass_jacobian.rs

ff/src/fields/macros.rs

ff/src/fields/models/cubic_extension.rs

ValarDragon · 2021-02-04T20:22:23Z

LGTM. For the extension fields, inverse_in_place could have this optimization applied to it, if the inverse logic is moved to inverse_in_place.

Pratyush · 2021-02-04T22:23:28Z

This is ready for review.

ec/src/models/short_weierstrass_jacobian.rs

ValarDragon · 2021-02-05T00:02:36Z

ec/src/models/short_weierstrass_jacobian.rs

@@ -260,7 +265,6 @@ impl<P: Parameters> Default for GroupAffine<P> {
 #[derivative(
    Copy(bound = "P: Parameters"),
    Clone(bound = "P: Parameters"),
-    Eq(bound = "P: Parameters"),


Why did this get removed?

This got moved to a proper impl, because you shouldn't have a manual impl of PartialEq and a derived impl of Eq.

ff/src/fields/models/cubic_extension.rs

jon-chuang · 2021-02-05T00:56:02Z

Hi @Pratyush , thanks for this quick follow up.

I did a bench of this PR (name: assign) against simply implementing the unroll to big integer (just_unroll), here are the results:

Thanks for catching cmp as being a helpful change. This improved things further over my original changes.

However, I question the changes to the semantics of the non-assigning versions of the ops, which are changed to mutate the underlying variable. From an API standpoint, I think this is confusing, further, it achieves nothing over just x.op_assign(); let y = x;. At the most, I would remove .clone(), as was done for modulus. Further, it appears, according to the benchmark, that changing the formulas to use only assigning versions of the ops has little or possibly even negative effect.

@Pratyush Would you mind giving me partial access to this repo so I could push commits to a branch on the repo? It would simplify things for me so I wouldn't have to keep switching my git remote to target different urls.

Pratyush · 2021-02-05T01:11:10Z

However, I question the changes to the semantics of the non-assigning versions of the ops. At the most, I would remove .clone(), as was done for modulus.

The semantics didn't change; I only removed unnecessary copies. In particular, the non-assigning versions take self by value, not by reference.

Further, it appears, according to the benchmark, that changing the formulas to use only assigning versions of the ops has little or possibly even negative effect.

That's surprising, I definitely benchmarked the assign versions against the old versions, and found a non-negligible difference.

@Pratyush Would you mind giving me partial access to this repo so I could push commits to a branch on the repo? It would simplify things for me so I wouldn't have to keep switching my git remote to target different urls.

Let me figure that out. In the mean time a simpler way to handle it would be to just change the patch location

jon-chuang · 2021-02-05T01:28:10Z

Let me figure that out. In the mean time a simpler way to handle it would be to just change the patch location

Hmm actually this is orthogonal to that issue (since deps target master, and one still has to make changes to target branches). I'm just thinking in terms of friction in PR workflow. It's a small thing.

Wrt the cross-dependency issue, I think patch doesn't fix the problem, I'll investigate if there is a good way to fix it.

That's surprising, I definitely benchmarked the assign versions against the old versions, and found a non-negligible difference.

Are you saying that the assign changes without the biginteger unroll changes produced an improvement?

The semantics didn't change; I only removed unnecessary copies. In particular, the non-assigning versions take self by value, not by reference.

Could you clarify this? I meant that the semantics have changed due to mutating the underlying variable now.

To clarify:

I question the changes to the semantics of the non-assigning versions of the ops, which are changed to mutate the underlying variable. From an API standpoint, I think this is confusing, further, it achieves nothing over just x.op_assign(); let y = x;

My point essentially is that the old non assigning APIs don't have to be changed, even if one were to convert all group formulas to use assigning versions.

Orthogonally, those conversions don't appear to help, at least according to the benchmarks I performed.

Pratyush · 2021-02-05T01:30:41Z

Hmm actually this is orthogonal to that issue (since deps target master, and one still has to make changes to target branches). I'm just thinking in terms of friction in PR workflow. It's a small thing.

Ah I just point the patch to a local clone of the relevant repos, so that I don't have to change branches and such. That way I can switch out the paths easily.

Are you saying that the assign changes without the biginteger unroll changes produced an improvement?

Yes

Could you clarify this? I meant that the semantics have changed due to mutating the underlying variable now.

The semantics to users are the same:

let a = 2;
let b = 3;
let c = a.add(&b); // c = 5
println!(a); // will print 2

ValarDragon · 2021-02-05T01:33:40Z

Here is my patch in curves, which works quite nice for me:

[patch.'https://github.com/arkworks-rs/algebra']
ark-ff = { path = '../algebra/ff' }
ark-ec = { path = '../algebra/ec' }
ark-serialize = { path = '../algebra/serialize' }

ValarDragon · 2021-02-05T01:35:56Z

Also I re-benchmarked this, and confirmed that I only saw operation times go down for all operations benchmarked. (Between 2 and 10% reduction depending on the operation)

ValarDragon

LGTM sans comment requests. Thanks for updating this!

ValarDragon · 2021-02-05T02:56:05Z

wait what happened in this force push

Pratyush · 2021-02-05T02:57:50Z

Sorry, I just rebased on master, though I will be making a clean up of the history also soon, to separate out commits that change just the unrolling and commits that additionally use in_place ops.

(Don't worry, I added your comment requests =P)

Pratyush · 2021-02-05T04:22:00Z

Ok @jon-chuang @ValarDragon seems like the in place ops don't actually get you a big benefit; I can't reproduce whatever I had earlier (which might well have been wrong). In light of that, I think it makes sense keep only the first three commits (loop unrolling and reducing copies), as the in place ops make the code more difficult to read.

Fields ops

 name                                    only_unroll ns/iter  in_place_ops ns/iter  diff ns/iter   diff %  speedup 
 bls12_381::fq12::add_assign             144                  142                             -2   -1.39%   x 1.01 
 bls12_381::fq12::deser                  835                  859                             24    2.87%   x 0.97 
 bls12_381::fq12::deser_unchecked        835                  859                             24    2.87%   x 0.97 
 bls12_381::fq12::double                 123                  128                              5    4.07%   x 0.96 
 bls12_381::fq12::inverse                17,497               18,103                         606    3.46%   x 0.97 
 bls12_381::fq12::mul_assign             3,960                4,039                           79    1.99%   x 0.98 
 bls12_381::fq12::negate                 141                  137                             -4   -2.84%   x 1.03 
 bls12_381::fq12::ser                    506                  504                             -2   -0.40%   x 1.00 
 bls12_381::fq12::ser_unchecked          520                  509                            -11   -2.12%   x 1.02 
 bls12_381::fq12::square                 2,788                2,744                          -44   -1.58%   x 1.02 
 bls12_381::fq12::sub_assign             146                  140                             -6   -4.11%   x 1.04 
 bls12_381::fq2::add_assign              13                   13                               0    0.00%   x 1.00 
 bls12_381::fq2::deser                   127                  127                              0    0.00%   x 1.00 
 bls12_381::fq2::deser_unchecked         126                  124                             -2   -1.59%   x 1.02 
 bls12_381::fq2::double                  12                   13                               1    8.33%   x 0.92 
 bls12_381::fq2::inverse                 11,427               11,210                        -217   -1.90%   x 1.02 
 bls12_381::fq2::mul_assign              132                  131                             -1   -0.76%   x 1.01 
 bls12_381::fq2::negate                  15                   13                              -2  -13.33%   x 1.15 
 bls12_381::fq2::ser                     87                   84                              -3   -3.45%   x 1.04 
 bls12_381::fq2::ser_unchecked           85                   84                              -1   -1.18%   x 1.01 
 bls12_381::fq2::sqrt                    78,645               77,164                      -1,481   -1.88%   x 1.02 
 bls12_381::fq2::square                  105                  107                              2    1.90%   x 0.98 
 bls12_381::fq2::sub_assign              15                   15                               0    0.00%   x 1.00 
 bls12_381::fq::add_assign               8                    8                                0    0.00%   x 1.00 
 bls12_381::fq::deser                    61                   61                               0    0.00%   x 1.00 
 bls12_381::fq::deser_unchecked          61                   61                               0    0.00%   x 1.00 
 bls12_381::fq::double                   6                    6                                0    0.00%   x 1.00 
 bls12_381::fq::from_repr                40                   39                              -1   -2.50%   x 1.03 
 bls12_381::fq::into_repr                29                   29                               0    0.00%   x 1.00 
 bls12_381::fq::inverse                  10,880               10,880                           0    0.00%   x 1.00 
 bls12_381::fq::mul_assign               37                   37                               0    0.00%   x 1.00 
 bls12_381::fq::negate                   7                    7                                0    0.00%   x 1.00 
 bls12_381::fq::repr_add_nocarry         5                    5                                0    0.00%   x 1.00 
 bls12_381::fq::repr_div2                2                    2                                0    0.00%   x 1.00 
 bls12_381::fq::repr_mul2                2                    2                                0    0.00%   x 1.00 
 bls12_381::fq::repr_num_bits            2                    2                                0    0.00%   x 1.00 
 bls12_381::fq::repr_sub_noborrow        2                    3                                1   50.00%   x 0.67 
 bls12_381::fq::ser                      43                   43                               0    0.00%   x 1.00 
 bls12_381::fq::ser_unchecked            43                   43                               0    0.00%   x 1.00 
 bls12_381::fq::sqrt                     19,079               19,263                         184    0.96%   x 0.99 
 bls12_381::fq::square                   37                   37                               0    0.00%   x 1.00 
 bls12_381::fq::sub_assign               9                    8                               -1  -11.11%   x 1.12 
 ed_on_bls12_381::fq::add_assign         4                    4                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::deser              30                   31                               1    3.33%   x 0.97 
 ed_on_bls12_381::fq::deser_unchecked    30                   31                               1    3.33%   x 0.97 
 ed_on_bls12_381::fq::double             4                    4                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::from_repr          24                   24                               0    0.00%   x 1.00 
 ed_on_bls12_381::fq::into_repr          4                    4                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::inverse            5,405                5,098                         -307   -5.68%   x 1.06 
 ed_on_bls12_381::fq::mul_assign         21                   21                               0    0.00%   x 1.00 
 ed_on_bls12_381::fq::negate             5                    5                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::repr_add_nocarry   4                    4                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::repr_div2          2                    2                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::repr_mul2          2                    3                                1   50.00%   x 0.67 
 ed_on_bls12_381::fq::repr_num_bits      2                    3                                1   50.00%   x 0.67 
 ed_on_bls12_381::fq::repr_sub_noborrow  2                    2                                0    0.00%   x 1.00 
 ed_on_bls12_381::fq::ser                18                   18                               0    0.00%   x 1.00 
 ed_on_bls12_381::fq::ser_unchecked      19                   18                              -1   -5.26%   x 1.06 
 ed_on_bls12_381::fq::sqrt               12,672               12,640                         -32   -0.25%   x 1.00 
 ed_on_bls12_381::fq::square             21                   20                              -1   -4.76%   x 1.05 
 ed_on_bls12_381::fq::sub_assign         6                    5                               -1  -16.67%   x 1.20

Group ops

name                             only_unroll_g1 ns/iter  in_place_ops_g1 ns/iter  diff ns/iter  diff %  speedup 
 bls12_381::g1::add_assign        655                     661                                 6   0.92%   x 0.99 
 bls12_381::g1::add_assign_mixed  486                     481                                -5  -1.03%   x 1.01 
 bls12_381::g1::deser             148,522                 149,175                           653   0.44%   x 1.00 
 bls12_381::g1::deser_unchecked   131                     132                                 1   0.76%   x 0.99 
 bls12_381::g1::double            333                     343                                10   3.00%   x 0.97 
 bls12_381::g1::msm_131072        1,342,187,320           1,334,606,002              -7,581,318  -0.56%   x 1.01 
 bls12_381::g1::mul_assign        166,342                 175,056                         8,714   5.24%   x 0.95 
 bls12_381::g1::rand              103,346                 105,790                         2,444   2.36%   x 0.98 
 bls12_381::g1::ser               105                     108                                 3   2.86%   x 0.97 
 bls12_381::g1::ser_unchecked     84                      86                                  2   2.38%   x 0.98

Pratyush mentioned this pull request Feb 4, 2021

Investigate ways to improve basic primitive speed #198

Open

ValarDragon reviewed Feb 4, 2021

View reviewed changes

ec/src/models/short_weierstrass_jacobian.rs Show resolved Hide resolved

ValarDragon mentioned this pull request Feb 4, 2021

Benchmark elliptic curve double_in_place_neg #200

Open

ValarDragon reviewed Feb 4, 2021

View reviewed changes

ff/src/fields/macros.rs Show resolved Hide resolved

ValarDragon reviewed Feb 4, 2021

View reviewed changes

ff/src/fields/models/cubic_extension.rs Show resolved Hide resolved

ff/src/fields/models/cubic_extension.rs Show resolved Hide resolved

ff/src/fields/models/cubic_extension.rs Show resolved Hide resolved

Pratyush force-pushed the faster-arithmetic branch from 6ead802 to c1c52d8 Compare February 4, 2021 20:34

ValarDragon reviewed Feb 5, 2021

View reviewed changes

ValarDragon approved these changes Feb 5, 2021

View reviewed changes

Pratyush mentioned this pull request Feb 5, 2021

Use intrinsics for bigint ops. #202

Closed

6 tasks

Pratyush force-pushed the faster-arithmetic branch from 68eb904 to f86bb91 Compare February 5, 2021 02:53

Pratyush added 8 commits February 4, 2021 18:59

Unroll biginteger loops

6499e97

Reduce field arithmetic copies

e62bb64

Reduce ec arithmetic copies

41eb72f

in-place ops in quadratic extension

b7c8a5c

in-place ops in cubic extension

cd1ff8b

in-place ops in sw

6693716

in-place ops in te

e40ea60

Update CHANGELOG.md

63a56ef

Pratyush force-pushed the faster-arithmetic branch from f86bb91 to 63a56ef Compare February 5, 2021 03:07

jon-chuang mentioned this pull request Feb 5, 2021

Improve primitive biginteger operations #204

Closed

6 tasks

Pratyush mentioned this pull request Feb 5, 2021

Unroll biginteger loops and reduce copies #205

Merged

6 tasks

Pratyush closed this Feb 5, 2021

Pratyush mentioned this pull request Sep 12, 2022

Field and curve optimizations #475

Merged

6 tasks

Pratyush deleted the faster-arithmetic branch October 26, 2022 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll loops and use in-place ops for faster `ff` and `ec` arithmetic #199

Unroll loops and use in-place ops for faster `ff` and `ec` arithmetic #199

Pratyush commented Feb 4, 2021 •

edited

Loading

ValarDragon commented Feb 4, 2021 •

edited

Loading

Pratyush commented Feb 4, 2021

ValarDragon Feb 5, 2021

Pratyush Feb 5, 2021

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

ValarDragon commented Feb 5, 2021

ValarDragon commented Feb 5, 2021

ValarDragon left a comment

ValarDragon commented Feb 5, 2021

Pratyush commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

Unroll loops and use in-place ops for faster ff and ec arithmetic #199

Unroll loops and use in-place ops for faster ff and ec arithmetic #199

Conversation

Pratyush commented Feb 4, 2021 • edited Loading

Description

ValarDragon commented Feb 4, 2021 • edited Loading

Pratyush commented Feb 4, 2021

ValarDragon Feb 5, 2021

Choose a reason for hiding this comment

Pratyush Feb 5, 2021

Choose a reason for hiding this comment

jon-chuang commented Feb 5, 2021 • edited Loading

Pratyush commented Feb 5, 2021 • edited Loading

jon-chuang commented Feb 5, 2021 • edited Loading

Pratyush commented Feb 5, 2021 • edited Loading

ValarDragon commented Feb 5, 2021

ValarDragon commented Feb 5, 2021

ValarDragon left a comment

Choose a reason for hiding this comment

ValarDragon commented Feb 5, 2021

Pratyush commented Feb 5, 2021 • edited Loading

Pratyush commented Feb 5, 2021 • edited Loading

Unroll loops and use in-place ops for faster `ff` and `ec` arithmetic #199

Unroll loops and use in-place ops for faster `ff` and `ec` arithmetic #199

Pratyush commented Feb 4, 2021 •

edited

Loading

ValarDragon commented Feb 4, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

jon-chuang commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading

Pratyush commented Feb 5, 2021 •

edited

Loading