Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uint: faster mul by u64 #125

Merged
merged 16 commits into from
May 7, 2019
Merged

uint: faster mul by u64 #125

merged 16 commits into from
May 7, 2019

Conversation

ordian
Copy link
Member

@ordian ordian commented Apr 22, 2019

on top of #123 (last three commits)
changes:

  • implements multiplication by u64 (needed for a faster division)
  • removes multiplication by u32
  • removes pub free-standing functions (technically this is a breaking change, but they're not supposed to be used by a user with #[doc(hidden)])
  • adds specialization for multiplication by primitive number types up to u64 (if the type is not specified, it's inferred as i32)

U512 multiplication by primitive type is 2-5x faster.

@ordian ordian mentioned this pull request Apr 22, 2019
@@ -512,6 +553,15 @@ macro_rules! construct_uint {
return true;
}

// Whether this fits u64.
#[inline]
fn fits_word(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the reverse of https://github.com/paritytech/parity-common/blob/master/uint/src/uint.rs#L215 – maybe we can save some code there? And I wonder if the unroll! is still warranted, and if the comment about inlining is still relevant on modern rustc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can save some code there?

There is also a difference in that in it starts from 1nd index, not 0th, so I'm not sure how are you suggesting to reuse it. Initially I wrote self.bits() <= 64, but there was a small (~2%), but consistently measurable difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starts from 1nd index, not 0th

Ah, missed that. Sorry.

Still, we should verify if the unroll! and the inlining advice is needed here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a change to unroll! to work with non-zero indices here, I'm curious if it makes any difference? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unroll v0.2.2 is out so I checked if it makes a difference (went back to commit 0b3844e) and it looks like unrolling is quite a significant win for U256:

     Running /Users/dvd/dev/parity/parity-common/target/release/deps/bigint-abf18d123360ed8c
u256_mul//(115792089237316195423570985008687907853269984665640564039457584007913129639935, 1)
                        time:   [3.3480 ns 3.3644 ns 3.3843 ns]
                        change: [+36.959% +41.024% +44.814%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
u256_mul//(3, 18446744073709551615)
                        time:   [3.3805 ns 3.3993 ns 3.4202 ns]
                        change: [+38.603% +43.937% +49.933%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
u256_mul//(21674844646682989462120101885968193938394323990565507610662749, 173)
                        time:   [3.4109 ns 3.4509 ns 3.4963 ns]
                        change: [+34.791% +38.484% +41.754%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

u512_mul//(1, 0)     time:   [8.5398 ns 8.5914 ns 8.6427 ns]
                        change: [+12.422% +16.167% +20.151%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe
u512_mul//(18446744073709551615, 4294967296)
                        time:   [8.4936 ns 8.5363 ns 8.5772 ns]
                        change: [+9.2618% +13.348% +17.802%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
u512_mul//(3759751734479964094783137206182536765532905409829204647089173492, 25097755455336819385...
                        time:   [24.957 ns 25.029 ns 25.111 ns]
                        change: [-4.7843% -1.5953% +1.9514%] (p = 0.40 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low mild
  9 (9.00%) high mild
  6 (6.00%) high severe
u512_mul//(452312848583266388373324160190187140051835877600158453279131187530910662655, 452312848...
                        time:   [25.200 ns 25.326 ns 25.461 ns]
                        change: [-7.0673% -4.0808% -1.6396%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David, will add a commit with unroll 0.2.2.

@@ -321,15 +368,9 @@ macro_rules! impl_mulassign_from {

#[inline(always)]
#[doc(hidden)]
pub fn mul_u32(a: (u64, u64), b: u64, carry: u64) -> (u64, u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public function, so we probably need to support it (with a deprecation warning?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is true, but it is only used for construct_uint macro, and doc(hidden) indicate it shouldn't be a part of the public interface.

uint/src/uint.rs Outdated

let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
(res2, carry)
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function is defined in uint crate, but the uint type itself is instantiated in the module it is defined (via macro).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-thinking about this. Not sure I follow the reasoning. This is a free function, a utility, called only from overflowing_mul_u64 I think. It is always inlined, so why can't it be a part of the impl $name for each type? (and be private)? Sorry if dumb q!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the same q applies to the split_128 as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't know the reasoning that was used before, maybe it will affect generated binary size somehow, but it seems like a good idea to me. 👍

uint/src/uint.rs Outdated
let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
(res2, carry)
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
let (hi, lo) = split_u128(u128::from(a) * u128::from(b) + u128::from(carry));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is u128 available on wasm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Seems to compile on playground, but I haven't measured perf for wasm. This method is needed for a faster division though.

Copy link

@eira-fransham eira-fransham May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check, the playground actually doesn't compile that function at all, it's not in the output. You can check what it looks like under Wasm using WebAssembly Studio. It compiles, but if the function isn't inlined then it does a (relatively costly in Wasm) stack return instead of using 2 return values like one would expect:

(func $split_u128 (export "split_u128") (type $t2) (param $p0 i32) (param $p1 i64) (param $p2 i64)
    get_local $p0
    get_local $p1
    i64.store offset=8
    get_local $p0
    get_local $p2
    i64.store)

So yes, this is available on Wasm, but I would be cautious. The assembly is actually fine under FireFox, although laughably bad for how simple a function this is (if LLVM had generated better Wasm it could just be 2x mov r64, r64), but assuming that you don't use the return value the next cycle after the call the cost will likely be hidden by the write being executed in parallel with other instructions:

wasm-function[1]:
  sub rsp, 8                            ; 0x000000 48 83 ec 08
  mov qword ptr [r15 + rdi + 8], rsi    ; 0x000004 49 89 74 3f 08
  mov qword ptr [r15 + rdi], rdx        ; 0x000009 49 89 14 3f
  nop                                   ; 0x00000d 66 90
  add rsp, 8                            ; 0x00000f 48 83 c4 08
  ret                                   ; 0x000013 c3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jack for the detailed info. This function is only used for parsing string (from_dec_str), multiplication by primitive types and (in #126) by a faster div. In div it's not a perf bottleneck, so I guess it's fine if it's currently not optimal for Wasm.

let o = other as u64;
/// Overflowing multiplication by u64.
/// Returns the result and carry.
fn overflowing_mul_u64(mut self, other: u64) -> (Self, u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be API change, no? We went from u32 to u64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a private method

@@ -950,56 +1003,18 @@ macro_rules! construct_uint {
}
}

// specialization for u32
impl $crate::core_::ops::Mul<u32> for $name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a breaking API change. I think the reason why it wasn't implemented for all numbers was type interference. Now if you try to write U256::from(2) * 2 it's going to complain, so you need to specify the correct type (brace for fixing the tests in dependent projects).
That said I think it's a good change, the previous behaviour was weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generated via impl_mul_for_primitive now (also for other primitive types)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my point. Existing code will break cause there are now multiple (potentially conflicting) implementations of Mul if the type is not strictly specified (i.e. you just write * 2 instead of * 2_u8 or * 2_u32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a breaking API change.

Indeed. We'll need a 0.7 release for sure.

Copy link
Member Author

@ordian ordian Apr 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomusdrw
Copy link
Contributor

Nice work!

@@ -512,6 +553,15 @@ macro_rules! construct_uint {
return true;
}

// Whether this fits u64.
#[inline]
fn fits_word(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starts from 1nd index, not 0th

Ah, missed that. Sorry.

Still, we should verify if the unroll! and the inlining advice is needed here too.

uint/src/uint.rs Outdated
(res2, carry)
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
let (hi, lo) = split_u128(u128::from(a) * u128::from(b) + u128::from(carry));
(lo, hi)
}

#[inline(always)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove or deprecate split() now?

uint/src/uint.rs Outdated

let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
(res2, carry)
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-thinking about this. Not sure I follow the reasoning. This is a free function, a utility, called only from overflowing_mul_u64 I think. It is always inlined, so why can't it be a part of the impl $name for each type? (and be private)? Sorry if dumb q!

uint/src/uint.rs Outdated

let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
(res2, carry)
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the same q applies to the split_128 as well?

@ordian ordian force-pushed the ao-are-you-ready-for-a-good-mul branch from 3a452d8 to a82d2c6 Compare April 25, 2019 11:26
@ordian ordian force-pushed the ao-are-you-ready-for-a-good-mul branch from a82d2c6 to 0b3844e Compare April 30, 2019 09:33
@ordian
Copy link
Member Author

ordian commented Apr 30, 2019

@dvdplm @tomusdrw I've updated the PR description with some numbers, let me know if you have any questions/concerns.

@ordian
Copy link
Member Author

ordian commented May 1, 2019

I've removed the fast path with fits_word check, so now there are no trade-offs to make. We can add it back in a separate PR if we want.

ordian added 2 commits May 6, 2019 15:33
* master:
  Bump fixed-hash to 0.3.2 (#134)
  Use EntropyRng instead of OsRng (#130)
  Bump parity-crypto to 0.3.1 (#132)
  rust-crypto dependency removal (#85)
  Use newly stablized TryFrom trait for primitive-types (#127)
* master:
  uint: convert benchmarks to criterion  (#123)
@ordian ordian merged commit a1ea210 into master May 7, 2019
@ordian ordian deleted the ao-are-you-ready-for-a-good-mul branch May 7, 2019 13:54
ordian added a commit that referenced this pull request May 7, 2019
* master:
  uint: faster mul by u64 (#125)
  fix rocksdb block cache size (#122)
  uint: convert benchmarks to criterion  (#123)
  Bump fixed-hash to 0.3.2 (#134)
  Use EntropyRng instead of OsRng (#130)
  Bump parity-crypto to 0.3.1 (#132)
  rust-crypto dependency removal (#85)
  Use newly stablized TryFrom trait for primitive-types (#127)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants