uint: faster mul by u64 #125

ordian · 2019-04-22T11:17:03Z

on top of #123 (last three commits)
changes:

implements multiplication by u64 (needed for a faster division)
removes multiplication by u32
removes pub free-standing functions (technically this is a breaking change, but they're not supposed to be used by a user with #[doc(hidden)])
adds specialization for multiplication by primitive number types up to u64 (if the type is not specified, it's inferred as i32)

U512 multiplication by primitive type is 2-5x faster.

dvdplm · 2019-04-22T16:32:13Z

uint/src/uint.rs

@@ -512,6 +553,15 @@ macro_rules! construct_uint {
 				return true;
 			}

+			// Whether this fits u64.
+			#[inline]
+			fn fits_word(&self) -> bool {


This looks like the reverse of https://github.com/paritytech/parity-common/blob/master/uint/src/uint.rs#L215 – maybe we can save some code there? And I wonder if the unroll! is still warranted, and if the comment about inlining is still relevant on modern rustc.

maybe we can save some code there?

There is also a difference in that in it starts from 1nd index, not 0th, so I'm not sure how are you suggesting to reuse it. Initially I wrote self.bits() <= 64, but there was a small (~2%), but consistently measurable difference.

starts from 1nd index, not 0th

Ah, missed that. Sorry.

Still, we should verify if the unroll! and the inlining advice is needed here too.

I made a change to unroll! to work with non-zero indices here, I'm curious if it makes any difference? :)

unroll v0.2.2 is out so I checked if it makes a difference (went back to commit 0b3844e) and it looks like unrolling is quite a significant win for U256:

Running /Users/dvd/dev/parity/parity-common/target/release/deps/bigint-abf18d123360ed8c u256_mul//(115792089237316195423570985008687907853269984665640564039457584007913129639935, 1) time: [3.3480 ns 3.3644 ns 3.3843 ns] change: [+36.959% +41.024% +44.814%] (p = 0.00 < 0.05) Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe u256_mul//(3, 18446744073709551615) time: [3.3805 ns 3.3993 ns 3.4202 ns] change: [+38.603% +43.937% +49.933%] (p = 0.00 < 0.05) Performance has regressed. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe u256_mul//(21674844646682989462120101885968193938394323990565507610662749, 173) time: [3.4109 ns 3.4509 ns 3.4963 ns] change: [+34.791% +38.484% +41.754%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe u512_mul//(1, 0) time: [8.5398 ns 8.5914 ns 8.6427 ns] change: [+12.422% +16.167% +20.151%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 8 (8.00%) high severe u512_mul//(18446744073709551615, 4294967296) time: [8.4936 ns 8.5363 ns 8.5772 ns] change: [+9.2618% +13.348% +17.802%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe u512_mul//(3759751734479964094783137206182536765532905409829204647089173492, 25097755455336819385... time: [24.957 ns 25.029 ns 25.111 ns] change: [-4.7843% -1.5953% +1.9514%] (p = 0.40 > 0.05) No change in performance detected. Found 17 outliers among 100 measurements (17.00%) 2 (2.00%) low mild 9 (9.00%) high mild 6 (6.00%) high severe u512_mul//(452312848583266388373324160190187140051835877600158453279131187530910662655, 452312848... time: [25.200 ns 25.326 ns 25.461 ns] change: [-7.0673% -4.0808% -1.6396%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe

Thanks David, will add a commit with unroll 0.2.2.

dvdplm · 2019-04-22T16:33:22Z

uint/src/uint.rs

@@ -321,15 +368,9 @@ macro_rules! impl_mulassign_from {

 #[inline(always)]
 #[doc(hidden)]
-pub fn mul_u32(a: (u64, u64), b: u64, carry: u64) -> (u64, u64) {


This is a public function, so we probably need to support it (with a deprecation warning?).

Technically, this is true, but it is only used for construct_uint macro, and doc(hidden) indicate it shouldn't be a part of the public interface.

dvdplm · 2019-04-22T16:33:44Z

uint/src/uint.rs

-
-	let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
-	(res2, carry)
+pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {


Does it need to be public?

Yes, this function is defined in uint crate, but the uint type itself is instantiated in the module it is defined (via macro).

re-thinking about this. Not sure I follow the reasoning. This is a free function, a utility, called only from overflowing_mul_u64 I think. It is always inlined, so why can't it be a part of the impl $name for each type? (and be private)? Sorry if dumb q!

I guess the same q applies to the split_128 as well?

TBH, I don't know the reasoning that was used before, maybe it will affect generated binary size somehow, but it seems like a good idea to me. 👍

tomusdrw · 2019-04-23T08:04:36Z

uint/src/uint.rs

-	let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
-	(res2, carry)
+pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
+	let (hi, lo) = split_u128(u128::from(a) * u128::from(b) + u128::from(carry));


Is u128 available on wasm?

That's a good point. Seems to compile on playground, but I haven't measured perf for wasm. This method is needed for a faster division though.

If you check, the playground actually doesn't compile that function at all, it's not in the output. You can check what it looks like under Wasm using WebAssembly Studio. It compiles, but if the function isn't inlined then it does a (relatively costly in Wasm) stack return instead of using 2 return values like one would expect:

(func $split_u128 (export "split_u128") (type $t2) (param $p0 i32) (param $p1 i64) (param $p2 i64) get_local $p0 get_local $p1 i64.store offset=8 get_local $p0 get_local $p2 i64.store)

So yes, this is available on Wasm, but I would be cautious. The assembly is actually fine under FireFox, although laughably bad for how simple a function this is (if LLVM had generated better Wasm it could just be 2x mov r64, r64), but assuming that you don't use the return value the next cycle after the call the cost will likely be hidden by the write being executed in parallel with other instructions:

wasm-function[1]: sub rsp, 8 ; 0x000000 48 83 ec 08 mov qword ptr [r15 + rdi + 8], rsi ; 0x000004 49 89 74 3f 08 mov qword ptr [r15 + rdi], rdx ; 0x000009 49 89 14 3f nop ; 0x00000d 66 90 add rsp, 8 ; 0x00000f 48 83 c4 08 ret ; 0x000013 c3

Thanks Jack for the detailed info. This function is only used for parsing string (from_dec_str), multiplication by primitive types and (in #126) by a faster div. In div it's not a perf bottleneck, so I guess it's fine if it's currently not optimal for Wasm.

tomusdrw · 2019-04-23T08:06:01Z

uint/src/uint.rs

-				let o = other as u64;
+			/// Overflowing multiplication by u64.
+			/// Returns the result and carry.
+			fn overflowing_mul_u64(mut self, other: u64) -> (Self, u64) {


This seems to be API change, no? We went from u32 to u64?

it is a private method

tomusdrw · 2019-04-23T08:08:18Z

uint/src/uint.rs

@@ -950,56 +1003,18 @@ macro_rules! construct_uint {
 			}
 		}

-		// specialization for u32
-		impl $crate::core_::ops::Mul<u32> for $name {


That's a breaking API change. I think the reason why it wasn't implemented for all numbers was type interference. Now if you try to write U256::from(2) * 2 it's going to complain, so you need to specify the correct type (brace for fixing the tests in dependent projects).
That said I think it's a good change, the previous behaviour was weird.

it's generated via impl_mul_for_primitive now (also for other primitive types)

Yeah, that's my point. Existing code will break cause there are now multiple (potentially conflicting) implementations of Mul if the type is not strictly specified (i.e. you just write * 2 instead of * 2_u8 or * 2_u32)

That's a breaking API change.

Indeed. We'll need a 0.7 release for sure.

Ah, I see your point, it's inferred as i32 by default: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=f3e5e51ddb3d1fa2ef073ae0085d6199

EDIT: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=e2d106cb3b267d716078e2db0ec55efb

tomusdrw · 2019-04-23T08:10:12Z

Nice work!

dvdplm · 2019-04-23T07:32:32Z

uint/src/uint.rs

@@ -512,6 +553,15 @@ macro_rules! construct_uint {
 				return true;
 			}

+			// Whether this fits u64.
+			#[inline]
+			fn fits_word(&self) -> bool {


starts from 1nd index, not 0th

Ah, missed that. Sorry.

Still, we should verify if the unroll! and the inlining advice is needed here too.

dvdplm · 2019-04-23T08:11:22Z

uint/src/uint.rs

-	(res2, carry)
+pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {
+	let (hi, lo) = split_u128(u128::from(a) * u128::from(b) + u128::from(carry));
+	(lo, hi)
 }

 #[inline(always)]


Can we remove or deprecate split() now?

dvdplm · 2019-04-23T08:22:13Z

uint/src/uint.rs

-
-	let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
-	(res2, carry)
+pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {


re-thinking about this. Not sure I follow the reasoning. This is a free function, a utility, called only from overflowing_mul_u64 I think. It is always inlined, so why can't it be a part of the impl $name for each type? (and be private)? Sorry if dumb q!

dvdplm · 2019-04-23T08:24:13Z

uint/src/uint.rs

-
-	let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64;
-	(res2, carry)
+pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) {


I guess the same q applies to the split_128 as well?

ordian · 2019-04-30T10:50:16Z

@dvdplm @tomusdrw I've updated the PR description with some numbers, let me know if you have any questions/concerns.

ordian · 2019-05-01T12:24:45Z

I've removed the fast path with fits_word check, so now there are no trade-offs to make. We can add it back in a separate PR if we want.

* master: Bump fixed-hash to 0.3.2 (#134) Use EntropyRng instead of OsRng (#130) Bump parity-crypto to 0.3.1 (#132) rust-crypto dependency removal (#85) Use newly stablized TryFrom trait for primitive-types (#127)

* master: uint: convert benchmarks to criterion (#123)

* master: uint: faster mul by u64 (#125) fix rocksdb block cache size (#122) uint: convert benchmarks to criterion (#123) Bump fixed-hash to 0.3.2 (#134) Use EntropyRng instead of OsRng (#130) Bump parity-crypto to 0.3.1 (#132) rust-crypto dependency removal (#85) Use newly stablized TryFrom trait for primitive-types (#127)

ordian added 9 commits April 16, 2019 15:20

uint: convert benchmarks to criterion

bdf3317

rustfmt

8256014

add u256_rem_small

726b7c6

add u512_rem_large bench

e134a1b

uint: add mulmod comparison benches

199f5e5

uint: add comparison with gmp

16358fa

uint: clean up benches

4948b8c

uint: fix windows ci

8a27b33

uint: fix gmp conversions

fe7c08a

ordian mentioned this pull request Apr 22, 2019

uint: faster div #126

Merged

dvdplm reviewed Apr 22, 2019

View reviewed changes

tomusdrw reviewed Apr 23, 2019

View reviewed changes

dvdplm reviewed Apr 23, 2019

View reviewed changes

ordian force-pushed the ao-are-you-ready-for-a-good-mul branch from 3a452d8 to a82d2c6 Compare April 25, 2019 11:26

ordian added 4 commits April 30, 2019 11:15

uint: make sure bytes are little endian for gmp conversion

acbcfe7

uint: compare mul by u32 and u64

2f2997c

uint: fast path for mul by u64

cdd017e

uint: make free-standing functions static

0b3844e

ordian force-pushed the ao-are-you-ready-for-a-good-mul branch from a82d2c6 to 0b3844e Compare April 30, 2019 09:33

ordian added the A0-pleasereview label Apr 30, 2019

uint: remove fast-path for mul by u64

e0c50c8

dvdplm approved these changes May 2, 2019

View reviewed changes

ordian added 2 commits May 6, 2019 15:33

Merge branch 'master' into ao-are-you-ready-for-a-good-mul

f7c6964

* master: Bump fixed-hash to 0.3.2 (#134) Use EntropyRng instead of OsRng (#130) Bump parity-crypto to 0.3.1 (#132) rust-crypto dependency removal (#85) Use newly stablized TryFrom trait for primitive-types (#127)

Merge branch 'master' into ao-are-you-ready-for-a-good-mul

6083084

* master: uint: convert benchmarks to criterion (#123)

dvdplm approved these changes May 6, 2019

View reviewed changes

andresilva approved these changes May 7, 2019

View reviewed changes

ordian merged commit a1ea210 into master May 7, 2019

ordian deleted the ao-are-you-ready-for-a-good-mul branch May 7, 2019 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uint: faster mul by u64 #125

uint: faster mul by u64 #125

ordian commented Apr 22, 2019 •

edited

Loading

dvdplm Apr 22, 2019

ordian Apr 22, 2019

dvdplm Apr 23, 2019

dvdplm May 7, 2019

dvdplm May 7, 2019

ordian May 7, 2019

dvdplm Apr 22, 2019

ordian Apr 22, 2019

dvdplm Apr 22, 2019

ordian Apr 22, 2019

dvdplm Apr 23, 2019

dvdplm Apr 23, 2019

ordian Apr 23, 2019

tomusdrw Apr 23, 2019

ordian Apr 23, 2019

eira-fransham May 7, 2019 •

edited

Loading

ordian May 7, 2019

tomusdrw Apr 23, 2019

ordian Apr 23, 2019

tomusdrw Apr 23, 2019

ordian Apr 23, 2019

tomusdrw Apr 23, 2019

dvdplm Apr 23, 2019

ordian Apr 23, 2019 •

edited

Loading

tomusdrw commented Apr 23, 2019

dvdplm Apr 23, 2019

dvdplm Apr 23, 2019

dvdplm Apr 23, 2019

dvdplm Apr 23, 2019

ordian commented Apr 30, 2019

ordian commented May 1, 2019

uint: faster mul by u64 #125

uint: faster mul by u64 #125

Conversation

ordian commented Apr 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eira-fransham May 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ordian Apr 23, 2019 • edited Loading

Choose a reason for hiding this comment

tomusdrw commented Apr 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ordian commented Apr 30, 2019

ordian commented May 1, 2019

ordian commented Apr 22, 2019 •

edited

Loading

eira-fransham May 7, 2019 •

edited

Loading

ordian Apr 23, 2019 •

edited

Loading