-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uint: faster mul by u64 #125
Conversation
@@ -512,6 +553,15 @@ macro_rules! construct_uint { | |||
return true; | |||
} | |||
|
|||
// Whether this fits u64. | |||
#[inline] | |||
fn fits_word(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the reverse of https://github.com/paritytech/parity-common/blob/master/uint/src/uint.rs#L215 – maybe we can save some code there? And I wonder if the unroll!
is still warranted, and if the comment about inlining is still relevant on modern rustc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can save some code there?
There is also a difference in that in it starts from 1nd index, not 0th, so I'm not sure how are you suggesting to reuse it. Initially I wrote self.bits() <= 64
, but there was a small (~2%), but consistently measurable difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
starts from 1nd index, not 0th
Ah, missed that. Sorry.
Still, we should verify if the unroll!
and the inlining advice is needed here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a change to unroll!
to work with non-zero indices here, I'm curious if it makes any difference? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unroll
v0.2.2 is out so I checked if it makes a difference (went back to commit 0b3844e) and it looks like unrolling is quite a significant win for U256
:
Running /Users/dvd/dev/parity/parity-common/target/release/deps/bigint-abf18d123360ed8c
u256_mul//(115792089237316195423570985008687907853269984665640564039457584007913129639935, 1)
time: [3.3480 ns 3.3644 ns 3.3843 ns]
change: [+36.959% +41.024% +44.814%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
u256_mul//(3, 18446744073709551615)
time: [3.3805 ns 3.3993 ns 3.4202 ns]
change: [+38.603% +43.937% +49.933%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
u256_mul//(21674844646682989462120101885968193938394323990565507610662749, 173)
time: [3.4109 ns 3.4509 ns 3.4963 ns]
change: [+34.791% +38.484% +41.754%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
u512_mul//(1, 0) time: [8.5398 ns 8.5914 ns 8.6427 ns]
change: [+12.422% +16.167% +20.151%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
8 (8.00%) high severe
u512_mul//(18446744073709551615, 4294967296)
time: [8.4936 ns 8.5363 ns 8.5772 ns]
change: [+9.2618% +13.348% +17.802%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
u512_mul//(3759751734479964094783137206182536765532905409829204647089173492, 25097755455336819385...
time: [24.957 ns 25.029 ns 25.111 ns]
change: [-4.7843% -1.5953% +1.9514%] (p = 0.40 > 0.05)
No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
2 (2.00%) low mild
9 (9.00%) high mild
6 (6.00%) high severe
u512_mul//(452312848583266388373324160190187140051835877600158453279131187530910662655, 452312848...
time: [25.200 ns 25.326 ns 25.461 ns]
change: [-7.0673% -4.0808% -1.6396%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks David, will add a commit with unroll
0.2.2.
@@ -321,15 +368,9 @@ macro_rules! impl_mulassign_from { | |||
|
|||
#[inline(always)] | |||
#[doc(hidden)] | |||
pub fn mul_u32(a: (u64, u64), b: u64, carry: u64) -> (u64, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a public function, so we probably need to support it (with a deprecation warning?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this is true, but it is only used for construct_uint
macro, and doc(hidden)
indicate it shouldn't be a part of the public interface.
uint/src/uint.rs
Outdated
|
||
let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64; | ||
(res2, carry) | ||
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this function is defined in uint crate, but the uint type itself is instantiated in the module it is defined (via macro).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-thinking about this. Not sure I follow the reasoning. This is a free function, a utility, called only from overflowing_mul_u64
I think. It is always inlined, so why can't it be a part of the impl $name
for each type? (and be private)? Sorry if dumb q!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the same q applies to the split_128
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't know the reasoning that was used before, maybe it will affect generated binary size somehow, but it seems like a good idea to me. 👍
uint/src/uint.rs
Outdated
let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64; | ||
(res2, carry) | ||
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) { | ||
let (hi, lo) = split_u128(u128::from(a) * u128::from(b) + u128::from(carry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is u128
available on wasm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Seems to compile on playground, but I haven't measured perf for wasm. This method is needed for a faster division though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check, the playground actually doesn't compile that function at all, it's not in the output. You can check what it looks like under Wasm using WebAssembly Studio. It compiles, but if the function isn't inlined then it does a (relatively costly in Wasm) stack return instead of using 2 return values like one would expect:
(func $split_u128 (export "split_u128") (type $t2) (param $p0 i32) (param $p1 i64) (param $p2 i64)
get_local $p0
get_local $p1
i64.store offset=8
get_local $p0
get_local $p2
i64.store)
So yes, this is available on Wasm, but I would be cautious. The assembly is actually fine under FireFox, although laughably bad for how simple a function this is (if LLVM had generated better Wasm it could just be 2x mov r64, r64
), but assuming that you don't use the return value the next cycle after the call the cost will likely be hidden by the write being executed in parallel with other instructions:
wasm-function[1]:
sub rsp, 8 ; 0x000000 48 83 ec 08
mov qword ptr [r15 + rdi + 8], rsi ; 0x000004 49 89 74 3f 08
mov qword ptr [r15 + rdi], rdx ; 0x000009 49 89 14 3f
nop ; 0x00000d 66 90
add rsp, 8 ; 0x00000f 48 83 c4 08
ret ; 0x000013 c3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jack for the detailed info. This function is only used for parsing string (from_dec_str
), multiplication by primitive types and (in #126) by a faster div. In div it's not a perf bottleneck, so I guess it's fine if it's currently not optimal for Wasm.
let o = other as u64; | ||
/// Overflowing multiplication by u64. | ||
/// Returns the result and carry. | ||
fn overflowing_mul_u64(mut self, other: u64) -> (Self, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be API change, no? We went from u32
to u64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a private method
@@ -950,56 +1003,18 @@ macro_rules! construct_uint { | |||
} | |||
} | |||
|
|||
// specialization for u32 | |||
impl $crate::core_::ops::Mul<u32> for $name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a breaking API change. I think the reason why it wasn't implemented for all numbers was type interference. Now if you try to write U256::from(2) * 2
it's going to complain, so you need to specify the correct type (brace for fixing the tests in dependent projects).
That said I think it's a good change, the previous behaviour was weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's generated via impl_mul_for_primitive
now (also for other primitive types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's my point. Existing code will break cause there are now multiple (potentially conflicting) implementations of Mul
if the type is not strictly specified (i.e. you just write * 2
instead of * 2_u8
or * 2_u32
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a breaking API change.
Indeed. We'll need a 0.7 release for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see your point, it's inferred as i32 by default: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=f3e5e51ddb3d1fa2ef073ae0085d6199
Nice work! |
@@ -512,6 +553,15 @@ macro_rules! construct_uint { | |||
return true; | |||
} | |||
|
|||
// Whether this fits u64. | |||
#[inline] | |||
fn fits_word(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
starts from 1nd index, not 0th
Ah, missed that. Sorry.
Still, we should verify if the unroll!
and the inlining advice is needed here too.
uint/src/uint.rs
Outdated
(res2, carry) | ||
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) { | ||
let (hi, lo) = split_u128(u128::from(a) * u128::from(b) + u128::from(carry)); | ||
(lo, hi) | ||
} | ||
|
||
#[inline(always)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove or deprecate split()
now?
uint/src/uint.rs
Outdated
|
||
let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64; | ||
(res2, carry) | ||
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-thinking about this. Not sure I follow the reasoning. This is a free function, a utility, called only from overflowing_mul_u64
I think. It is always inlined, so why can't it be a part of the impl $name
for each type? (and be private)? Sorry if dumb q!
uint/src/uint.rs
Outdated
|
||
let carry = (upper >> 32) + overflow1 as u64 + overflow2 as u64; | ||
(res2, carry) | ||
pub fn mul_u64(a: u64, b: u64, carry: u64) -> (u64, u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the same q applies to the split_128
as well?
3a452d8
to
a82d2c6
Compare
a82d2c6
to
0b3844e
Compare
I've removed the fast path with |
* master: uint: convert benchmarks to criterion (#123)
* master: uint: faster mul by u64 (#125) fix rocksdb block cache size (#122) uint: convert benchmarks to criterion (#123) Bump fixed-hash to 0.3.2 (#134) Use EntropyRng instead of OsRng (#130) Bump parity-crypto to 0.3.1 (#132) rust-crypto dependency removal (#85) Use newly stablized TryFrom trait for primitive-types (#127)
on top of #123 (last three commits)
changes:
#[doc(hidden)]
)U512
multiplication by primitive type is 2-5x faster.