Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apd: optimize Quo, remove per-digit long division #114

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #98.

This commit speeds up Context.Quo, replacing its per-digit long
division with a call to BigInt.QuoRem. This avoids per-digit
iteration. It also allows Context.Quo to benefit further from the
inline fast-path of BigInt.QuoRem for small coefficient values that
fit in a uint64.

This is a partial revert of 1eddda3, at least in spirit. Before that
commit, Context.Quo did try to use big.Int.Quo. Unfortunately, it
was inaccurate for certain values because it was not scaling the
coefficients correctly before dividing. It tried to compensate for this
by using a very large precision (c.Precision*2 + 8), but this was
insufficient on its own because it was not taking the size of the values
into account. So the commit abandoned that approach.

However, that commit, which was based on the description given on the
GDA site, did demonstrate how to scale the coefficients in a way that
would permit this kind of division. Specifically, it began adjusting the
operands so that the coefficient of the dividend was greater than or
equal to the coefficient of the divisor and was also less than ten times
the coefficient of the divisor.

This commit uses the coefficient adjustment introduced in 1eddda3 to
revive the call into BigInt.QuoRem. With the two operands adjusted, it
now is possible to scale the coefficient of the dividend by the desired
precision and then perform a direct division on the operands.

Microbenchmarks

name                 old time/op    new time/op    delta
GDA/exp-10              116ms ± 0%      21ms ± 0%  -82.28%  (p=0.000 n=9+9)
GDA/cuberoot-apd-10    1.76ms ± 0%    0.40ms ± 0%  -77.47%  (p=0.000 n=9+10)
GDA/log10-10            102ms ± 0%      25ms ± 0%  -74.99%  (p=0.000 n=10+9)
GDA/ln-10              78.3ms ± 0%    19.8ms ± 1%  -74.69%  (p=0.000 n=10+10)
GDA/power-10            212ms ± 0%      56ms ± 0%  -73.46%  (p=0.000 n=9+9)
GDA/divide-10           229µs ± 0%      65µs ± 1%  -71.72%  (p=0.000 n=10+10)
GDA/powersqrt-10        190ms ± 0%      74ms ± 0%  -61.19%  (p=0.000 n=10+10)
GDA/squareroot-10      12.2ms ± 0%     5.2ms ± 0%  -57.32%  (p=0.000 n=8+9)
GDA/rounding-10         249µs ± 0%     166µs ± 0%  -33.35%  (p=0.000 n=8+10)
GDA/randoms-10         1.70ms ± 0%    1.36ms ± 0%  -20.06%  (p=0.000 n=9+10)
GDA/reduce-10          10.3µs ± 1%    10.2µs ± 1%   -1.24%  (p=0.000 n=10+10)
GDA/remainder-10       26.5µs ± 0%    26.4µs ± 0%   -0.42%  (p=0.000 n=10+10)
GDA/base-10             112µs ± 0%     112µs ± 0%   -0.21%  (p=0.004 n=10+10)
GDA/abs-10             4.48µs ± 1%    4.48µs ± 3%     ~     (p=0.154 n=9+9)
GDA/add-10              550µs ± 1%     551µs ± 1%     ~     (p=0.400 n=10+9)
GDA/compare-10         25.6µs ± 1%    25.7µs ± 1%     ~     (p=0.101 n=10+8)
GDA/comparetotal-10    23.9µs ± 0%    23.9µs ± 1%     ~     (p=0.234 n=9+10)
GDA/divideint-10       13.3µs ± 1%    13.3µs ± 0%     ~     (p=0.363 n=10+10)
GDA/minus-10           5.98µs ± 0%    5.98µs ± 1%     ~     (p=0.196 n=10+10)
GDA/multiply-10        50.1µs ± 0%    50.1µs ± 0%     ~     (p=0.481 n=10+10)
GDA/plus-10            33.1µs ± 0%    33.1µs ± 1%     ~     (p=0.315 n=10+10)
GDA/quantize-10        68.2µs ± 0%    68.5µs ± 1%     ~     (p=0.497 n=9+10)
GDA/subtract-10        81.0µs ± 1%    80.8µs ± 1%     ~     (p=0.219 n=9+10)
GDA/tointegral-10      16.0µs ± 1%    16.0µs ± 1%     ~     (p=0.888 n=9+10)
GDA/tointegralx-10     16.5µs ± 1%    16.6µs ± 1%     ~     (p=0.447 n=9+10)

name                 old alloc/op   new alloc/op   delta
GDA/exp-10             60.6MB ± 0%     9.6MB ± 0%  -84.21%  (p=0.000 n=10+10)
GDA/squareroot-10       307kB ± 0%     145kB ± 0%  -52.90%  (p=0.000 n=9+10)
GDA/cuberoot-apd-10     129kB ± 0%     122kB ± 0%   -5.66%  (p=0.000 n=10+10)
GDA/abs-10              0.00B          0.00B          ~     (all equal)
GDA/add-10              459kB ± 0%     459kB ± 0%     ~     (p=0.504 n=10+10)
GDA/base-10            26.2kB ± 0%    26.2kB ± 0%     ~     (all equal)
GDA/compare-10          0.00B          0.00B          ~     (all equal)
GDA/comparetotal-10     0.00B          0.00B          ~     (all equal)
GDA/divideint-10        96.0B ± 0%     96.0B ± 0%     ~     (all equal)
GDA/minus-10            0.00B          0.00B          ~     (all equal)
GDA/multiply-10        15.7kB ± 0%    15.7kB ± 0%     ~     (all equal)
GDA/plus-10            41.0kB ± 0%    41.0kB ± 0%     ~     (all equal)
GDA/quantize-10        9.20kB ± 0%    9.20kB ± 0%     ~     (all equal)
GDA/randoms-10         48.5kB ± 0%    48.5kB ± 0%     ~     (all equal)
GDA/reduce-10           0.00B          0.00B          ~     (all equal)
GDA/remainder-10        96.0B ± 0%     96.0B ± 0%     ~     (all equal)
GDA/rounding-10        3.07kB ± 0%    3.07kB ± 0%     ~     (all equal)
GDA/subtract-10        35.2kB ± 0%    35.2kB ± 0%     ~     (all equal)
GDA/tointegral-10      6.58kB ± 0%    6.58kB ± 0%     ~     (all equal)
GDA/tointegralx-10     6.58kB ± 0%    6.58kB ± 0%     ~     (all equal)
GDA/powersqrt-10       6.01MB ± 0%    6.27MB ± 0%   +4.28%  (p=0.000 n=10+10)
GDA/power-10           20.7MB ± 0%    22.6MB ± 0%   +9.20%  (p=0.000 n=10+10)
GDA/divide-10          9.70kB ± 0%   10.59kB ± 0%   +9.24%  (p=0.000 n=10+10)
GDA/ln-10              7.16MB ± 0%    7.91MB ± 0%  +10.42%  (p=0.000 n=10+10)
GDA/log10-10           9.12MB ± 0%   10.27MB ± 0%  +12.58%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
GDA/exp-10               696k ± 0%      127k ± 0%  -81.71%  (p=0.000 n=10+10)
GDA/squareroot-10       5.81k ± 0%     3.86k ± 0%  -33.57%  (p=0.000 n=10+10)
GDA/cuberoot-apd-10     2.62k ± 0%     2.43k ± 0%   -7.03%  (p=0.000 n=10+10)
GDA/abs-10               0.00           0.00          ~     (all equal)
GDA/add-10              4.21k ± 0%     4.21k ± 0%     ~     (all equal)
GDA/base-10             2.52k ± 0%     2.52k ± 0%     ~     (all equal)
GDA/compare-10           0.00           0.00          ~     (all equal)
GDA/comparetotal-10      0.00           0.00          ~     (all equal)
GDA/divideint-10         2.00 ± 0%      2.00 ± 0%     ~     (all equal)
GDA/minus-10             0.00           0.00          ~     (all equal)
GDA/multiply-10           312 ± 0%       312 ± 0%     ~     (all equal)
GDA/plus-10               260 ± 0%       260 ± 0%     ~     (all equal)
GDA/quantize-10          53.0 ± 0%      53.0 ± 0%     ~     (all equal)
GDA/randoms-10            704 ± 0%       704 ± 0%     ~     (all equal)
GDA/reduce-10            0.00           0.00          ~     (all equal)
GDA/remainder-10         2.00 ± 0%      2.00 ± 0%     ~     (all equal)
GDA/rounding-10          96.0 ± 0%      96.0 ± 0%     ~     (all equal)
GDA/subtract-10           210 ± 0%       210 ± 0%     ~     (all equal)
GDA/tointegral-10        48.0 ± 0%      48.0 ± 0%     ~     (all equal)
GDA/tointegralx-10       48.0 ± 0%      48.0 ± 0%     ~     (all equal)
GDA/powersqrt-10         205k ± 0%      210k ± 0%   +2.33%  (p=0.000 n=10+10)
GDA/divide-10             202 ± 0%       213 ± 0%   +5.45%  (p=0.000 n=10+10)
GDA/power-10             423k ± 0%      450k ± 0%   +6.51%  (p=0.000 n=10+10)
GDA/ln-10                146k ± 0%      157k ± 0%   +7.43%  (p=0.000 n=10+10)
GDA/log10-10             186k ± 0%      204k ± 0%   +9.61%  (p=0.000 n=10+10)

From the benchmark referenced in #98:

The reported results were:

Benchmark/1_Ops:_apd.Decimal128.Add-16         	 5285344	       218 ns/op
Benchmark/1_Ops:_apd.Decimal128.Sub-16         	 5665453	       208 ns/op
Benchmark/1_Ops:_apd.Decimal128.Mul-16         	 9138602	       138 ns/op
Benchmark/1_Ops:_apd.Decimal128.Quo-16         	  162537	      7790 ns/op

After #103 and this PR, they are now:

Benchmark/1_Ops:_Decimal128.Add-10 	11128542	       109.5 ns/op
Benchmark/1_Ops:_Decimal128.Sub-10 	12875046	        94.40 ns/op
Benchmark/1_Ops:_Decimal128.Mul-10 	16563775	        75.21 ns/op
Benchmark/1_Ops:_Decimal128.Quo-10 	 3055980	       398.4 ns/op

Code cleanup. Should not change behavior, but makes the code's purpose
clearer and may be slightly faster.
Fixes cockroachdb#98.

This commit speeds up `Context.Quo`, replacing its per-digit long
division with a call to `BigInt.QuoRem`. This avoids per-digit
iteration. It also allows `Context.Quo` to benefit further from the
inline fast-path of `BigInt.QuoRem` for small coefficient values that
fit in a uint64.

This is a partial revert of 1eddda3, at least in spirit. Before that
commit, `Context.Quo` did try to use `big.Int.Quo`. Unfortunately, it
was inaccurate for certain values because it was not scaling the
coefficients correctly before dividing. It tried to compensate for this
by using a very large precision (`c.Precision*2 + 8`), but this was
insufficient on its own because it was not taking the size of the values
into account. So the commit abandoned that approach.

However, that commit, which was based on the description given on the
GDA site, did demonstrate how to scale the coefficients in a way that
would permit this kind of division. Specifically, it began adjusting the
operands so that the coefficient of the dividend was greater than or
equal to the coefficient of the divisor and was also less than ten times
the coefficient of the divisor.

This commit uses the coefficient adjustment introduced in 1eddda3 to
revive the call into `BigInt.QuoRem`. With the two operands adjusted, it
now is possible to scale the coefficient of the dividend by the desired
precision and then perform a direct division on the operands.
This commit replaces the looping in Quo when adjusting coefficients to
be integer divisible with a fixed set of operations that have the same
effect.
These all pass now. I don't know why.
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @nvanbenschoten)


context.go, line 287 at r2 (raw file):

	}

	if c.Precision > 5000 {

nit: can/should this check now be removed?

Now that we removed the long division, we don't directly iterate per
digit in Quo, so we can remove the precision limit.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 2 of 3 files reviewed, all discussions resolved (waiting on @yuzefovich)


context.go, line 287 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: can/should this check now be removed?

Done.

@nvanbenschoten nvanbenschoten merged commit 94925d2 into cockroachdb:master Jan 31, 2022
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/quoFast branch January 31, 2022 18:15
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 1, 2022
Picks up two PRs that improved the performance of `Quo`, `Sqrt`,
`Cbrt`, `Exp`, `Ln`, `Log`, and `Pow`:
- cockroachdb/apd#114
- cockroachdb/apd#115

Almost all of the testing changes here are due to the rounding behavior
in cockroachdb/apd#115. This brings us closer
to PG's behavior, but also creates a lot of noise in this diff.

Release note (performance improvement): The performance of many DECIMAL
arithmetic operators has been improved by as much as 60%. These operators
include division (`/`), `sqrt`, `cbrt`, `exp`, `ln`, `log`, and `pow`.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 2, 2022
Picks up two PRs that improved the performance of `Quo`, `Sqrt`,
`Cbrt`, `Exp`, `Ln`, `Log`, and `Pow`:
- cockroachdb/apd#114
- cockroachdb/apd#115

Almost all of the testing changes here are due to the rounding behavior
in cockroachdb/apd#115. This brings us closer
to PG's behavior, but also creates a lot of noise in this diff.

Release note (performance improvement): The performance of many DECIMAL
arithmetic operators has been improved by as much as 60%. These operators
include division (`/`), `sqrt`, `cbrt`, `exp`, `ln`, `log`, and `pow`.
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Feb 2, 2022
74612: rangefeedcache,settingswatcher,systemcfgwatcher: lose gossip deps r=ajwerner a=ajwerner

This is a pile of commits which supersedes #69269 and pretty much puts in place all of the pieces to move on #70560. 

75726: sql: refactor pg_has_role to remove privilege.GRANT usage r=RichardJCai a=ecwall

refs #73129

Also combines some layers of privilege checking code.

Release note: None

75770: vendor: bump cockroachdb/apd to v3.1.0, speed up decimal division r=nvanbenschoten a=nvanbenschoten

Picks up two PRs that improved the performance of `Quo`, `Sqrt`, `Cbrt`, `Exp`, `Ln`, `Log`, and `Pow`:
- cockroachdb/apd#114
- cockroachdb/apd#115

Almost all of the testing changes here are due to the rounding behavior in cockroachdb/apd#115. This brings us closer to PG's behavior, but also creates a lot of noise in this diff. To verify that this noise wasn't hiding any correctness regressions caused by the rewrite of `Context.Quo` in the first PR, I created #75757, which only includes the first PR. #75757 passes CI with minimal testing changes. The testing changes that PR did require all have to do with trailing zeros, and most of them are replaced in this PR.

Release note (performance improvement): The performance of many DECIMAL arithmetic operators has been improved by as much as 60%. These operators include division (`/`), `sqrt`, `cbrt`, `exp`, `ln`, `log`, and `pow`.

----

### Speedup on TPC-DS dataset

The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (web_sales, row count = 7,197,566):

```sql
# q1
select sum(ws_wholesale_cost / ws_ext_list_price) from web_sales;

# q2
select sum(ws_wholesale_cost / ws_ext_list_price - sqrt(ws_net_paid_inc_tax)) from web_sales;
```

Here's the difference in runtime of these two queries before and after this change on an `n2-standard-8` instance:

```
name              old s/op   new s/op   delta
TPC-DS/custom/q1  22.4 ± 0%   8.6 ± 0%  -61.33%  (p=0.002 n=6+6)
TPC-DS/custom/q2  91.4 ± 0%  32.1 ± 0%  -64.85%  (p=0.008 n=5+5)
```

75771: colexec: close the ordered sync used by the external sorter r=yuzefovich a=yuzefovich

**colexec: close the ordered sync used by the external sorter**

Previously, we forgot to close the ordered synchronizer that is used by
the external sorter to merge already sorted partitions. This could
result in some tracing spans never being finished and is now fixed.

Release note: None

**colexec: return an error rather than logging it on Close in some cases**

This error eventually will be logged anyway, but we should try to
propagate it up the stack as much as possible.

Release note: None

75807: ui: Add CircleFilled component r=ericharmeling a=ericharmeling

Previously, there was no component for a filled circle icon in the `ui` package.
Recent product designs have requested this icon for the DB Console (see #67510, #73463).
This PR adds a `CircleFilled` component to the `ui` package.

Release note: None

75813: sql: fix flakey TestShowRangesMultipleStores r=ajwerner a=ajwerner

Fixes #75699

Release note: None

75836: dev: add generate protobuf r=ajwerner a=ajwerner

Convenient, fast.

Release note: None

Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Evan Wall <wall@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add small precision fast-path to Quo
3 participants