Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec: overflow handling for vectorized arithmetic #38967

Merged
merged 2 commits into from
Jul 23, 2019

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Jul 18, 2019

The overflow checks are done as part of the code generation in
overloads.go. The checks are done inline, rather than calling the
functions in the arith package for performance reasons.

The checks are only done for integer math. float math is already
well-defined since overflow will result in +Inf and -Inf as necessary.

The operations that these checks are relevant for are the SUM_INT
aggregator and projection. In the future, AVG will also benefit from
these overflow checks.

This changes the error message produced by overflows in the
non-vectorized SUM_INT aggregator so that the messages are consistent.
This should be fine in terms of postgres-compatibility since SUM_INT is
unique to CRDB and eventually we will get rid of it anyway.

resolves #38775

Release note: None

@rafiss rafiss requested review from solongordon and a team July 18, 2019 20:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rafiss
Copy link
Collaborator Author

rafiss commented Jul 18, 2019

SUM aggregator results are not too bad overall. Most affected is multiplication projection.

name                                                                              old time/op    new time/op    delta
Aggregator/SUM/ordered/Int64/groupSize=1/hasNulls=false/numInputBatches=64-24        383µs ± 0%     387µs ± 0%   +0.99%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=1/hasNulls=true/numInputBatches=64-24         707µs ± 1%     724µs ± 1%   +2.42%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=2/hasNulls=false/numInputBatches=64-24        309µs ± 2%     309µs ± 1%     ~     (p=0.684 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=2/hasNulls=true/numInputBatches=64-24         471µs ± 1%     491µs ± 1%   +4.16%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=512/hasNulls=false/numInputBatches=64-24      240µs ± 0%     272µs ± 0%  +13.58%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=512/hasNulls=true/numInputBatches=64-24       346µs ± 0%     382µs ± 0%  +10.49%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=1024/hasNulls=false/numInputBatches=64-24     239µs ± 0%     271µs ± 0%  +13.24%  (p=0.000 n=10+9)
Aggregator/SUM/ordered/Int64/groupSize=1024/hasNulls=true/numInputBatches=64-24      348µs ± 1%     383µs ± 0%  +10.07%  (p=0.000 n=10+10)

name                                                                              old speed      new speed      delta
Aggregator/SUM/ordered/Int64/groupSize=1/hasNulls=false/numInputBatches=64-24     1.37GB/s ± 0%  1.35GB/s ± 0%   -0.98%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=1/hasNulls=true/numInputBatches=64-24       741MB/s ± 1%   724MB/s ± 1%   -2.36%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=2/hasNulls=false/numInputBatches=64-24     1.70GB/s ± 2%  1.70GB/s ± 1%     ~     (p=0.684 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=2/hasNulls=true/numInputBatches=64-24      1.11GB/s ± 1%  1.07GB/s ± 1%   -3.99%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=512/hasNulls=false/numInputBatches=64-24   2.19GB/s ± 0%  1.93GB/s ± 0%  -11.96%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=512/hasNulls=true/numInputBatches=64-24    1.52GB/s ± 0%  1.37GB/s ± 0%   -9.50%  (p=0.000 n=10+10)
Aggregator/SUM/ordered/Int64/groupSize=1024/hasNulls=false/numInputBatches=64-24  2.19GB/s ± 0%  1.94GB/s ± 0%  -11.69%  (p=0.000 n=10+9)
Aggregator/SUM/ordered/Int64/groupSize=1024/hasNulls=true/numInputBatches=64-24   1.51GB/s ± 1%  1.37GB/s ± 0%   -9.15%  (p=0.000 n=10+10)
ProjOp/op=projMinusInt64Int64Op/useSel=true/hasNulls=true-24      1.34µs ± 0%    1.89µs ± 1%    +40.64%  (p=0.000 n=10+9)
ProjOp/op=projMinusInt64Int64Op/useSel=true/hasNulls=false-24      957ns ± 0%    1468ns ± 2%    +53.48%  (p=0.000 n=9+10)
ProjOp/op=projMinusInt64Int64Op/useSel=false/hasNulls=true-24     1.01µs ± 0%    1.58µs ± 1%    +56.61%  (p=0.000 n=8+10)
ProjOp/op=projMinusInt64Int64Op/useSel=false/hasNulls=false-24     646ns ± 0%    1212ns ± 3%    +87.69%  (p=0.000 n=8+10)
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=true-24       1.35µs ± 0%   13.31µs ± 0%   +884.40%  (p=0.000 n=8+10)
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=false-24       958ns ± 0%   12831ns ± 0%  +1239.87%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=true-24      1.02µs ± 1%   14.44µs ± 1%  +1309.27%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=false-24      646ns ± 1%   13970ns ± 1%  +2061.84%  (p=0.000 n=10+10)
ProjOp/op=projDivInt64Int64Op/useSel=true/hasNulls=true-24        12.0µs ± 0%    12.5µs ± 2%     +3.88%  (p=0.000 n=9+10)
ProjOp/op=projDivInt64Int64Op/useSel=true/hasNulls=false-24       11.5µs ± 0%    11.9µs ± 1%     +3.16%  (p=0.000 n=9+10)
ProjOp/op=projDivInt64Int64Op/useSel=false/hasNulls=true-24       12.1µs ± 0%    12.7µs ± 0%     +5.01%  (p=0.000 n=10+10)
ProjOp/op=projDivInt64Int64Op/useSel=false/hasNulls=false-24      11.7µs ± 0%    12.3µs ± 1%     +4.71%  (p=0.000 n=8+9)
ProjOp/op=projPlusInt64Int64Op/useSel=true/hasNulls=true-24       1.35µs ± 1%    1.83µs ± 0%    +35.43%  (p=0.000 n=10+10)
ProjOp/op=projPlusInt64Int64Op/useSel=true/hasNulls=false-24       966ns ± 1%    1413ns ± 0%    +46.24%  (p=0.000 n=10+9)
ProjOp/op=projPlusInt64Int64Op/useSel=false/hasNulls=true-24      1.17µs ± 1%    1.57µs ± 2%    +33.99%  (p=0.000 n=10+10)
ProjOp/op=projPlusInt64Int64Op/useSel=false/hasNulls=false-24      819ns ± 2%    1220ns ± 0%    +48.93%  (p=0.000 n=10+8)

name                                                            old speed      new speed      delta
ProjOp/op=projMinusInt64Int64Op/useSel=true/hasNulls=true-24    12.2GB/s ± 0%   8.7GB/s ± 1%    -28.88%  (p=0.000 n=10+9)
ProjOp/op=projMinusInt64Int64Op/useSel=true/hasNulls=false-24   17.1GB/s ± 0%  11.2GB/s ± 2%    -34.83%  (p=0.000 n=9+10)
ProjOp/op=projMinusInt64Int64Op/useSel=false/hasNulls=true-24   16.2GB/s ± 0%  10.3GB/s ± 1%    -36.08%  (p=0.000 n=10+10)
ProjOp/op=projMinusInt64Int64Op/useSel=false/hasNulls=false-24  25.3GB/s ± 0%  13.5GB/s ± 3%    -46.70%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=true-24     12.1GB/s ± 0%   1.2GB/s ± 0%    -89.83%  (p=0.000 n=9+10)
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=false-24    17.1GB/s ± 0%   1.3GB/s ± 0%    -92.53%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=true-24    16.0GB/s ± 1%   1.1GB/s ± 1%    -92.90%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=false-24   25.3GB/s ± 1%   1.2GB/s ± 1%    -95.37%  (p=0.000 n=10+10)
ProjOp/op=projDivInt64Int64Op/useSel=true/hasNulls=true-24      1.37GB/s ± 0%  1.31GB/s ± 2%     -3.71%  (p=0.000 n=9+10)
ProjOp/op=projDivInt64Int64Op/useSel=true/hasNulls=false-24     1.42GB/s ± 0%  1.38GB/s ± 1%     -3.06%  (p=0.000 n=9+10)
ProjOp/op=projDivInt64Int64Op/useSel=false/hasNulls=true-24     1.35GB/s ± 0%  1.29GB/s ± 0%     -4.77%  (p=0.000 n=10+10)
ProjOp/op=projDivInt64Int64Op/useSel=false/hasNulls=false-24    1.40GB/s ± 0%  1.34GB/s ± 1%     -4.49%  (p=0.000 n=8+9)
ProjOp/op=projPlusInt64Int64Op/useSel=true/hasNulls=true-24     12.1GB/s ± 1%   9.0GB/s ± 0%    -26.15%  (p=0.000 n=10+10)
ProjOp/op=projPlusInt64Int64Op/useSel=true/hasNulls=false-24    16.9GB/s ± 1%  11.6GB/s ± 0%    -31.59%  (p=0.000 n=10+9)
ProjOp/op=projPlusInt64Int64Op/useSel=false/hasNulls=true-24    13.9GB/s ± 1%  10.4GB/s ± 2%    -25.36%  (p=0.000 n=10+10)
ProjOp/op=projPlusInt64Int64Op/useSel=false/hasNulls=false-24   20.0GB/s ± 2%  13.4GB/s ± 0%    -32.85%  (p=0.000 n=10+8)

@solongordon
Copy link
Contributor

Similar to my suggestion from yesterday, I bet we can make multiplication much faster in the typical case by skipping the overflow check when the ints are sufficiently small. Maybe something clever like checking if int64(int32(x)) == x (in the Int64 case). I don't know if that's exactly right but you get the idea.

@rafiss rafiss force-pushed the aggregator-overflow-handling branch from 6956af2 to 9653a99 Compare July 18, 2019 21:48
@petermattis
Copy link
Collaborator

I wonder if math/bits.Mul64 and then checking whether the high-bits are non-zero would be faster for multiplication.

@rafiss
Copy link
Collaborator Author

rafiss commented Jul 19, 2019

Thanks for the pointers, both the casting idea and math/bits.Mul64 sound promising! Mul64 only works with unsigned ints, and there are only 64- and 32-bit variants, but it certainly is worth looking into more. I'll do tests to see if the performance is better if we special case positive ints of that size, and check if the algorithm in that function can be adapted for our use case more generally.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: assuming that test failure is expected.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @solongordon)

@rafiss rafiss force-pushed the aggregator-overflow-handling branch from 9653a99 to a11b89f Compare July 19, 2019 20:28
@rafiss
Copy link
Collaborator Author

rafiss commented Jul 19, 2019

I ended up changing multiplication to do something similar to Solon's idea, but I'm explicitly comparing to the upper and lower bounds so that things are a little more legible. There's a much smaller hit on latency now (although it's still substantial).

name                                                           old time/op    new time/op    delta
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=true-24      1.37µs ± 1%    3.58µs ± 0%  +160.55%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=false-24      976ns ± 1%    3205ns ± 0%  +228.34%  (p=0.000 n=9+8)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=true-24     1.04µs ± 1%    2.72µs ± 0%  +162.45%  (p=0.000 n=10+9)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=false-24     662ns ± 1%    2366ns ± 0%  +257.21%  (p=0.000 n=10+10)

name                                                           old speed      new speed      delta
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=true-24    11.9GB/s ± 1%   4.6GB/s ± 0%   -61.61%  (p=0.000 n=10+10)
ProjOp/op=projMultInt64Int64Op/useSel=true/hasNulls=false-24   16.8GB/s ± 1%   5.1GB/s ± 0%   -69.53%  (p=0.000 n=9+8)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=true-24   15.8GB/s ± 1%   6.0GB/s ± 0%   -61.89%  (p=0.000 n=10+9)
ProjOp/op=projMultInt64Int64Op/useSel=false/hasNulls=false-24  24.7GB/s ± 1%   6.9GB/s ± 0%   -71.99%  (p=0.000 n=10+10)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, nice. I have a suggestion to make the templates a little more legible. I'm also a little anxious that we don't have particularly great edge case testing of this still, even with the added tests from you and Matt. Is there a way we could write a quickcheck style random test that's specifically just testing edge behavior for this stuff? Or is it overkill?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss and @solongordon)


pkg/sql/exec/execgen/cmd/execgen/overloads.go, line 363 at r2 (raw file):

						panic(tree.ErrIntOutOfRange)
					}
					%[1]s = result

Did you find these numbers confusing while writing this code? As a suggestion, it might be easier to read if you used a text template:

m := map[string]interface{}{"Target": target, "L": l, "R":, r}
buf := strings.Builder{}
t := template.Must(template.New("").Parse(`
{
  result := {{.L}} + {{.R}}
...`)
t.Execute(&buf, m)
return buf.Build()

(disclaimer: not tested)


pkg/sql/exec/execgen/cmd/execgen/overloads.go, line 389 at r2 (raw file):

			case 8:
				upperBound = "10"
				lowerBound = "-10"

Do we need the int8 case? Oh right, problem is that the type exists in our package, like you were saying. We should probably just delete that type.

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're definitely right this isn't super well tested. That failing test in the previous revision of the PR was due to a bug/typo I had in the subtraction template, and we got lucky that there was a test that caught it in an unrelated logic test. I was trying to see if I could unit test this overflow handling since there are a lot of edge cases to check, but maybe I should just go ahead and write logic tests for all of the cases.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @solongordon)


pkg/sql/exec/execgen/cmd/execgen/overloads.go, line 363 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Did you find these numbers confusing while writing this code? As a suggestion, it might be easier to read if you used a text template:

m := map[string]interface{}{"Target": target, "L": l, "R":, r}
buf := strings.Builder{}
t := template.Must(template.New("").Parse(`
{
  result := {{.L}} + {{.R}}
...`)
t.Execute(&buf, m)
return buf.Build()

(disclaimer: not tested)

i'll look into this idea. I did find the %[1]s syntax annoying; i wish there were named format strings like in python


pkg/sql/exec/execgen/cmd/execgen/overloads.go, line 389 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Do we need the int8 case? Oh right, problem is that the type exists in our package, like you were saying. We should probably just delete that type.

yeah, would make sense to remove int8 later

@jordanlewis
Copy link
Member

Writing unit tests seems like the way to go to me. Logic tests are very "bulky" in the sense that they take a lot of lines to do relatively little work. I think a unit test could probably manage to test quite a bit of these edge cases without as much typing.

@solongordon solongordon requested a review from a team July 22, 2019 18:27
@rafiss rafiss force-pushed the aggregator-overflow-handling branch from a11b89f to d7cff34 Compare July 23, 2019 16:19
@rafiss rafiss requested a review from a team July 23, 2019 16:19
@rafiss
Copy link
Collaborator Author

rafiss commented Jul 23, 2019

I updated the PR so that the overloads use templates with named arguments, and with a more robust unit test suite for all the overflow checks. Please look again :)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third :lgtm: is the charm. Excellent work!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss and @solongordon)


pkg/sql/exec/overloads_test.go, line 122 at r4 (raw file):

}

func assertIntegerEquals(t *testing.T, expected, actual int) {

You can use the testify package for these helpers if you want. assert.Equal and assert.PanicsWithValue.


pkg/sql/exec/execgen/cmd/execgen/overloads.go, line 414 at r4 (raw file):

				{
					result := {{.Left}} * {{.Right}}
					if {{.Left}} > {{.UpperBound}} || {{.Left}} < {{.LowerBound}} || {{.Right}} > {{.UpperBound}} || {{.Right}} < {{.LowerBound}} {

This is readable! Yay!

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @rafiss, and @solongordon)


pkg/sql/exec/overloads_test.go, line 122 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

You can use the testify package for these helpers if you want. assert.Equal and assert.PanicsWithValue.

ah nice, switched to the library

The overflow checks are done as part of the code generation in
overloads.go. The checks are done inline, rather than calling the
functions in the arith package for performance reasons.

The checks are only done for integer math. float math is already
well-defined since overflow will result in +Inf and -Inf as necessary.

The operations that these checks are relevant for are the SUM_INT
aggregator and projection. In the future, AVG will also benefit from
these overflow checks.

This changes the error message produced by overflows in the
non-vectorized SUM_INT aggregator so that the messages are consistent.
This should be fine in terms of postgres-compatibility since SUM_INT is
unique to CRDB and eventually we will get rid of it anyway.

resolves cockroachdb#38775

Release note: None
@rafiss rafiss force-pushed the aggregator-overflow-handling branch from d7cff34 to 1dc97ee Compare July 23, 2019 16:55
@rafiss
Copy link
Collaborator Author

rafiss commented Jul 23, 2019

thanks for helping me iterate!

bors r+

craig bot pushed a commit that referenced this pull request Jul 23, 2019
38967: exec: overflow handling for vectorized arithmetic r=rafiss a=rafiss

The overflow checks are done as part of the code generation in
overloads.go. The checks are done inline, rather than calling the
functions in the arith package for performance reasons.

The checks are only done for integer math. float math is already
well-defined since overflow will result in +Inf and -Inf as necessary.

The operations that these checks are relevant for are the SUM_INT
aggregator and projection. In the future, AVG will also benefit from
these overflow checks.

This changes the error message produced by overflows in the
non-vectorized SUM_INT aggregator so that the messages are consistent.
This should be fine in terms of postgres-compatibility since SUM_INT is
unique to CRDB and eventually we will get rid of it anyway.

resolves #38775

Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Jul 23, 2019

Build succeeded

@craig craig bot merged commit 1dc97ee into cockroachdb:master Jul 23, 2019
@rafiss rafiss deleted the aggregator-overflow-handling branch July 23, 2019 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exec: overflow handling for aggregates
5 participants