-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Speed up symbol parsing by minimizing allocations #258
Conversation
ba059c1
to
19c0968
Compare
I tried running the benchmarks against @@ -175,7 +175,11 @@ func TestUtf8Validation(t *testing.T) {
var sym Symbol
for i := 0; i < b.N; i++ {
occ := allOccurrences[i]
- _ = parsePartialSymbolV2(occ.Symbol, true, &sym)
+ err = parsePartialSymbolV2(occ.Symbol, true, &sym)
+ if err != nil {
+ panic(fmt.Sprintf("Failed to parse '%s' with %s", occ.Symbol, err))
+ // fmt.Printf("Failed to parse '%s' with %s", occ.Symbol, err)
+ }
}
}
stdUtf8ValidationOnly := func(b *simpleBenchmark) {
|
7a119d6
to
2c6b27d
Compare
33b2adb
to
62a1946
Compare
0394303
to
7071041
Compare
@@ -0,0 +1,239 @@ | |||
package internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this code after the new symbol parser doesn't show any problems in practice.
package shared | ||
|
||
func IsSimpleIdentifierCharacter(c rune) bool { | ||
return c == '_' || c == '+' || c == '-' || c == '$' || ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || ('0' <= c && c <= '9') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the ordering of the comparisons doesn't seem to have any noticeable changes in benchmarks, so leaving this code as-is to match the order in scip.proto
.
e4ba021
to
3f8bc54
Compare
} | ||
|
||
func (x *Package) ID() string { | ||
return fmt.Sprintf("%s %s %s", x.Manager, x.Name, x.Version) | ||
func ValidateSymbolUTF8(symbol beaut.UTF8String) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this helper function for better readability at call-sites since upload ingestion requires filtering out occurrences/SymbolInformation values with malformed symbols.
"strings" | ||
|
||
"github.com/cockroachdb/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is best reviewed in 'Split diff' view.
// | ||
// Unlike ParseSymbol, this skips UTF-8 validation. To customize | ||
// parsing behavior, use ParseSymbolUTF8With. | ||
func ParseSymbolUTF8(symbol beaut.UTF8String) (*Symbol, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to add a new function here instead of modifying the signature of the existing ParseSymbol
function to avoid gratuitously breaking callers, since it's not hard to maintain back-compat.
if s.current() == r { | ||
s.index++ | ||
return nil | ||
func ParseSymbolUTF8With(symbol beaut.UTF8String, options ParseSymbolOptions) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function takes a ParseSymbolOptions
struct so that we can add more options in the future without breaking back compat.
3f8bc54
to
fc0ff09
Compare
b9583e7
to
85b9892
Compare
e143e74
to
d93bbf0
Compare
c269d2b
to
57d9817
Compare
57d9817
to
3893589
Compare
See PR for benchmarks.
3893589
to
3376768
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I think the performance wins are worth the extra complexity in the parser.
I added a couple comments/questions/suggestions, but nothing major.
@@ -1,4 +1,4 @@ | |||
golang 1.20.14 | |||
golang 1.22.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we go to 1.23.x (3 at the moment) while were at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's potentially too restrictive since we're providing a library. The Go toolchain typically provides support for ~2 major versions, and a bunch of OSS libraries do the same.
In principle, we could split the code into different modules so that we can aggressively bump the version for the CLI and leave the version bound for the bindings lower, but that would make things more complicated, so not doing that for now.
SymbolString string | ||
byteIndex int | ||
currentRune rune | ||
bytesToNextRune int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be cheap to derive from currentRune
, is there a particular reason why you're storing it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't expect to be constructing lots of parser objects, so I'm not concerned about potential memory usage. However, I thought it didn't make sense to have an extra branch (even if it's well predicted) in the common case to identify the length from the rune, since we already have the value computed anyways.
|
||
// Pre-condition: string is well-formed UTF-8 | ||
// Pre-condition: byteIndex is in bounds | ||
func findRuneAtIndex(s string, byteIndex int) (r rune, bytesRead int32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised to see this function, with Go having support for string slices. Does manually tracking the byte offset, rather than continuously slicing the input string have noticeable performance impact? Otherwise we could be using https://pkg.go.dev/unicode/utf8#DecodeRune
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the function here, that's much more complicated than what we have as it's trying to also handle the invalid UTF-8 case.
https://sourcegraph.com/github.com/golang/go/-/blob/src/unicode/utf8/utf8.go?L205-243
I suspect it's probably slower given that it's doing more work (we just have 1 indexing operation + 1 comparison on the fastest path), but I have not benchmarked it.
errorCaseByteNotFound | ||
) | ||
|
||
// TODO: Enable https://github.com/nishanths/exhaustive in CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional TODO for the future, or something you meant to do as part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, not this PR.
occ := allOccurrences[i] | ||
_, err = internal.ParsePartialSymbolV1ToBeDeleted(occ.Symbol, true) | ||
if err != nil { | ||
//panic(fmt.Sprintf("v1: index path: %v: error: %v", path, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we at least collect these errors, and check that both parsers errored on the same symbols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is handled by TestParseCompat, we're deliberately dropping them here. Added a comment explaining that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speeeeeeed! :)
NOTE: The diff looks large, but the majority of that is because of scheme
changes due to a small comment that I added.
What
Changes the symbol parsing logic to minimize allocations. In particular,
when we only care about validating symbols (e.g. during document
canonicalization when ingesting uploads), there is really ~no need to
allocate any strings at all. Validation and parsing share most of the
underlying code -- the only change is we create "writer" types which
will discard writes (and hence any internal buffer growth) when we're
only in validation mode.
Why
Ideally, we want to validate all symbols that we enter into the DB,
(and we also want to have fast splitting of symbols) so it's valuable
to have the overhead be as low as possible. In the validation case,
we only make minimal heap allocations in the error case (there is a
test which makes sure we don't allocate in the non error cases).
Benchmarks
I ran some benchmarks with sample SCIP indexes located here: (Sourcegraph-internal)
https://drive.google.com/drive/folders/1z62Se7eHaa5T89a16-y7s0Z1qbRY4VCg
Once the indexes are decompressed into
dev/sample_indexes
, you can runSymbol parse (v1) represents the older symbol parsing logic;
Symbol parse (v2) represents the newer symbol parsing logic.
I also added a validation helper function on top of the newer parser;
that is also benchmarked separately.
symbol parse (v2) is noticeably slower than validation because of
allocations needed so that we can have symbol parse (v1) and (v2)
-- the old parsing logic would always return a new
*scip.Symbol
,so it'd be an unfair comparison if we just pre-allocated everything
for benchmarking (v2). It is possible to get symbol parse (v2) to validation level
speed by pre-allocating arrays of
scip.Package
andscip.Descriptor
values up-front and essentially passing pointers to those inside those to
successive calls to
ParseSymbolUTF8With
.There is one surprising case where parsing is much slower when only testing against
1000 occurrences with the newer parser, which I've marked with a
(*)
--that seems to be somewhat reproducible. However, I haven't spent time on investigating
that because it seems like the speed improvements are present at higher occurrence counts.
Test plan
Added compatibility tests for the old parser vs the new parser and ran
that against a bunch of existing indexes.