feat: Speed up symbol parsing by minimizing allocations #258

varungandhi-src · 2024-06-25T12:23:23Z

NOTE: The diff looks large, but the majority of that is because of scheme
changes due to a small comment that I added.

What

Changes the symbol parsing logic to minimize allocations. In particular,
when we only care about validating symbols (e.g. during document
canonicalization when ingesting uploads), there is really ~no need to
allocate any strings at all. Validation and parsing share most of the
underlying code -- the only change is we create "writer" types which
will discard writes (and hence any internal buffer growth) when we're
only in validation mode.

Why

Ideally, we want to validate all symbols that we enter into the DB,
(and we also want to have fast splitting of symbols) so it's valuable
to have the overhead be as low as possible. In the validation case,
we only make minimal heap allocations in the error case (there is a
test which makes sure we don't allocate in the non error cases).

Benchmarks

I ran some benchmarks with sample SCIP indexes located here: (Sourcegraph-internal)
https://drive.google.com/drive/folders/1z62Se7eHaa5T89a16-y7s0Z1qbRY4VCg

Once the indexes are decompressed into dev/sample_indexes, you can run

# Benchmarks
go run ./bindings/go/scip/speedtest
# Compatibility test with old parser
go test ./... -run TestParseCompat -tags asserts

Symbol parse (v1) represents the older symbol parsing logic;
Symbol parse (v2) represents the newer symbol parsing logic.
I also added a validation helper function on top of the newer parser;
that is also benchmarked separately.

symbol parse (v2) is noticeably slower than validation because of
allocations needed so that we can have symbol parse (v1) and (v2)
-- the old parsing logic would always return a new *scip.Symbol,
so it'd be an unfair comparison if we just pre-allocated everything
for benchmarking (v2). It is possible to get symbol parse (v2) to validation level
speed by pre-allocating arrays of scip.Package and scip.Descriptor
values up-front and essentially passing pointers to those inside those to
successive calls to ParseSymbolUTF8With.

Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/django-1.scip":
+-------------------------------+------------+-------+------------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000       | RATIO | 10000      | RATIO | 100000   | RATIO |
+-------------------------------+------------+-------+------------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 1.074µs/op | -     | 1.059µs/op | -     | 928ns/op | -     |
| Symbol parse (v2) - Speed     | 528ns/op   | 0.49x | 620ns/op   | 0.59x | 620ns/op | 0.67x |
| Symbol validate (v2) - Speed  | 359ns/op   | 0.33x | 421ns/op   | 0.40x | 413ns/op | 0.45x |
| Symbol parse (v1) - Allocs    | 101535B/op | -     | 10153B/op  | -     | 1015B/op | -     |
| Symbol parse (v2) - Allocs    | 41218B/op  | 0.41x | 4121B/op   | 0.41x | 412B/op  | 0.41x |
| Symbol validate (v2) - Allocs | 4B/op      | 0.00x | 0B/op      | 0.00x | 0B/op    | 0.00x |
+-------------------------------+------------+-------+------------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/flink-1.scip":
+-------------------------------+------------+-------+------------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000       | RATIO | 10000      | RATIO | 100000   | RATIO |
+-------------------------------+------------+-------+------------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 1.334µs/op | -     | 1.216µs/op | -     | 962ns/op | -     |
| Symbol parse (v2) - Speed     | 665ns/op   | 0.50x | 686ns/op   | 0.56x | 694ns/op | 0.72x |
| Symbol validate (v2) - Speed  | 397ns/op   | 0.30x | 392ns/op   | 0.32x | 400ns/op | 0.42x |
| Symbol parse (v1) - Allocs    | 107481B/op | -     | 10748B/op  | -     | 1074B/op | -     |
| Symbol parse (v2) - Allocs    | 61517B/op  | 0.57x | 6151B/op   | 0.57x | 615B/op  | 0.57x |
| Symbol validate (v2) - Allocs | 0B/op      | 0.00x | 0B/op      | 0.00x | 0B/op    | 0.00x |
+-------------------------------+------------+-------+------------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/llvm-project-1.scip":
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000      | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 941ns/op  | -     | 651ns/op | -     | 516ns/op | -     |
| Symbol parse (v2) - Speed     | 387ns/op  | 0.41x | 442ns/op | 0.68x | 363ns/op | 0.70x |
| Symbol validate (v2) - Speed  | 229ns/op  | 0.24x | 230ns/op | 0.35x | 174ns/op | 0.34x |
| Symbol parse (v1) - Allocs    | 49424B/op | -     | 4942B/op | -     | 494B/op  | -     |
| Symbol parse (v2) - Allocs    | 33525B/op | 0.68x | 3352B/op | 0.68x | 335B/op  | 0.68x |
| Symbol validate (v2) - Allocs | 0B/op     | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/rust-1.scip":
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000      | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 664ns/op  | -     | 636ns/op | -     | 739ns/op | -     |
| Symbol parse (v2) - Speed     | 450ns/op  | 0.68x | 443ns/op | 0.70x | 498ns/op | 0.67x |
| Symbol validate (v2) - Speed  | 278ns/op  | 0.42x | 288ns/op | 0.45x | 301ns/op | 0.41x |
| Symbol parse (v1) - Allocs    | 70200B/op | -     | 7020B/op | -     | 702B/op  | -     |
| Symbol parse (v2) - Allocs    | 38142B/op | 0.54x | 3814B/op | 0.54x | 381B/op  | 0.54x |
| Symbol validate (v2) - Allocs | 0B/op     | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/shopify-api-ruby-1.scip":
+-------------------------------+------------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000       | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+------------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 1.024µs/op | -     | 914ns/op | -     | 893ns/op | -     |
| Symbol parse (v2) - Speed     | 5.607µs/op | 5.48x | 349ns/op | 0.38x | 340ns/op | 0.38x | (*)
| Symbol validate (v2) - Speed  | 236ns/op   | 0.23x | 203ns/op | 0.22x | 203ns/op | 0.23x |
| Symbol parse (v1) - Allocs    | 65794B/op  | -     | 6579B/op | -     | 657B/op  | -     |
| Symbol parse (v2) - Allocs    | 30391B/op  | 0.46x | 3039B/op | 0.46x | 303B/op  | 0.46x |
| Symbol validate (v2) - Allocs | 0B/op      | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+------------+-------+----------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/typescript-1.scip":
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000      | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 555ns/op  | -     | 597ns/op | -     | 572ns/op | -     |
| Symbol parse (v2) - Speed     | 450ns/op  | 0.81x | 389ns/op | 0.65x | 385ns/op | 0.67x |
| Symbol validate (v2) - Speed  | 189ns/op  | 0.34x | 225ns/op | 0.38x | 222ns/op | 0.39x |
| Symbol parse (v1) - Allocs    | 62000B/op | -     | 6200B/op | -     | 620B/op  | -     |
| Symbol parse (v2) - Allocs    | 37137B/op | 0.60x | 3713B/op | 0.60x | 371B/op  | 0.60x |
| Symbol validate (v2) - Allocs | 0B/op     | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+-----------+-------+----------+-------+----------+-------+

There is one surprising case where parsing is much slower when only testing against
1000 occurrences with the newer parser, which I've marked with a (*) --
that seems to be somewhat reproducible. However, I haven't spent time on investigating
that because it seems like the speed improvements are present at higher occurrence counts.

Test plan

Added compatibility tests for the old parser vs the new parser and ran
that against a bunch of existing indexes.

bindings/go/scip/symbol.go

kritzcreek · 2024-06-27T04:03:37Z

I tried running the benchmarks against chromium-1.scip (to compare parsing the same symbols), but it looks like the parser is failing to parse the symbols:

@@ -175,7 +175,11 @@ func TestUtf8Validation(t *testing.T) {
 		var sym Symbol
 		for i := 0; i < b.N; i++ {
 			occ := allOccurrences[i]
-			_ = parsePartialSymbolV2(occ.Symbol, true, &sym)
+			err = parsePartialSymbolV2(occ.Symbol, true, &sym)
+			if err != nil {
+				panic(fmt.Sprintf("Failed to parse '%s' with %s", occ.Symbol, err))
+				// fmt.Printf("Failed to parse '%s' with %s", occ.Symbol, err)
+			}
 		}
 	}
 	stdUtf8ValidationOnly := func(b *simpleBenchmark) {

=== RUN   TestUtf8Validation
--- FAIL: TestUtf8Validation (17.96s)
panic: Failed to parse 'cxx . todo-pkg todo-version `apps/switches.h:6:9`!' with unrecognized descriptor "do-pkg " [recovered]
	panic: Failed to parse 'cxx . todo-pkg todo-version `apps/switches.h:6:9`!' with unrecognized descriptor "do-pkg "

varungandhi-src · 2024-11-30T08:13:53Z

bindings/go/scip/internal/old_symbol_parser.go

@@ -0,0 +1,239 @@
+package internal


We can remove this code after the new symbol parser doesn't show any problems in practice.

varungandhi-src · 2024-11-30T08:23:15Z

bindings/go/scip/internal/shared/shared.go

+package shared
+
+func IsSimpleIdentifierCharacter(c rune) bool {
+	return c == '_' || c == '+' || c == '-' || c == '$' || ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || ('0' <= c && c <= '9')


Changing the ordering of the comparisons doesn't seem to have any noticeable changes in benchmarks, so leaving this code as-is to match the order in scip.proto.

varungandhi-src · 2024-11-30T13:21:30Z

bindings/go/scip/symbol.go

 }

-func (x *Package) ID() string {
-	return fmt.Sprintf("%s %s %s", x.Manager, x.Name, x.Version)
+func ValidateSymbolUTF8(symbol beaut.UTF8String) error {


Adding this helper function for better readability at call-sites since upload ingestion requires filtering out occurrences/SymbolInformation values with malformed symbols.

varungandhi-src · 2024-11-30T13:22:56Z

bindings/go/scip/symbol.go

 	"strings"

-	"github.com/cockroachdb/errors"


This file is best reviewed in 'Split diff' view.

varungandhi-src · 2024-11-30T13:24:02Z

bindings/go/scip/symbol.go

+//
+// Unlike ParseSymbol, this skips UTF-8 validation. To customize
+// parsing behavior, use ParseSymbolUTF8With.
+func ParseSymbolUTF8(symbol beaut.UTF8String) (*Symbol, error) {


I've decided to add a new function here instead of modifying the signature of the existing ParseSymbol function to avoid gratuitously breaking callers, since it's not hard to maintain back-compat.

varungandhi-src · 2024-11-30T13:24:50Z

bindings/go/scip/symbol.go

-	if s.current() == r {
-		s.index++
-		return nil
+func ParseSymbolUTF8With(symbol beaut.UTF8String, options ParseSymbolOptions) error {


This function takes a ParseSymbolOptions struct so that we can add more options in the future without breaking back compat.

dev/sample_indexes/indexes-metadata.json

See PR for benchmarks.

kritzcreek

Very nice! I think the performance wins are worth the extra complexity in the parser.

I added a couple comments/questions/suggestions, but nothing major.

kritzcreek · 2024-12-03T14:12:16Z

.tool-versions

@@ -1,4 +1,4 @@
-golang     1.20.14
+golang     1.22.0


Should we go to 1.23.x (3 at the moment) while were at it?

That's potentially too restrictive since we're providing a library. The Go toolchain typically provides support for ~2 major versions, and a bunch of OSS libraries do the same.

In principle, we could split the code into different modules so that we can aggressively bump the version for the CLI and leave the version bound for the bindings lower, but that would make things more complicated, so not doing that for now.

bindings/go/scip/internal/old_symbol_parser.go

bindings/go/scip/symbol_parser.go

kritzcreek · 2024-12-03T14:45:31Z

bindings/go/scip/symbol_parser.go

+	SymbolString    string
+	byteIndex       int
+	currentRune     rune
+	bytesToNextRune int32


This would be cheap to derive from currentRune, is there a particular reason why you're storing it separately?

We don't expect to be constructing lots of parser objects, so I'm not concerned about potential memory usage. However, I thought it didn't make sense to have an extra branch (even if it's well predicted) in the common case to identify the length from the rune, since we already have the value computed anyways.

kritzcreek · 2024-12-03T14:48:08Z

bindings/go/scip/symbol_parser.go

+
+// Pre-condition: string is well-formed UTF-8
+// Pre-condition: byteIndex is in bounds
+func findRuneAtIndex(s string, byteIndex int) (r rune, bytesRead int32) {


I was surprised to see this function, with Go having support for string slices. Does manually tracking the byte offset, rather than continuously slicing the input string have noticeable performance impact? Otherwise we could be using https://pkg.go.dev/unicode/utf8#DecodeRune

If you look at the function here, that's much more complicated than what we have as it's trying to also handle the invalid UTF-8 case.

https://sourcegraph.com/github.com/golang/go/-/blob/src/unicode/utf8/utf8.go?L205-243

I suspect it's probably slower given that it's doing more work (we just have 1 indexing operation + 1 comparison on the fastest path), but I have not benchmarked it.

kritzcreek · 2024-12-03T14:49:07Z

bindings/go/scip/symbol_parser.go

+	errorCaseByteNotFound
+)
+
+// TODO: Enable https://github.com/nishanths/exhaustive in CI


Intentional TODO for the future, or something you meant to do as part of this PR?

For the future, not this PR.

kritzcreek · 2024-12-03T15:42:46Z

bindings/go/scip/speedtest/speedtest_main.go

+			occ := allOccurrences[i]
+			_, err = internal.ParsePartialSymbolV1ToBeDeleted(occ.Symbol, true)
+			if err != nil {
+				//panic(fmt.Sprintf("v1: index path: %v: error: %v", path, err))


Should we at least collect these errors, and check that both parsers errored on the same symbols?

That is handled by TestParseCompat, we're deliberately dropping them here. Added a comment explaining that.

kritzcreek

Speeeeeeed! :)

varungandhi-src commented Jun 25, 2024

View reviewed changes

bindings/go/scip/symbol.go Outdated Show resolved Hide resolved

varungandhi-src force-pushed the vg/fast-parse branch from ba059c1 to 19c0968 Compare June 27, 2024 03:47

varungandhi-src mentioned this pull request Jun 27, 2024

scip-syntax: adds strict SCIP symbol parsing and formatting sourcegraph/sourcegraph-public-snapshot#63443

Merged

varungandhi-src force-pushed the vg/fast-parse branch from 7a119d6 to 2c6b27d Compare September 18, 2024 15:21

varungandhi-src force-pushed the vg/fast-parse branch from 33b2adb to 62a1946 Compare November 9, 2024 08:55

varungandhi-src force-pushed the vg/fast-parse branch 2 times, most recently from 0394303 to 7071041 Compare November 29, 2024 08:41

varungandhi-src changed the base branch from main to vg/fast-json November 29, 2024 09:11

varungandhi-src commented Nov 30, 2024

View reviewed changes

Base automatically changed from vg/fast-json to main November 30, 2024 08:33

varungandhi-src force-pushed the vg/fast-parse branch 4 times, most recently from e4ba021 to 3f8bc54 Compare November 30, 2024 13:18

varungandhi-src commented Nov 30, 2024

View reviewed changes

varungandhi-src force-pushed the vg/fast-parse branch from 3f8bc54 to fc0ff09 Compare November 30, 2024 13:27

varungandhi-src commented Nov 30, 2024

View reviewed changes

dev/sample_indexes/indexes-metadata.json Show resolved Hide resolved

varungandhi-src force-pushed the vg/fast-parse branch 8 times, most recently from b9583e7 to 85b9892 Compare December 1, 2024 12:57

varungandhi-src force-pushed the vg/fast-parse branch 2 times, most recently from e143e74 to d93bbf0 Compare December 2, 2024 02:17

varungandhi-src marked this pull request as ready for review December 2, 2024 02:17

varungandhi-src changed the title ~~Speed up ParseSymbol by avoiding allocations~~ feat: Speed up symbol parsing and validation by minimizing allocations Dec 2, 2024

varungandhi-src force-pushed the vg/fast-parse branch 2 times, most recently from c269d2b to 57d9817 Compare December 2, 2024 02:49

varungandhi-src changed the title ~~feat: Speed up symbol parsing and validation by minimizing allocations~~ feat: Speed up symbol parsing by minimizing allocations Dec 2, 2024

varungandhi-src requested a review from kritzcreek December 2, 2024 03:00

varungandhi-src force-pushed the vg/fast-parse branch from 57d9817 to 3893589 Compare December 2, 2024 04:09

feat: Minimize allocations in SCIP symbol parsing.

3376768

See PR for benchmarks.

varungandhi-src force-pushed the vg/fast-parse branch from 3893589 to 3376768 Compare December 2, 2024 04:09

kritzcreek reviewed Dec 3, 2024

View reviewed changes

Address review comments

fc33e49

kritzcreek approved these changes Dec 3, 2024

View reviewed changes

varungandhi-src merged commit 12ff730 into main Dec 4, 2024
6 checks passed

varungandhi-src deleted the vg/fast-parse branch December 4, 2024 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Speed up symbol parsing by minimizing allocations #258

feat: Speed up symbol parsing by minimizing allocations #258

varungandhi-src commented Jun 25, 2024 •

edited

Loading

kritzcreek commented Jun 27, 2024

varungandhi-src Nov 30, 2024

varungandhi-src Nov 30, 2024

varungandhi-src Nov 30, 2024 •

edited

Loading

varungandhi-src Nov 30, 2024

varungandhi-src Nov 30, 2024

varungandhi-src Nov 30, 2024

kritzcreek left a comment

kritzcreek Dec 3, 2024

varungandhi-src Dec 3, 2024 •

edited

Loading

kritzcreek Dec 3, 2024

varungandhi-src Dec 3, 2024

kritzcreek Dec 3, 2024 •

edited

Loading

varungandhi-src Dec 3, 2024

kritzcreek Dec 3, 2024

varungandhi-src Dec 3, 2024

kritzcreek Dec 3, 2024

varungandhi-src Dec 3, 2024

kritzcreek left a comment

feat: Speed up symbol parsing by minimizing allocations #258

feat: Speed up symbol parsing by minimizing allocations #258

Conversation

varungandhi-src commented Jun 25, 2024 • edited Loading

What

Why

Benchmarks

Test plan

kritzcreek commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varungandhi-src Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kritzcreek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varungandhi-src Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kritzcreek Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kritzcreek left a comment

Choose a reason for hiding this comment

varungandhi-src commented Jun 25, 2024 •

edited

Loading

varungandhi-src Nov 30, 2024 •

edited

Loading

varungandhi-src Dec 3, 2024 •

edited

Loading

kritzcreek Dec 3, 2024 •

edited

Loading