Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize serialization of []any and map[string]any #617

Open
wants to merge 7 commits into
base: 5.0
Choose a base branch
from

Conversation

zolstein
Copy link

Improve serialization performance by specializing slices or maps of any, and optimizing reflective looping through maps.

Special-casing []any and especially map[string]any, which are common parameter types, eliminates allocation overhead to loop through these values.

Using MapRange rather than MapKeys and MapIndex (and using Value.SetIterKey/Value rather than MapIter.Key/Value) significantly reduces allocations when looping through other map types.

@robsdedude
Copy link
Member

Hello, thank you for taking the time to contribute to a Neo4j project, we appreciate all community engagement.
Before we review a change we require that users have signed our Contributor License Agreement(CLA).
For more details on signing our CLA please see: https://neo4j.com/developer/cla/

@zolstein
Copy link
Author

zolstein commented Jan 29, 2025

@robsdedude I did send an email agreeing to the license last month. Is there some reason that that didn't go through? Is it associated with the wrong email address? In any case, I've sent another agreement associated with a different email address, hopefully that resolves the issue.

@robsdedude
Copy link
Member

One might think our CLA processes is automated... Well it's only partially 😅 Thanks, I'll look into performing the steps to get CI to pass and review your PR in the next days.

@robsdedude
Copy link
Member

robsdedude commented Jan 31, 2025

FYI, I've been having a look and the speed-up looks very promising! I've not spotted any logic issues either 💪 great job! But you've piqued my curiosity. I'm currently looking into whether there's a way to generalize this optimization so that it applies to even more cases.

@robsdedude
Copy link
Member

I didn't succeed :/ everything I tried didn't improve the situation or made it even worse. What I've don, however, is added all map[String]xyz short paths that also exist for []xyz.

Here are some benchmarks I quickly threw together to validate that this PR makes things faster:

### BEFORE ###

BenchmarkOutgoing
BenchmarkOutgoing/pack_[]any
BenchmarkOutgoing/pack_[]any-16                                      	 4242537	       250.7 ns/op
BenchmarkOutgoing/pack_[]bool
BenchmarkOutgoing/pack_[]bool-16                                     	 2663608	       455.2 ns/op
BenchmarkOutgoing/pack_map[string]any
BenchmarkOutgoing/pack_map[string]any-16                             	  943662	      1342 ns/op
BenchmarkOutgoing/pack_map[string]string
BenchmarkOutgoing/pack_map[string]string-16                          	 3308748	       375.1 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_string)
BenchmarkOutgoing/pack_map[string]any_(any_always_string)-16         	  873051	      1281 ns/op
BenchmarkOutgoing/pack_map[string][]string
BenchmarkOutgoing/pack_map[string][]string-16                        	  803454	      1547 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)-16       	  825868	      1558 ns/op
BenchmarkOutgoing/pack_[]byte
BenchmarkOutgoing/pack_[]byte-16                                     	45481484	        34.66 ns/op


### AFTER ###

BenchmarkOutgoing
BenchmarkOutgoing/pack_[]any
BenchmarkOutgoing/pack_[]any-16                                      	 5936780	       178.7 ns/op
BenchmarkOutgoing/pack_[]bool
BenchmarkOutgoing/pack_[]bool-16                                     	 2564455	       472.9 ns/op
BenchmarkOutgoing/pack_map[string]any
BenchmarkOutgoing/pack_map[string]any-16                             	 3001713	       407.4 ns/op
BenchmarkOutgoing/pack_map[string]string
BenchmarkOutgoing/pack_map[string]string-16                          	 3258835	       386.6 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_string)
BenchmarkOutgoing/pack_map[string]any_(any_always_string)-16         	 3530245	       374.7 ns/op
BenchmarkOutgoing/pack_map[string][]string
BenchmarkOutgoing/pack_map[string][]string-16                        	  857400	      1437 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)-16       	 1801963	       584.2 ns/op
BenchmarkOutgoing/pack_[]byte
BenchmarkOutgoing/pack_[]byte-16                                     	35928717	        39.06 ns/op

Copy link
Member

@robsdedude robsdedude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏇💨

Thanks again for taking the time to craft this PR 🙇

@zolstein
Copy link
Author

zolstein commented Feb 1, 2025

Unfortunately, my experience is that it's quite hard to optimize iterating a map with reflection - it's always much faster if you can statically convert to a known map type.

I did find a couple other optimizations, though I'm not sure if you'd want to take them.

The first is that I noticed that using MapIter.Value only allocates when the map values are not pointer-types. Therefore, you can branch on the map value type and avoid allocating a slot for the value if it has pointers, otherwise use the code as written.

	if isPointerType(t.Elem()) {
		r := v.MapRange()
		for r.Next() {
			key.SetIterKey(r)
			o.packer.String(key.String())
			o.packX(r.Value().Interface())
		}
	} else {
		value := reflect.New(t.Elem()).Elem()
		r := v.MapRange()
		for r.Next() {
			key.SetIterKey(r)
			value.SetIterValue(r)
			o.packer.String(key.String())
			o.packX(value.Interface())
		}
	}

The second is that you can put an extra string field in the outgoing struct, and use it as the place to store the key value. This lets you avoid an allocation per-map, which is especially useful if there are nested structures. However, the naive version of this breaks if the map has keys that are a new-type of string, and the only workaround I found for this required a bit of unsafe code.

	key := reflect.NewAt(t.Key(), unsafe.Pointer(&o.stringSlot)).Elem()

@zolstein
Copy link
Author

zolstein commented Feb 1, 2025

Also, FYI, it's often helpful to include the -benchmem flag when benchmarking, so you also get data on allocations. That's especially true here, since allocations drive most of the performance difference.

@robsdedude
Copy link
Member

Great pointers 👏 🚀 Thank you very much once again 🙇

I actually managed to squeeze a bit more performance out of the packing code by avoiding some unnecessary jumping back and forth between reflect.Value and any.

neo4j/internal/bolt/outgoing.go Outdated Show resolved Hide resolved
o.packer.Bytes(v.Bytes())
return
case reflect.Int, reflect.Int64, reflect.String, reflect.Float64, reflect.Interface:
if v.Len() > 5 {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably bears actually measuring, at least on one machine. My intuition is that the crossover point at which it's worth taking the allocation is likely to be either "v.Len() > 0" or "basically never."

If I remember correctly what reflection is doing under the hood, there's actually not too much overhead reflectively iterating over and indexing into a slice, since it can just track the values as pointers into the slice until you convert to an interface. So the fast-paths mostly just save you looking up the type of the value. (But the branch predictor might make that easy anyway?) Maybe the better solution here is to have fast-paths here that loop through the slices reflectively, but handle the types directly?

I had looked a bit ago at replacing the packX logic with something that worked predominately on reflect.Value objects and avoided converting back to interfaces. If you're looking at having a packV anyway, it might be better to make packX a really thin wrapper and try to move most of the logic and fast-pathing directly here, rather than duplicating most of the logic and moving back and forth between them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I took a crack at profiling this, and, at least on my machine:

  1. Going through the fast-path in packX is, in fact, notably faster.
  2. It works out that "v.Len() > 0" is the closer bet than "never' or even "v.Len() > 5".
  3. The Interface() call didn't allocate, which I honestly can't quite explain.
  4. The version of the optimization I suggested didn't seem to help.

So... nice find. Personally, I'd stick with it and probably change it to "v.Len() > 0" unless there's some other pathological case. In my testing, testing the actual crossover point was move like "v.Len() > 1" but at that point the difference was ~1ns.

Copy link
Member

@robsdedude robsdedude Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"arbitrary guess" is a bit of a lie 😅 I actually did some benchmarking to determine a rough value that seemed to be close to the cross-over point where the allocation gets amortized (at least on my machine).The

Interface() call didn't allocate, which I honestly can't quite explain.

Now this is most peculiar 😮 because in my testing, that call did allocate. But I assume the fact that it didn't for you explains why the cross-over point is > 0 for you, but was higher for me. For a slice with only 1 element, taking the v.Interface() route makes the packing more than 2 as slow on my system. So I'd rather not 🙃.

For the record: I tested on Go 1.23.0 on linux amd64

Copy link
Member

@robsdedude robsdedude Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that reason, I decided to stick with the minimum size check being some small number as apparently on some systems (like mine) not having it makes things considerably worse while it's not making much of a difference on other systems (like yours).

This avoids choosing the would-be fast-path in `packV` when the slice type
is not actually one that benefits from running through `packX`.

See also
neo4j#617 (comment)
@robsdedude
Copy link
Member

robsdedude commented Feb 5, 2025

Here are some benchmarks of the current state of the PR

# +++++ BEFORE +++++
BenchmarkOutgoing
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________-16         	  632931	      1592 ns/op	     541 B/op	       6 allocs/op
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_-16         	  795153	      1551 ns/op	     539 B/op	       6 allocs/op
BenchmarkOutgoing/pack_[]bool__________________________________________
BenchmarkOutgoing/pack_[]bool__________________________________________-16         	 2363528	       467.3 ns/op	      98 B/op	      13 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]string____________________
BenchmarkOutgoing/pack_map[string]map[string]string____________________-16         	 1648632	       738.9 ns/op	     378 B/op	       3 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)-16         	  648488	      1777 ns/op	     742 B/op	      17 allocs/op
BenchmarkOutgoing/pack_[]*int__________________________________________
BenchmarkOutgoing/pack_[]*int__________________________________________-16         	 4342869	       272.4 ns/op	     145 B/op	       4 allocs/op
PASS


# +++++ AFTER +++++
BenchmarkOutgoing
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________-16         	  896643	      1296 ns/op	     494 B/op	       6 allocs/op
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_-16         	  896656	      1484 ns/op	     350 B/op	       0 allocs/op
BenchmarkOutgoing/pack_[]bool__________________________________________
BenchmarkOutgoing/pack_[]bool__________________________________________-16         	 8330124	       133.1 ns/op	      73 B/op	       0 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]string____________________
BenchmarkOutgoing/pack_map[string]map[string]string____________________-16         	 1588008	       748.1 ns/op	     333 B/op	       2 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)-16         	 1493149	       804.4 ns/op	     353 B/op	       2 allocs/op
BenchmarkOutgoing/pack_[]*int__________________________________________
BenchmarkOutgoing/pack_[]*int__________________________________________-16         	 8745207	       136.0 ns/op	     109 B/op	       0 allocs/op
PASS
neo4j/internal/bolt/outgoing_bench_test.go
package bolt

import (
	"github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/packstream"
	"testing"
)

func BenchmarkOutgoing(outer *testing.B) {
	type workload struct {
		description string
		data        any
	}

	type NewAny any

	someInt := 123456789

	workloads := []workload{
		{
			"pack [][]any (any always int)                        ",
			[][]any{
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
			},
		},

		{
			"pack [][]NewAny (NewAny new-type of any, always int) ",
			[][]NewAny{
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
			},
		},

		{
			"pack []bool                                          ",
			[]bool{true, false, true, false, true, true, true, true, false, false, true, false, false},
		},

		{
			"pack map[string]map[string]string                    ",
			map[string]map[string]string{
				"hello": {"world": "world", "oh": "oh", "hello": "world"},
				"int":   {"1": "1", "2": "2", "3": "3"},
			},
		},

		{
			"pack map[string]map[string]any (any is always string)",
			map[string]map[string]any{
				"hello": {"world": "world", "oh": "oh", "hello": "world"},
				"int":   {"1": "1", "2": "2", "3": "3"},
			},
		},

		{
			"pack []*int                                          ",
			[]*int{&someInt, &someInt, &someInt, &someInt},
		},
	}

	for _, load := range workloads {
		outer.Run(load.description, func(inner *testing.B) {
			out := &outgoing{
				chunker:   newChunker(),
				packer:    packstream.Packer{},
				onPackErr: func(err error) { inner.Error(err) },
			}
			for _i := 0; _i < inner.N; _i++ {
				out.packX(load.data)
			}
		})
	}
}

@zolstein
Copy link
Author

zolstein commented Feb 5, 2025

@robsdedude A note on your benchmark code - you're not resetting the buffer inside the Packer between benchmark iterations, which means the work done between iterations isn't consistent - some iterations will need to resize the buffer, others won't. Bytes allocated theoretically average out, though number of allocations won't. IMO it's better to pre-allocate a buffer big enough to avoid resizes and reset it between runs. That way, you can isolate the performance of the serialization code. You also might want to trigger a GC (and reset the timer) between benchmarks.

buffer := make([]byte, 0, 1024*1024)
...
	out := &outgoing{
		chunker:   newChunker(),
		packer:    packstream.Packer{},
		onPackErr: func(err error) { inner.Error(err) },
	}
	runtime.GC()
	inner.ResetTimer()
	for _i := 0; _i < inner.N; _i++ {
		out.packer.Begin(buffer)
		out.packX(load.data)
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants