Optimize serialization of []any and map[string]any #617

zolstein · 2025-01-26T01:10:08Z

Improve serialization performance by specializing slices or maps of any, and optimizing reflective looping through maps.

Special-casing []any and especially map[string]any, which are common parameter types, eliminates allocation overhead to loop through these values.

Using MapRange rather than MapKeys and MapIndex (and using Value.SetIterKey/Value rather than MapIter.Key/Value) significantly reduces allocations when looping through other map types.

robsdedude · 2025-01-29T16:12:58Z

Hello, thank you for taking the time to contribute to a Neo4j project, we appreciate all community engagement.
Before we review a change we require that users have signed our Contributor License Agreement(CLA).
For more details on signing our CLA please see: https://neo4j.com/developer/cla/

zolstein · 2025-01-29T22:03:22Z

@robsdedude I did send an email agreeing to the license last month. Is there some reason that that didn't go through? Is it associated with the wrong email address? In any case, I've sent another agreement associated with a different email address, hopefully that resolves the issue.

robsdedude · 2025-01-29T23:24:30Z

One might think our CLA processes is automated... Well it's only partially 😅 Thanks, I'll look into performing the steps to get CI to pass and review your PR in the next days.

robsdedude · 2025-01-31T07:45:22Z

FYI, I've been having a look and the speed-up looks very promising! I've not spotted any logic issues either 💪 great job! But you've piqued my curiosity. I'm currently looking into whether there's a way to generalize this optimization so that it applies to even more cases.

robsdedude · 2025-01-31T16:27:43Z

I didn't succeed :/ everything I tried didn't improve the situation or made it even worse. What I've don, however, is added all map[String]xyz short paths that also exist for []xyz.

Here are some benchmarks I quickly threw together to validate that this PR makes things faster:

### BEFORE ###

BenchmarkOutgoing
BenchmarkOutgoing/pack_[]any
BenchmarkOutgoing/pack_[]any-16                                      	 4242537	       250.7 ns/op
BenchmarkOutgoing/pack_[]bool
BenchmarkOutgoing/pack_[]bool-16                                     	 2663608	       455.2 ns/op
BenchmarkOutgoing/pack_map[string]any
BenchmarkOutgoing/pack_map[string]any-16                             	  943662	      1342 ns/op
BenchmarkOutgoing/pack_map[string]string
BenchmarkOutgoing/pack_map[string]string-16                          	 3308748	       375.1 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_string)
BenchmarkOutgoing/pack_map[string]any_(any_always_string)-16         	  873051	      1281 ns/op
BenchmarkOutgoing/pack_map[string][]string
BenchmarkOutgoing/pack_map[string][]string-16                        	  803454	      1547 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)-16       	  825868	      1558 ns/op
BenchmarkOutgoing/pack_[]byte
BenchmarkOutgoing/pack_[]byte-16                                     	45481484	        34.66 ns/op


### AFTER ###

BenchmarkOutgoing
BenchmarkOutgoing/pack_[]any
BenchmarkOutgoing/pack_[]any-16                                      	 5936780	       178.7 ns/op
BenchmarkOutgoing/pack_[]bool
BenchmarkOutgoing/pack_[]bool-16                                     	 2564455	       472.9 ns/op
BenchmarkOutgoing/pack_map[string]any
BenchmarkOutgoing/pack_map[string]any-16                             	 3001713	       407.4 ns/op
BenchmarkOutgoing/pack_map[string]string
BenchmarkOutgoing/pack_map[string]string-16                          	 3258835	       386.6 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_string)
BenchmarkOutgoing/pack_map[string]any_(any_always_string)-16         	 3530245	       374.7 ns/op
BenchmarkOutgoing/pack_map[string][]string
BenchmarkOutgoing/pack_map[string][]string-16                        	  857400	      1437 ns/op
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)
BenchmarkOutgoing/pack_map[string]any_(any_always_[]string)-16       	 1801963	       584.2 ns/op
BenchmarkOutgoing/pack_[]byte
BenchmarkOutgoing/pack_[]byte-16                                     	35928717	        39.06 ns/op

robsdedude

🏇💨

Thanks again for taking the time to craft this PR 🙇

zolstein · 2025-02-01T01:23:34Z

Unfortunately, my experience is that it's quite hard to optimize iterating a map with reflection - it's always much faster if you can statically convert to a known map type.

I did find a couple other optimizations, though I'm not sure if you'd want to take them.

The first is that I noticed that using MapIter.Value only allocates when the map values are not pointer-types. Therefore, you can branch on the map value type and avoid allocating a slot for the value if it has pointers, otherwise use the code as written.

	if isPointerType(t.Elem()) {
		r := v.MapRange()
		for r.Next() {
			key.SetIterKey(r)
			o.packer.String(key.String())
			o.packX(r.Value().Interface())
		}
	} else {
		value := reflect.New(t.Elem()).Elem()
		r := v.MapRange()
		for r.Next() {
			key.SetIterKey(r)
			value.SetIterValue(r)
			o.packer.String(key.String())
			o.packX(value.Interface())
		}
	}

The second is that you can put an extra string field in the outgoing struct, and use it as the place to store the key value. This lets you avoid an allocation per-map, which is especially useful if there are nested structures. However, the naive version of this breaks if the map has keys that are a new-type of string, and the only workaround I found for this required a bit of unsafe code.

	key := reflect.NewAt(t.Key(), unsafe.Pointer(&o.stringSlot)).Elem()

zolstein · 2025-02-01T01:25:31Z

Also, FYI, it's often helpful to include the -benchmem flag when benchmarking, so you also get data on allocations. That's especially true here, since allocations drive most of the performance difference.

robsdedude · 2025-02-03T16:10:20Z

Great pointers 👏 🚀 Thank you very much once again 🙇

I actually managed to squeeze a bit more performance out of the packing code by avoiding some unnecessary jumping back and forth between reflect.Value and any.

neo4j/internal/bolt/outgoing.go

zolstein · 2025-02-04T07:49:04Z

neo4j/internal/bolt/outgoing.go

+			o.packer.Bytes(v.Bytes())
+			return
+		case reflect.Int, reflect.Int64, reflect.String, reflect.Float64, reflect.Interface:
+			if v.Len() > 5 {


This probably bears actually measuring, at least on one machine. My intuition is that the crossover point at which it's worth taking the allocation is likely to be either "v.Len() > 0" or "basically never."

If I remember correctly what reflection is doing under the hood, there's actually not too much overhead reflectively iterating over and indexing into a slice, since it can just track the values as pointers into the slice until you convert to an interface. So the fast-paths mostly just save you looking up the type of the value. (But the branch predictor might make that easy anyway?) Maybe the better solution here is to have fast-paths here that loop through the slices reflectively, but handle the types directly?

I had looked a bit ago at replacing the packX logic with something that worked predominately on reflect.Value objects and avoided converting back to interfaces. If you're looking at having a packV anyway, it might be better to make packX a really thin wrapper and try to move most of the logic and fast-pathing directly here, rather than duplicating most of the logic and moving back and forth between them.

Actually, I took a crack at profiling this, and, at least on my machine:

Going through the fast-path in packX is, in fact, notably faster.

It works out that "v.Len() > 0" is the closer bet than "never' or even "v.Len() > 5".

The Interface() call didn't allocate, which I honestly can't quite explain.

The version of the optimization I suggested didn't seem to help.

So... nice find. Personally, I'd stick with it and probably change it to "v.Len() > 0" unless there's some other pathological case. In my testing, testing the actual crossover point was move like "v.Len() > 1" but at that point the difference was ~1ns.

"arbitrary guess" is a bit of a lie 😅 I actually did some benchmarking to determine a rough value that seemed to be close to the cross-over point where the allocation gets amortized (at least on my machine).The

Interface() call didn't allocate, which I honestly can't quite explain.

Now this is most peculiar 😮 because in my testing, that call did allocate. But I assume the fact that it didn't for you explains why the cross-over point is > 0 for you, but was higher for me. For a slice with only 1 element, taking the v.Interface() route makes the packing more than 2 as slow on my system. So I'd rather not 🙃.

For the record: I tested on Go 1.23.0 on linux amd64

For that reason, I decided to stick with the minimum size check being some small number as apparently on some systems (like mine) not having it makes things considerably worse while it's not making much of a difference on other systems (like yours).

This avoids choosing the would-be fast-path in `packV` when the slice type is not actually one that benefits from running through `packX`. See also neo4j#617 (comment)

robsdedude · 2025-02-05T12:00:37Z

Here are some benchmarks of the current state of the PR

# +++++ BEFORE +++++
BenchmarkOutgoing
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________-16         	  632931	      1592 ns/op	     541 B/op	       6 allocs/op
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_-16         	  795153	      1551 ns/op	     539 B/op	       6 allocs/op
BenchmarkOutgoing/pack_[]bool__________________________________________
BenchmarkOutgoing/pack_[]bool__________________________________________-16         	 2363528	       467.3 ns/op	      98 B/op	      13 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]string____________________
BenchmarkOutgoing/pack_map[string]map[string]string____________________-16         	 1648632	       738.9 ns/op	     378 B/op	       3 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)-16         	  648488	      1777 ns/op	     742 B/op	      17 allocs/op
BenchmarkOutgoing/pack_[]*int__________________________________________
BenchmarkOutgoing/pack_[]*int__________________________________________-16         	 4342869	       272.4 ns/op	     145 B/op	       4 allocs/op
PASS


# +++++ AFTER +++++
BenchmarkOutgoing
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________
BenchmarkOutgoing/pack_[][]any_(any_always_int)________________________-16         	  896643	      1296 ns/op	     494 B/op	       6 allocs/op
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_
BenchmarkOutgoing/pack_[][]NewAny_(NewAny_new-type_of_any,_always_int)_-16         	  896656	      1484 ns/op	     350 B/op	       0 allocs/op
BenchmarkOutgoing/pack_[]bool__________________________________________
BenchmarkOutgoing/pack_[]bool__________________________________________-16         	 8330124	       133.1 ns/op	      73 B/op	       0 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]string____________________
BenchmarkOutgoing/pack_map[string]map[string]string____________________-16         	 1588008	       748.1 ns/op	     333 B/op	       2 allocs/op
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)
BenchmarkOutgoing/pack_map[string]map[string]any_(any_is_always_string)-16         	 1493149	       804.4 ns/op	     353 B/op	       2 allocs/op
BenchmarkOutgoing/pack_[]*int__________________________________________
BenchmarkOutgoing/pack_[]*int__________________________________________-16         	 8745207	       136.0 ns/op	     109 B/op	       0 allocs/op
PASS

neo4j/internal/bolt/outgoing_bench_test.go

package bolt

import (
	"github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/packstream"
	"testing"
)

func BenchmarkOutgoing(outer *testing.B) {
	type workload struct {
		description string
		data        any
	}

	type NewAny any

	someInt := 123456789

	workloads := []workload{
		{
			"pack [][]any (any always int)                        ",
			[][]any{
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
			},
		},

		{
			"pack [][]NewAny (NewAny new-type of any, always int) ",
			[][]NewAny{
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
				{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
			},
		},

		{
			"pack []bool                                          ",
			[]bool{true, false, true, false, true, true, true, true, false, false, true, false, false},
		},

		{
			"pack map[string]map[string]string                    ",
			map[string]map[string]string{
				"hello": {"world": "world", "oh": "oh", "hello": "world"},
				"int":   {"1": "1", "2": "2", "3": "3"},
			},
		},

		{
			"pack map[string]map[string]any (any is always string)",
			map[string]map[string]any{
				"hello": {"world": "world", "oh": "oh", "hello": "world"},
				"int":   {"1": "1", "2": "2", "3": "3"},
			},
		},

		{
			"pack []*int                                          ",
			[]*int{&someInt, &someInt, &someInt, &someInt},
		},
	}

	for _, load := range workloads {
		outer.Run(load.description, func(inner *testing.B) {
			out := &outgoing{
				chunker:   newChunker(),
				packer:    packstream.Packer{},
				onPackErr: func(err error) { inner.Error(err) },
			}
			for _i := 0; _i < inner.N; _i++ {
				out.packX(load.data)
			}
		})
	}
}

zolstein · 2025-02-05T22:35:39Z

@robsdedude A note on your benchmark code - you're not resetting the buffer inside the Packer between benchmark iterations, which means the work done between iterations isn't consistent - some iterations will need to resize the buffer, others won't. Bytes allocated theoretically average out, though number of allocations won't. IMO it's better to pre-allocate a buffer big enough to avoid resizes and reset it between runs. That way, you can isolate the performance of the serialization code. You also might want to trigger a GC (and reset the timer) between benchmarks.

buffer := make([]byte, 0, 1024*1024)
...
	out := &outgoing{
		chunker:   newChunker(),
		packer:    packstream.Packer{},
		onPackErr: func(err error) { inner.Error(err) },
	}
	runtime.GC()
	inner.ResetTimer()
	for _i := 0; _i < inner.N; _i++ {
		out.packer.Begin(buffer)
		out.packX(load.data)
	}

Optimize serialization of []any and map[string]any

55a0e7d

robsdedude added 2 commits January 31, 2025 14:05

Packer: add more fast lanes

3ae1b05

Merge branch '5.0' into optimize-any-collection-serialization

10452f6

robsdedude approved these changes Jan 31, 2025

View reviewed changes

robsdedude added 3 commits February 3, 2025 10:36

Optimize packing new types and pointer types

5453e25

Further optimize packing nested slices

dc864d1

Loads of unit tests

33c5783

robsdedude requested a review from StephenCathcart February 3, 2025 16:10

zolstein commented Feb 4, 2025

View reviewed changes

Strict type check before choosing fast-path

3c06c01

This avoids choosing the would-be fast-path in `packV` when the slice type is not actually one that benefits from running through `packX`. See also neo4j#617 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize serialization of []any and map[string]any #617

Optimize serialization of []any and map[string]any #617

zolstein commented Jan 26, 2025

robsdedude commented Jan 29, 2025

zolstein commented Jan 29, 2025 •

edited

Loading

robsdedude commented Jan 29, 2025

robsdedude commented Jan 31, 2025 •

edited

Loading

robsdedude commented Jan 31, 2025

robsdedude left a comment

zolstein commented Feb 1, 2025

zolstein commented Feb 1, 2025

robsdedude commented Feb 3, 2025

zolstein Feb 4, 2025

zolstein Feb 4, 2025

robsdedude Feb 5, 2025 •

edited

Loading

robsdedude Feb 5, 2025 •

edited

Loading

robsdedude commented Feb 5, 2025 •

edited

Loading

zolstein commented Feb 5, 2025

Optimize serialization of []any and map[string]any #617

Are you sure you want to change the base?

Optimize serialization of []any and map[string]any #617

Conversation

zolstein commented Jan 26, 2025

robsdedude commented Jan 29, 2025

zolstein commented Jan 29, 2025 • edited Loading

robsdedude commented Jan 29, 2025

robsdedude commented Jan 31, 2025 • edited Loading

robsdedude commented Jan 31, 2025

robsdedude left a comment

Choose a reason for hiding this comment

zolstein commented Feb 1, 2025

zolstein commented Feb 1, 2025

robsdedude commented Feb 3, 2025

zolstein Feb 4, 2025

Choose a reason for hiding this comment

zolstein Feb 4, 2025

Choose a reason for hiding this comment

robsdedude Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

robsdedude Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

robsdedude commented Feb 5, 2025 • edited Loading

zolstein commented Feb 5, 2025

zolstein commented Jan 29, 2025 •

edited

Loading

robsdedude commented Jan 31, 2025 •

edited

Loading

robsdedude Feb 5, 2025 •

edited

Loading

robsdedude Feb 5, 2025 •

edited

Loading

robsdedude commented Feb 5, 2025 •

edited

Loading