-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize serialization of []any and map[string]any #617
base: 5.0
Are you sure you want to change the base?
Optimize serialization of []any and map[string]any #617
Conversation
Hello, thank you for taking the time to contribute to a Neo4j project, we appreciate all community engagement. |
@robsdedude I did send an email agreeing to the license last month. Is there some reason that that didn't go through? Is it associated with the wrong email address? In any case, I've sent another agreement associated with a different email address, hopefully that resolves the issue. |
One might think our CLA processes is automated... Well it's only partially 😅 Thanks, I'll look into performing the steps to get CI to pass and review your PR in the next days. |
FYI, I've been having a look and the speed-up looks very promising! I've not spotted any logic issues either 💪 great job! But you've piqued my curiosity. I'm currently looking into whether there's a way to generalize this optimization so that it applies to even more cases. |
I didn't succeed :/ everything I tried didn't improve the situation or made it even worse. What I've don, however, is added all Here are some benchmarks I quickly threw together to validate that this PR makes things faster:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏇💨
Thanks again for taking the time to craft this PR 🙇
Unfortunately, my experience is that it's quite hard to optimize iterating a map with reflection - it's always much faster if you can statically convert to a known map type. I did find a couple other optimizations, though I'm not sure if you'd want to take them. The first is that I noticed that using MapIter.Value only allocates when the map values are not pointer-types. Therefore, you can branch on the map value type and avoid allocating a slot for the value if it has pointers, otherwise use the code as written. if isPointerType(t.Elem()) {
r := v.MapRange()
for r.Next() {
key.SetIterKey(r)
o.packer.String(key.String())
o.packX(r.Value().Interface())
}
} else {
value := reflect.New(t.Elem()).Elem()
r := v.MapRange()
for r.Next() {
key.SetIterKey(r)
value.SetIterValue(r)
o.packer.String(key.String())
o.packX(value.Interface())
}
} The second is that you can put an extra string field in the key := reflect.NewAt(t.Key(), unsafe.Pointer(&o.stringSlot)).Elem() |
Also, FYI, it's often helpful to include the |
Great pointers 👏 🚀 Thank you very much once again 🙇 I actually managed to squeeze a bit more performance out of the packing code by avoiding some unnecessary jumping back and forth between |
neo4j/internal/bolt/outgoing.go
Outdated
o.packer.Bytes(v.Bytes()) | ||
return | ||
case reflect.Int, reflect.Int64, reflect.String, reflect.Float64, reflect.Interface: | ||
if v.Len() > 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably bears actually measuring, at least on one machine. My intuition is that the crossover point at which it's worth taking the allocation is likely to be either "v.Len() > 0" or "basically never."
If I remember correctly what reflection is doing under the hood, there's actually not too much overhead reflectively iterating over and indexing into a slice, since it can just track the values as pointers into the slice until you convert to an interface. So the fast-paths mostly just save you looking up the type of the value. (But the branch predictor might make that easy anyway?) Maybe the better solution here is to have fast-paths here that loop through the slices reflectively, but handle the types directly?
I had looked a bit ago at replacing the packX
logic with something that worked predominately on reflect.Value
objects and avoided converting back to interfaces. If you're looking at having a packV
anyway, it might be better to make packX
a really thin wrapper and try to move most of the logic and fast-pathing directly here, rather than duplicating most of the logic and moving back and forth between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I took a crack at profiling this, and, at least on my machine:
- Going through the fast-path in packX is, in fact, notably faster.
- It works out that "v.Len() > 0" is the closer bet than "never' or even "v.Len() > 5".
- The Interface() call didn't allocate, which I honestly can't quite explain.
- The version of the optimization I suggested didn't seem to help.
So... nice find. Personally, I'd stick with it and probably change it to "v.Len() > 0" unless there's some other pathological case. In my testing, testing the actual crossover point was move like "v.Len() > 1" but at that point the difference was ~1ns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"arbitrary guess" is a bit of a lie 😅 I actually did some benchmarking to determine a rough value that seemed to be close to the cross-over point where the allocation gets amortized (at least on my machine).The
Interface() call didn't allocate, which I honestly can't quite explain.
Now this is most peculiar 😮 because in my testing, that call did allocate. But I assume the fact that it didn't for you explains why the cross-over point is > 0
for you, but was higher for me. For a slice with only 1 element, taking the v.Interface()
route makes the packing more than 2 as slow on my system. So I'd rather not 🙃.
For the record: I tested on Go 1.23.0 on linux amd64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that reason, I decided to stick with the minimum size check being some small number as apparently on some systems (like mine) not having it makes things considerably worse while it's not making much of a difference on other systems (like yours).
This avoids choosing the would-be fast-path in `packV` when the slice type is not actually one that benefits from running through `packX`. See also neo4j#617 (comment)
Here are some benchmarks of the current state of the PR
neo4j/internal/bolt/outgoing_bench_test.gopackage bolt
import (
"github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/packstream"
"testing"
)
func BenchmarkOutgoing(outer *testing.B) {
type workload struct {
description string
data any
}
type NewAny any
someInt := 123456789
workloads := []workload{
{
"pack [][]any (any always int) ",
[][]any{
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
},
},
{
"pack [][]NewAny (NewAny new-type of any, always int) ",
[][]NewAny{
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
},
},
{
"pack []bool ",
[]bool{true, false, true, false, true, true, true, true, false, false, true, false, false},
},
{
"pack map[string]map[string]string ",
map[string]map[string]string{
"hello": {"world": "world", "oh": "oh", "hello": "world"},
"int": {"1": "1", "2": "2", "3": "3"},
},
},
{
"pack map[string]map[string]any (any is always string)",
map[string]map[string]any{
"hello": {"world": "world", "oh": "oh", "hello": "world"},
"int": {"1": "1", "2": "2", "3": "3"},
},
},
{
"pack []*int ",
[]*int{&someInt, &someInt, &someInt, &someInt},
},
}
for _, load := range workloads {
outer.Run(load.description, func(inner *testing.B) {
out := &outgoing{
chunker: newChunker(),
packer: packstream.Packer{},
onPackErr: func(err error) { inner.Error(err) },
}
for _i := 0; _i < inner.N; _i++ {
out.packX(load.data)
}
})
}
} |
@robsdedude A note on your benchmark code - you're not resetting the buffer inside the Packer between benchmark iterations, which means the work done between iterations isn't consistent - some iterations will need to resize the buffer, others won't. Bytes allocated theoretically average out, though number of allocations won't. IMO it's better to pre-allocate a buffer big enough to avoid resizes and reset it between runs. That way, you can isolate the performance of the serialization code. You also might want to trigger a GC (and reset the timer) between benchmarks. buffer := make([]byte, 0, 1024*1024)
...
out := &outgoing{
chunker: newChunker(),
packer: packstream.Packer{},
onPackErr: func(err error) { inner.Error(err) },
}
runtime.GC()
inner.ResetTimer()
for _i := 0; _i < inner.N; _i++ {
out.packer.Begin(buffer)
out.packX(load.data)
} |
Improve serialization performance by specializing slices or maps of
any
, and optimizing reflective looping through maps.Special-casing
[]any
and especiallymap[string]any
, which are common parameter types, eliminates allocation overhead to loop through these values.Using
MapRange
rather thanMapKeys
andMapIndex
(and usingValue.SetIterKey/Value
rather thanMapIter.Key/Value
) significantly reduces allocations when looping through other map types.