-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add string interning to SyncIterator
#3411
Conversation
pkg/parquetquery/iters.go
Outdated
@@ -933,7 +937,7 @@ func (c *SyncIterator) makeResult(t RowNumber, v *pq.Value) *IteratorResult { | |||
r := GetResult() | |||
r.RowNumber = t | |||
if c.selectAs != "" { | |||
r.AppendValue(c.selectAs, v.Clone()) | |||
r.AppendValue(c.selectAs, c.interner.UnsafeClone(v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a very high cardinality column will this be slower than just cloning?
should we compare the dictionary length to the page length and decide whether to intern or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice suggestion. Added logic to skip UUIDs. I don't think it's worth going beyond that. The overhead of using the interner when the column has high cardinality is very little.
This comment was marked as outdated.
This comment was marked as outdated.
pkg/util/intern/intern.go
Outdated
a.ptr = addressOfBytes(i.internBytes(v.ByteArray())) | ||
return *(*pq.Value)(unsafe.Pointer(&a)) | ||
default: | ||
return *v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this incur an extra value copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we pass pq.Value
as values (🥁) in Entries
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we cloning values unnecessarily? For any non-byte array/fixed len byte array by cloning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but values are used everywhere. Changing this to pointers is a big change in the codebase, that I'd prefer to keep out from this PR.
FWIW, v.Clone()
is/was creating a copy of pq.Value
too. This is unchanged in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the line be v.Clone()
to make it consistent with the way it used to be? i'm honestly not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all other types (int, float, etc) the struct doesn't contain external pointers so a copy of the contents is fine. That's what Clone() does: https://github.com/parquet-go/parquet-go/blob/main/value.go#L792 Avoiding it is fine because it would incur another type check unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with this leaving as is in the PR, but it does feel safer to call .Clone()
in case that internal behavior changes.
SyncIterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good on this change. Loving some of the improvements in allocations.
As discussed I would like to see a benchmark that purposefully references a column that is 1) common and 2) high cardinality to confirm there are not regressions in this case.
Great work!
The highest cardinality attribute that I could find was Benchmark results
|
nice! lgtm. |
What this PR does:
Adds string (
[]byte
actually) interning in theSyncIterator
forpq.Value
s of kindByteArray
andFixedLenByteArray
. This saves allocations when the same string is repeated multiple times, which is very frequent.The interning logic is contained in
/pkg/parquetquery/intern
.Interner
is not concurrent-safe and uses unsafe code for converting between types (string
,[]byte
,*byte
) with minimal overhead.Benchmark results
main
vsbranch
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]