Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication: Handle large binlog compressed transactions more efficiently #16328

Merged
merged 26 commits into from
Jul 10, 2024

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Jul 3, 2024

Description

We added support for MySQL binlog transaction compression in VReplication in #12950. This initial implementation performed all processing of the compressed events in memory.

The size of these compressed transaction payload events, when uncompressed (the compressed transaction payload event is limited by max_allowed_packet), could be 10s or even 100s of GiB in size — think updating 1 million rows in a wide table with JSON, BLOB, and TEXT fields and keep in mind that for each of those rows changed the binlog ROW event contains a before and after image with binlog_row_image=full (the default).

So keeping that entire uncompressed transaction in memory can become unfeasible and problematic (the vttablet process is likely to get OOM-killed).

This PR changes the processing of the compressed transaction payload events so that when the uncompressed size is above 128MiB we stream the decompression work and, whether we use the slower streaming method or the faster stateless in-memory buffer method, we read each event within the uncompressed transaction, one at a time. This allows us to process larger payloads w/o allocating more than the maximum 128MiB we allow for the in-memory processing.

Memory Usage Comparisons

Below are memory usage comparisons using this manual test on macOS: https://gist.github.com/mattlord/74a18bb96dbcb1157fb72eba07730d49

Main

Physical footprint:         18.6G
Physical footprint (peak):  18.6G

Fetching profile over HTTP from http://localhost:15100/debug/pprof/heap
Saved profile in /Users/matt/pprof/pprof.vttablet.alloc_objects.alloc_space.inuse_objects.inuse_space.017.pb.gz
File: vttablet
Type: inuse_space
Time: Jul 4, 2024 at 11:59pm (EDT)
Showing nodes accounting for 12GB, 99.89% of 12.01GB total
Dropped 105 nodes (cum <= 0.06GB)
      flat  flat%   sum%        cum   cum%
      12GB 99.89% 99.89%       12GB 99.89%  bytes.growSlice
         0     0% 99.89%       12GB 99.89%  bytes.(*Buffer).Write
         0     0% 99.89%       12GB 99.89%  bytes.(*Buffer).grow
         0     0% 99.89%       12GB 99.89%  github.com/klauspost/compress/zstd.(*Decoder).WriteTo
         0     0% 99.89%       12GB 99.89%  google.golang.org/grpc.(*Server).handleStream
         0     0% 99.89%       12GB 99.89%  google.golang.org/grpc.(*Server).processStreamingRPC
         0     0% 99.89%       12GB 99.89%  google.golang.org/grpc.(*Server).serveStreams.func2.1
         0     0% 99.89%       12GB 99.89%  io.Copy (inline)
         0     0% 99.89%       12GB 99.89%  io.copyBuffer
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/mysql.(*TransactionPayload).Decode
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/mysql.(*TransactionPayload).decode
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/mysql.(*TransactionPayload).decompress
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/mysql.binlogEvent.TransactionPayload
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/proto/queryservice._Query_VStream_Handler
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/grpcqueryservice.(*query).VStream
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).VStream
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*Engine).Stream
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*uvstreamer).Stream
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).Stream
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).parseEvent
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).parseEvents
         0     0% 99.89%       12GB 99.89%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).replicate

PR Branch

Physical footprint:         9.8G
Physical footprint (peak):  11.6G

Sampling process 34131 for 180 seconds with 1 millisecond of run time between samples
Fetching profile over HTTP from http://localhost:15100/debug/pprof/heap
Saved profile in /Users/matt/pprof/pprof.vttablet.alloc_objects.alloc_space.inuse_objects.inuse_space.104.pb.gz
File: vttablet
Type: inuse_space
Time: Jul 9, 2024 at 8:36pm (EDT)
Showing nodes accounting for 5067.72MB, 99.72% of 5081.87MB total
Dropped 128 nodes (cum <= 25.41MB)
      flat  flat%   sum%        cum   cum%
 5004.77MB 98.48% 98.48%  5004.77MB 98.48%  vitess.io/vitess/go/sqltypes.RowToProto3Inplace
   62.95MB  1.24% 99.72%    62.95MB  1.24%  vitess.io/vitess/go/mysql.(*TransactionPayload).decode.func1
         0     0% 99.72%  5068.22MB 99.73%  google.golang.org/grpc.(*Server).handleStream
         0     0% 99.72%  5067.72MB 99.72%  google.golang.org/grpc.(*Server).processStreamingRPC
         0     0% 99.72%  5068.22MB 99.73%  google.golang.org/grpc.(*Server).serveStreams.func2.1
         0     0% 99.72%    62.95MB  1.24%  vitess.io/vitess/go/mysql.(*TransactionPayload).GetNextEvent
         0     0% 99.72%  5004.77MB 98.48%  vitess.io/vitess/go/sqltypes.RowToProto3 (inline)
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/proto/queryservice._Query_VStream_Handler
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/grpcqueryservice.(*query).VStream
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).VStream
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*Engine).Stream
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*uvstreamer).Stream
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).Stream
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).parseEvent
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).parseEvents
         0     0% 99.72%  5004.77MB 98.48%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).processRowEvent
         0     0% 99.72%  5067.72MB 99.72%  vitess.io/vitess/go/vt/vttablet/tabletserver/vstreamer.(*vstreamer).replicate

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

@mattlord mattlord added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication labels Jul 3, 2024
Copy link
Contributor

vitess-bot bot commented Jul 3, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jul 3, 2024
@github-actions github-actions bot added this to the v21.0.0 milestone Jul 3, 2024
@mattlord mattlord removed NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Jul 3, 2024
Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 67.74194% with 30 lines in your changes missing coverage. Please review.

Project coverage is 68.69%. Comparing base (d236ccd) to head (4e6206a).
Report is 18 commits behind head on main.

Files Patch % Lines
go/mysql/binlog_event_compression.go 75.90% 20 Missing ⚠️
go/vt/vttablet/tabletserver/vstreamer/vstreamer.go 0.00% 8 Missing ⚠️
go/mysql/binlog_event_filepos.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16328      +/-   ##
==========================================
- Coverage   68.72%   68.69%   -0.04%     
==========================================
  Files        1547     1548       +1     
  Lines      198290   198487     +197     
==========================================
+ Hits       136279   136350      +71     
- Misses      62011    62137     +126     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title [WiP] VReplication: Handle large compressed transaction payloads more efficiently [WiP] VReplication: Handle large binlog compressed transactions more efficiently Jul 3, 2024
// within the compressed transaction.
TransactionPayload(BinlogFormat) ([]BinlogEvent, error)
TransactionPayload(BinlogFormat) (func() (BinlogEvent, error), error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're changing this to be compatible with the new Go v1.23 iterators, right? 💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we will almost certainly be using 1.23 on main before the release and I'd like to use the new standard interface if I can.

// temporary file.
if tp.uncompressedSize > zstdInMemoryDecompressorMaxSize {
// Create a temporary file to stream the uncompressed payload to.
tmpFile, err := os.CreateTemp("", "binlog-transaction-payload-*")
Copy link
Member

@derekperkins derekperkins Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth the complexity or cleanup synchronization, but if we named the file based on header information, couldn't multiple streams share the same file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could prove to be worth it in the long run, but at this point I would say no. I say that as the file management and I/O would become much more complex if we went from a single RW "owner" / single FD, to having 1 RW FD and N RO FDs as we'd have to coordinate on I/O and file cleanup. The cleanup could be an issue too as the streams are not aligned and when would we clean them up? We could move the problem from exhausting RAM to exhausting block device space as these files can be 10s or 100s of GBs in size so we'd want to clean them up ASAP. I'm also assuming that these VERY large compressed transactions (many rows changed on wide tables with before and after images in the row events) will be relatively rare. But we'll see. That's definitely an optimization that is possible if it proves warranted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thought too. For my specific issue, and assuming that 60 GB RAM translates roughly into disk space, we wouldn't have had sufficient disk either to write all these files, but that would have been easier to track down than vttablet being repeatedly OOMKilled.

I know there are already mutexes all over the place, but maybe we could add another one to lock here inside the if statement, which would prevent multiple large files / transactions from being processed concurrently. That would have no impact on regular operations, but would prevent amplification of the same huge transaction taking all RAM / disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea too. I probably won't do that here but it's something I'll keep in mind if disk usage does end up being a problem. Thanks for the good suggestions/ideas!

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the uncompressed_trx_to_file branch 3 times, most recently from 28e7363 to 35b340c Compare July 5, 2024 16:59
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the uncompressed_trx_to_file branch 2 times, most recently from be3c879 to 6ee167c Compare July 6, 2024 15:27
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@arthurschreiber
Copy link
Contributor

Nice work! Agree that we should be able to use memory map to avoid copying large file contents into memory.

@shlomi-noach This no longer copies anything to disk, so the mmap part is irrelevant now. 😄

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
statefulDecodersPool = sync.Pool{
New: func() any {
d, err := zstd.NewReader(nil, zstd.WithDecoderMaxMemory(zstdInMemoryDecompressorMaxSize))
log.Errorf("Error creating stateful decoder: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this a couple days ago that helps propagate the real error instead of checking for nil, but what you have is probably sufficient. It doesn't seem like that new function should error out in practice
https://www.reddit.com/r/golang/comments/1dww9bi/how_to_handle_errors_on_synvpool/lbxinal

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Contributor Author

mattlord commented Jul 9, 2024

I'm going to revert the sync.Pool related code as it seems like there’s no good way to use a pool of stream decoders because the goroutines the reader spawns (on NewReader and Reset) will prevent the finalizer from ever running so we’ll never call Close and those 4 goroutines it creates for each reader will leak (it uses a Background context with them). For it to be safe, you have to use zstd.WithDecoderConcurrency(1) to make it fully synchronous (no goroutines) , which then causes it to be ~ 2-3x slower: klauspost/compress#498

I can see in my tests that we use significantly more memory with the sync.Pool related code too (~ 1GiB or more).

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
noleak.go:56: found unexpected goroutines:
[Goroutine 23 in state chan receive, with vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers.func1 on top of the stack:
vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers.func1()
  /home/runner/work/vitess/vitess/go/mysql/auth_server_static.go:273 +0x6d
created by vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers in goroutine 11
  /home/runner/work/vitess/vitess/go/mysql/auth_server_static.go:272 +0x165
Goroutine 74 in state chan receive, with vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers.func1 on top of the stack:
vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers.func1()
  /home/runner/work/vitess/vitess/go/mysql/auth_server_static.go:273 +0x6d
created by vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers in goroutine 24
  /home/runner/work/vitess/vitess/go/mysql/auth_server_static.go:272 +0x165
Goroutine 75 in state chan receive, with vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers.func2 on top of the stack:
vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers.func2()
  /home/runner/work/vitess/vitess/go/mysql/auth_server_static.go:282 +0xc7
created by vitess.io/vitess/go/mysql.(*AuthServerStatic).installSignalHandlers in goroutine 24
  /home/runner/work/vitess/vitess/go/mysql/auth_server_static.go:281 +0x25d

It's also not relevant anymore as we're not managing a pool of
zstd.Decoders.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Contributor Author

mattlord commented Jul 10, 2024

I'm going to revert the sync.Pool related code as it seems like there’s no good way to use a pool of stream decoders because the goroutines the reader spawns (on NewReader and Reset) will prevent the finalizer from ever running so we’ll never call Close and those 4 goroutines it creates for each reader will leak (it uses a Background context with them). For it to be safe, you have to use zstd.WithDecoderConcurrency(1) to make it fully synchronous (no goroutines) , which then causes it to be ~ 2-3x slower: klauspost/compress#498

After further testing and reading I can see that I was perfectly wrong here. 😆 What I described above was true prior to klauspost/compress#498. It was in that PR where a key behavior was changed as noted in the PR description:

Goroutines exit when streams have finished reading (either error or EOF).

And we can see that calling Reset with a nil reader — as we do in the TransactionPayload's Close — causes the goroutines to terminate by setting an internal error and it then cleans up the previous reader's resources here: https://github.com/klauspost/compress/blob/v1.17.9/zstd/decoder.go#L166-L186

I confirmed through testing that in fact no goroutines are leaked — using both the unit tests utils.LeakCheckContext as well as looking at goroutine dumps of the source/commerce primary vttablet where the vstreamer runs during and after the manual test.

I can see in my tests that we use significantly more memory with the sync.Pool related code too (~ 1GiB or more).

That was in part due to the unnecessary finalizer and I no longer see any increase in memory usage in my tests.

SO, after all this I'm going to re-apply the sync.Pool work that allows us to use a pool of stateful decoders when we have concurrent large payload decoding work to do.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
go/mysql/binlog_event_compression.go Outdated Show resolved Hide resolved
go/mysql/binlog_event_compression.go Outdated Show resolved Hide resolved
go/mysql/binlog_event_compression.go Outdated Show resolved Hide resolved
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants