Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Race condition on uploading backup and manifest #16825

Open
rvrangel opened this issue Sep 23, 2024 · 3 comments
Open

Bug Report: Race condition on uploading backup and manifest #16825

rvrangel opened this issue Sep 23, 2024 · 3 comments

Comments

@rvrangel
Copy link
Contributor

Overview of the Issue

We have encountered a rare issue where we got a backup done where only the MANIFEST was written to S3, but not the actual backup file!

Looking at the logs, it seems we tried to complete the upload process of backup.xbstream.gz after the MANIFEST was written, braking the contract highlighted here.

Reproduction Steps

not easy to reproduce since it depends on S3 throttling us, but might be possible to write a test that simulates this kind of behaviour.

Binary Version

running v15 from our Slack branch

Operating System and Environment details

not OS related.

Log Fragments

I0920 19:04:40.356237 3431799 xtrabackupengine.go:357] xtrabackup stderr: 2024-09-20T19:04:40.355928-07:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!
I0920 19:04:40.750777 3431799 xtrabackupengine.go:709] Found position: <<redacted>>
I0920 19:04:40.750838 3431799 xtrabackupengine.go:146] Closing backup file backup.xbstream.gz
I0920 19:04:40.750850 3431799 xtrabackupengine.go:201] Writing backup MANIFEST
I0920 19:04:40.751397 3431799 xtrabackupengine.go:237] Backup completed
I0920 19:04:40.751421 3431799 xtrabackupengine.go:146] Closing backup file MANIFEST
W0920 19:04:41.425825 3431799 rpc_server.go:80] TabletManager.Backup(concurrency:4)(on us_east_1c-0169388481 from ) error: MultipartUpload: upload multipart failed
\tupload id: <<redacted>>
caused by: Throttling: Rate exceeded
\tstatus code: 400, request id: <<redacted>>
@rvrangel rvrangel added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Sep 23, 2024
@deepthi
Copy link
Member

deepthi commented Sep 23, 2024

@frouioui looks like you might have run into a similar issue and fixed it in #16806?

EDIT: Taking another look, they look like different race conditions.

@deepthi deepthi added Component: Backup and Restore and removed Needs Triage This issue needs to be correctly labelled and triaged labels Sep 23, 2024
@frouioui frouioui self-assigned this Sep 25, 2024
@frouioui
Copy link
Member

frouioui commented Dec 2, 2024

Hello @rvrangel, thanks for reporting this. I spend some time investigating, and the issue is rooted in how we do AddFile on S3 and Ceph. To make it short, when backing up Vitess, both the builtin and xtrabackup engine code will do an AddFile to upload each individual file (or stripe), this function returns a writer to which we write. However, the S3 and Ceph AddFile will write to the remote storage asynchronously, meaning that we may return from backupFiles when we are done writing everything to the buffer, but when not everything has been written/processed by S3/Ceph. To fix this, we must use bh.EndBackup() at the end of backupFiles to make sure S3/Ceph are done processing the files before moving on to the MANIFEST.

The refactor I did in #17271 will fix this issue for the builtin backup engine. The xtrabackup engine fix should be part of another PR.

@rvrangel
Copy link
Contributor Author

thanks for looking into it @frouioui ! could you link here when you have a PR for the xtrabackup engine as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants