Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: incremental backup and point-in-time recovery #11227

Closed
shlomi-noach opened this issue Sep 15, 2022 · 16 comments
Closed

RFC: incremental backup and point-in-time recovery #11227

shlomi-noach opened this issue Sep 15, 2022 · 16 comments

Comments

@shlomi-noach
Copy link
Contributor

We wish to implement a native solution for (offline) incremental backup and compatible point-in-time recovery in Vitess. There is already a Work in Progress PR. But let's first describe the problem, what's the offered solution, and how it differs from an already existing prior implementation.

Background

Point-in-time recoveries make it possible to recover a database into a specific or rough, timestamp or position. The classic use case is a catastrophic change to the data, e.g. an unintentional DELETE FROM <table> or similar. Normally the damage only applies to a subset of the data, and the database is generally still valid, and the app is still able to function. As such, we want to fix the specific damage inflicted. The flow is to restore the data on an offline/non-serving server, to a point in time immediately before the damage was done. It's then typically a manual process of salvaging the specific damaged records.

It's also possible to just throw away everything and roll back the entire database to that point in time, though that is an uncommon use case.

A point in time can be either an actual timestamp, or, more accurately, a position. Specifically in MySQL 5.7 and above, this will be a GTID set, the @@gtid_executed just before the damage. Since every transaction gets its own GTID value, it should be possible to restore up to a single transaction granularity (where a timestamp is a more coarse measurement).

A point in time recovery is possible by combining a full backup recovery, followed by an incremental stream of changes since that backup. There are two main techniques in three different forms:

  1. Using binary logs, stored offline
  2. Using a binary log live stream
  3. Using Xtrabackup incremental backup

This RFC wishes to address (1). There is already prior work for (2). Right now we do not wish to address (3).

The existing prior work addresses (2), and specifically assumes:

  • You have a binlog server in your topology
  • The binlog server still has all the required binary logs to perform the recovery
  • You are able to join your server into the live replication stream

Suggested solution, backup

We wish to implement a more general solution by actually backing up binary logs as part of the backup process. These can be stored on local disk, in S3, etc., same way as any vitess backup is stored. In fact, an incremental backup will be listed just like any other backup, and this listing is also the key to performing a restore.

The user will take an incremental backup similarly to how they take a full backup:

  • Full backup: vtctlclient -- Backup zone1-0000000102
  • Incremental backup: vtctlclient -- Backup --incremental_from_pos "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-615" zone1-0000000102
  • or, auto incremental backup: vtctlclient -- Backup --incremental_from_pos "auto" zone1-0000000102

An incremental backup needs to have a starting point, given as --incremental_from_pos flag. The incremental backup must cover that position, but does not have to start exactly at that position: it can start with an earlier position. See diagram below. The backup ends with the rough position of the time the backup was requested. It will cover the exact point in time where the request was made, and possibly extend slightly beyond that.

An incremental backup is taken by copying binary logs. To do that, there is no need to shut down the MySQL server, and it is free to be fully operational and serve traffic while backup takes place. The backup process will rotate binary logs (FLUSH BINARY LOGS) so as to ensure the files it is backing up are safely immutable.

A manifest of an incremental backup may look like so:

{
  "BackupMethod": "builtin",
  "Position": "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-883",
  "FromPosition": "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-867",
  "Incremental": true,
  "BackupTime": "2022-08-25T12:55:05Z",
  "FinishedTime": "2022-08-25T12:55:05Z",
  "ServerUUID": "1ea0631b-22b6-11ed-933f-0a43f95f28a3",
  "TabletAlias": "zone1-0000000102",
  "CompressionEngine": "pargzip",
  "FileEntries": [
     ..
  ]
}
  • The above is an incremental backup's manifest. Clearly indicated by "Incremental": true,
  • "FileEntries" will list binary log files
  • "FromPosition" indicates the first position covered by the backup. It is smaller than or equal to the requested --incremental_from_pos. This value is empty for full backup.
  • ServerUUID is new and self explanatory, added for convenience
  • TabletAlias is new and self explanatory, added for convenience

Suggested solution, restore/recovery

Again, riding the familiar Restore command. A restore looks like:

vtctlclient -- RestoreFromBackup  --restore_to_pos  "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-10000" zone1-0000000102

Vitess will attempt to find a path that recovers the database to that point in time. The path consists of exactly one full backup, followed by zero or more incremental restores. There could be exactly one such path, there could be multiple paths, or there could be no path. Consider the following scenarios:

Recovery scenario 1

point-in-time-recovery-path-1

This is the classic scenario. A full backup takes place at e.g. 12:10, then an incremental backup taken from exactly that point and is valid to 13:20, then the next one from exactly that point, valid to 16:15, etc.

To restore the database to e.g. 20:00 (let's assume that's at position 16b1039f-22b6-11ed-b765-0a43f95f28a3:1-10000), we will restore the full backup, followed by incrementals 1 -> 2 -> 3 -> 4. Note that 4 exceeds 20:00 and vitess will only apply changes up to 20:00, or to be more precise, up to 16b1039f-22b6-11ed-b765-0a43f95f28a3:1-10000.

Recovery scenario 2

point-in-time-recovery-path-2

The above is actually identical to the first scenario. Notice how the first incremental backup precedes the full backup, and how backups 2 % 3 overlap. This is fine! We take strong advantage of MySQL's GTIDs. Because the overlapping transactions in 2 and 3 are consistently identified by same GTIDs, MySQL is able to ignore the duplicates as we apply both restores one after the other.

Recovery scenario 3

point-in-time-recovery-path-3

In the above we have four different paths for recovery!

  • 1 -> 2 -> 3 -> 4
  • 1 -> 2 -> 6
  • 1 -> 5 -> 3 -> 4
  • 1 -> 5 -> 6

Either is valid, Vitess should choose however it pleases. Ideally using as fewest backups as possible (hence preferring 2nd or 4th options).

Recovery scenario 4

If we wanted to restore up to 22:15, then, there's no incremental backup that can take us there, and the operation must fail before it event begins.

Finding paths

Vitess should be able to determine the recovery path before even actually applying anything. It is able to do so by reading the available manifests, finding the shortest valid path to a requested point in time. By a greedy algorithm, it will seek the most recent full backup at or before requested time, and then the shortest sequence of incremental backups to take us to that point.

Backups from multiple sources

Scenario (3) looks imaginary, until you consider backups may be taken from different tablets. These have different binary logs at different rotation time -- but all share the same sequence of GTIDs. Since an incremental backup consists of full binary log copies, there could be overlaps between binary logs backed up from different tablets/MySQL servers.

Vitess should not care about the identity of the sources, should not care about the binary log names (one server's binlog.0000289 may come before another server's binlog.0000101), should not care about binary log count. It should only care about the GTID range an incremental backup covers: from (exclusive) and to (inclusive)

Restore time

It should be notes that an incremental restore based on binary logs means sequentially applying changes to a server. This make take minutes or hours, depending on how many binary log events we need to apply.

Testing

As usual, testing is to take place in:

  • Unit tests (e.g. validate recovery path logic)
  • endtoend (validate incremental backup, validate point in time restore)

Thoughts welcome. Please see #11097 for Work In Progress.

@deepthi
Copy link
Member

deepthi commented Sep 15, 2022

Nicely written proposal. A few questions/comments:

  • In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?
  • Assuming that is correct, the restored tablet will have its replication stopped at the desired position and will not attempt to connect to the shard primary, right?
  • Restored tablet will not be serving, and most likely will be lagging.
  • Any alerts that might be generated by this situation are the responsibility of the (human) operator to work around.

@shlomi-noach
Copy link
Contributor Author

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?

Correct, and we need the logic to prevent it from auto-replicating.

Restored tablet will not be serving, and most likely will be lagging.

👍

Any alerts that might be generated by this situation are the responsibility of the (human) operator to work around.

👍

@shlomi-noach
Copy link
Contributor Author

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?

Thinking more on this, I'm not sure which is the preferred way: use the same keyspace or create a new keyspace. Using the same keyspace leads to the risk of the server unintentionally getting attached to th ereplication stream. In fact, that's what's happening in my dev env right now: I make a point in time restore, and then vitess auto-configures the restored server to replicate -- even though I skip the replication configuration in the restore process.

Is there a way to forcefully prevent the restored server from joining the replication stream?

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Sep 22, 2022

The current implementation now reuses same keyspace, sets tablet type to DRAINED and ensures not to start eplication.

@mattlord
Copy link
Contributor

The current implementation now reuses same keyspace, sets tablet type to DRAINED and ensures not to start eplication.

Re-using the same keyspace seems more logical to me at first thought. You can prevent tablet repair in a number of standard ways:

  1. Set the tablet type to BACKUP or RESTORE, RESTORE seems more relevant than DRAINED
  2. touching tabletmanager.replicationStoppedFile
  3. I think using --disable_active_reparents

@GuptaManan100 would know better

@GuptaManan100
Copy link
Member

GuptaManan100 commented Sep 27, 2022

BACKUP is meant for tablets that are in midst of taking backups and RESTORE for the ones that are being restored. I am not sure how we use DRAINED. From the VTOrc perspective, all three are ignored so if we use any of the three, then VTOrc won't repair replication on them. The same goes for the replication manager too, it won't fix replication. So we shouldn't need to add the replication stopped file.

If we do want to disable the replication manager explicitly (even though in my opinion it shouldn't be required), then there is a new flag that was added recently - disable-replication-manager.

One thing that could be an issue is the setting of replication parameters by the tablet manager when it first starts. We can prevent that from happening with disable_active_reparents as @mattlord pointed out. We could also fix this step by checking the tablet type as we do for the other two, so we won't need this flag either. I can make that change if we decide to go with this alternative.

I looked at the linked PR and I think that has all the changes that should be needed. There is already code to stop the vtctld from setting up replication after the restore is complete and also code to prevent the restore flow itself on the vttablets to not start replication. Since we set the type to DRAINED in the end, neither of VTOrc nor the replication manager should be repairing replication.

@shlomi-noach Do you know where the replication is fixed by vitess in your tests? I don't think there is any other place, other than the 3 mentioned ☝️ that repair replication. I can help debug if we are seeing that replication is being repaired after the restore.

EDIT: Looked at the test in the PR and it is using a replica tablet that is already running to run the recovery process, so the initialization code shouldn't matter either.

@shlomi-noach
Copy link
Contributor Author

Do you know where the replication is fixed by vitess in your tests?

@GuptaManan100 I don't think they are? The PITR tests are all good and validate that replication does not get fixed.

@mattlord like @GuptaManan100 said, I think BACKUP and RESTORE types are for actively-being-backed-up-or-restored tablets. I think DRAINED makes most sense because by default vitess will not serve any traffic from DRAINED, but will allow explicit connections to read from the tablet mysql -h @drained ...

@GuptaManan100
Copy link
Member

GuptaManan100 commented Sep 28, 2022

@shlomi-noach Okay great! I was looking at

I make a point in time restore, and then vitess auto-configures the restored server to replicate -- even though I skip the replication configuration in the restore process.

which made me think that replication was being repaired by something in Vitess, even though I wasn't expecting it to be.
Maybe you made code changes after that comment which resolved it.

And I agree thatDRAINED should be the ideal type to use given our alternatives

@shlomi-noach
Copy link
Contributor Author

which made me think that replication was being repaired by something in Vitess

Sorry I wasn't clear. There was this problem and I found where it was that forced replication to start. It was as part of the Restore process itself, in vtctld.

@GuptaManan100
Copy link
Member

GuptaManan100 commented Sep 28, 2022

Oh! I see, I had added that in response to an issue wherein if there was a PRS while a tablet was in restore state, its semi-sync settings weren't set up correctly when it finally transitioned to replica back. The changes in your PR as far as this flow is concerned, are perfect 💯

@derekperkins
Copy link
Member

derekperkins commented Nov 1, 2022

Link to a prior discussion

@shlomi-noach
Copy link
Contributor Author

A few words about #13156: this PR supports incremental backup/recovery for Xtrabackup. It does use binary logs, as with the builtin engine. It does not use Xtrabackup's incremental backup where it copies InnoDB pages.

With #13156, it is possible to run full & incremental backups using Xtrabackup engine, without taking down a MySQL server.

The incremental restore process is similar to that of builtin: take down a server, restore a full backup, start the server, then apply binary logs as appropriate.

#13156 is merges in release-17.0.

Note that this still only supports --restore-to-pos, which means:

  • You need to know the specific GTID position to which you want to restore
  • And that is per-shard, as each shard will have completely different GTID sets

Support for a point in time (as in, restore to a given timestamp) will be added next.

@shlomi-noach
Copy link
Contributor Author

Supporting a point-in-time recovery:

We want to be able to recover one or all shards up to a specified point-in-time, ie a timestamp. We want to be able to restore to any point in time in a 1sec resolution. We will technically be able to restore to a microsecond resolution, but for now let's discuss a 1 second resolution.

Whether we restore a single shard or multiple shards, the operation will take place independently on each shard.

When restoring multiple shards to the same point in time, the user should be aware that the shards may not be in full sync with each other. Time granularity, clock skews etc., can all mean that the restored shards may not be 100% consistent with an actual historical point-in-time.

As for the algorithm we will go by: it's a bit different than a restore-to-position because:

  • Positions are discrete, where time isn't. More accurately, positions are uniquely identifiable, and two or more transactions can share the same timestamp
  • Positions are logical, and are independent of hardware clock
  • It is therefore impossible, or undesirable to claim "this full backup is true to this precise time". If it's an online backup, then we'd have to freeze writes so as to get the current time, and even then be susceptible to clock skew. If we take a backup from a replica, who knows what the exact time on the primary the replica image reflects? We can estimate with e.g. heartbeats, but the fact remains the value is inaccurate.

The most reliable information we have is the original committed timestamp value in the binary log. This event header remains true to the primary, even if read from a replica.

The way to run a point-in-time recovery is a bit upside-down from restore-to-pos:

  1. We will first find an incremental backup whose binlog entries range (first and last binlog entry in the backup) represet a timestamp range that includes our desired point-in-time to restore
  2. Then, we will work backwards to find previous incremental backups, and until we hit a full backup, which are all recoverable in sequence (ie no GTID gaps)
  3. But, also, since we support incremental backups with Xtrabackup, it is possible that the full backup overlaps with one or more incremental backup binary logs (binary logs are not rotated in a Xtrabackup backup). We also require that the full backup must have been completed before our desired point-in-time to restore.
  4. Once we have such a sequence: full backup, incremental backups, leading to a timestamp that is higher than our desired point-in-time, we apply as follows:
  • Restore the full backup. We know that it is true to before our desired point-in-time
  • Restore all incremental backups in the sequence. In each incremental backup we restore all binary logs
  • As reminder, we allow binlog overlaps, as we rely on MySQL GTID to skip duplicate transactions
  • In extracting any of the binary logs, we will use mysqlbinlog --stop-datetime to ensure no event gets applied that is later than our desired point-in-time

I tend to require the user to supply the point-in-time strictly in UTC, but we can work this out. Everything is possible, of course, but I wonder what is more correct UX-wise.

@shlomi-noach
Copy link
Contributor Author

WIP for restore-to-time: #13270

@shlomi-noach
Copy link
Contributor Author

It's been pointed out by the community (Vitess slack, #feat-backup channel) as well as by @deepthi that the current flow for PITR requires taking out an existing replica. The request is to be able to initialize a new replica with PITR flags, so that it is created and restored from backup with PITR all in one go.

PR incoming.

@shlomi-noach
Copy link
Contributor Author

Closing this RFC as the work was done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants