-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding new mysql shell backup engine #16295
Conversation
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16295 +/- ##
==========================================
- Coverage 69.51% 69.43% -0.08%
==========================================
Files 1569 1570 +1
Lines 202529 202945 +416
==========================================
+ Hits 140784 140913 +129
- Misses 61745 62032 +287 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this submission! Some initial general thoughts, before a deeper code review. PErhaps these questions are more appropriate on #16294 but I did not want to then split the discussion so let's keep it here.
I have not used MySQL Shell backups before. Some questions and notes:
-
This PR adds dependencies on
mysqlsh
andmysqlshell
binaries. This is just an observation, but points for consideration are:- Neither are included in a standard MySQL build. What are version dependencies between mysqlsh/mysqlshell and the MySQL server?
- Neither are included in the MySQL docker images, to the best of my understanding. This means this backup method will not be available on kubernetes deployments via
vitess-operator
.
-
Re: GTID not being available in the manifest file, this means we will not be able to run point in time recoveries with a mysqlshell-based full backup. Point in time recoveries require GTID information. As mentioned in Feature Request: MySQL Shell Logical Backups #16294 (comment), the mysqlshell method is the first and only (thus far) logical backup solution, so it's unfortunate that this solution will not support logical point in time recoveries.
Is it not possible to read thegtidExecuted
field from the@.json
dump file immediately after the backup is complete, and update the manifest file? E.g. if the dump is into a directory, isn't that directory available for us to read?
// This is usually run in a background goroutine, so there's no point | ||
// returning an error. Just log it. | ||
logger.Warningf("error scanning lines from %s: %v", prefix, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice we do not have a unit test for this function. Having moved it around, perhaps now is a good opportunity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I can probably add it there 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit test added for this function
These are good questions, thanks Shlomi!
|
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
ce12fdd
to
c51ee4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thank you for this contribution.
Since we have not fully finished the deletion of mysqld
in the vitess/lite
Docker Images, the mysqlsh
binary will have to be included in the vitess/lite
image regardless if it's included in the official MySQL Docker Images or not. Since we are letting people choose between using an official MySQL image or the vitess/lite
image for their Docker/K8S deployment we must have the binary in both.
Regarding vitess-operator
, a follow-up PR is needed on the vitess-operator to allow this new backup engine. In our CRDs we have an enumeration that restrict what backup engines are allowed, we just need to add a new entry in the enumeration. This can be done here.
FYI, I can handle the vitess-operator changes.
Have you looked at |
Oh, that's nice! The reason I thought it wasn't included is that
I feel like it's OK to have some solution "with limitations". We should strive to support as much functionality as possible though. So IMHO we should strive to include the GTID when the backup goes into a directory. This should be possible to do, which then means the backup should fail if for some reason we can't fetch the GTID or validate it (correct GTID form). i.e. return I'd like @deepthi to weigh in her opinion. Assuming we do decide to move forward, then I'd next expect a CI/endtoend test please, as follows:
When these are all added, a new CI job will run to test If this test passes, then you will have validated the full cycle of backup and resotre, as well as correctness of the captured GTID. Edit: since S3, Azure, GCP can be left without GTID support for now. We'd need a documentation PR that clearly indicates the limitations of this method. |
var ( | ||
// location to store the mysql shell backup | ||
mysqlShellBackupLocation = "" | ||
// flags passed to the mysql shell utility, used both on dump/restore | ||
mysqlShellFlags = "--defaults-file=/dev/null --js -h localhost" | ||
// flags passed to the Dump command, as a JSON string | ||
mysqlShellDumpFlags = `{"threads": 2}` | ||
// flags passed to the Load command, as a JSON string | ||
mysqlShellLoadFlags = `{"threads": 4, "updateGtidSet": "replace", "skipBinlog": true, "progressFile": ""}` | ||
// drain a tablet when taking a backup | ||
mysqlShellBackupShouldDrain = false | ||
// disable redo logging and double write buffer | ||
mysqlShellSpeedUpRestore = false | ||
|
||
MySQLShellPreCheckError = errors.New("MySQLShellPreCheckError") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have correctly followed the existing design. I'm just taking the opportunity to say at some point we will want to move away from these global variables.
// flags passed to the mysql shell utility, used both on dump/restore | ||
mysqlShellFlags = "--defaults-file=/dev/null --js -h localhost" | ||
// flags passed to the Dump command, as a JSON string | ||
mysqlShellDumpFlags = `{"threads": 2}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to choose 2
rather than something based on runtime.NumCPU()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say because MySQL Shell backups can be a bit more intensive - we are requesting a bunch of data off MySQL which needs to be fetched, parsed and compressed - and in our particular use case we were taking backups online so we didn't wan't it to cause that much disruption. It is also part of the reason why I made sure ShouldDrainForBackup()
was configurable in case it is more suitable for the use case.
I am fine with changing the default to runtime.NumCPU()
though since it is configurable and leave this up to the user to decide based on their environment requirements, although I am also conscious that it might cause some issues in Kube where it will show the number of CPUs of the underlying node despite the pod being possibly limited on how much CPU it can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we already have a --concurrency
flag for the Backup
command. Should we reuse that flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so looking this up again today, the default in MySQL Shell if not passed is 4
. what I have been considering is that if we do something like:
mysqlShellDumpFlags = fmt.Sprintf("{\"threads\": %d}", runtime.NumCPU())
it might give the user the wrong impression that this is the default for MySQL Shell and that if they use the flag to change other settings and they don't add thread
to it, they will use all CPUs by default.
so perhaps we should just set it to {"threads": 4}
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might give the user the wrong impression that this is the default for MySQL Shell and that if they use the flag to change other settings and they don't add thread to it, they will use all CPUs by default.
This is a documentation concern. We can't expect any user expectation. I can say that I would not assume as much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still, we can try to make it more difficult for users to do the wrong thing. perhaps we can remove threads
from the default value and leave it up to the user to modify it if the desire so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can remove threads from the default value and leave it up to the user to modify it if the desire so?
Would that mean mysql shell would then use 4
? I think this is fine. So you'd give the user the option for setting "threads"
, but that option is by default empty? I like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's correct. the docs mention the default is 4
fs.StringVar(&mysqlShellFlags, "mysql_shell_flags", mysqlShellFlags, "execution flags to pass to mysqlsh binary to be used during dump/load") | ||
fs.StringVar(&mysqlShellDumpFlags, "mysql_shell_dump_flags", mysqlShellDumpFlags, "flags to pass to mysql shell dump utility. This should be a JSON string and will be saved in the MANIFEST") | ||
fs.StringVar(&mysqlShellLoadFlags, "mysql_shell_load_flags", mysqlShellLoadFlags, "flags to pass to mysql shell load utility. This should be a JSON string") | ||
fs.BoolVar(&mysqlShellBackupShouldDrain, "mysql_shell_should_drain", mysqlShellBackupShouldDrain, "decide if we should drain while taking a backup or continue to serving traffic") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the choice of draining vs not draining is due to the increased workload on the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, exactly. in fact I have been meaning to propose this to be modifiable for the xtrabackup
engine as well, so a tablet won't be service traffic when it is taking a backup
fs.StringVar(&mysqlShellDumpFlags, "mysql_shell_dump_flags", mysqlShellDumpFlags, "flags to pass to mysql shell dump utility. This should be a JSON string and will be saved in the MANIFEST") | ||
fs.StringVar(&mysqlShellLoadFlags, "mysql_shell_load_flags", mysqlShellLoadFlags, "flags to pass to mysql shell load utility. This should be a JSON string") | ||
fs.BoolVar(&mysqlShellBackupShouldDrain, "mysql_shell_should_drain", mysqlShellBackupShouldDrain, "decide if we should drain while taking a backup or continue to serving traffic") | ||
fs.BoolVar(&mysqlShellSpeedUpRestore, "mysql_shell_speedup_restore", mysqlShellSpeedUpRestore, "speed up restore by disabling redo logging and double write buffer during the restore process") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels risky. Please indicate caveats in this flag's description. Otherwise this looks "too good", why wouldn't anyone want to speed up the restore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you need the redo log/double write buffer to be disable once the instance has completed the restore, there shouldn't be much risk in enabling this. for some setups there might be an interest in disabling this (I can see as a possible case, somebody running on zfs
and wanting to keep the double write buffer disabled), so I didn't want to force it if the user has a similar scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so for the duration of the restore this is fine, because who cares, the server is not serving, it's just being restored, right? But then, at the end of the process, as we mentioned elsewhere, these settings must return to "persistent" values, or else the backup should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, most users will only want to run with these settings during the restore and go back to the having redo log/doublewrite buffer enabled before the tablets starts serving. This is the use case described in the MySQL docs:
As of MySQL 8.0.21, you can disable redo logging using the ALTER INSTANCE DISABLE INNODB REDO_LOG statement. This functionality is intended for loading data into a new MySQL instance. Disabling redo logging speeds up data loading by avoiding redo log writes and doublewrite buffering.
WarningThis feature is intended only for loading data into a new MySQL instance. Do not disable redo logging on a production system.
One thing that I realise now is that this will likely fail if the user is running <8.0.21
. Is this something we should validate in case this flag is set or do we already have a minimum version for MySQL with Vitess? In any case, it would just cause the backup to fail and the user would have to disable passing this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I realise now is that this will likely fail if the user is running <8.0.21. Is this something we should validate in case this flag is set or do we already have a minimum version for MySQL with Vitess?
We should validate that based on the actual MySQL connection. This is an already recognized "capability" in
DisableRedoLogFlavorCapability // supported in MySQL 8.0.21 and above: https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-21.html |
Here is an example usage:
vitess/go/vt/vttablet/onlineddl/executor.go
Lines 851 to 852 in ae7214d
capableOf := mysql.ServerVersionCapableOf(conn.ServerVersion) | |
capable, err := capableOf(capabilities.PerformanceSchemaDataLocksTableCapability) |
We should fail the operation if the flags are set and the MySQL server versions is "incapable of" the functionality. Ideally, as close as possible to the flag parsing, but also just fine if just as part of the restore process, but do try to move that up as much as possible, and before doing actual substantial resotre work (e.g. before destorying the existing data...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been added on 6e59215
args = append(args, strings.Fields(mysqlShellFlags)...) | ||
} | ||
|
||
args = append(args, "-e", fmt.Sprintf("util.dumpSchemas([\"vt_%s\"], %q, %s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the keyspace/schema names be escaped here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean escape like in MySQL as `vt_keyspace`
? I am not sure MySQL Shell supports or expects it, I will verify that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't assume that the database name is vt_keyspace
. It is determined by the --init_db_name_override
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brings up another point. Typically when we do the regular kind of backup we also backup the sidecar db, which defaults to _vt
. Does the mysqlshell backup specifically backup only the actual keyspace/database? What are the implications of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it is possible to backup multiple db/schemas, we just need to specify them. is this the right place to get the name of the sidecar db?
vitess/go/constants/sidecar/name.go
Lines 23 to 25 in c51ee4b
const ( | |
DefaultName = "_vt" | |
) |
also, I am curious what are the consequences if _vt
is not in place, we tried some of these restores without any issues but maybe we are losing some metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach would it make sense to do it in a separate PR though, so it is easier to revert if needed? I would hate for us to have to revert this PR once merged in case it causes any side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeleteBeforeRestore
aside, would you all be comfortable with the idea adding a flag that lets makes us run DELETE USER
on all users except the current one and than letting the user define loadUsers
on the mysql shell restore flags on their own so these are reloaded as appropriate if they rely on replication for propagating the grants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to just get rid of DeleteBeforeRestore if not being used, rather than having the backup engines ignore the intended value.
I interpreted that in the sense that you prefer to remove the variable in this PR. It's fine if we did that on another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to do it in a separate PR though
One thing to take into account is that if we do it in a separate PR, after merging this current PR the new engine won't work on vitess-operator (the restore part only). I don't mind waiting for the second PR to be ready/merged in order to merge the vitess-operator one. But, I think it would be nice to flag it somewhere, either in this PR description or else, that this engine is only available if not using vitess-operator, in the event we do two PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, lets do it in this PR then to avoid the admin work of having it separate and putting up notices, I will push a change later today
defer func() { // re-enable once we are done with the restore. | ||
err := params.Mysqld.ExecuteSuperQueryList(ctx, []string{"ALTER INSTANCE ENABLE INNODB REDO_LOG"}) | ||
if err != nil { | ||
params.Logger.Errorf("unable to re-enable REDO_LOG: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fail the restore process? I'm not sure if I have a good answer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good question. the original intention here is, since we were able to successfully disable Redo logs/double write buffer, it would be better to fail the restore than potentially put it in service with a potentially dangerous configuration without the user realising it.
if the user wishes to run with redo log/double write disabled buffer they can avoid setting --mysql_shell_speedup_restore
and handle it outside of vitess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we should fail the restore when these flags are provided and if the above errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just verifying that this is still to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there something you still wanted me to address? currently we will only attempt to disable/enable if the flags are passed, otherwise there is no change in behaviour and if we can't run the ALTER INSTANCE
we will fail the backup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently we will only attempt to disable/enable if the flags are passed, otherwise there is no change in behaviour and if we can't run the ALTER INSTANCE we will fail the backup
All right! Then there is nothing else to be addressed.
@shlomi-noach yeah, we looked into I think the proposal to have the GTID read when backing up to a directory while not when using an object store is fair, let me look into it and make the necessary changes. If all looks good I will proceed with working on the CI/endtoend tests, I just wanted to get some initial feedback before doing so. Also curious what @deepthi thinks about this approach.
Is that something that needs to happen as part of this PR or something separate? |
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
@frouioui I have added the missing grant, hopefully that will fix it on the vitess operator. I have also removed |
@rvrangel, the latest changes look good to me, and the vtop tests are all passing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about marking this new workflow as required when merging the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We'll give it a couple weeks until all open PRs are in sync, then we'll make it Required
.
changelog/21.0/21.0.0/summary.md
Outdated
We are introducing a the backup engine supporting logical backups starting on v21 to support use cases that require something else besides physical backups. This is experimental and is based on the | ||
[MySQL Shell](https://dev.mysql.com/doc/mysql-shell/8.0/en/). | ||
|
||
The new engine is enabled by using `--backup_engine_implementation=mysqlshell`. There are other options that are required, so check the docs on which options are required and how to use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a link to the docs would be useful here.
@@ -179,6 +179,7 @@ func (p *RestoreParams) IsIncrementalRecovery() bool { | |||
// Returns the manifest of a backup if successful, otherwise returns an error | |||
type RestoreEngine interface { | |||
ExecuteRestore(ctx context.Context, params RestoreParams, bh backupstorage.BackupHandle) (*BackupManifest, error) | |||
ShouldStartMySQLAfterRestore() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved on my side, pending final comments from @frouioui
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Merged |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me besides #16295 (comment).
The vitess-operator PR also looks good to me and is ready to go.
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
@frouioui thanks, I missed that comment. I have added the GRANT on those files now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working with us through this big change! I don't think any of us anticipated the amount of work related to adding a new backup engine.
Vitess-operator, Docker, and the Vitess parts look good to me!
Thanks everyone for the support, it was a long PR but it definitely looks a much better solution than when it started! I appreciate all the suggestions feedback provided and great to see it finally merged 🎉 |
Signed-off-by: Renan Rangel <rrangel@slack-corp.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Description
This is a PR that implements a new backup engine for use with MySQL Shell, which is mentioned in the feature request here: #16294
It works a bit differently than the existing engines in vitess in which it only stores the metadata of how the backup was created (location + parameters used) and during the restore uses the location plus other parameters (mysql shell are different if you are doing a dump vs a restore, so we can't use exactly the same ones)
Related Issue(s)
Fixes #16294
Checklist
Deployment Notes