Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: support Amazon RDS #163

Closed
shlomi-noach opened this issue Aug 18, 2016 · 53 comments
Closed

Tracking issue: support Amazon RDS #163

shlomi-noach opened this issue Aug 18, 2016 · 53 comments

Comments

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Aug 18, 2016

EDIT: Documentation for RDS is available

This is a tracking issue to discuss all things RDS and what it takes to be able to run gh-ost on RDS.

@shlomi-noach
Copy link
Contributor Author

First limitation is the SUPER privilege. This is defined in #132 and addressed in #156

When you know you are using RBR (Row Based Replication), you can specify --assume-rbr and this will skip the step of STOP SLAVE; START SLAVE which requires SUPER privilege.

@shlomi-noach
Copy link
Contributor Author

I should note that I am not using RDS at the moment; will need community feedback for production issues.

@pbitty
Copy link
Contributor

pbitty commented Aug 18, 2016

I am using RDS and have been trying out gh-ost with success so far. I am happy to give feedback and contribute where possible.

I've succeeded in running a test with --test-on-replica (using the workaround described in #162) and table checksums matched. And I've done a migration with --migrate-on-replica and did not have any significant issues.

@pbitty
Copy link
Contributor

pbitty commented Aug 18, 2016

As far as I can tell, the only other blocker is the STOP SLAVE command that is issued in --test-on-replica mode (to be solved via #162).

In RDS, it is possible to start/stop replication with stored procedures they provide, ie: CALL mysql.rds_stop_replication. So we could support automatic replication stop/start.

The only other thing that is not supported (that I can tell) is setting global variables, so we wouldn't be able to support auto-configuring RBR from mysql itself. It would be possible to do it via the RDS API, but there are non-trivial concerns since multiple RDS instances can share the same configuration, so setting binlog_format=ROW on a shared configuration could change it on multiple RDS instances at the same time. This could be left as a manual step and included in a doc, or a plugin from #62.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Aug 18, 2016

Thank you! You are now assigned the role of Chief RDS Officer at Gh-Ost Corp.

Changing a statement replica to row can certainly be delegated to the user, no need to trouble gh-ost on this if this is an issue.

@pbitty
Copy link
Contributor

pbitty commented Aug 19, 2016

Ha, thanks!

I am working on documenting the steps required for using gh-ost in our RDS environment. I should be able to make it generic enough to help others. When I do, I can submit a PR if you are interested. Hopefully we can get other RDS users to provide some feedback.

@shlomi-noach
Copy link
Contributor Author

Hooks have been merged, though not yet released.

@pbitty
Copy link
Contributor

pbitty commented Aug 31, 2016

Great! I will try them out with our next --test-on-replica run either today or tomorrow.

@pbitty
Copy link
Contributor

pbitty commented Sep 1, 2016

The gh-ost-on-stop-replication worked like a charm. I was able to stop replication the RDS way before continuing with cut-over, using the hook along with --test-on-replica and --test-on-replica-skip-replica-stop.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Sep 5, 2016

At this time (v.1.0.17), the way to run gh-ost on RDS is as follows:

Real migration

  • You need to make sure replication is RBR. gh-ost has no privilege to set binlog_format=ROW for you.
  • RDS will not honor the parameter group config for the binlog unless you enable automated backups on the read replica (for at least 1 day).
  • Execute gh-ost --assume-rbr ...
    This will make gh-ost skip the stop slave; start slave step, which it does not have privileges for.
  • 🍕

Test on replica

In addition to the above:

  • gh-ost --assume-rbr --test-on-replica --test-on-replica-skip-replica-stop
    The --test-on-replica-skip-replica-stop avoids issuing a stop slave at the end of migration
  • However replication must be stopped. Using hooks you need to supply gh-ost-on-stop-replication hook -- where you will invoke your own RDS script to stop replication
  • You must make read_only=0 on the replica's Parameter Group. By default read_only={TrueIfReplica} and therefore enabled
  • 🍕

Once the above is tested more by the community, I will put this in documentation

@pbitty
Copy link
Contributor

pbitty commented Sep 20, 2016

This is the hook we use.

At /hooks/gh-ost-on-stop-replication-rds-command.sh:

#!/usr/bin/env bash

statement="call mysql.rds_stop_replication"
echo "Issuing statement to stop replication: ${statement}" 1>&2
exec mysql -h $DB_HOST -u $DB_USER -p$DB_PASSWORD -e "${statement}"

where --hooks-path=/hooks.

The DB_... env vars are set when running the gh-ost process, and then passed by gh-ost onto the hook process.

@pbitty
Copy link
Contributor

pbitty commented Sep 20, 2016

This is the minimum privilege required for that:

GRANT EXECUTE ON PROCEDURE mysql.rds_stop_replication to '<user>'@'<host_or_wildcard>';

@shlomi-noach
Copy link
Contributor Author

@pbitty thank you for this info!

To clarify, in: GRANT EXECUTE ON PROCEDURE mysql.rds_stop_replication to '<your_ghost_user>'@... you don't actually need the user to be the gh-ost user you use for running the migration. This was your personal choice.

@pbitty
Copy link
Contributor

pbitty commented Sep 22, 2016

Good point. I've updated the comment above.

@shlomi-noach
Copy link
Contributor Author

Cool. Eventually we'll make a doc page out of this.

@drmmr763
Copy link

hey all

Just arriving here after researching tools like percona's online schema change and even @shlomi-noach's old open oak project. This project has become highly relevant to me on the back of a terrible alter-table experience last week :)

I'm curious what is the status of Amazon RDS support at this point? it looks like the biggest issue that even Percona's toolkit had with regard to the super user was resolved in #156. Is this the only known limiting issue or are there more known issues?

@pbitty
Copy link
Contributor

pbitty commented Sep 26, 2016

@drmmr763 , I have been using gh-ost in RDS with no real issues. Any issues I've had have not been related to RDS and are being addressed as we go.

@pbitty
Copy link
Contributor

pbitty commented Sep 26, 2016

I should note, for --test-on-replica mode, you must make read_only=0 on the replica's Parameter Group. By default read_only={TrueIfReplica} and therefore enabled.

@shandrew
Copy link

Has anyone tested gh-ost on RDS-Aurora?

@drmmr763
Copy link

Thanks for the reply @pbitty! Actually I have a followup based on what @shandrew said -- is Amazon RDS not the same as RDS-Aurora? Or are they the same infrastructure / architecture?

@pbitty
Copy link
Contributor

pbitty commented Sep 29, 2016

@drmmr763 Amazon Relational Database Service (RDS) is the umbrella name for their relational DB offerings. It's basically a managed DB service. They offer Mysql, Postgres, MS-SQL server instances and maybe others. RDS Aurora is a new offering that is Mysql wire-compatible but has a different architecture on the backend (it's not actually a mysql server). We haven't used it yet, but they claim performance benefits in a lot of cases.

@fern4lvarez
Copy link
Contributor

@shlomi-noach could you please update this comment to reflect that replica's Parameter Group requires read_only=0 set for --test-on-replica mode, as @pbitty pointed out in #163 (comment)?

It can help others to have all needed information in one place, until this makes it into the official docs.

@shlomi-noach
Copy link
Contributor Author

@fern4lvarez done, thank you!

@pbitty
Copy link
Contributor

pbitty commented Oct 19, 2016

I will set aside some time before end of week to whip up a PR for RDS documentation.

@fern4lvarez
Copy link
Contributor

@pbitty did you try to throttle a migration during a long period of time? I'm concerned about expire_logs_days value set as 0 by default on RDS instances. Furthermore, this variable is not editable via Paremeter Groups, which makes it a bit more painful to change every time I want to start new replicas. See https://forums.aws.amazon.com/thread.jspa?threadID=152786

@pbitty
Copy link
Contributor

pbitty commented Oct 20, 2016

@fern4lvarez , I don't believe I've throttled one for a long period - however, we do stop replication for about 8 hours during backups on some of our instances.

RDS hides this setting behind a config option that you can set using their mysql.rds_set_configuration stored procedure. eg. To set it for 24 hours, you call:

call mysql.rds_set_configuration('binlog retention hours', 24);

It's not very well-documented, you can find it at the bottom of this doc.

@jessicalu2016
Copy link

Hi, I am trying to do "Connect to replica, migrate on master" or "Migrate/test on replica" on RDS but having an issue complaining "<rbr_replica_host> must have binary logs enabled". The parameter group does have "log-bin" set up and "binlog_format" is set to "ROW". I would be very appreciated any suggestions how to solve it. Thank you.

@jessicalu2016
Copy link

@pbitty Thanks so much! this solved the problem.

@pbitty
Copy link
Contributor

pbitty commented Oct 28, 2016

@fern4lvarez Is the replica configured withbinlog_format=ROW ?

@shlomi-noach
Copy link
Contributor Author

@pbitty thank you -- and I think you're on the spot.

@fern4lvarez may I suggest you should ask these questions in new issues; this issue is for tracking down changes to support RDS, and we cannot handle multiple problems in the same thread. Opening a new issue and linking to this issue would be the right thing to do.

@Ahmadposten
Copy link

@shlomi-noach
I can see that the binlog_format parameter in rds parameter groups is non-modifiable .. How did you manage to make it "ROW"?
currently -asume-rbr does me no good and the binlog_format is set to Mixed

@shlomi-noach
Copy link
Contributor Author

@Ahmadposten I myself am not using RDS in the first place. Hopefully @pbitty can help out -- but do see his earlier notes on the matter.

@pbitty
Copy link
Contributor

pbitty commented Nov 17, 2016

@Ahmadposten What version of mysql are you using? In our account I can only change that setting on PGs for mysql 5.6 (or higher?). On 5.5 and 5.1 it is non-modifiable as you said. Is this the case for you?

You would have to upgrade to 5.6 to use gh-ost. In RDS, only mysql 5.6 or higher have support for "external replicas". With 5.5 and lower, they don't give you the ability to grant REPLICATION SLAVE to a user.

@Ahmadposten
Copy link

@pbitty @shlomi-noach
Yes this seems to be my case .. I use mysql 5.5
I will try upgrading ... Thank you guys

@lperrin
Copy link

lperrin commented Dec 1, 2016

@shlomi-noach one thing to note is that RDS will not honor the parameter group config for the binlog unless you enable automated backups on the read replica (for at least 1 day). Could you add it in your comment?

@shlomi-noach
Copy link
Contributor Author

@lperrin done. thank you!

@jessicalu2016
Copy link

Hi guys, we are thinking to move from RDS mysql to Aurora and is curious if anyone here knows if gh-ost works for Aurora as well or not. Thanks!

@HasMikeW
Copy link

I'm testing it with Aurora and so far so good.

@dgtized
Copy link

dgtized commented Jan 24, 2017

I've run several migrations on RDS MySQL using the directions above and had no problems.

@kamaljoshi
Copy link

I ran around 7-8 migrations on MySQL in December using this. Everything seemed to have worked well.
Some minor changes I had to do with gh-ost were related to making --assume-master-host work without Tungsten. As RDS replica was using internal IPs(172.17.X.X) to communicate with master and the machine we were running gh-ost on couldn't access that, I had to manually pass the master hostname.

@jacobbednarz
Copy link
Contributor

I've been testing this for the last couple of months using RDS and aurora as the engine. The steps here related to RDS are spot on 👍 , the only differing pain point I've found is that the aurora parameter group doesn't allow modifying read_only. It still uses the same value of {TrueIfReplica} however the aurora parameters are marked as a non-modifiable field which prevents connecting to replica but running on the master.

We're looking into a couple of work arounds for this and I'll report back if we get anything worth while but for now, best to just be aware you'll need to run it on the master.

@pbitty
Copy link
Contributor

pbitty commented Mar 10, 2017

@jacobbednarz That's great to hear.

From what I understand of Aurora, all nodes share the same underlying storage layer, so only one node is allowed to write to it. You may be able to get around this by replicating to another Aurora cluster. From gh-ost's point of view, it would still be connecting to a "master", but this master would be on a 2nd Aurora cluster (that is receiving binlog updates from the 1st one).

I haven't tried this, just theorizing. :)

@jacobbednarz
Copy link
Contributor

You may be able to get around this by replicating to another Aurora cluster.

This is a possibility however I don't believe it solves the issue of running on a replica but migrating the master (which is the intention here). Your comment further down explains why this is the case and right now I can't see a viable solution.

The best you can expect from RDS on Aurora at the moment is that you can run it on the cluster but only directly on the master. If you want to test, you need to stand up additional clusters - probably based on a recent snapshot.

@HasMikeW
Copy link

I migrated a 1.5 billion record table on a highly concurrent system in Aurora. It took a couple attempts, because the inserts would slow even with no constraints added to the base command. We had to scale back some of the load on the server to be able to complete the task in a timely manner.

@shlomi-noach
Copy link
Contributor Author

@HasMikeW which version of gh-ost were you using? More specifically, does your version support --dml-batch-size? Introduction of this functionality (indicated by the existence of this flag) significantly improved write throughput. Generally speaking I prefer to just have some sane configuration and not try to tune too much, but if you suffer from low throughput you may try a higher value for that flag (default: 10, try perhaps 50 or 100).

@HasMikeW
Copy link

@shlomi-noach The version is 1.0.32 I had looked into using the --dml-batch-size but was unable to get it to work with this version. I didn't want to go back and have to run more tests that a new version of gh-ost would have dictated at that point. That was the next step if reducing the load did not work :-)

@jacobbednarz
Copy link
Contributor

Just an update to my comment above - I've managed to find a nice workaround for testing/migrating on a replica in AWS RDS using aurora as the engine.

@pbitty was pretty close to the money however you don't use Aurora replication, you need to use standard MySQL-style replication. The AWS documentation page does an OK-ish job at explaining this but it's probably better to follow non-AWS docs and just setup replication as if you would elsewhere using a separate aurora cluster in the same region.

@shlomi-noach
Copy link
Contributor Author

@jacobbednarz do you think it would be good to write some documentation for Aurora users, and if so, would you be able and willing to write it?

@jacobbednarz
Copy link
Contributor

It would definitely be worth while! I'm happy to write it (since I've already got a bunch of notes from our internal documentation that would help here too).

@jacobbednarz
Copy link
Contributor

I've opened a PR to add in AWS RDS docs at #385 for anyone following along at home.

@jacobbednarz
Copy link
Contributor

@shlomi-noach I think everything in this issue has been addressed either via docs or changes. We've also managed to run ~20 successful test + production migrations on RDS so I'm happy most of the annoyances are out of the way. Would it be time to close this off and expect new issues to be in a thread of their own?

@shlomi-noach
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests