Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure replication running after replication restart #1422

Merged
merged 7 commits into from
Dec 18, 2024

Conversation

andyedison
Copy link
Contributor

@andyedison andyedison commented Jun 11, 2024

A Pull Request should be associated with an Issue.

We wish to have discussions in Issues. A single issue may be targeted by multiple PRs.
If you're offering a new feature or fixing anything, we'd like to know beforehand in Issues,
and potentially we'll be able to point development in a particular direction.

Related issue:

Further notes in https://github.com/github/gh-ost/blob/master/.github/CONTRIBUTING.md
Thank you! We are open to PRs, but please understand if for technical reasons we are unable to accept each and any PR

Description

This PR adds checks to function restartReplication that ensures that replication has started before continuing. Before adding this check, there as a hard coded 500ms wait time and then the program assumed that the replication threads were started and running (added in #337).

We encountered situations in different environments that this wait time wasn't sufficient. As an experiment, we doubled this wait time and deployed it to our live environments to see if this resolves the issue. This did help solve the problem, so now we are coming back to find a better permanent fix.

example output from our logs:

2024-06-11 08:17:41 FATAL Replication on <replica-hostname>:3306 is broken: Slave_IO_Running: Connecting, Slave_SQL_Running: Yes. Please make sure replication runs before using gh-ost.
old description This PR increases the `startSlavePostWaitMilliseconds` as we are seeing an error when running `gh-ost` in some cloud environments that the `Slave_IO_Running` is `Connecting` rather than `Yes` as expected.

We found this old PR that described the issue we're having #337 - as a first step we are increasing by doubling the value. If this test is successful, then we'll look into making this something that could be configured.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

meiji163
meiji163 previously approved these changes Jun 11, 2024
@timvaillancourt
Copy link
Collaborator

@andyedison I wonder if this fix is good enough for all cases. The sleep that exists now is a bit hacky

in some cloud environments that the Slave_IO_Running is Connecting rather than Yes as expected.

What should we be waiting for? I think you're saying the IO thread running. Whatever the answer, it would be safer if gh-ost waited + checked that what we want is achieved vs a time.Sleep()

@andyedison
Copy link
Contributor Author

No I agree, I doubt this is good enough for all cases. This was a bit of an experiment to see if increasing this time would prevent the errors we were seeing in a particular environment from happening over a period of time. I believe it has, we just haven't had time to swing back to this and dig into it to find a more permanent solution

@andyedison andyedison changed the title Adjust wait time after replication restart Ensure replication running after replication restart Oct 14, 2024
@andyedison
Copy link
Contributor Author

I've updated the PR and title+description to better represent what this change is. Instead of assuming that replication has resumed successfully after the timeout, I made the change to instead check if it is running, and if not, wait an interval before trying to check again, erroring if we exceed a maximum wait time

@@ -22,7 +22,8 @@ import (
"github.com/openark/golib/sqlutils"
)

const startSlavePostWaitMilliseconds = 500 * time.Millisecond
const startReplicationPostWait = 250 * time.Millisecond
const startReplicationMaxWait = 2 * time.Second
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds was somewhat arbitrary, I could be convinced to adjust this if others had strong opinions

@meiji163
Copy link
Contributor

Internal testing results look good, feel free to merge when ready 👍

@andyedison andyedison merged commit 19cf183 into master Dec 18, 2024
7 checks passed
@andyedison andyedison deleted the adjust-wait-post-repl-restart branch December 19, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants