Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid failover delays in twemproxy reconfig when using master targets #277

Merged
merged 1 commit into from
Oct 17, 2023

Conversation

roivaz
Copy link
Member

@roivaz roivaz commented Oct 16, 2023

It seems that even though the "sentinel master <shard_name>" command returns updated information about the master as soon as a slave is promoted, there is an exception with the sentinel instance that acts as "failover leader". The failover leader is the instance that actually performs the shard reconfigurations, and in this case, its "sentinel master <shard_name>" command only returns updated information when all the slaves have been also reconfigured to point to the new master. This causes delays in twemproxy reconfiguration if this is the instance used by the twemproxyconfig controller to "discover" the shard.
To detect this situation, add a step that checks the master address with the command "sentinel get-master-addr-by-name <shard_name>". This command always returns updated master information, even if we are querying the leader sentinel instance.

This must be merged before #276 as it should be included in the new release.

/kind bug
/priority important-soon
/assign

@3scale-robot 3scale-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 16, 2023
@3scale-robot 3scale-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next sprint. label Oct 16, 2023
@3scale-robot 3scale-robot added needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. size/M Requires about a day to complete the PR or the issue. and removed needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. labels Oct 16, 2023
@roivaz roivaz force-pushed the avoid-failover-delays branch from b89c11a to d9cab77 Compare October 17, 2023 09:08
@3scale-robot 3scale-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2023
@3scale-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 578f6ba3019e17ab6a204433a2579a70b46a1fdf

It seems that even though the "sentinel master <shard_name>" command
returns updated information about the master as soon as a slave is
promoted, there is an exeption with the sentinel instance that acts as
"failover leader". The failover leader is the instance that actually
performs the shard reconfigurations, and in this case, its "sentinel
master <shard_name>" command only returns updated information when all
the slaves have been reconfigured to point to the new master. This
causes delays in twemproxy reconfiguration if this is the instance being
used to "discover" the shard.
To detect this situation, add a step that checks the master address with
the command "sentinel get-master-addr-by-name <shard_name>". This
command always returns updated master information, even if we are
querying the leader sentinel instance.
@roivaz roivaz force-pushed the avoid-failover-delays branch from d9cab77 to 79e9e89 Compare October 17, 2023 09:14
@3scale-robot 3scale-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2023
@3scale-robot 3scale-robot requested a review from slopezz October 17, 2023 09:14
@roivaz roivaz added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2023
@roivaz
Copy link
Member Author

roivaz commented Oct 17, 2023

/approve

@3scale-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: roivaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@3scale-robot 3scale-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023
@3scale-robot 3scale-robot merged commit e081080 into main Oct 17, 2023
4 checks passed
@3scale-robot 3scale-robot deleted the avoid-failover-delays branch October 17, 2023 09:32
@roivaz
Copy link
Member Author

roivaz commented Nov 3, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next sprint. size/M Requires about a day to complete the PR or the issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants