-
Notifications
You must be signed in to change notification settings - Fork 937
Orchestrator Recovery Time (version 3.0.11) #648
Comments
@rafael both these times fall under our acceptable failover time expectation, but we got more on the
A As you work with GTID, and time is spent on
(To contract, with Pseudo-GTID So: the time it takes to run Easy answer (which I have done before): standard binary log size is Also, do you happen to have a delayed replica among those replicas which were regrouped? I incidentally optimized that in #641 (not yet merged). |
@shlomi-noach - Correct, I updated all the servers. I will check oh I was actually thinking the same about the GTID issue. Could we apply the same trick we did in Vitess in vitessio/vitess#4161 to flush binary bin logs before we try to regroup GTIDs? |
In the tests I did, the replicas were not delayed. But I think binlogs were big. I will try to reproduce this situation. |
That won't help you at time of failover since there's no incoming writes to populate the newly rotated binary logs. |
Ah yes. Forgot about that detail. |
You can either configure the binary logs to be 100MB, or to routinely flush them via cron/daemon as soon as they extend 100MB (just make sure to |
BTW you can ask |
Thanks @shlomi-noach. Yes! we've used that one in the past. Our We are in the process of rolling out a change to set Still thinking what to do about the binlogs issue. I'll make sure to update here if we see improvements next time we get failure! |
Looking forward for your update. I got to thinking. Since the binlog size is a read-only variable in MySQL (need to restart server to apply it), it would be reasonable in my opinion to have |
Description
I discussed this long time ago with @shlomi-noach in Slack and I finally got the chance to start looking into this. I'm looking into better understand what are the expected times that orchestrator should take to recover a dead master.
In our setup, I'm seeing two patterns in how long it takes to recover a dead master and wanted to check if these are the expected times or if there are ways to improve this and shorten this time.
Pattern 1:
Total Time: 23s
Pattern 2:
Total Time: 12s
From my observations in these tests, the time that it takes orchestrator to declare a DeadMaster is pretty stable (around 10s).
The time that it takes to recover seems to vary and I would love to understand this a bit better. When I look in the audit log, when it takes longer, the time is being spent in these steps:
Other notes:
I tried what we discussed in Slack:
But that didn't make a difference in the detection time.
The text was updated successfully, but these errors were encountered: