-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feedback: Monitoring a replica set | Tarantool #2604
Comments
Found #355. And now my understanding is the following. Say, we have a master and a replica. I'll highlight: it is just my understanding of docs, the issue above and the related commit message without even looking into the code. Every statement below may be a mistake or be inaccurate. How master tracks the regularity of communicationsLet's imagine the following timeline:
TBD: What if the master doesn't see heartbeats for a long time? How replica tracks a network latencyNext, another timeline:
And another case, when the master doesn't serve write requests for a long time:
So TBD: Or not? If the network is broken, will we see it somehow differently? How replica tracks the regularity of communicationsI don't see good points how So TBD: What if the replica doesn't see heartbeath for a long time? What I'm doing here? I'm looking for some 'common sense' criteria of healthy instance to implement some tracking in connectors and prevent a user from seeing stale data. Healthy masterA master is bleeding edge of our data, it is not stale by the definition. However with automatic leader election we can meet a situation, when the leader loss connectivity to quorum of instances, another leader was elected and the old one doesn't know about this. We should mark the old leader as unhealthy if another leader was elected in the newer Healtly replicaA replica is okay if it is on track with master(s). We should look at the maximal However it does NOT reveal large latency situation. We can receive updates from a master regularly, but with a large delay. So we also should look at the maximal What is a replicaWe can look on it at different angles:
However, say, the instance acts as a master (and has an upstream; say the replicaset is in full mesh). The connectivity with the upstream becomes broken. So what? The instance anyway contains the most fresh data. So, maybe:
(I need to think about pitfalls here.) Master-masterHere each instance acts like a replica and a master both, so we should apply both criterias. Since our automatic leader election does not support master-master, it means that we'll effectively apply the replica's criteria here. Any instance healthThere are points that are applicable for any instance: as for a master as well as for a replica. At least we generally should not execute requests until the database will be fully bootstrapped (recovered from a disc or from a master). We should look at Other status values are In fact, that's strange that an instance serves requests before a full bootstrap: it leads to problems like the following:
And extra code is needed to handle it. Of course, some service requests should be processed before boostrap: monitoring requests, replication join requests, likely some others. But allow to access data (or app logic) in this state by default was a mistake, I think. |
https://www.tarantool.io/en/doc/latest/book/replication/repl_monitoring/
Am I understand right: replication idle and replication lag are the same except that replication lag tracks only WAL writes and is not updated with heartbeats?
On https://www.tarantool.io/en/doc/latest/reference/reference_lua/box_info/replication/ I see that both lag and idle are in the upstream object (on replica), but the downstream object (on master) has only idle.
To be honest, the documentation does not give me a predicate, which I should use to decide, whether an instance is healthy. It also does not reveal details how exactly given two metrics work, so I can't construct this predicate myself.
(Filed by @Totktonada.)
The text was updated successfully, but these errors were encountered: