-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track replication upstream.idle when calling on replica #453
Comments
Can't we lean just on NB: My old findings regarding replication monitoring: tarantool/doc#2604. |
I assume you meant |
During fixing the initial problem i get the following upstream info in my case:
So monitoring |
No, but I got the answer :) In my imagination tarantool should have some Now, we are going to solve the same monitoring tasks in each client on the application level. |
It should be a combination of both lag and idle, whichever is greater. Or, as @Totktonada suggests, we may look at Or it can be |
After discussing with @sergepetrenko, it looks like we should go with the following condition for not healthy replica. If status of upstream to master is not I see this feature as automatic lowering of priority (temporary) on router if replica is not healthy. We'll increase priority back, if node considers itself healthy. No additional cfg parameters for now. @Gerold103, your opinion on this? |
what is "priority"? if it's something like the probability of sending a request to this host, then I would prefer the probability to be zero if we agreed that too laggy replicas are not suitable for process request at all. Otherwise we will receive "flashing" data depending on which replica the request lands on. |
Firstly we try to make request to the most prioritized replica and then to another ones, if most prioritized one failed. If you don't want for replica to serve any kind of requests, you can just disable the storage. But most users would prefer request to be non failing, so that we go to the non healthy replica, if requests to other instances fail. Here's the conflict between consistency and availability |
Sounds all good to me. |
Before this patch router didn't take into account the state of box.info.replication of the storage, when routing requests to it. From now on router automatically lowers the priority of replica, when router supposes, that connection from the master to a replica is dead (status or idle > 30) or too slow (lag is > 30 sec). We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds. This is needed to speed up how quickly a replica notices the master's change. Before the patch the non-master never knew, where the master currently is. Now, since we try to check status of the master's upstream, we need to find this master in service_info via conn_manager. Since after that replica doesn't do any requests to master, the connection is collected by conn_manager in collect_idle_conns after 30 seconds. Then router's failover calls service_info one more time and non-master locates master, which may have already changed. This patch allows to increase the consistency of read requests and decreases the probability of reading a stale data. Closes tarantool#453 Closes tarantool#487 NO_DOC=<bugfix>
Before this patch router didn't take into account the state of box.info.replication of the storage, when routing requests to it. From now on router automatically lowers the priority of replica, when router supposes, that connection from the master to a replica is dead (status or idle > 30) or too slow (lag is > 30 sec). We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds. This is needed to speed up how quickly a replica notices the master's change. Before the patch the non-master never knew, where the master currently is. Now, since we try to check status of the master's upstream, we need to find this master in service_info via conn_manager. Since after that replica doesn't do any requests to master, the connection is collected by conn_manager in collect_idle_conns after 30 seconds. Then router's failover calls service_info one more time and non-master locates master, which may have already changed. This patch allows to increase the consistency of read requests and decreases the probability of reading a stale data. Closes tarantool#453 Closes tarantool#487 NO_DOC=bugfix
Before this patch router didn't take into account the state of box.info.replication of the storage, when routing requests to it. From now on router automatically lowers the priority of replica, when router supposes, that connection from the master to a replica is dead (status or idle > 30) or too slow (lag is > 30 sec). We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds. This is needed to speed up how quickly a replica notices the master's change. Before the patch the non-master never knew, where the master currently is. Now, since we try to check status of the master's upstream, we need to find this master in service_info via conn_manager. Since after that replica doesn't do any requests to master, the connection is collected by conn_manager in collect_idle_conns after 30 seconds. Then router's failover calls service_info one more time and non-master locates master, which may have already changed. This patch allows to increase the consistency of read requests and decreases the probability of reading a stale data. Closes #453 Closes #487 NO_DOC=bugfix
Currently any read call, routed to replica will be executed not regarding if the replica is connected to any other instance. In my case:
box.info.ro == false
) others (B) and (C) being replicas (box.info.ro == true
)box.info.replication[n].upstream.status
aredisconnected
. (n. don't use DML in triggers on ro instances, if the row is received via applier)We could prevent such dirty reads, using available data, as stated in our docs:
It would be great if we could prevent those dirty reads by not routing any request to replicas (ro instances) which are disconnected from everyone, or maybe even give the ability to configure some
max_replication_idle
ormax_replication_lag
and return errors when trying to process a request on replica whose currentmax(replication[n].upstream.idle/lag)
exceeds configured value.The text was updated successfully, but these errors were encountered: