Track replication upstream.idle when calling on replica #453

grafin · 2023-12-06T17:57:48Z

Currently any read call, routed to replica will be executed not regarding if the replica is connected to any other instance. In my case:

Have a shard with 3 instances, one (A) being master (box.info.ro == false) others (B) and (C) being replicas (box.info.ro == true)
For some unrelated to vshard reason all appliers on replica (B) broke down not being able to recover without manual instance restart. All box.info.replication[n].upstream.status are disconnected. (n. don't use DML in triggers on ro instances, if the row is received via applier)
All reads, routed to replica (B) returned very old data (instance (B) was broken for several hours).

We could prevent such dirty reads, using available data, as stated in our docs:

Therefore, in a healthy replication setup, idle should never exceed replication_timeout: if it does, either the replication is lagging seriously behind, because the master is running ahead of the replica, or the network link between the instances is down.

It would be great if we could prevent those dirty reads by not routing any request to replicas (ro instances) which are disconnected from everyone, or maybe even give the ability to configure some max_replication_idle or max_replication_lag and return errors when trying to process a request on replica whose current max(replication[n].upstream.idle/lag) exceeds configured value.

The text was updated successfully, but these errors were encountered:

Totktonada · 2023-12-06T18:24:32Z

Can't we lean just on box.info.status?

NB: My old findings regarding replication monitoring: tarantool/doc#2604.

sergepetrenko · 2023-12-07T08:55:54Z

Can't we lean just on box.info.status?

I assume you meant upstream.status?
Still, no. upstream.status may be 'follow' but the replica still may have a huge lag to this upstream. For example, when master performs a batch of changes, which replica can't apply as fast as master did.

grafin · 2023-12-08T10:12:49Z

During fixing the initial problem i get the following upstream info in my case:

id: 1
    uuid: bbbbbbbb-bbbb-4000-b000-000000000001
    lsn: 60719
    upstream:
      peer: admin@localhost:48809
      lag: 0.0002281665802002
      status: stopped
      idle: 169.65149075794
      message: Can't modify data on a read-only instance - box.cfg.read_only is true

So monitoring upstream.lag doesn't solve the issue, and as @sergepetrenko mentioned upstream.status is not very effective too. Looks like it is best to check upstream.idle.

Totktonada · 2023-12-08T11:37:13Z

Can't we lean just on box.info.status?

I assume you meant upstream.status?

No, but I got the answer :)

In my imagination tarantool should have some max_data_lag option to go into some 'stale data' state in such cases and broadcast the new state to clients.

Now, we are going to solve the same monitoring tasks in each client on the application level.

sergepetrenko · 2023-12-11T06:48:09Z

Looks like it is best to check upstream.idle.

It should be a combination of both lag and idle, whichever is greater. Or, as @Totktonada suggests, we may look at box.info.status == 'orphan', but then we have to make replicas go orphan if they see they stopped catching up with the master (this is a breaking change).

Or it can be box.info.status == 'stale_data'.

Serpentian · 2024-09-20T13:53:38Z

After discussing with @sergepetrenko, it looks like we should go with the following condition for not healthy replica. If status of upstream to master is not follow or upstream.lag is > N, where N will be constant for now, then we consider replica as not healthy. If smb asks, we may introduce additional callback to mark replica as not healthy, e.g. not to route requests to replica, if cartridge config have not been properly applied.

I see this feature as automatic lowering of priority (temporary) on router if replica is not healthy. We'll increase priority back, if node considers itself healthy. No additional cfg parameters for now. @Gerold103, your opinion on this?

R-omk · 2024-09-20T15:44:28Z

what is "priority"? if it's something like the probability of sending a request to this host, then I would prefer the probability to be zero if we agreed that too laggy replicas are not suitable for process request at all. Otherwise we will receive "flashing" data depending on which replica the request lands on.

Serpentian · 2024-09-20T15:49:06Z

what is "priority"?

Firstly we try to make request to the most prioritized replica and then to another ones, if most prioritized one failed. If you don't want for replica to serve any kind of requests, you can just disable the storage. But most users would prefer request to be non failing, so that we go to the non healthy replica, if requests to other instances fail.

Here's the conflict between consistency and availability

Gerold103 · 2024-09-20T17:10:42Z

Sounds all good to me.

Before this patch router didn't take into account the state of box.info.replication of the storage, when routing requests to it. From now on router automatically lowers the priority of replica, when router supposes, that connection from the master to a replica is dead (status or idle > 30) or too slow (lag is > 30 sec). We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds. This is needed to speed up how quickly a replica notices the master's change. Before the patch the non-master never knew, where the master currently is. Now, since we try to check status of the master's upstream, we need to find this master in service_info via conn_manager. Since after that replica doesn't do any requests to master, the connection is collected by conn_manager in collect_idle_conns after 30 seconds. Then router's failover calls service_info one more time and non-master locates master, which may have already changed. This patch allows to increase the consistency of read requests and decreases the probability of reading a stale data. Closes tarantool#453 Closes tarantool#487 NO_DOC=<bugfix>

Before this patch router didn't take into account the state of box.info.replication of the storage, when routing requests to it. From now on router automatically lowers the priority of replica, when router supposes, that connection from the master to a replica is dead (status or idle > 30) or too slow (lag is > 30 sec). We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds. This is needed to speed up how quickly a replica notices the master's change. Before the patch the non-master never knew, where the master currently is. Now, since we try to check status of the master's upstream, we need to find this master in service_info via conn_manager. Since after that replica doesn't do any requests to master, the connection is collected by conn_manager in collect_idle_conns after 30 seconds. Then router's failover calls service_info one more time and non-master locates master, which may have already changed. This patch allows to increase the consistency of read requests and decreases the probability of reading a stale data. Closes tarantool#453 Closes tarantool#487 NO_DOC=bugfix

Before this patch router didn't take into account the state of box.info.replication of the storage, when routing requests to it. From now on router automatically lowers the priority of replica, when router supposes, that connection from the master to a replica is dead (status or idle > 30) or too slow (lag is > 30 sec). We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds. This is needed to speed up how quickly a replica notices the master's change. Before the patch the non-master never knew, where the master currently is. Now, since we try to check status of the master's upstream, we need to find this master in service_info via conn_manager. Since after that replica doesn't do any requests to master, the connection is collected by conn_manager in collect_idle_conns after 30 seconds. Then router's failover calls service_info one more time and non-master locates master, which may have already changed. This patch allows to increase the consistency of read requests and decreases the probability of reading a stale data. Closes #453 Closes #487 NO_DOC=bugfix

grafin added the feature A new functionality label Dec 6, 2023

grafin changed the title ~~Track replication status/lag when calling on replica~~ Track replication upstream.idle when calling on replica Dec 8, 2023

Serpentian mentioned this issue Sep 5, 2024

Mechanism for additional health checkers in vshard #487

Closed

Serpentian self-assigned this Sep 24, 2024

Serpentian mentioned this issue Nov 13, 2024

router: failover says, that All replicas are ok even when they're not #500

Open

Gerold103 closed this as completed Nov 15, 2024

sergepetrenko added bug Something isn't working and removed feature A new functionality labels Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track replication upstream.idle when calling on replica #453

Track replication upstream.idle when calling on replica #453

grafin commented Dec 6, 2023

Totktonada commented Dec 6, 2023

sergepetrenko commented Dec 7, 2023

grafin commented Dec 8, 2023

Totktonada commented Dec 8, 2023 •

edited

Loading

sergepetrenko commented Dec 11, 2023 •

edited

Loading

Serpentian commented Sep 20, 2024

R-omk commented Sep 20, 2024

Serpentian commented Sep 20, 2024 •

edited

Loading

Gerold103 commented Sep 20, 2024

Track replication upstream.idle when calling on replica #453

Track replication upstream.idle when calling on replica #453

Comments

grafin commented Dec 6, 2023

Totktonada commented Dec 6, 2023

sergepetrenko commented Dec 7, 2023

grafin commented Dec 8, 2023

Totktonada commented Dec 8, 2023 • edited Loading

sergepetrenko commented Dec 11, 2023 • edited Loading

Serpentian commented Sep 20, 2024

R-omk commented Sep 20, 2024

Serpentian commented Sep 20, 2024 • edited Loading

Gerold103 commented Sep 20, 2024

Totktonada commented Dec 8, 2023 •

edited

Loading

sergepetrenko commented Dec 11, 2023 •

edited

Loading

Serpentian commented Sep 20, 2024 •

edited

Loading