Implement retries for availability faults #261

ivansenic · 2023-03-14T12:39:49Z

Spec defines the Availability Faults. We need to add implementation for this.

Currently the Stargate V2 allows retries based on the gRPC status codes, we need to see if this is enough. It could be that we need to extend the bridge, so that we can recognize client-side and server-side timeouts. Or client side timeouts can be disabled for now.

ivansenic · 2023-03-16T14:00:15Z

I'll try to explain here the current situation and come up with conclusions on what has to be done.

From the Availability Faults, we said we want to have a single retry for Unavailable, WriteTimeout and ReadTimeout exceptions that we receive.

Unavailable

Stargate already retries all the UNAVAILABLE gRPC status codes on the client side, see gRPC CONFIGURATION.md. We receive UNAVAILABLE as gRPC status in following cases:

UnavailableException (includes explicit unavailable trailer)
UnhandledClientException
in case C* error code is IS_BOOTSTRAPPING (not sure what exception exactly has this)
in case bridge endpoint is not reachable

Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK?

Timeouts

When timeouts occur, we receive gRPC status code DEADLINE_EXCEEDED. Now, we need to distinguish between client and server timeouts. They will carry on same status code, but server side timeouts will have explicit trailers.

Client side

The client timeout is set by default to 30 sec in the gRPC CONFIGURATION.md. I think we can agree client timeouts should never be retried. However, question is if we want to have timeouts at all. Receiving a client timeout, means that it's unsure how the operation execution finished on the server. It could be successful. Here I see few options:

disable client timeouts
clearly distinguish between client timeouts and server timeouts and report to the user in the error message

Server side

The gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:

read - based on the ReadTimeoutException rte state: rte.received >= rte.blockFor && !rte.dataPresent
writes - only for write type BATCH_LOG

Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts?

However, this again opens a question of the client timeouts and how we should handle this. Since we have a chance in that single CQL to end up a timeout on the server side, in order to receive the server side timeout, we need to have client side timeout long enough.

Furthermore in some scenarios, we have a proxy between client and a bridge. Depending on what we agree on, we must ensure not to receive the timeout from the proxy as well. The proxy responds with HTML code and we must avoid this.

I would like a whole team to contribute to this, as this issue is a concern in the OSS as well. Pinging @jeffreyscarpenter @tatu-at-datastax @kathirsvn @maheshrajamani for discussion.

amorton · 2023-03-28T07:34:18Z

Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK?

Assuming the "Stargate already retries all the UNAVAILABLE gRPC status codes " is the client side configuration for retries, this makes sense for the actual Cassandra server raised UnavailableException. There others look like application errors from the bridge or gRPC .

UnhandledClientException
I could not find this in the C* code base, when is this raised ?

in case C* error code is IS_BOOTSTRAPPING (not sure what exception exactly has this)
Do you have a link for where this is raised from. Bootstrapping is something that will happen when a new node has joined the cluster and is streaming data from other nodes.

However, question is if we want to have timeouts at all. Receiving a client timeout, means that it's unsure how the operation execution finished on the server. It could be successful. Here I see few options:

For client timeouts I am assuming we are talking about TCP socket timeouts. We should have client timeouts, and they should be large enough so the C* timeouts fail first. We need the client timeouts to handle cases where the C* coordinator / node fails to take the frame of the TCP stack and start working on it.

Client side timeouts should be considered internal application errors, no retry, just fail. Yes we dont know the state of the request, if we fail to timeout our client connection we will end up hanging the client calling us. If the client calling us does not have a socket timeout then we could, in theory, have a lot hung requests that consume resources and result in denial of service type errors.

The gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:

Not a big fan of this, it is behaviour that is counter to the way C* works and IMHO should not be in the coordinator tier. We now have retry logic in two places (client and coordinator), and the policy for the coordinator is (i assume) applied to all request so cannot be tailored. Can it be removed from the coordinator? this logic should live in the client.

Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts?

Unsure on the ask, we want to retry write and read timeouts each once.

Furthermore in some scenarios, we have a proxy between client and a bridge.
Is this something we have in Astra or for OSS deployments ?

In general:

we need socket (client) timeouts as a safety valve, they should be set 2X what we know or guess the C* server timeouts to be.
retry logic should be in the client (not the coordinator) as that is where it is expected , and we can change it with the least impact on other services.
anything not Unavailable or (read, write, or CAS) timeout can be considered an application error and we can fail without retry.

sync-by-unito · 2023-03-29T12:09:51Z

➤ Ivan Senic commented:

Needed impl in the OSS done here: stargate/stargate#2517 ( https://github.com/stargate/stargate/pull/2517|smart-link )

sync-by-unito · 2023-03-29T12:45:42Z

➤ Ivan Senic commented:

jsonapi PR: #309 ( https://github.com/stargate/jsonapi/pull/309|smart-link )

…309)

ivansenic self-assigned this Mar 14, 2023

ivansenic pushed a commit that referenced this issue Mar 29, 2023

closes #261: grpc retries and timeouts to fix availability failures

0fb6f73

ivansenic mentioned this issue Mar 29, 2023

closes #261: grpc retries and timeouts to fix availability failures #309

Merged

2 tasks

ivansenic pushed a commit that referenced this issue Apr 4, 2023

closes #261: grpc retries and timeouts to fix availability failures

c8a65c7

ivansenic pushed a commit that referenced this issue Apr 5, 2023

closes #261: grpc retries and timeouts to fix availability failures

966be59

ivansenic closed this as completed in #309 Apr 5, 2023

ivansenic pushed a commit that referenced this issue Apr 5, 2023

closes #261: grpc retries and timeouts to fix availability failures (#…

4aff962

…309)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement retries for availability faults #261

Implement retries for availability faults #261

ivansenic commented Mar 14, 2023 •

edited by sync-by-unito bot

Loading

ivansenic commented Mar 16, 2023 •

edited

Loading

amorton commented Mar 28, 2023

sync-by-unito bot commented Mar 29, 2023

sync-by-unito bot commented Mar 29, 2023

Implement retries for availability faults #261

Implement retries for availability faults #261

Comments

ivansenic commented Mar 14, 2023 • edited by sync-by-unito bot Loading

ivansenic commented Mar 16, 2023 • edited Loading

Unavailable

Timeouts

Client side

Server side

amorton commented Mar 28, 2023

sync-by-unito bot commented Mar 29, 2023

sync-by-unito bot commented Mar 29, 2023

ivansenic commented Mar 14, 2023 •

edited by sync-by-unito bot

Loading

ivansenic commented Mar 16, 2023 •

edited

Loading