Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement retries for availability faults #261

Closed
ivansenic opened this issue Mar 14, 2023 · 4 comments · Fixed by #309
Closed

Implement retries for availability faults #261

ivansenic opened this issue Mar 14, 2023 · 4 comments · Fixed by #309
Assignees

Comments

@ivansenic
Copy link
Contributor

ivansenic commented Mar 14, 2023

Spec defines the Availability Faults. We need to add implementation for this.

Currently the Stargate V2 allows retries based on the gRPC status codes, we need to see if this is enough. It could be that we need to extend the bridge, so that we can recognize client-side and server-side timeouts. Or client side timeouts can be disabled for now.

@ivansenic ivansenic self-assigned this Mar 14, 2023
@ivansenic
Copy link
Contributor Author

ivansenic commented Mar 16, 2023

I'll try to explain here the current situation and come up with conclusions on what has to be done.

From the Availability Faults, we said we want to have a single retry for Unavailable, WriteTimeout and ReadTimeout exceptions that we receive.

Unavailable

Stargate already retries all the UNAVAILABLE gRPC status codes on the client side, see gRPC CONFIGURATION.md. We receive UNAVAILABLE as gRPC status in following cases:

  • UnavailableException (includes explicit unavailable trailer)
  • UnhandledClientException
  • in case C* error code is IS_BOOTSTRAPPING (not sure what exception exactly has this)
  • in case bridge endpoint is not reachable

Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK?

Timeouts

When timeouts occur, we receive gRPC status code DEADLINE_EXCEEDED. Now, we need to distinguish between client and server timeouts. They will carry on same status code, but server side timeouts will have explicit trailers.

Client side

The client timeout is set by default to 30 sec in the gRPC CONFIGURATION.md. I think we can agree client timeouts should never be retried. However, question is if we want to have timeouts at all. Receiving a client timeout, means that it's unsure how the operation execution finished on the server. It could be successful. Here I see few options:

  • disable client timeouts
  • clearly distinguish between client timeouts and server timeouts and report to the user in the error message
Server side

The gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:

  • read - based on the ReadTimeoutException rte state: rte.received >= rte.blockFor && !rte.dataPresent
  • writes - only for write type BATCH_LOG

Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts?

However, this again opens a question of the client timeouts and how we should handle this. Since we have a chance in that single CQL to end up a timeout on the server side, in order to receive the server side timeout, we need to have client side timeout long enough.

Furthermore in some scenarios, we have a proxy between client and a bridge. Depending on what we agree on, we must ensure not to receive the timeout from the proxy as well. The proxy responds with HTML code and we must avoid this.

I would like a whole team to contribute to this, as this issue is a concern in the OSS as well. Pinging @jeffreyscarpenter @tatu-at-datastax @kathirsvn @maheshrajamani for discussion.

@amorton
Copy link
Contributor

amorton commented Mar 28, 2023

Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK?

Assuming the "Stargate already retries all the UNAVAILABLE gRPC status codes " is the client side configuration for retries, this makes sense for the actual Cassandra server raised UnavailableException. There others look like application errors from the bridge or gRPC .

UnhandledClientException
I could not find this in the C* code base, when is this raised ?

in case C* error code is IS_BOOTSTRAPPING (not sure what exception exactly has this)
Do you have a link for where this is raised from. Bootstrapping is something that will happen when a new node has joined the cluster and is streaming data from other nodes.

However, question is if we want to have timeouts at all. Receiving a client timeout, means that it's unsure how the operation execution finished on the server. It could be successful. Here I see few options:

For client timeouts I am assuming we are talking about TCP socket timeouts. We should have client timeouts, and they should be large enough so the C* timeouts fail first. We need the client timeouts to handle cases where the C* coordinator / node fails to take the frame of the TCP stack and start working on it.

Client side timeouts should be considered internal application errors, no retry, just fail. Yes we dont know the state of the request, if we fail to timeout our client connection we will end up hanging the client calling us. If the client calling us does not have a socket timeout then we could, in theory, have a lot hung requests that consume resources and result in denial of service type errors.

The gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:

Not a big fan of this, it is behaviour that is counter to the way C* works and IMHO should not be in the coordinator tier. We now have retry logic in two places (client and coordinator), and the policy for the coordinator is (i assume) applied to all request so cannot be tailored. Can it be removed from the coordinator? this logic should live in the client.

Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts?

Unsure on the ask, we want to retry write and read timeouts each once.

Furthermore in some scenarios, we have a proxy between client and a bridge.
Is this something we have in Astra or for OSS deployments ?

In general:

  • we need socket (client) timeouts as a safety valve, they should be set 2X what we know or guess the C* server timeouts to be.
  • retry logic should be in the client (not the coordinator) as that is where it is expected , and we can change it with the least impact on other services.
  • anything not Unavailable or (read, write, or CAS) timeout can be considered an application error and we can fail without retry.

@sync-by-unito
Copy link

sync-by-unito bot commented Mar 29, 2023

➤ Ivan Senic commented:

Needed impl in the OSS done here: stargate/stargate#2517 ( https://github.com/stargate/stargate/pull/2517|smart-link )

@sync-by-unito
Copy link

sync-by-unito bot commented Mar 29, 2023

➤ Ivan Senic commented:

jsonapi PR: #309 ( https://github.com/stargate/jsonapi/pull/309|smart-link )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants