-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement retries for availability faults #261
Comments
I'll try to explain here the current situation and come up with conclusions on what has to be done. From the Availability Faults, we said we want to have a single retry for UnavailableStargate already retries all the
Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK? TimeoutsWhen timeouts occur, we receive gRPC status code Client sideThe client timeout is set by default to
Server sideThe gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:
Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts? However, this again opens a question of the client timeouts and how we should handle this. Since we have a chance in that single CQL to end up a timeout on the server side, in order to receive the server side timeout, we need to have client side timeout long enough. Furthermore in some scenarios, we have a proxy between client and a bridge. Depending on what we agree on, we must ensure not to receive the timeout from the proxy as well. The proxy responds with HTML code and we must avoid this. I would like a whole team to contribute to this, as this issue is a concern in the OSS as well. Pinging @jeffreyscarpenter @tatu-at-datastax @kathirsvn @maheshrajamani for discussion. |
Assuming the "Stargate already retries all the UNAVAILABLE gRPC status codes " is the client side configuration for retries, this makes sense for the actual Cassandra server raised
For client timeouts I am assuming we are talking about TCP socket timeouts. We should have client timeouts, and they should be large enough so the C* timeouts fail first. We need the client timeouts to handle cases where the C* coordinator / node fails to take the frame of the TCP stack and start working on it. Client side timeouts should be considered internal application errors, no retry, just fail. Yes we dont know the state of the request, if we fail to timeout our client connection we will end up hanging the client calling us. If the client calling us does not have a socket timeout then we could, in theory, have a lot hung requests that consume resources and result in denial of service type errors.
Not a big fan of this, it is behaviour that is counter to the way C* works and IMHO should not be in the coordinator tier. We now have retry logic in two places (client and coordinator), and the policy for the coordinator is (i assume) applied to all request so cannot be tailored. Can it be removed from the coordinator? this logic should live in the client.
Unsure on the ask, we want to retry write and read timeouts each once.
In general:
|
➤ Ivan Senic commented: Needed impl in the OSS done here: stargate/stargate#2517 ( https://github.com/stargate/stargate/pull/2517|smart-link ) |
➤ Ivan Senic commented: jsonapi PR: #309 ( https://github.com/stargate/jsonapi/pull/309|smart-link ) |
Spec defines the Availability Faults. We need to add implementation for this.
Currently the Stargate V2 allows retries based on the gRPC status codes, we need to see if this is enough. It could be that we need to extend the bridge, so that we can recognize client-side and server-side timeouts. Or client side timeouts can be disabled for now.
The text was updated successfully, but these errors were encountered: