-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAVA-3145: Implement some notion of retry logic around call to metadata service for Astra clients #1902
base: 4.x
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good so far!
connection.setSSLSocketFactory(sslContext.getSocketFactory()); | ||
connection.setRequestMethod("GET"); | ||
connection.setRequestProperty("host", "localhost"); | ||
attempt++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a problem logically but I'd argue it's a bit cleaner to increment the number of attempts at the very top of this loop. You can probably do it even before going into the try block honestly.
// if this is the last attempt, throw | ||
// else if the response code is not 200, retry | ||
// else, throw | ||
if (attempt != METADATA_RETRY_MAX_ATTEMPTS && connection.getResponseCode() != 200) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be attempt <= METADATA_RETRY_MAX_ATTEMPTS
. You're incrementing "attempt" at the very top of this block so it will always point to the current attempt (i.e. it'll be one on the first attempt, two on the second, etc.). You want to check here that the current attempt is less than or equal to the max number of attempts allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want it to throw the error at the last attempt. Otherwise, a FileNotFoundException
will become a DriverTimeoutException
(the dead code at the end).
Thread.sleep(METADATA_RETRY_INITIAL_DELAY_MS); | ||
continue; | ||
} catch (InterruptedException interruptedException) { | ||
Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need to do an explicit interrupt of the current thread here; that's already happened (in order to get the InterruptedException to you). :)
// if this is the last attempt, throw | ||
// else if the response code is not 200, retry | ||
// else, throw | ||
if (attempt != METADATA_RETRY_MAX_ATTEMPTS && connection.getResponseCode() != 200) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to make this a bit more granular. For example: 300 errors (3xx) indicate that a resource has been moved (or that we're being redirected someplace else). We should probably follow those if URLConnection doesn't do so already.
4xx errors are all client errors; my guess is that in most cases we won't want to retry these, although we have seen at least one case with a 421 (misdirected request) response which could benefit from a retry. JAVA-3033 also makes reference to a case in which 401 errors can be returned apparently due to a hibernated Astra instance.
5xx errors indicate a failure server-side; these should probably be retried in case there's a transient server error of some kind.
It might be worth changing this to an if test only here and then having a set of if tests underneath that to check for the various error codes; that way you can handle each case independently without making the code too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3xx should be redirected as long as the redirection is in the same protocol. How about if (connection.getResponseCode() == 401 || connection.getResponseCode() == 421 || connection.getResponseCode() >= 500)
, retry?
"a set of if tests underneath that" do you mean extracting the conditions into a function, like bool shouldRetryGivenStatusCode(int)
?
// if this is the last attempt, throw | ||
// else if the response code is not 200, retry | ||
// else, throw | ||
if (attempt != METADATA_RETRY_MAX_ATTEMPTS && connection.getResponseCode() != 200) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HttpURLConnection.HTTP_OK is probably preferred here to the literal 200
} | ||
} | ||
return new BufferedReader( | ||
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know that connection.getInputStream() can throw an IOException under various cases (including some non-200 response codes). Presumably if we add checks above to make sure we only hit this code if we have a 200 response code we can avoid that issue... but any such code is probably a bit more brittle than we'd like. While we're here it may be best to just wrap this line of code in it's own try/catch and either (a) return an IllegalStateException if we get an IOException (like we do in the other catch blocks here) or (b) fold any such error into our retry logic.
Note that this issue is really what JAVA-3033 is about.
@@ -58,6 +59,8 @@ | |||
@ThreadSafe | |||
public class CloudConfigFactory { | |||
private static final Logger LOG = LoggerFactory.getLogger(CloudConfigFactory.class); | |||
private static final int METADATA_RETRY_MAX_ATTEMPTS = 4; | |||
private static final int METADATA_RETRY_INITIAL_DELAY_MS = 250; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: per an earlier conversation @SiyaoIsHiding and I decided to leave these as constants for now. We can move them to configurable values if an explicit need prevents itself (or perhaps as a future enhancement).
Regarding your alternate proposed implementation above @SiyaoIsHiding; I think we probably should prefer the version actually contained in this PR. We know some non-200 response codes can trigger an IOException but I don't think we have that guaranteed across the board. The code in your existing impl explicitly checks the response code for everybody rather than relying on an IOException to hit that logic; I'm guessing that prolly what we want to go with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @SiyaoIsHiding , this is coming along nicely!
} | ||
|
||
// last attempt, still 421 or 401 | ||
if (connection.getResponseCode() == 421 || connection.getResponseCode() == 401) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to throw an exception of some kind in this case as long as the response code isn't a 200? I'm trying to imagine a situation in which we would want to proceed with trying to get an InputStream from the connection if we got a 3xx, 4xx or 5xx response.
There are other 2xx responses besides 200 (the spec has some more details) but none of those seem to apply in our case.
// if this is the last attempt, throw | ||
// else if the response code is 401, 421, or 5xx, retry | ||
// else, throw | ||
if (attempt < METADATA_RETRY_MAX_ATTEMPTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd be better off handling the attempt >= METADATA_RETRY_MAX_ATTEMPTS first. The benefit with that approach is that you're guaranteed that everything afterwards has not yet hit the max number of attempts (and is therefore subject to retry):
int rc = connection.getResponseCode();
if (attempt >= METADATA_RETRY_MAX_ATTEMPTS) {
if (rc == HttpURLConnection.HTTP_OK) {
try {
return new BufferedReader(
new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));
} catch (ConnectException e) {
throw new IllegalStateException(
"Unable to connect to cloud metadata service. Please make sure your cluster is not parked or terminated",
e);
} catch (UnknownHostException e) {
throw new IllegalStateException(
"Unable to resolve host for cloud metadata service. Please make sure your cluster is not terminated",
e);
}
}
else {
// TODO: Might also want to add some logging here (or maybe above) about what the observed return code was
throw new IllegalStateException(
"Unable to fetch metadata from cloud metadata service. Please make sure your cluster is not parked or terminated");
}
}
}
assert attempt < METADATA_RETRY_MAX_ATTEMPTS;
if (attempt < METADATA_RETRY_MAX_ATTEMPTS | ||
&& (connection.getResponseCode() == 401 | ||
|| connection.getResponseCode() == 421 | ||
|| connection.getResponseCode() >= 500)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here (and in the other spots throughout this code) you'd be better off using the constants defined by HttpURLConnection such as HTTP_OK for 200 (and so on). It makes the code a bit more descriptive about what's going on.
if (attempt < METADATA_RETRY_MAX_ATTEMPTS | ||
&& (connection.getResponseCode() == 401 | ||
|| connection.getResponseCode() == 421 | ||
|| connection.getResponseCode() >= 500)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should also probably be pulled out into a simple helper method, something like this:
private boolean shouldRetry(int rc) {
return connection.getResponseCode() == 401
|| connection.getResponseCode() == 421
|| connection.getResponseCode() >= 500;
}
if (attempt < METADATA_RETRY_MAX_ATTEMPTS | ||
&& (connection.getResponseCode() == 401 | ||
|| connection.getResponseCode() == 421 | ||
|| connection.getResponseCode() >= 500)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately clear to me that this is the extent of response codes we want to handle here.
I agree with retrying on any 5xx error. It also probably does make sense to retry on a 401 and 421 given what we've seen from this service. But there are a whole set of 4xx response codes all of which indicate that the client did something wrong (from the server's perspective). We could see any of these codes coming off of the metadata service... what do we want to do in that case? We may need to spend some more time looking at this list and thinking about that.
There are also a lot of 3xx errors but most of those involve the server telling the client that the thing it's asking for is somewhere else. In that case retrying that specific request probably doesn't make sense... but I would expect a fully-formed HTTP client to send another (slightly different) request based on the feedback it got from the server. We're not building a fully-formed HTTP client, though, so I think I'm okay with not retrying 3xx errors for now.
There is part of me wondering if we shouldn't put a real HTTP client in here for this...
I implemented the retry mechanism in two ways. Apart from the current one, the following also works:
We could discuss those design decisions in our next meeting :)