-
-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix websocket connection retries logic #771
Fix websocket connection retries logic #771
Conversation
We have observed cases where the websocket connection would fail temporarily with a handshake error that was not retried and causing the agent to fail. Also, in CloudBees CI HA, there are some cases when an agent is disconnected on purpose and we expect a reconnect attempt as soon as possible. The default 10 seconds delay appeared too long in that case. To reconcile this use case with the typical failure scenario, I have implemented exponential backoff for retries (immediate, 1 sec, 3 seconds, 7 seconds, 10 seconds).
I'm also considering a bigger refactoring (to be filed in a separate PR) to reduce duplication across inbound tcp and websocket connection flows and make them more similar. |
return true; | ||
} | ||
} catch (Exception x) { | ||
events.status("Failed to connect: " + x.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are losing the stack trace here; is that potentially important for diagnosis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience these stacktraces are noisy. I could eventually log them with standard JUL logger (FINE) in parallel, but I don't really see the point of keeping them in status.
final Duration maxDelay; | ||
|
||
ExponentialRetry(Duration timeout) { | ||
this(Duration.ofSeconds(0), timeout, 2, Duration.ofSeconds(1), Duration.ofSeconds(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare user-configurable version in #676
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jitter could be useful, but exposing these as user-level settings seem overkill to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, just noting for reference.
We have observed cases where the websocket connection would fail temporarily with a handshake error that was not retried and causing the agent to fail.
Also, in CloudBees CI HA, there are some cases when an agent is disconnected on purpose and we expect a reconnect attempt as soon as possible. The default 10 seconds delay appeared too long in that case.
To reconcile this use case with the typical failure scenario, I have implemented exponential backoff for retries (immediate, 1 sec, 3 seconds, 7 seconds, 10 seconds).
Testing done
Submitter checklist