-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically reconnect Kubernetes watcher when closed exceptionally #6023
Conversation
Motivation: A watcher in `KubernetesEndpointGroup` automatically reconnects when it fails to connect to the remote peer. However, it does not reconnect when a `WatcherException` is raised. ``` io.fabric8.kubernetes.client.WatcherException: too old resource version: 573375490 (573377297) at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401) at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369) at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:52) at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:106) at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:37) at com.linecorp.armeria.common.stream.DefaultStreamMessage.notifySubscriberWithElements(DefaultStreamMessage.java:412) ... Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 573375490 (573377297) ... 62 common frames omitted ``` I don't know why `too old resource version` was raised but watchers should not be stopped until `KubernetesEndpointGroup` is closed Modifications: - Refactor `KuberntesEndpointGroup` to start watchers asynchronously. - Automatically restart `Watcher`s when `onClose(WatcherException)` is invoked. - Add more logs that I think might be useful. - Also make the log formats consistent - Debounce the update of endpoints to prevent `EndpointGroup.whenReady()` from completing with a small number of endpoints are recevied. - The purpose is to avoid a small number of endpoints from receiving too much traffic when a watcher is newly created. Result: `KubernetesEndpointGroup` now automatically reconnects a `Watcher` when `WatcherException` is raised.
} else { | ||
podWatch = podWatch0; | ||
} | ||
watchPodAsync(service0.getSpec().getSelector()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question) What do you think of keeping the selector
as a volatile field in KubernetesEndpointGroup
, and have the pod thread continuously watching this field?
I'm imagining the case where
watchPodAsync(newSelector)
is called- slightly later, an exception is thrown and
watchPodAsync(oldSelector)
is called - the new selector is not used
If the above case is not possible, the current implementation seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor suggestions. 👍
.../linecorp/armeria/client/kubernetes/endpoints/KubernetesEndpointGroupFaultToleranceTest.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/linecorp/armeria/client/kubernetes/endpoints/KubernetesEndpointGroup.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/linecorp/armeria/client/kubernetes/endpoints/KubernetesEndpointGroup.java
Outdated
Show resolved
Hide resolved
* The debounce millis for the update of the endpoints. | ||
* A short delay would be enough because the initial events are delivered sequentially. | ||
*/ | ||
private static final int DEBOUNCE_MILLIS = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it too short? maybe 100 millis? (I'm worried about making another flaky test. 😉 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 100 ms.
.../src/main/java/com/linecorp/armeria/client/kubernetes/endpoints/KubernetesEndpointGroup.java
Outdated
Show resolved
Hide resolved
logger.warn("Pod watcher for {}/{} is closed.", namespace, serviceName, cause); | ||
logger.warn("[{}/{}] Pod watcher is closed.", namespace, serviceName, cause); | ||
logger.info("[{}/{}] Reconnecting the pod watcher...", namespace, serviceName); | ||
// TODO(ikhoon): Add a backoff strategy to prevent rapid reconnections when the pod watcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we at least add a delay (e.g. 3 seconds) to prevent the situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to count errors that occur continuously.
- Immediately retry for the first failure.
- Delay with a backoff from the second failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.../src/main/java/com/linecorp/armeria/client/kubernetes/endpoints/KubernetesEndpointGroup.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍 👍 Thanks!
…ine#6023) Motivation: A watcher in `KubernetesEndpointGroup` automatically reconnects when it fails to connect to the remote peer. However, it does not reconnect when a `WatcherException` is raised. ``` io.fabric8.kubernetes.client.WatcherException: too old resource version: 573375490 (573377297) at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401) at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369) at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:52) at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:106) at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:37) at com.linecorp.armeria.common.stream.DefaultStreamMessage.notifySubscriberWithElements(DefaultStreamMessage.java:412) ... Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 573375490 (573377297) ... 62 common frames omitted ``` I don't know why `too old resource version` was raised but an important thing is that watchers should not be stopped until `KubernetesEndpointGroup` is closed. Modifications: - Refactor `KuberntesEndpointGroup` to start watchers asynchronously. - Automatically restart `Watcher`s when `onClose(WatcherException)` is invoked. - Add more logs that I think might be useful. - Also make the log formats consistent - Debounce the update of endpoints to prevent `EndpointGroup.whenReady()` from completing with a small number of endpoints. - The purpose is to prevent a few endpoints from receiving too much traffic when a watcher is newly created. Result: `KubernetesEndpointGroup` automatically reconnects a `Watcher` when `WatcherException` is raised.
Motivation:
A watcher in
KubernetesEndpointGroup
automatically reconnects when it fails to connect to the remote peer. However, it does not reconnect when aWatcherException
is raised.I don't know why
too old resource version
was raised but an important thing is that watchers should not be stopped untilKubernetesEndpointGroup
is closed.Modifications:
KuberntesEndpointGroup
to start watchers asynchronously.Watcher
s whenonClose(WatcherException)
is invoked.EndpointGroup.whenReady()
from completing with a small number of endpoints.Result:
KubernetesEndpointGroup
automatically reconnects aWatcher
whenWatcherException
is raised.