Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically reconnect Kubernetes watcher when closed exceptionally #6023

Merged
merged 6 commits into from
Dec 11, 2024

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Dec 7, 2024

Motivation:

A watcher in KubernetesEndpointGroup automatically reconnects when it fails to connect to the remote peer. However, it does not reconnect when a WatcherException is raised.

io.fabric8.kubernetes.client.WatcherException: too old resource version: 573375490 (573377297)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:52)
	at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:106)
	at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:37)
	at com.linecorp.armeria.common.stream.DefaultStreamMessage.notifySubscriberWithElements(DefaultStreamMessage.java:412)
        ...
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 573375490 (573377297)
	... 62 common frames omitted

I don't know why too old resource version was raised but an important thing is that watchers should not be stopped until KubernetesEndpointGroup is closed.

Modifications:

  • Refactor KuberntesEndpointGroup to start watchers asynchronously.
  • Automatically restart Watchers when onClose(WatcherException) is invoked.
  • Add more logs that I think might be useful.
    • Also make the log formats consistent
  • Debounce the update of endpoints to prevent EndpointGroup.whenReady() from completing with a small number of endpoints.
    • The purpose is to prevent a few endpoints from receiving too much traffic when a watcher is newly created.

Result:

KubernetesEndpointGroup automatically reconnects a Watcher when WatcherException is raised.

Motivation:

A watcher in `KubernetesEndpointGroup` automatically reconnects when it
fails to connect to the remote peer. However, it does not reconnect
when a `WatcherException` is raised.

```
io.fabric8.kubernetes.client.WatcherException: too old resource version: 573375490 (573377297)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:52)
	at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:106)
	at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:37)
	at com.linecorp.armeria.common.stream.DefaultStreamMessage.notifySubscriberWithElements(DefaultStreamMessage.java:412)
        ...
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 573375490 (573377297)
	... 62 common frames omitted
```

I don't know why `too old resource version` was raised but watchers
should not be stopped until `KubernetesEndpointGroup` is closed

Modifications:

- Refactor `KuberntesEndpointGroup` to start watchers asynchronously.
- Automatically restart `Watcher`s when `onClose(WatcherException)` is
  invoked.
- Add more logs that I think might be useful.
  - Also make the log formats consistent
- Debounce the update of endpoints to prevent
  `EndpointGroup.whenReady()` from completing with a small number of
  endpoints are recevied.
  - The purpose is to avoid a small number of endpoints from receiving too much
    traffic when a watcher is newly created.

Result:

`KubernetesEndpointGroup` now automatically reconnects a `Watcher` when
`WatcherException` is raised.
@ikhoon ikhoon added the defect label Dec 7, 2024
@ikhoon ikhoon added this to the 1.31.3 milestone Dec 7, 2024
} else {
podWatch = podWatch0;
}
watchPodAsync(service0.getSpec().getSelector());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question) What do you think of keeping the selector as a volatile field in KubernetesEndpointGroup, and have the pod thread continuously watching this field?

I'm imagining the case where

  1. watchPodAsync(newSelector) is called
  2. slightly later, an exception is thrown and watchPodAsync(oldSelector) is called
  3. the new selector is not used

If the above case is not possible, the current implementation seems fine.

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor suggestions. 👍

* The debounce millis for the update of the endpoints.
* A short delay would be enough because the initial events are delivered sequentially.
*/
private static final int DEBOUNCE_MILLIS = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it too short? maybe 100 millis? (I'm worried about making another flaky test. 😉 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 100 ms.

logger.warn("Pod watcher for {}/{} is closed.", namespace, serviceName, cause);
logger.warn("[{}/{}] Pod watcher is closed.", namespace, serviceName, cause);
logger.info("[{}/{}] Reconnecting the pod watcher...", namespace, serviceName);
// TODO(ikhoon): Add a backoff strategy to prevent rapid reconnections when the pod watcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least add a delay (e.g. 3 seconds) to prevent the situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to count errors that occur continuously.

  • Immediately retry for the first failure.
  • Delay with a backoff from the second failure.

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍 Thanks!

@ikhoon ikhoon merged commit 1c9a22d into line:main Dec 11, 2024
13 of 14 checks passed
@ikhoon ikhoon deleted the kubernetes-endpoints-rewatch branch December 11, 2024 02:36
ikhoon added a commit to ikhoon/armeria that referenced this pull request Dec 11, 2024
…ine#6023)

Motivation:

A watcher in `KubernetesEndpointGroup` automatically reconnects when it
fails to connect to the remote peer. However, it does not reconnect when
a `WatcherException` is raised.

```
io.fabric8.kubernetes.client.WatcherException: too old resource version: 573375490 (573377297)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:52)
	at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:106)
	at com.linecorp.armeria.client.kubernetes.ArmeriaWebSocket.onNext(ArmeriaWebSocket.java:37)
	at com.linecorp.armeria.common.stream.DefaultStreamMessage.notifySubscriberWithElements(DefaultStreamMessage.java:412)
        ...
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 573375490 (573377297)
	... 62 common frames omitted
```

I don't know why `too old resource version` was raised but an important
thing is that watchers should not be stopped until
`KubernetesEndpointGroup` is closed.

Modifications:

- Refactor `KuberntesEndpointGroup` to start watchers asynchronously.
- Automatically restart `Watcher`s when `onClose(WatcherException)` is
invoked.
- Add more logs that I think might be useful.
  - Also make the log formats consistent
- Debounce the update of endpoints to prevent
`EndpointGroup.whenReady()` from completing with a small number of
endpoints.
- The purpose is to prevent a few endpoints from receiving too much
traffic when a watcher is newly created.

Result:

`KubernetesEndpointGroup` automatically reconnects a `Watcher` when
`WatcherException` is raised.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants