Watcher handle graceful onclose #56

NiklasJonsson6 · 2023-12-18T14:50:32Z

We mistakenly did not implement any behavior for when our watch for kubernetes endpoints closed gracefully, since it has a default implementation that only logs a message on the debug level.

Log on DEBUG-level for the kubernetes client.
Handle graful watcher close the same way we handle exception close (log and exit the application with code 11).

the same way as we do when the watch closes exceptionally. The graceful onClose is default-implemented with a debug log message so we had mistakenly not overridden it.

We had no sign of graceful watcher onClose since it only logged a debug message in its default implementation

solsson

Do we want to test any of these, from client reference?

solsson · 2023-12-18T15:38:13Z

src/main/resources/application.yaml

@@ -50,6 +50,8 @@ quarkus:
        level: DEBUG
      "org.apache.kafka.clients.Metadata":
        level: DEBUG
+      "io.fabric8.kubernetes.client":
+        level: DEBUG


We should use an env here. This is great for troubleshooting stuff like ECONNRESET but we should probably default to info.

Let's do a single-arch build with DEBUG, so we don't have to update lots of yaml.

while "onupdate" refers to the mechanism of handling key-value updates This reverts commit 440a8cd.

NiklasJonsson6 · 2023-12-19T07:39:36Z

Do we want to test any of these, from client reference?

I think no, since the default reconnect behavior seems quite agressive. The issue we had was (most likely) a graceful close of the watch according to the client, so in that case I would assume no reconnect is done. Otherwise, an unlimited of retry with a 1 second interval (the default reconnect behavior) would surely have succeeded eventually.

solsson · 2023-12-27T09:15:54Z

Watching is never logged, despite debug level. Probably a configuration error on our side. WebSocket is also never logged.

solsson · 2023-12-28T05:09:52Z

Evaluating docker.io/yolean/kafka-keyvalue:509399df255d03f7fe4fb63a0b5c653b7ddf8aab@sha256:6eec474b2a58870bc404523ddabd92cbdefbe78bcb29e85891042f7721bd7aa5 with GKE 1.27 API-server. It seems to recover from "too old resource version" and keep endpoints unique and up-to-date.

I haven't added test coverage because I think we might want to switch to an informer (or are they always watching all namespaces?) in which case none of these hacks will be needed, at least not in their current form.

solsson · 2023-12-28T05:11:12Z

I still don't know how API-server restarts behave now. We do however have debug logging for the websocket connections since 252f46d.
.
fabric8io/kubernetes-client#5189

solsson · 2023-12-28T18:27:39Z

fabric8io/kubernetes-client#5372 (comment)

This is handled automatically by Informers, but not for Watches. Bookmarks will be used by default, which for newer api servers make this exception much more infrequent.

using testcontainers with k3s

solsson

My interpretation is that we restart watches all the time. Correct?

src/main/java/se/yolean/kafka/keyvalue/kubernetes/EndpointsWatcher.java

NiklasJonsson6 · 2024-01-24T14:25:11Z

After the latest changes, the integration test validates that we reconnect our watch after being disconnected. Running the test confirms that we actually get onClose-events after reconnect fails. This might be due to the fact that we're running a newer version now, and issues like these are fixed fabric8io/kubernetes-client#5189.

If the integration test failed to reproduce the issue, we're back to square one...

and select a random value between to use

the informer handles reconnects internally, which makes in unneccesary for us to write our own logic for that. Also, the replaced watch reconnect behaviour we wrote ourselves was not perfect and regularly re-watched without a need to do so

we modify two collections in sequence and the method is invoked by random threadpool-threads

solsson · 2024-01-31T13:59:47Z

src/main/resources/application.yaml

@@ -1,7 +1,7 @@
 kkv:
+  namespace: ${NAMESPACE}


By convention we use POD_NAMESPACE when depending on the downward API for envs.

Alternatively I think we should use WATCH_NAMESPACE or something like TARGET_NAMESPACE if we think that watching a different namespace is a reasonable use case.

I'll move namespace and resync-period to kkv.target.service (next to service name and port), since they are directly related to this.

I think that TARGET_NAMESPACE, or even more descriptive TARGET_SERVICE_NAMESPACE is a good idea. Even though we always run kkv in the same namespace as the service, I really don't think there are any technical restrictions to allowing other namespaces as well.

NiklasJonsson6 and others added 4 commits December 18, 2023 15:48

exit quarkus on graceful watch onClose

e1773e7

the same way as we do when the watch closes exceptionally. The graceful onClose is default-implemented with a debug log message so we had mistakenly not overridden it.

log debug level for kubernetes client

efb546d

We had no sign of graceful watcher onClose since it only logged a debug message in its default implementation

Quarkus 3.6.0->3.6.3

fa6388d

io.fabric8.kubernetes.client.vertx only logs errors

8c7bf0a

solsson approved these changes Dec 18, 2023

View reviewed changes

solsson added 4 commits December 18, 2023 17:18

Counts both success and failure so we can get an error rate

06cffc2

Uses the same prefix as our other onupdate related metrics

440a8cd

Changed my mind: keep the prefix, means actual watch target update

7389f62

while "onupdate" refers to the mechanism of handling key-value updates This reverts commit 440a8cd.

Fixes test with non-static init

cca2aed

solsson added 6 commits December 27, 2023 10:17

Adds reconnect logging

98abe80

should only be needed for trace level, but why don't we get logs?

252f46d

Quarkus 3.6.4

f53c844

Don't cache jvm builds, they're cheap

9031c99

Avoids exit, allows the client lib to retry

c55b61f

Reconnect on close, as we're not using an informer

509399d

NiklasJonsson6 added 3 commits January 24, 2024 10:07

Quarkus 3.6.7

14fb31a

integration test for endpoints watch reconnect behavior

d99be4e

using testcontainers with k3s

EndpointsWatcher retries on connection closed

e58855d

solsson requested changes Jan 24, 2024

View reviewed changes

src/main/java/se/yolean/kafka/keyvalue/kubernetes/EndpointsWatcher.java Outdated Show resolved Hide resolved

configure min- and max-retry delay

c37fa23

and select a random value between to use

NiklasJonsson6 requested a review from solsson January 24, 2024 15:58

NiklasJonsson6 added 2 commits January 31, 2024 13:47

plain watch replaced with informer

e30605e

the informer handles reconnects internally, which makes in unneccesary for us to write our own logic for that. Also, the replaced watch reconnect behaviour we wrote ourselves was not perfect and regularly re-watched without a need to do so

add synchornized to handleEvent method

310720c

we modify two collections in sequence and the method is invoked by random threadpool-threads

solsson reviewed Jan 31, 2024

View reviewed changes

NiklasJonsson6 and others added 4 commits January 31, 2024 16:02

config property and metrics fix

d4d256f

quarkus 3.6.8

0821f64

require targetServiceNamespace if informer is enabled

7f6854a

Builder with latest Mandrel and JRE runtime with java 21.0.2

d3ecc7a

NiklasJonsson6 merged commit 44affa5 into main Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher handle graceful onclose #56

Watcher handle graceful onclose #56

NiklasJonsson6 commented Dec 18, 2023 •

edited

Loading

solsson left a comment

solsson Dec 18, 2023

solsson Dec 18, 2023

NiklasJonsson6 commented Dec 19, 2023

solsson commented Dec 27, 2023 •

edited

Loading

solsson commented Dec 28, 2023

solsson commented Dec 28, 2023 •

edited

Loading

solsson commented Dec 28, 2023

solsson left a comment

NiklasJonsson6 commented Jan 24, 2024

solsson Jan 31, 2024 •

edited

Loading

NiklasJonsson6 Jan 31, 2024

Watcher handle graceful onclose #56

Watcher handle graceful onclose #56

Conversation

NiklasJonsson6 commented Dec 18, 2023 • edited Loading

solsson left a comment

Choose a reason for hiding this comment

solsson Dec 18, 2023

Choose a reason for hiding this comment

solsson Dec 18, 2023

Choose a reason for hiding this comment

NiklasJonsson6 commented Dec 19, 2023

solsson commented Dec 27, 2023 • edited Loading

solsson commented Dec 28, 2023

solsson commented Dec 28, 2023 • edited Loading

solsson commented Dec 28, 2023

solsson left a comment

Choose a reason for hiding this comment

NiklasJonsson6 commented Jan 24, 2024

solsson Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

NiklasJonsson6 Jan 31, 2024

Choose a reason for hiding this comment

NiklasJonsson6 commented Dec 18, 2023 •

edited

Loading

solsson commented Dec 27, 2023 •

edited

Loading

solsson commented Dec 28, 2023 •

edited

Loading

solsson Jan 31, 2024 •

edited

Loading