Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher handle graceful onclose #56

Merged
merged 24 commits into from
Feb 13, 2024
Merged

Conversation

NiklasJonsson6
Copy link
Contributor

@NiklasJonsson6 NiklasJonsson6 commented Dec 18, 2023

We mistakenly did not implement any behavior for when our watch for kubernetes endpoints closed gracefully, since it has a default implementation that only logs a message on the debug level.

  • Log on DEBUG-level for the kubernetes client.

  • Handle graful watcher close the same way we handle exception close (log and exit the application with code 11).

NiklasJonsson6 and others added 4 commits December 18, 2023 15:48
the same way as we do when the watch closes exceptionally. The graceful onClose is default-implemented with a debug log message so we had mistakenly not overridden it.
We had no sign of graceful watcher onClose since it only logged a debug message in its default implementation
Copy link
Contributor

@solsson solsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to test any of these, from client reference?

Screenshot 2023-12-18 at 16 43 24

@@ -50,6 +50,8 @@ quarkus:
level: DEBUG
"org.apache.kafka.clients.Metadata":
level: DEBUG
"io.fabric8.kubernetes.client":
level: DEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use an env here. This is great for troubleshooting stuff like ECONNRESET but we should probably default to info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do a single-arch build with DEBUG, so we don't have to update lots of yaml.

@NiklasJonsson6
Copy link
Contributor Author

Do we want to test any of these, from client reference?

Screenshot 2023-12-18 at 16 43 24

I think no, since the default reconnect behavior seems quite agressive. The issue we had was (most likely) a graceful close of the watch according to the client, so in that case I would assume no reconnect is done. Otherwise, an unlimited of retry with a 1 second interval (the default reconnect behavior) would surely have succeeded eventually.

@solsson
Copy link
Contributor

solsson commented Dec 27, 2023

Watching is never logged, despite debug level. Probably a configuration error on our side. WebSocket is also never logged.

@solsson
Copy link
Contributor

solsson commented Dec 28, 2023

Evaluating docker.io/yolean/kafka-keyvalue:509399df255d03f7fe4fb63a0b5c653b7ddf8aab@sha256:6eec474b2a58870bc404523ddabd92cbdefbe78bcb29e85891042f7721bd7aa5 with GKE 1.27 API-server. It seems to recover from "too old resource version" and keep endpoints unique and up-to-date.

I haven't added test coverage because I think we might want to switch to an informer (or are they always watching all namespaces?) in which case none of these hacks will be needed, at least not in their current form.

@solsson
Copy link
Contributor

solsson commented Dec 28, 2023

I still don't know how API-server restarts behave now. We do however have debug logging for the websocket connections since 252f46d.
.
fabric8io/kubernetes-client#5189

@solsson
Copy link
Contributor

solsson commented Dec 28, 2023

fabric8io/kubernetes-client#5372 (comment)

This is handled automatically by Informers, but not for Watches. Bookmarks will be used by default, which for newer api servers make this exception much more infrequent.

Copy link
Contributor

@solsson solsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation is that we restart watches all the time. Correct?

@NiklasJonsson6
Copy link
Contributor Author

After the latest changes, the integration test validates that we reconnect our watch after being disconnected. Running the test confirms that we actually get onClose-events after reconnect fails. This might be due to the fact that we're running a newer version now, and issues like these are fixed fabric8io/kubernetes-client#5189.

If the integration test failed to reproduce the issue, we're back to square one...

and select a random value between to use
the informer handles reconnects internally, which makes in unneccesary for us to write our own logic for that. Also, the replaced watch reconnect behaviour we wrote ourselves was not perfect and regularly re-watched without a need to do so
we modify two collections in sequence and the method is invoked by random threadpool-threads
@@ -1,7 +1,7 @@
kkv:
namespace: ${NAMESPACE}
Copy link
Contributor

@solsson solsson Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention we use POD_NAMESPACE when depending on the downward API for envs.

Alternatively I think we should use WATCH_NAMESPACE or something like TARGET_NAMESPACE if we think that watching a different namespace is a reasonable use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move namespace and resync-period to kkv.target.service (next to service name and port), since they are directly related to this.

I think that TARGET_NAMESPACE, or even more descriptive TARGET_SERVICE_NAMESPACE is a good idea. Even though we always run kkv in the same namespace as the service, I really don't think there are any technical restrictions to allowing other namespaces as well.

@NiklasJonsson6 NiklasJonsson6 merged commit 44affa5 into main Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants