Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes websocket watches silently dying #5189

Closed
duke-bartholomew opened this issue May 31, 2023 · 10 comments
Closed

Kubernetes websocket watches silently dying #5189

duke-bartholomew opened this issue May 31, 2023 · 10 comments

Comments

@duke-bartholomew
Copy link

Describe the bug

This might be related to #5152

I'm using Strimzi (v0.35.0) as kafka-operator which uses the kubernetes-client (httpclient-jdk:v6.6.1) under the hood to watch for k8s resources.
I'm noticing websocket connections getting stale / not being reconnected quite often.
With often I mean, in my case, approx. 30 minutes or so after the operator starts, the websockets die.
Only way to recover from that, and have the operator functioning properly again, is restarting the kafka-operator container.

Is there to work around this problem, or to find out what the root-cause is?
AKS might also be in failure by not responding on the close frame sent by the client, but how can I be 100% sure about this?

Fabric8 Kubernetes Client version

other (please specify in additional context)

Steps to reproduce

  1. start the strimzi user-operator
  2. wait approx. 30 minutes
  3. try to add a resource (KafkaUser): it's not picked up
    logs show ws connection close

Expected behavior

This is a very good question.
I would expect the watches/informers to keep working instead of silently dying.

On the other hand: why is there the need to close the websocket connections every 'n' minutes anyway?
websockets do have a keep-alive mechanism, so why is this not sufficient?

Runtime

other (please specify in additional context)

Kubernetes API Server version

other (please specify in additional context)

Environment

Azure

Fabric8 Kubernetes Client Logs

$ kubectl logs -f deployment/kafka-entity-operator -c user-operator -n infra-kafka | grep Watch

2023-05-30 15:55:14 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/apis/kafka.strimzi.io/v1beta2/namespaces/infra-kafka/kafkausers?labelSelector=strimzi.io%2Fcluster%3Dkafka&resourceVersion=6763909&timeoutSeconds=600&allowWatchBookmarks=true&watch=true...
2023-05-30 15:55:14 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/api/v1/namespaces/infra-kafka/secrets?labelSelector=strimzi.io%2Fcluster%3Dkafka%2Cstrimzi.io%2Fkind%3DKafkaUser&resourceVersion=6763754&timeoutSeconds=600&allowWatchBookmarks=true&watch=true...
2023-05-30 15:55:15 DEBUG WatcherWebSocketListener:41 - WebSocket successfully opened
2023-05-30 15:55:15 DEBUG Reflector:129 - Watch started for v1/namespaces/infra-kafka/secrets
2023-05-30 15:55:15 DEBUG WatcherWebSocketListener:41 - WebSocket successfully opened
2023-05-30 15:55:15 DEBUG Reflector:129 - Watch started for kafka.strimzi.io/v1beta2/namespaces/infra-kafka/kafkausers
2023-05-30 16:01:03 DEBUG WatchConnectionManager:60 - Closing websocket io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl@5aa9c893
2023-05-30 16:01:03 DEBUG WatcherWebSocketListener:71 - WebSocket close received. code: 1000, reason: 
2023-05-30 16:01:03 DEBUG AbstractWatchManager:200 - Scheduling reconnect task in 1000 ms
2023-05-30 16:01:04 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/api/v1/namespaces/infra-kafka/secrets?labelSelector=strimzi.io%2Fcluster%3Dkafka%2Cstrimzi.io%2Fkind%3DKafkaUser&resourceVersion=6768294&timeoutSeconds=600&allowWatchBookmarks=true&watch=true...
2023-05-30 16:01:04 DEBUG WatcherWebSocketListener:41 - WebSocket successfully opened
2023-05-30 16:03:12 DEBUG WatchConnectionManager:60 - Closing websocket io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl@7dfe1f34
2023-05-30 16:03:12 DEBUG WatcherWebSocketListener:71 - WebSocket close received. code: 1000, reason: 
2023-05-30 16:03:12 DEBUG AbstractWatchManager:200 - Scheduling reconnect task in 1000 ms
2023-05-30 16:03:13 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/apis/kafka.strimzi.io/v1beta2/namespaces/infra-kafka/kafkausers?labelSelector=strimzi.io%2Fcluster%3Dkafka&resourceVersion=6768293&timeoutSeconds=600&allowWatchBookmarks=true&watch=true...
2023-05-30 16:03:14 DEBUG WatcherWebSocketListener:41 - WebSocket successfully opened
2023-05-30 16:09:17 DEBUG WatchConnectionManager:60 - Closing websocket io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl@32c848aa
2023-05-30 16:10:31 DEBUG WatchConnectionManager:60 - Closing websocket io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl@3f1a48f4

// ... nothing happening anymore ...
// Now (UTC) = 2023-05-30 17:39:00

Additional context

kubernetes-client version: 6.6.1
Kubernetes API Server version: AKS v1.25.5
Java version inside Strimzi operator:

openjdk version "17.0.7" 2023-04-18 LTS
OpenJDK Runtime Environment (Red_Hat-17.0.7.0.7-1.el8_7) (build 17.0.7+7-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-17.0.7.0.7-1.el8_7) (build 17.0.7+7-LTS, mixed mode, sharing)

related discussion on Strimzi GitHub: strimzi/strimzi-kafka-operator#8586

@shawkins
Copy link
Contributor

This might be related to #5152

Yes this should be addressed by that discussion / issue in 6.7.

Is there to work around this problem, or to find out what the root-cause is?

I'm not aware of a great workaround. It may be worth trying a different fabric8 http client if possible - if this happens with jetty or vertx with 6.6.1, then it's likely that we're not seeing a close back from the remote side. OkHttp does kill the websocket after 1 minute and should guarantee that we'll see onClose called.

AKS might also be in failure by not responding on the close frame sent by the client, but how can I be 100% sure about this?

Yes the most likely problem is that we are never seeing the websocket as truly closed - we send a close, but onClose nor onError are never invoked. It is not clear what is exactly is the root cause. There is one possible bad path in the fabric8 client addressed by #5153 - if we don't see a close back from the remote side within a minute, we'll abort the websocket. However once aborted onError nor onClose will not be invoked. To guard against this more generally with #5153 the watch manager will forcibly start another watch if it doesn't see the previous one die as expected.

@shawkins
Copy link
Contributor

@andreaTP @manusa to recap from our call today - the 6.5 and before behavior (before adding our client side and url based timeout) was that it was up to the server to terminate the watch, which typically should have been after 1-2 hours. Reports from strimzi confirm that those long running watches would at some point seem to stop receiving events, and we would not see a close nor an error event. To mitigate that we introduced timeouts to the watch - similar to what is in the go client, but ours will generally be triggered from the client side.

The issues seen by strimzi are seeing are likely largely due to jdk httpclient. The closure handling for the okhttp client was already more tolerant without the #5153 such that the changes for the timeout handling should be fine.

The changes in #5153 should make jetty and jdk httpclient also tolerant to problematic closure - however vertx does not support the notion of aborting / terminating the websocket, so it will just log if it can't complete a close as expected.

#5153 will forcibly start a new watch should the last one not end appropriately. That will emit an error log:

logger.error("The last watch has yet to terminate as expected, will force start another watch. "
depending upon how frequently and for which clients (this won't be expected for okhttp, vertx, or jdk because they should forcible close the connection after 1 minute) we see this we can evaluate next steps, these include:

  1. creating an upstream kubernetes / go client issue that we're missing close messages and the connection seems to remain open.
  2. shorten the timeouts - waiting 2 minutes from the time we try to send close may too long. For the clients that support it once the close future is completed we could proactively abort / terminate - which would mimic the go client.

@duke-bartholomew
Copy link
Author

@shawkins I've done some more testing to try to find out what actually could be the root-cause of this issue with the strimzi-operator, but currently I'm getting more and more puzzled actually and thought to just share what I could currently find out, in the hopes it helps you to tackle this issue.

I created a little test application that, as the strimzi-user-operator does, has 2 Informers.
I used one normal informer that just logs the resources seen, and one hooked to a Lister on which I list() all entries from another thread every 30 seconds. This just to kind of mimic the watching behaviour of the informers in the strimzi-user-operator.

I have been running this test application for a couple of days straight in 4 different configurations:

  • kubernetes-client v6.6.1, running off cluster (connecting through a local kube-config)
  • kubernetes-client v6.6.1, running on cluster (connecting through the service account)
  • kubernetes-client v6.6.2, running off cluster (connecting through a local kube-config)
  • kubernetes-client v6.6.2, running on cluster (connecting through the service account)

From this test application I have not seen 1 occurrence in which any of the informers would die. Everything kept on working as expected, properly refreshing the websockets every 5 to 10 minutes without missing any close-frame from the kube-API.
When launching the strimzi-operator, I see the informers hang in almost all cases within 30 minutes after startup, so there has to be something causing this, and I'm doubting it's the kube-API server ....

Of course, they are different applications, but from a watcher/informer point of view they should behave similar.
The only thing that's different between the strimzi-operator and my test application is the threading model.
Strimzi uses multiple threads while my application uses a single thread as executor for the kubernetes-client (and the underlying http client), so I'm starting to think the issue might be caused by some kind of race/dead-lock situation.

So next thing I'm going to try out is the multi-threaded approach ...

@shawkins
Copy link
Contributor

shawkins commented Jun 1, 2023

so I'm starting to think the issue might be caused by some kind of race/dead-lock situation

Can you take a thread dump of the strimzi application to see if any of the jdk worker threads are deadlocked or otherwise not able to continue working?

We have seen potential issues with the vertx client wrt to its worker threads, but have not seen the same behavior from jdk as the workers share a work queue. It would seem to take exhausting all of the workers - the pool size should be based upon the available cores.

@shawkins
Copy link
Contributor

shawkins commented Jun 2, 2023

@duke-bartholomew @scholzj continuing from the previous comment - I don't see any usage of informOnCondition, waitOnCondition, or ExecListener outside of test code in the strimzi operator. Those are the places in fabric8 where user provided methods are called directly by the IO threads that could be easily identified as problematic. So it's best to get a thread dump for final confirmation if there is a thread utilization issue with the jdk client. My expectation is that all the jdk client worker threads would need to be tied up and thus all IO handling, not just close events, would be inhibited.

@scholzj
Copy link
Contributor

scholzj commented Jun 2, 2023

I'm not aware of any use of informOnCondition, waitOnCondition, or ExecListener in the production code. We use just the indexer and event handler as far as I know.

@duke-bartholomew
Copy link
Author

@shawkins Further testing and debugging made me conclude it's (probably) not a thread issue.
I also noticed some dying Informers in my test application, although not as soon as with the strimzi-operator.
(my test application ran fine for several days while the strimzi-operator has issues already 30 minutes after startup)

I'm currently running 6.7.0 in test on the strimzi-operator to see how that behaves ....

@shawkins
Copy link
Contributor

shawkins commented Jun 5, 2023

@shawkins Further testing and debugging made me conclude it's (probably) not a thread issue.

Were you able to get a thread dump to confirm?

I also noticed some dying Informers in my test application, although not as soon as with the strimzi-operator.

And to clarify, these informers are simply waiting on a watch closure, but have not completed the stopped future and are reporting isRunning=true.

I'm currently running 6.7.0 in test on the strimzi-operator to see how that behaves ....

If a similar situation arises in 6.7, you may see

logger.error("The last watch has yet to terminate as expected, will force start another watch. "

However the jdk client should be able to force a resolution sooner via

LOG.warn("Aborting WebSocket due to a write error or failure with sendClose");

@duke-bartholomew
Copy link
Author

duke-bartholomew commented Jun 6, 2023

@shawkins I was not able to capture a thread dump on that container, but got to the conclusion it was not a threading issue by inspecting /proc/x/task/y, the logs (in which I added thread-name/id) and the code from closer-up.

Running my latest test with the latest strimzi-operator (with the 6.7.0 kubernetes-client) on jdk-http shows indeed a major improvement if not a fix for the issue I had been experiencing with it (the dying informers).
From the last 24 hours test-run I noticed not a single lockup anymore from the Informers and saw the failure with sendClose log 10 times, but all recovered fine and kept working as expected.

{"stream":"stdout","logtag":"F","message":"2023-06-05 09:31:36.49960 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-05 09:41:01.39990 [CachedSingleThreadScheduler-566891420-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-05 10:39:35.41767 [CachedSingleThreadScheduler-566891420-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-05 14:24:20.78835 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-05 18:51:04.72397 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-05 20:23:31.80758 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-05 20:23:59.72860 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-06 00:43:37.82030 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-06 04:55:19.83255 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}
{"stream":"stdout","logtag":"F","message":"2023-06-06 06:35:46.83686 [CachedSingleThreadScheduler-846718105-pool-4-thread-1] WARN  io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl - Aborting WebSocket due to a write error or failure with sendClose"}

@shawkins
Copy link
Contributor

shawkins commented Jun 6, 2023

From the last 24 hours test-run I noticed not a single lockup anymore from the Informers and saw the failure with sendClose log 10 times, but all recovered fine and kept working as expected.

@manusa @andreaTP

Yes each of those logs before would have resulted in websocket being effectively dead for the jdk client, but no notification downstream that there was an issue. It's still not clear just from this what the root cause is, I'll add another log that might help shed more light. I would expect to see the same type of behavior with the other clients - in particular jetty / vertx as there would be similar logs. If not, then ideally we can narrow down to some one off problem with the jdk client. If we do see the behavior with the other clients, then we'll need to determine if that points toward an api server issue (if it appears from our side that close has been sent, but we're not receiving a close back), or if this is due to the lack of robust closure handling that we're now trying to compensate for.

To recap - in 6.6 and before a failed close attempt by the jdk client would result in a abort call 1 minute later if the remote side did not send a close or terminate the connection before then. Since the server side timeout is always 10 minutes and the client side timeout is between 5-10 minutes, that was quite possible. Since that jdk abort call did not cause a call to onClose or onError the websocket was effectively dead, but downstream this was not known so no new watch was started. For jetty, vertx, and okhttp the expectation would be that the server side timeout would ultimately cause us to be notified that the websocket was closed / terminated. For okhttp it also already has close handling that is robust enough that we'd get notification of onClose or onError within a minute of calling sendClose regardless of the outcome.

To go further back prior to 6.6 there were no explicit informer watch timeouts (client side or in the url), but we'd still get reports of scenarios where watches would stop getting events - unfortunately there was never a reproducer nor sufficient logs to determine why. Without the timeouts the default server side timeout would apply (typically 1-2 hours) but that of course is too long should there be an issue with receiving events. It's possible there are still api server issues related to event delivery / timeouts as we saw previously the api server mishandled the timeout - kubernetes/kubernetes#102464 So one way to help mitigate the possibility that long running watches were becoming non-functional was to add similar timeouts as are done by the go client - however that means in 6.6 we're quite frequently relying on closure behavior that especially for the jdk client was not sufficient.

Some other next steps:

  • How to use only http in watch? #4624 - to again further align with the go client provide a config option to use http watches instead of websockets
  • could consider toning down the frequency with which we are enforcing the client side timeout, that is make it be based upon usage as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants