Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watching for events stops working after ~10 min stale time #228

Closed
audunsol opened this issue Nov 8, 2021 · 3 comments
Closed

Watching for events stops working after ~10 min stale time #228

audunsol opened this issue Nov 8, 2021 · 3 comments
Labels

Comments

@audunsol
Copy link

audunsol commented Nov 8, 2021

Hello,

We are using the reflector for copying secrets into dynamically created namespaces (for review apps in a GitOps-setup) following a naming pattern, so we want it to trigger as soon as possible upon newly created namespaces (and/or of course changes to the source secrets).

Upgraded from 5.4.17 => 6.0.21 on Nov 5 via helm install in our AKS cluster, by just

helm uninstall reflector
helm repo add emberstack https://emberstack.github.io/helm-charts
helm repo update
helm upgrade --install reflector emberstack/reflector

(This was in order to get around some OutOfMemory exceptions from the logs that we were seeing randomly, but more and more often in the previous version, and version 6 aims to improves performance, as I understand from the release log).

Did some testing just after upgrade, and it seemed to work more or less as before.

But after a while, the sync seems to trigger very late (a bit random, sometimes not before 15 to 30 minutes after creating a new namespace or so). Log sample here:

2021-11-08 15:30:52.833 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Created review-app-mytest/app-secrets as a reflection of default/app-secrets
2021-11-08 15:30:52.845 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Created review-app-mytest/other-secret as a reflection of redis/other-secret
2021-11-08 15:41:09.712 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Session closed. Duration: 00:55:20.1649435. Faulted: False.
2021-11-08 15:41:09.712 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
2021-11-08 15:41:29.893 +00:00 [INF] (ES.Kubernetes.Reflector.Core.NamespaceWatcher) Session closed. Duration: 00:51:04.5208406. Faulted: False.
2021-11-08 15:41:29.893 +00:00 [INF] (ES.Kubernetes.Reflector.Core.NamespaceWatcher) Requesting V1Namespace resources
2021-11-08 16:06:14.365 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretWatcher) Session closed. Duration: 00:48:52.4417812. Faulted: False.
2021-11-08 16:06:14.365 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretWatcher) Requesting V1Secret resources
2021-11-08 16:06:14.488 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Auto-reflected redis/other-secret where permitted. Created 1 - Updated 0 - Deleted 0 - Validated 91.
2021-11-08 16:06:14.498 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Created review-app-mytest/other-secret as a reflection of redis/other-secret
2021-11-08 16:06:14.598 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Auto-reflected default/yet-another-secret where permitted. Created 1 - Updated 0 - Deleted 0 - Validated 90.
2021-11-08 16:06:14.606 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Created review-app-mytest/yet-another-secret as a reflection of default/yet-another-secret
2021-11-08 16:06:14.758 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Auto-reflected default/app-secrets where permitted. Created 1 - Updated 0 - Deleted 0 - Validated 96.
2021-11-08 16:06:14.765 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Created review-app-mytest/app-secrets as a reflection of default/app-secrets
...
(lots of other similar log entries that does not seem to be problematic in this context).
...

(Secret and namespace names altered for the sake of this ticket.)

For instance, in the above log, the namespace review-app-mytest was created at 2021-11-08T15:42:26Z, so it took ~24 minutes before the secret reflection kicked in in this case.

After several attempts, it seems that the log events like [INF] (ES.Kubernetes.Reflector.Core.SecretWatcher) Session closed. Duration: 00:55:59.9519413. Faulted: False. resets the watcher and makes the reflector work again. If nothing happens for 10 min, it has to wait until next such event before anything happens.

I have looked at the config options to see if I could/should tweak some setting, but could not find any. This is the closest thing to something relevant: kubernetes-client/csharp#533 (but I might looking in the wrong direction).

Kubernetes version:

Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"8ab00ff68a1763b5a506a7073cb8e67b12dfbcd7", GitTreeState:"clean", BuildDate:"2021-03-10T23:40:01Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Please let me know if you want me to provide some particular info/logs/setting details in addition to the above.

PS: thanks for providing and maintaining this simple tool, by the way. We have used it for 9 months or so now, with not too big hiccups (other than the ones mentioned above).

@winromulus
Copy link
Contributor

Hi @audunsol

You have a very weird scenario. Let me explain why:
Reflector does not query/scan for changes. It subscribes to the kube-api web socket channel and gets notified by k8s when a change occurs. Kubernetes SHOULD issue the change event as soon as the resource change occurs. If it gets delayed by 24 mins it means that there might be an issue with you control-plane (old "master") nodes where either they're bogged down with events or not dispatching when it needs to.

The session close occurs on most cloud provided k8s installations at around 40-45 mins (plus minus 10 minutes), so seeing the logs you sent it seems that socket timeout works fine. There were versions of k8s which had issues with issuing events on time (basically subscribers that idled were still connected but weren't getting the events), which seems to be your scenario here, but what's weird is that they usually got disconnected after 20-30 mins.

My suggestion is either upgrade k8s to a newer version because something is definitely wrong there and I'm suspecting it's the older k8s issue (for the love of me if I can remember which versions had the issue).
I would also look at the metrics for kube-api in Prometheus or what you have and see if there are any spikes in traffic or high load.

@audunsol
Copy link
Author

audunsol commented Nov 8, 2021

Thanks for your swift reply @winromulus!

I might get around to upgrade the cluster at some point, and I can report back to you then. But 1.19 is not that old, and the previous versions of the reflector has survived quite a few older versions than that as well, here before upgrading it to v6.

I have sifted through metrics in Azure and in Grafana now, but I could not spot anything unusual.

Even if it felt very random at first, the issue is 100% reproducible on my end now. Even now this late in the evening, when we have a lot less traffic on our cluster (this is our TEST cluster used by our devs/testers, and right now very few of them are online).

Here is the gist of it:

  • As long as the SecretMirror logged something recently, it creates my 6 secrets in a new namespace almost instananeously (and logs about it in the logs)
  • If nothing has been reported by SecretMirror for at least 10 minutes, nothing happens, until the SecretMirror Session closed event occurs (which then seems to trigger a full sync of everything).

Anyways, thanks for confirming how it works. I understood that you were using a few watch: true from browsing the code, and I completely agree that it SHOULD issue and trigger on these change events, but the ticket I linked to suggests that the watches silently times out and stops working after some time, in particular for AKS (which we are using). Did you get a chance to look at it yet?

@winromulus
Copy link
Contributor

@audunsol yup, that's the exact issue I've seen in the past, it's just the version of k8s that I can't confirm (I don't want to influence but I think it was 1.18 and 1.19).
The latest version of reflector was tested with k8s 1.21.2 and newer (on AKS) and that problem was solved (it gets events even after a large period of idle). Reflector is running on a few hundred clusters (reported) and this issue hasn't come up yet on newer versions of k8s.
I know the upgrade from 1.19 to 1.21 is quite a bit of effort (especially for storage and ingress) but it's well worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants