Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

porch: Fix watcher duplication bug #4067

Merged
merged 2 commits into from
Oct 12, 2023
Merged

Conversation

johnbelamaric
Copy link
Contributor

As mentioned in #4050, it probably wasn't the root cause of the porch-server issues. It did help a lot, but after about 8 days we saw CPU spike and the API server became unavailable.

I found something curious. I noticed way way more "stopping watcher" messages than there should have been watchers. Thousands. Poking around I found something interesting. The WatcherManager maintains an array of watchers, which it sends watch events to. When one is stopped, it sets the array entry to nil, see here:

for i, watcher := range r.watchers {

When a new watcher is created, it goes through the array and attempts to fill an empty spot, appending if it doesn't find one. That is done in this this little gem:

for i, watcher := range r.watchers {

Notice what's missing in that loop? The break statement. Without that, this new watcher gets copied to EVERY empty spot. I suspect this is the root cause. Basically, as watchers come and go, the spots get freed, but they all get refilled immediately with duplicates. The array thus doesn't grow just to the peak number of watchers, but in fact grows continuously with at least one new entry for every two new watchers. This means that eventually the list gets full of many duplicates, so any watch event is sent a zillion times to the watchers.

Hard to say if this is the cause of the CPU spike, but it is likely. And then, I wonder if the K8s APF bug means that the API server does not recover well after the spike. But that's just speculation.

In any case, this PR should fix it. I don't break but instead continue looping so I can count all filled slots in the array, just to log it. An array isn't really the right data structure here...

Signed-off-by: John Belamaric <jbelamaric@google.com>
Copy link
Contributor

@natasha41575 natasha41575 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! Makes sense to me.

Signed-off-by: John Belamaric <jbelamaric@google.com>
@johnbelamaric johnbelamaric merged commit c1f963f into kptdev:main Oct 12, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants