Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce watch updates #4050

Merged
merged 8 commits into from
Sep 27, 2023
Merged

Conversation

johnbelamaric
Copy link
Contributor

Fixes #4023

This replaces the previous PR. The fixes in here are a mitigation based on the current theory of the problem, but there are some things I am not sure about yet. I am submitting this PR so we at least have a mitigation in place, and we can continue working on the problem.

Current hypothesis:

  1. Something is leaking resources such that CPU continuously increases. This PR eliminates the watch over-notification which seems to mitigate this, but this PR is not clearly a root cause fix for the CPU problem.
  2. As CPU gets tight, the polling for repo cache updates takes longer and longer. At some point, it starts taking longer than one minute.
  3. The ticker goes off every minute, so once we exceed one minute runs, we start to accumulate attempts at concurrent polls. There won't be concurrent polls due to the mutex, but they will back up and start to increase memory use.
  4. Once enough waiting poll go routines are there, they fill up memory and it gets OOMKilled.

So, that all sounds great, except that for 3), I don't see a new go routine per tick, but rather one per "newRepository" call. So, maybe the issue is that somewhere we are re-creating repositories? Maybe the fix in #4049 is sufficient in that case. But we need a mitigation now, we can check that later.

Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
Signed-off-by: John Belamaric <jbelamaric@google.com>
@johnbelamaric
Copy link
Contributor Author

FYI, here is without this PR (though also without #4049). Note the memory is stable but the CPU is climbing over the course of the day.

Screenshot from 2023-09-27 13-42-29

@johnbelamaric
Copy link
Contributor Author

Ok, getting a little closer. Looks like I am probably fixing the wrong ticker here. This ticker won't launch new go routines, but this will:

https://github.com/kptdev/kpt/blob/main/porch/pkg/registry/porch/background.go#L116

because runOnce in turn calls OpenRepository which in turn calls newRepository, thus launching a new go routine for polling that repository.

@johnbelamaric
Copy link
Contributor Author

I will merge this now anyway, it does help the issue. But we have more work to do.

@johnbelamaric johnbelamaric merged commit 9f1149a into kptdev:main Sep 27, 2023
11 checks passed
@johnbelamaric
Copy link
Contributor Author

As an FYI, as expected shortly after runs started exceeding 60 seconds, OOMKill:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 26 Sep 2023 13:32:01 -0700
      Finished:     Wed, 27 Sep 2023 16:09:07 -0700

Screenshot from 2023-09-27 17-06-34

I will now restart this pod with the latest build that includes this PR, and see how it does.

@johnbelamaric
Copy link
Contributor Author

With the latest build that includes this PR, CPU is stable for the last 13h:

Screenshot from 2023-09-28 06-49-34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

porch: porch-server crash with "Unable to derive new concurrency limits"
2 participants