Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498

Closed
voelzmo opened this issue Dec 6, 2021 · 8 comments
Assignees
Labels
area/vertical-pod-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@voelzmo
Copy link
Contributor

voelzmo commented Dec 6, 2021

Which component are you using?
VPA v0.9.2

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.
Hey, I'm currently looking at a few issues involving the VPA in some of our installations where given memory recommendations are somewhat unexpected. In these cases, we also found indications that the VPACheckpointing mechanism which usually happens every 60 seconds doesn't seem to work correctly. We regularly see messages that checkpoints could not be written successfully within the given time. Here's a few numbers/observations

  • We didn't change any default configuration around intervals, qps limits, or other things
  • We're currently working with ~3800 VPA objects. There's a mismatch between number of VPACheckpoints and number of VPA objects I'm seeing in the system – there are ~3800 VPAs and ~4500 VPACheckpoints. Maybe this indicates also a different issue with garbage collection of the Checkpoints?
  • The warnings about Checkpoint not being written successfully are coming every ~11 minutes. Maybe this is related to the 10 minute garbage collection interval?

Describe the solution you'd like.
I'd like to see more documentation of that area in the VPA. If it doesn't exist yet, I'd also like to have additional indicators of issues with writing the checkpoints.

Specifically, I'd like to get an idea of

  • Rough sizing guidelines (max number of VPA objects) and the tradeoffs to move this sizing, e.g. increasing the checkpoint intervals in trade for different behavior when recommender restarts
  • Indicators other than skimming through logs that there is an issue with writing checkpoints and how "big" that issue is (like, how many checkpoints can actually be written successfully before it is cancelled? How many Checkpoint objects have not been updated for the last y intervals?)
  • Which impact there might arise from the recommender not being able to write a checkpoint for x intervals. I guess given the decaying nature of the histogram being used for memory, a recommender restart will result in unhelpful recommendations after a certain amount of missed checkpoints?
  • If someone has this: how does using prometheus compare to using checkpoints? Does this scale better? Are there other considerations in comparing the two approaches?

Thanks!

@voelzmo voelzmo added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 6, 2021
@voelzmo
Copy link
Contributor Author

voelzmo commented Dec 6, 2021

/label vertical-pod-autoscaler

@k8s-ci-robot
Copy link
Contributor

@voelzmo: The label(s) /label vertical-pod-autoscaler cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor

In response to this:

/label vertical-pod-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jbartosik
Copy link
Collaborator

Effect of time between VPA checkpoints

When VPA restarts it will load historical data for each VPA object from a corresponding checkpoint (here). Data from time between the checkpoint was saved and VPA restart if effectively lost for VPA.

VPA bases its recommendation on ~8 days of data so losing a few minutes shouldn't affect recommendations much. Losing more is more problematic for memory recommendations (VPA looks at daily memory peaks to react quickly to growing memory usage).

Estimating time between VPA checkpoints

By default VPA recommender will limit it self to 5 qps (flag).

So even if the only thing it did was saving checkpoints then with 3800 VPA object it wold save checkpoint for each object every 12-13 minutes. I expect things to be much slower than that (IIRC with 500 VPA objects VPA spent about half its time doing things other than writing checkpoints).

For more realistic estimate you can:

Then you can calculate:

  • average VPA checkpoints per iteration (average duration of MaintainCheckpoints step * 5 (qps limit),
  • average rate of writing checkpoints (average VPA checkpoints per iteration / average duration of total step),
  • average time between checkpoints for a VPA (3800 (number of VPA objects) / average rate of writing checkpoints).

@voelzmo
Copy link
Contributor Author

voelzmo commented Jan 19, 2022

Thanks @jbartosik, following your advice I was able to create some visualizations which can help guiding me through what's happening here!

I created heatmaps for the execution_latency_seconds metric and looked at the several steps. The most interesting ones seem to be MaintainCheckpoints and UpdateVPAs. A few things seemed interesting when looking at the graphs:

  • Although the main loop is configured to run every 60s, we only get new measurements every ~10 minutes. Is there something blocking in the main loop such that a new run cannot start until the previous one finished?
  • The UpdateVPAs step consistently takes more than 5 minutes (I suppose it takes ~10 minutes, which would explain the next loop not starting until then). This would make sense given that 3800/5=760 seconds.
  • The MaintainCheckpoints step seems to be able to finish between 500ms-50s, but given that it is executed after the UpdateVPAs step and the cancellation timeout considers all previous steps as well, it means that this step is always cancelled after the minimum amount of VPACheckpoints written, correct?

Screenshots for your amusement:
Screenshot 2022-01-19 at 17 04 28
Screenshot 2022-01-19 at 17 03 39

So I guess our best option is to adjust the settings for QPS to a higher value, e.g. 70, which should allow finishing within the 60s and see what breaks next?

@jbartosik
Copy link
Collaborator

  • Although the main loop is configured to run every 60s, we only get new measurements every ~10 minutes. Is there something blocking in the main loop such that a new run cannot start until the previous one finished?

Yes. Recommender runs in a loop (here) first it waits for a round of processing to finish then it waits for a tick before next iteration of the loop starts. In each iteration of the loop it processes all VPA objects (here). So if it takes longer than --recommender-interval (default 1 minute) to process all VPAs the loop will execute 1 / (how much time it takes to do all the processing).

  • The UpdateVPAs step consistently takes more than 5 minutes (I suppose it takes ~10 minutes, which would explain the next loop not starting until then). This would make sense given that 3800/5=760 seconds.

Looks right.

  • The MaintainCheckpoints step seems to be able to finish between 500ms-50s, but given that it is executed after the UpdateVPAs step and the cancellation timeout considers all previous steps as well, it means that this step is always cancelled after the minimum amount of VPACheckpoints written, correct?

Yes, that's what I'd expect to happen. StoreCheckpoints saves minimum number of checkpoints then it starts looking if its deadline passed (which is --checkpoints-timeout after main loop iteration starts)) it will just save the minimum number of checkpoints.

So I guess our best option is to adjust the settings for QPS to a higher value, e.g. 70, which should allow finishing within the 60s and see what breaks next?

Yes. I think you should try increasing the flag gradually (to avoid problems with cluster API server).

If things don't work next thing you can try is using --vpa-object-namespace and running separate recommenders for different namespaces. It won't protect cluster API server (it still needs to send all the info and answer all the same queries) but if there is a bottleneck inside VPA it might help. But it will be more difficult that increasing QPS.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022
@voelzmo
Copy link
Contributor Author

voelzmo commented May 30, 2022

We've adjusted the QPS to much higher values and are close to the ~60s cycle time. Thanks for the pointers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants