Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498

voelzmo · 2021-12-06T15:36:56Z

Which component are you using?
VPA v0.9.2

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.
Hey, I'm currently looking at a few issues involving the VPA in some of our installations where given memory recommendations are somewhat unexpected. In these cases, we also found indications that the VPACheckpointing mechanism which usually happens every 60 seconds doesn't seem to work correctly. We regularly see messages that checkpoints could not be written successfully within the given time. Here's a few numbers/observations

We didn't change any default configuration around intervals, qps limits, or other things
We're currently working with ~3800 VPA objects. There's a mismatch between number of VPACheckpoints and number of VPA objects I'm seeing in the system – there are ~3800 VPAs and ~4500 VPACheckpoints. Maybe this indicates also a different issue with garbage collection of the Checkpoints?
The warnings about Checkpoint not being written successfully are coming every ~11 minutes. Maybe this is related to the 10 minute garbage collection interval?

Describe the solution you'd like.
I'd like to see more documentation of that area in the VPA. If it doesn't exist yet, I'd also like to have additional indicators of issues with writing the checkpoints.

Specifically, I'd like to get an idea of

Rough sizing guidelines (max number of VPA objects) and the tradeoffs to move this sizing, e.g. increasing the checkpoint intervals in trade for different behavior when recommender restarts
Indicators other than skimming through logs that there is an issue with writing checkpoints and how "big" that issue is (like, how many checkpoints can actually be written successfully before it is cancelled? How many Checkpoint objects have not been updated for the last y intervals?)
Which impact there might arise from the recommender not being able to write a checkpoint for x intervals. I guess given the decaying nature of the histogram being used for memory, a recommender restart will result in unhelpful recommendations after a certain amount of missed checkpoints?
If someone has this: how does using prometheus compare to using checkpoints? Does this scale better? Are there other considerations in comparing the two approaches?

Thanks!

voelzmo · 2021-12-06T15:39:37Z

/label vertical-pod-autoscaler

k8s-ci-robot · 2021-12-06T15:39:38Z

@voelzmo: The label(s) /label vertical-pod-autoscaler cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor

In response to this:

/label vertical-pod-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jbartosik · 2021-12-15T11:26:04Z

Effect of time between VPA checkpoints

When VPA restarts it will load historical data for each VPA object from a corresponding checkpoint (here). Data from time between the checkpoint was saved and VPA restart if effectively lost for VPA.

VPA bases its recommendation on ~8 days of data so losing a few minutes shouldn't affect recommendations much. Losing more is more problematic for memory recommendations (VPA looks at daily memory peaks to react quickly to growing memory usage).

Estimating time between VPA checkpoints

By default VPA recommender will limit it self to 5 qps (flag).

So even if the only thing it did was saving checkpoints then with 3800 VPA object it wold save checkpoint for each object every 12-13 minutes. I expect things to be much slower than that (IIRC with 500 VPA objects VPA spent about half its time doing things other than writing checkpoints).

For more realistic estimate you can:

Check how much time VPA spends writing checkpoints per each iteration: metric execution_latency_seconds, MaintainCheckpoints step,
Check how much time each iteration takes metric execution_latency_seconds, total step,

Then you can calculate:

average VPA checkpoints per iteration (average duration of MaintainCheckpoints step * 5 (qps limit),
average rate of writing checkpoints (average VPA checkpoints per iteration / average duration of total step),
average time between checkpoints for a VPA (3800 (number of VPA objects) / average rate of writing checkpoints).

voelzmo · 2022-01-19T16:14:08Z

Thanks @jbartosik, following your advice I was able to create some visualizations which can help guiding me through what's happening here!

I created heatmaps for the execution_latency_seconds metric and looked at the several steps. The most interesting ones seem to be MaintainCheckpoints and UpdateVPAs. A few things seemed interesting when looking at the graphs:

Although the main loop is configured to run every 60s, we only get new measurements every ~10 minutes. Is there something blocking in the main loop such that a new run cannot start until the previous one finished?
The UpdateVPAs step consistently takes more than 5 minutes (I suppose it takes ~10 minutes, which would explain the next loop not starting until then). This would make sense given that 3800/5=760 seconds.
The MaintainCheckpoints step seems to be able to finish between 500ms-50s, but given that it is executed after the UpdateVPAs step and the cancellation timeout considers all previous steps as well, it means that this step is always cancelled after the minimum amount of VPACheckpoints written, correct?

Screenshots for your amusement:

So I guess our best option is to adjust the settings for QPS to a higher value, e.g. 70, which should allow finishing within the 60s and see what breaks next?

jbartosik · 2022-01-26T18:36:36Z

Although the main loop is configured to run every 60s, we only get new measurements every ~10 minutes. Is there something blocking in the main loop such that a new run cannot start until the previous one finished?

Yes. Recommender runs in a loop (here) first it waits for a round of processing to finish then it waits for a tick before next iteration of the loop starts. In each iteration of the loop it processes all VPA objects (here). So if it takes longer than --recommender-interval (default 1 minute) to process all VPAs the loop will execute 1 / (how much time it takes to do all the processing).

The UpdateVPAs step consistently takes more than 5 minutes (I suppose it takes ~10 minutes, which would explain the next loop not starting until then). This would make sense given that 3800/5=760 seconds.

Looks right.

The MaintainCheckpoints step seems to be able to finish between 500ms-50s, but given that it is executed after the UpdateVPAs step and the cancellation timeout considers all previous steps as well, it means that this step is always cancelled after the minimum amount of VPACheckpoints written, correct?

Yes, that's what I'd expect to happen. StoreCheckpoints saves minimum number of checkpoints then it starts looking if its deadline passed (which is --checkpoints-timeout after main loop iteration starts)) it will just save the minimum number of checkpoints.

So I guess our best option is to adjust the settings for QPS to a higher value, e.g. 70, which should allow finishing within the 60s and see what breaks next?

Yes. I think you should try increasing the flag gradually (to avoid problems with cluster API server).

If things don't work next thing you can try is using --vpa-object-namespace and running separate recommenders for different namespaces. It won't protect cluster API server (it still needs to send all the info and answer all the same queries) but if there is a bottleneck inside VPA it might help. But it will be more difficult that increasing QPS.

k8s-triage-robot · 2022-04-26T19:21:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-26T19:54:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

voelzmo · 2022-05-30T13:12:48Z

We've adjusted the QPS to much higher values and are close to the ~60s cycle time. Thanks for the pointers!

voelzmo added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 6, 2021

jbartosik added the area/vertical-pod-autoscaler label Dec 7, 2021

jbartosik self-assigned this Dec 7, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022

voelzmo closed this as completed May 30, 2022

voelzmo mentioned this issue Feb 13, 2023

What would we like from a VPA Benchmark? #5493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498

Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498

voelzmo commented Dec 6, 2021 •

edited

Loading

voelzmo commented Dec 6, 2021

k8s-ci-robot commented Dec 6, 2021

jbartosik commented Dec 15, 2021

voelzmo commented Jan 19, 2022 •

edited

Loading

jbartosik commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

voelzmo commented May 30, 2022

Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498

Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498

Comments

voelzmo commented Dec 6, 2021 • edited Loading

voelzmo commented Dec 6, 2021

k8s-ci-robot commented Dec 6, 2021

jbartosik commented Dec 15, 2021

Effect of time between VPA checkpoints

Estimating time between VPA checkpoints

voelzmo commented Jan 19, 2022 • edited Loading

jbartosik commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

voelzmo commented May 30, 2022

voelzmo commented Dec 6, 2021 •

edited

Loading

voelzmo commented Jan 19, 2022 •

edited

Loading