-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration, limitations and impact of VPACheckpointing (not) working correctly #4498
Comments
/label vertical-pod-autoscaler |
@voelzmo: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Effect of time between VPA checkpointsWhen VPA restarts it will load historical data for each VPA object from a corresponding checkpoint (here). Data from time between the checkpoint was saved and VPA restart if effectively lost for VPA. VPA bases its recommendation on ~8 days of data so losing a few minutes shouldn't affect recommendations much. Losing more is more problematic for memory recommendations (VPA looks at daily memory peaks to react quickly to growing memory usage). Estimating time between VPA checkpointsBy default VPA recommender will limit it self to 5 qps (flag). So even if the only thing it did was saving checkpoints then with 3800 VPA object it wold save checkpoint for each object every 12-13 minutes. I expect things to be much slower than that (IIRC with 500 VPA objects VPA spent about half its time doing things other than writing checkpoints). For more realistic estimate you can:
Then you can calculate:
|
Thanks @jbartosik, following your advice I was able to create some visualizations which can help guiding me through what's happening here! I created heatmaps for the
Screenshots for your amusement: So I guess our best option is to adjust the settings for QPS to a higher value, e.g. 70, which should allow finishing within the 60s and see what breaks next? |
Yes. Recommender runs in a loop (here) first it waits for a round of processing to finish then it waits for a tick before next iteration of the loop starts. In each iteration of the loop it processes all VPA objects (here). So if it takes longer than
Looks right.
Yes, that's what I'd expect to happen.
Yes. I think you should try increasing the flag gradually (to avoid problems with cluster API server). If things don't work next thing you can try is using |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
We've adjusted the QPS to much higher values and are close to the ~60s cycle time. Thanks for the pointers! |
Which component are you using?
VPA v0.9.2
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.
Hey, I'm currently looking at a few issues involving the VPA in some of our installations where given memory recommendations are somewhat unexpected. In these cases, we also found indications that the VPACheckpointing mechanism which usually happens every 60 seconds doesn't seem to work correctly. We regularly see messages that checkpoints could not be written successfully within the given time. Here's a few numbers/observations
Describe the solution you'd like.
I'd like to see more documentation of that area in the VPA. If it doesn't exist yet, I'd also like to have additional indicators of issues with writing the checkpoints.
Specifically, I'd like to get an idea of
Thanks!
The text was updated successfully, but these errors were encountered: