Stop filtering aggregatedContainerStates without samples #5326

voelzmo · 2022-11-23T14:35:15Z

Which component this PR applies to?

vertical-pod-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

Stop filtering aggregatedContainerStates with TotalSampleCount==0. These aggregateContainerStates can contain OOMSamples, which don't increase the TotalSampleCount but make it possible to give a better memory recommendation. This change also increases the timeout for the OOMKill e2e test as it takes ~10 minutes for the recommendations to reach the expected range.

Which issue(s) this PR fixes:

Fixes #4981

Special notes for your reviewer:

I'm not entirely sure if there is a good reason why aggregateContainerStates without PodMetric samples were ignored besides trying to optimize. Maybe someone else has context if there's an additional reason behind it why we shouldn't be removing this condition?

I looked at other options to solve this, but they seem much more complicated.

Alternative A: Adding separate `SampleCounts` for CPU and memory

The VPA is built with the assumption that a sample contains both, CPU and memory data, therefore only a single TotalSampleCount exists in aggregateContainerState. We could move this further down and give each resource its own counter, but confidence is computed based on TotalSampleCount: therefore this would mean that CPU and memory can have different confidence and therefore different upper and lower bounds. This seems way too intrusive.

Alternative B: Increase `TotalSampleCount` when an OOMSample is added

Do something similar to the quick hack I discussed in #4981, but only for OOMSamples, not all memory samples.

diff --git a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
index 3facbe37e..7accd072e 100644
--- a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
+++ b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
@@ -184,6 +184,7 @@ func (a *AggregateContainerState) AddSample(sample *ContainerUsageSample) {
        case ResourceCPU:
                a.addCPUSample(sample)
        case ResourceMemory:
+               a.TotalSamplesCount++
                a.AggregateMemoryPeaks.AddSample(BytesFromMemoryAmount(sample.Usage), 1.0, sample.MeasureStart)
        default:
                panic(fmt.Sprintf("AddSample doesn't support resource '%s'", sample.Resource))

Downside: When many OOMSamples are recorded in a short period (e.g. when a Pod suddenly spikes in memory or has waaaaayyyyy too low intial resources defined), the confidence for CPU recommendations increases as well – which is the opposite of what's happening: In a situation like this we don't get PodMetrics and therefore no CPU metrics, so this also seems incorrect.

Does this PR introduce a user-facing change?

Fixed a bug which could lead to the recommender not giving any recommendations for Pods in OOMKill CrashLoopBackoff

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

jbartosik

Sorry for slow reply, I'm not sure why I missed this PR.

General approach looks good to me, there are some things we need to do to make sure this works properly.

We use TotalSamplesCount to decide if isEmpty should return true, which we later use to decide if we should delete the container state (here directly and inside call to isExpired.

I'm worried that this will result in OOMs only briefly having an effect and then being cleaned up.

jbartosik · 2022-12-19T16:15:26Z

Also I wonder why do you increase timeout for the OOM test?

voelzmo · 2022-12-19T20:27:36Z

Also I wonder why do you increase timeout for the OOM test?

It's buried in the long PR description, sorry:

This change also increases the timeout for the OOMKill e2e test as it takes ~10 minutes for the recommendations to reach the expected range.

When I watched the VPA status changes during the test, I could see that the recommendations were increasing, it just took longer than currently configured. Without the change, the recommendations didn't move at all.

voelzmo · 2022-12-19T20:33:39Z

Sorry for slow reply, I'm not sure why I missed this PR.

General approach looks good to me, there are some things we need to do to make sure this works properly.

We use TotalSamplesCount to decide if isEmpty should return true, which we later use to decide if we should delete the container state (here directly and inside call to isExpired.

I'm worried that this will result in OOMs only briefly having an effect and then being cleaned up.

That's a good point! For the aggregate state to be garbage collected, the Pod needs to be detected a 'non contributive' in addition to the TotalSampleCount being 0 – which shouldn't be the case as long as the Deployment still exists, right?

I wasn't sure how likely this would happen, but it definitely could! Do you have concrete suggestions on how to improve the solution so we don't run into this case?

jbartosik · 2022-12-23T14:50:04Z

/lgtm

voelzmo · 2023-01-25T14:02:51Z

ping @jbartosik is there something missing before merging this?

jbartosik

/approve

Thanks for the reminder.

k8s-ci-robot · 2023-01-30T14:24:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbartosik, voelzmo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [jbartosik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

voelzmo added 3 commits November 21, 2022 16:27

Adapt e2e OOM test to CrashLoopBackOff scenario

2233a87

Stop filtering aggregatedContainerStates without samples

4e8fa60

Increase timeout for OOMKill test

25106e9

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 23, 2022

k8s-ci-robot requested review from kgolab and krzysied November 23, 2022 14:35

jbartosik added the area/vertical-pod-autoscaler label Nov 25, 2022

jbartosik reviewed Dec 19, 2022

View reviewed changes

k8s-ci-robot assigned jbartosik Dec 23, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 23, 2022

jbartosik approved these changes Jan 30, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2023

k8s-ci-robot merged commit 09f851d into kubernetes:master Jan 30, 2023

This was referenced Feb 1, 2023

Adapt e2e OOM test to CrashLoopBackOff scenario #5028

Closed

Bump VPA to 0.13.0 gardener/gardener#7441

Merged

voelzmo mentioned this pull request Feb 10, 2023

REQUEST: New membership for voelzmo kubernetes/org#4015

Closed

9 tasks

This was referenced Mar 27, 2023

Make the "OOMing pods under VPA" test more robust. #5621

Merged

Fix the setup for allow containers to fail (they should be OOM-killed) #5623

Merged

voelzmo mentioned this pull request Apr 19, 2024

VPA Recommender fails to record OOM and increase resources due to KeyError #6705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop filtering aggregatedContainerStates without samples #5326

Stop filtering aggregatedContainerStates without samples #5326

voelzmo commented Nov 23, 2022

jbartosik left a comment

jbartosik commented Dec 19, 2022

voelzmo commented Dec 19, 2022

voelzmo commented Dec 19, 2022

jbartosik commented Dec 23, 2022

voelzmo commented Jan 25, 2023

jbartosik left a comment

k8s-ci-robot commented Jan 30, 2023

Stop filtering aggregatedContainerStates without samples #5326

Stop filtering aggregatedContainerStates without samples #5326

Conversation

voelzmo commented Nov 23, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Alternative A: Adding separate SampleCounts for CPU and memory

Alternative B: Increase TotalSampleCount when an OOMSample is added

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

jbartosik left a comment

Choose a reason for hiding this comment

jbartosik commented Dec 19, 2022

voelzmo commented Dec 19, 2022

voelzmo commented Dec 19, 2022

jbartosik commented Dec 23, 2022

voelzmo commented Jan 25, 2023

jbartosik left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 30, 2023

Alternative A: Adding separate `SampleCounts` for CPU and memory

Alternative B: Increase `TotalSampleCount` when an OOMSample is added