Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop filtering aggregatedContainerStates without samples #5326

Merged

Conversation

voelzmo
Copy link
Contributor

@voelzmo voelzmo commented Nov 23, 2022

Which component this PR applies to?

vertical-pod-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

Stop filtering aggregatedContainerStates with TotalSampleCount==0. These aggregateContainerStates can contain OOMSamples, which don't increase the TotalSampleCount but make it possible to give a better memory recommendation. This change also increases the timeout for the OOMKill e2e test as it takes ~10 minutes for the recommendations to reach the expected range.

Which issue(s) this PR fixes:

Fixes #4981

Special notes for your reviewer:

I'm not entirely sure if there is a good reason why aggregateContainerStates without PodMetric samples were ignored besides trying to optimize. Maybe someone else has context if there's an additional reason behind it why we shouldn't be removing this condition?

I looked at other options to solve this, but they seem much more complicated.

Alternative A: Adding separate SampleCounts for CPU and memory

The VPA is built with the assumption that a sample contains both, CPU and memory data, therefore only a single TotalSampleCount exists in aggregateContainerState. We could move this further down and give each resource its own counter, but confidence is computed based on TotalSampleCount: therefore this would mean that CPU and memory can have different confidence and therefore different upper and lower bounds. This seems way too intrusive.

Alternative B: Increase TotalSampleCount when an OOMSample is added

Do something similar to the quick hack I discussed in #4981, but only for OOMSamples, not all memory samples.

diff --git a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
index 3facbe37e..7accd072e 100644
--- a/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
+++ b/vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go
@@ -184,6 +184,7 @@ func (a *AggregateContainerState) AddSample(sample *ContainerUsageSample) {
        case ResourceCPU:
                a.addCPUSample(sample)
        case ResourceMemory:
+               a.TotalSamplesCount++
                a.AggregateMemoryPeaks.AddSample(BytesFromMemoryAmount(sample.Usage), 1.0, sample.MeasureStart)
        default:
                panic(fmt.Sprintf("AddSample doesn't support resource '%s'", sample.Resource))

Downside: When many OOMSamples are recorded in a short period (e.g. when a Pod suddenly spikes in memory or has waaaaayyyyy too low intial resources defined), the confidence for CPU recommendations increases as well – which is the opposite of what's happening: In a situation like this we don't get PodMetrics and therefore no CPU metrics, so this also seems incorrect.

Does this PR introduce a user-facing change?

Fixed a bug which could lead to the recommender not giving any recommendations for Pods in OOMKill CrashLoopBackoff

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 23, 2022
Copy link
Collaborator

@jbartosik jbartosik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for slow reply, I'm not sure why I missed this PR.

General approach looks good to me, there are some things we need to do to make sure this works properly.

We use TotalSamplesCount to decide if isEmpty should return true, which we later use to decide if we should delete the container state (here directly and inside call to isExpired.

I'm worried that this will result in OOMs only briefly having an effect and then being cleaned up.

@jbartosik
Copy link
Collaborator

Also I wonder why do you increase timeout for the OOM test?

@voelzmo
Copy link
Contributor Author

voelzmo commented Dec 19, 2022

Also I wonder why do you increase timeout for the OOM test?

It's buried in the long PR description, sorry:

This change also increases the timeout for the OOMKill e2e test as it takes ~10 minutes for the recommendations to reach the expected range.

When I watched the VPA status changes during the test, I could see that the recommendations were increasing, it just took longer than currently configured. Without the change, the recommendations didn't move at all.

@voelzmo
Copy link
Contributor Author

voelzmo commented Dec 19, 2022

Sorry for slow reply, I'm not sure why I missed this PR.

General approach looks good to me, there are some things we need to do to make sure this works properly.

We use TotalSamplesCount to decide if isEmpty should return true, which we later use to decide if we should delete the container state (here directly and inside call to isExpired.

I'm worried that this will result in OOMs only briefly having an effect and then being cleaned up.

That's a good point! For the aggregate state to be garbage collected, the Pod needs to be detected a 'non contributive' in addition to the TotalSampleCount being 0 – which shouldn't be the case as long as the Deployment still exists, right?

I wasn't sure how likely this would happen, but it definitely could! Do you have concrete suggestions on how to improve the solution so we don't run into this case?

@jbartosik
Copy link
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 23, 2022
@voelzmo
Copy link
Contributor Author

voelzmo commented Jan 25, 2023

ping @jbartosik is there something missing before merging this?

Copy link
Collaborator

@jbartosik jbartosik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks for the reminder.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbartosik, voelzmo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VPA doesn't provide any recommendations when Pod is in OOMKill CrashLoopBackoff right after start
3 participants