Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-12210: Prevent partially filled HPA behaviors from crashing kube-controller-manager #1876

Merged

Conversation

jkyros
Copy link

@jkyros jkyros commented Jan 31, 2024

The short version here is that:

  • If you supply partial HPA behaviors (e.g. ScaleUp but not ScaleDown, etc ) in kube < 1.27, it will send the kube-controller-manager into CrashLoopBackOff
  • This is fixed in kube 1.27+ by defaulting to autoscaling v2: Autoscaling: advance v2 as the preferred API version over v1 kubernetes/kubernetes#114358 but we can't backport that type of change
  • So, since we're storing as v1 but consuming as v2 in the controller, we need to make sure that the behaviors aren't nil in the v2 object when someone creates or edits a v1 object to have partially filled behaviors

This PR:

  • Defauts any nil behaviors when converting from v1 -> internal
  • Makes the controller fill in missing behavior with defaults
  • Removes the "defaulter cheating" in the unit test that was masking the crash
  • Adds a test case to verify that it works
  • Is targeted straight to 4.13 because it's useless after that (if it were preexisting carry, it would have been dropped in 4.14)

Updstream details:

  • I did inquire upstream but we're already outside the n-3 supported versions and the fix is useless post 1.27, so the "juice wasn't worth the squeeze". We still have at least one customer that needs this fixed so we'd just need to get this into 4.13 and 4.12 since they're still supported.

Here is a straightforward crasher ( you might have to wait a little bit until the HPA touches it, but you should be able to see kube-controller-manager pods go into CrashLoopBackoff):

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: crasher
  namespace: test 
  labels:
    app: test
  annotations:
    autoscaling.alpha.kubernetes.io/behavior: '{"ScaleDown":{"StabilizationWindowSeconds":600,"SelectPolicy":"Max","Policies":[{"type":"Pods","value":1,"periodSeconds":1}]}}'
spec:
  scaleTargetRef:
    kind: Deployment
    name: test
    apiVersion: apps/v1
  minReplicas: 8
  maxReplicas: 25
  targetCPUUtilizationPercentage: 120
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: test
  namespace: test
  labels:
    app: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test
    spec:
      containers:
        - resources:
            limits:
              cpu: 500m
              memory: 128Mi
            requests:
              cpu: 25m
              memory: 128Mi
          readinessProbe:
            httpGet:
              path: /
              port: 8080
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: nginx
          livenessProbe:
            httpGet:
              path: /
              port: 8080
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          ports:
            - containerPort: 8080
              protocol: TCP
          imagePullPolicy: Always
          image: 'nginxinc/nginx-unprivileged:latest'
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

Fixes: OCPBUGS-12210

@openshift-ci-robot openshift-ci-robot added backports/validated-commits Indicates that all commits come to merged upstream PRs. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 31, 2024
@openshift-ci-robot
Copy link

@jkyros: This pull request references Jira Issue OCPBUGS-12210, which is invalid:

  • expected the bug to target the "4.13.z" version, but no target version was set
  • expected Jira Issue OCPBUGS-12210 to depend on a bug targeting a version in 4.14.0, 4.14.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The short version here is that:

  • If you supply partial HPA behaviors (e.g. ScaleUp but not ScaleDown, etc ) in kube < 1.27, it will send the kube-controller-manager into CrashLoopBackOff
  • This is fixed in kube 1.27+ by defaulting to autoscaling v2: Autoscaling: advance v2 as the preferred API version over v1 kubernetes/kubernetes#114358 but we can't backport that type of change
  • So, since we're storing as v1 but consuming as v2 in the controller, we need to make sure that the behaviors aren't nil in the v2 object when someone creates or edits a v1 object to have partially filled behaviors

This PR:

  • Defauts any nil behaviors when converting from v1 -> internal
  • Adds a test case to verify that it works
  • Is targeted straight to 4.13 because it's useless after that (if it were preexisting carry, it would have been dropped in 4.14)

Updstream details:

  • I did inquire upstream but we're already outside the n-3 supported versions and the fix is useless post 1.27, so the "juice wasn't worth the squeeze". We still have at least one customer that needs this fixed so we'd just need to get this into 4.13 and 4.12 since they're still supported.

Here is a straightforward crasher ( you might have to wait a little bit until the HPA touches it, but you should be able to see kube-controller-manager pods go into CrashLoopBackoff):

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
 name: crasher
 namespace: test 
 labels:
   app: test
 annotations:
   autoscaling.alpha.kubernetes.io/behavior: '{"ScaleDown":{"StabilizationWindowSeconds":600,"SelectPolicy":"Max","Policies":[{"type":"Pods","value":1,"periodSeconds":1}]}}'
spec:
 scaleTargetRef:
   kind: Deployment
   name: test
   apiVersion: apps/v1
 minReplicas: 8
 maxReplicas: 25
 targetCPUUtilizationPercentage: 120
---
kind: Deployment
apiVersion: apps/v1
metadata:
 name: test
 namespace: test
 labels:
   app: test
spec:
 replicas: 2
 selector:
   matchLabels:
     app: test
 template:
   metadata:
     creationTimestamp: null
     labels:
       app: test
   spec:
     containers:
       - resources:
           limits:
             cpu: 500m
             memory: 128Mi
           requests:
             cpu: 25m
             memory: 128Mi
         readinessProbe:
           httpGet:
             path: /
             port: 8080
             scheme: HTTP
           timeoutSeconds: 1
           periodSeconds: 10
           successThreshold: 1
           failureThreshold: 3
         terminationMessagePath: /dev/termination-log
         name: nginx
         livenessProbe:
           httpGet:
             path: /
             port: 8080
             scheme: HTTP
           timeoutSeconds: 1
           periodSeconds: 10
           successThreshold: 1
           failureThreshold: 3
         ports:
           - containerPort: 8080
             protocol: TCP
         imagePullPolicy: Always
         image: 'nginxinc/nginx-unprivileged:latest'
     restartPolicy: Always
     terminationGracePeriodSeconds: 30
     dnsPolicy: ClusterFirst
     securityContext: {}
     schedulerName: default-scheduler
 strategy:
   type: RollingUpdate
   rollingUpdate:
     maxUnavailable: 25%
     maxSurge: 25%
 revisionHistoryLimit: 10
 progressDeadlineSeconds: 600

Fixes: OCPBUGS-12210

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@jkyros: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@jkyros
Copy link
Author

jkyros commented Jan 31, 2024

/jira refresh

@openshift-ci-robot
Copy link

@jkyros: This pull request references Jira Issue OCPBUGS-12210, which is invalid:

  • expected Jira Issue OCPBUGS-12210 to depend on a bug targeting a version in 4.14.0, 4.14.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@joelsmith
Copy link

/lgtm

@aravindhp
Copy link

/lgtm

I am fine with this approach

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2024
@deads2k
Copy link

deads2k commented Jan 31, 2024

the referenced upstream doesn't exist: 11358

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2024
@openshift-ci-robot
Copy link

@jkyros: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@jkyros
Copy link
Author

jkyros commented Jan 31, 2024

the referenced upstream doesn't exist: 11358

Fixed, I'm a clown, I left out the 4, should have been 114358: kubernetes#114358

@@ -449,6 +450,20 @@ func Convert_v1_HorizontalPodAutoscaler_To_autoscaling_HorizontalPodAutoscaler(i
// drop round-tripping annotations after converting to internal
out.Annotations, _ = autoscaling.DropRoundTripHorizontalPodAutoscalerAnnotations(out.Annotations)

// Until kube 1.27 we're still storing autoscaling as v1, but the HPA controller is consuming it as v2. Behaviors are an annotation in v1 and can be partially
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do to the annotations in question in a flow like

  1. write object to API/etcd, scaleup gets set
  2. read object as v1, do annotations make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They get converted back out properly, but I assume as part out of the round-trip to the "internal" version when the controller updates it as v2 (I didn't think it round tripped v1 -> internal -> v1 on the way in on initial creation, but maybe I'm wrong and it does).

e.g. if I omit ScaleUp:

cat << EOF | oc create -f - 
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: crasher
  namespace: test 
  labels:
    app: test
  annotations:
    autoscaling.alpha.kubernetes.io/behavior: '{"ScaleDown":{"StabilizationWindowSeconds":600,"SelectPolicy":"Max","Policies":[{"type":"Pods","value":1,"periodSeconds":1}]}}'
spec:
  scaleTargetRef:
    kind: Deployment
    name: test
    apiVersion: apps/v1
  minReplicas: 8
  maxReplicas: 25
  targetCPUUtilizationPercentage: 120
EOF

When I ask for it back, ScaleUp in the autoscaling.alpha.kubernetes.io/behavior: annotation has been filled in by the default value.

[jkyros@jkyros-thinkpadp1gen5 ocpbugs-12210]$ oc get hpa.v1.autoscaling -n test crasher -o yaml 
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/behavior: '{"ScaleUp":{"StabilizationWindowSeconds":0,"SelectPolicy":"Max","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":15},{"Type":"Percent","Value":100,"PeriodSeconds":15}]},"ScaleDown":{"StabilizationWindowSeconds":600,"SelectPolicy":"Max","Policies":[{"Type":"Pods","Value":1,"PeriodSeconds":1}]}}'
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2024-01-31T20:54:22Z","reason":"ScaleDownStabilized","message":"recent
      recommendations were higher than current one, applying the highest recent recommendation"},{"type":"ScalingActive","status":"True","lastTransitionTime":"2024-01-31T20:54:22Z","reason":"ValidMetricFound","message":"the
      HPA was able to successfully calculate a replica count from cpu resource utilization
      (percentage of request)"},{"type":"ScalingLimited","status":"False","lastTransitionTime":"2024-01-31T20:54:22Z","reason":"DesiredWithinRange","message":"the
      desired count is within the acceptable range"}]'
    autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"Resource","resource":{"name":"cpu","currentAverageUtilization":0,"currentAverageValue":"0"}}]'
    converted.jkyros.io: autoscaling -> v1
  creationTimestamp: "2024-01-31T20:54:20Z"
  labels:
    app: test
  name: crasher
  namespace: test
  resourceVersion: "92141"
  uid: 3b0646ec-ed98-4b2f-a989-8bc201d7a56f
spec:
  maxReplicas: 25
  minReplicas: 8
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test
  targetCPUUtilizationPercentage: 120
status:
  currentCPUUtilizationPercentage: 0
  currentReplicas: 8
  desiredReplicas: 8

TL;DR yep the converters work, the annotations make sense

@deads2k
Copy link

deads2k commented Jan 31, 2024

where does the crash happen? is it in conversion code? if the failure happens in the kube-controller-manager, I'd rather see a kube-controller-manager patch because the blast radius is much smaller and doesn't have any persistent effect in the cluster.

@openshift-ci-robot
Copy link

@jkyros: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

The description in the autoscaling API for the HorizontalPodAutoscaler
suggests that HorizontalPodAutoscalerSpec's Behavior field (and its
ScaleUp and ScaleDown members) are optional. And that if not supplied,
defaults will be used.

That's true if the entire Behavior is nil because, we go
through "normalizeDesiredReplicas" instead of
"normalizeDesiredReplicasWithBehaviors", but if the structure is only
partially supplied, leaving some members nil, it results in nil
dereferences when we end up going though
normalizeDesiredReplicasWithBehaviors.

So we end up in a situation where:
- If Behavior is entirely absent (nil) we use defaults (good)
- If Behavior is partially specified we panic (very bad)
- If stabilizationWindowSeconds is nil in either ScaleUp or Scaledown,
  we panic (also very bad)

In general, this only happens with pre-v2 HPA objects because v2 does
properly fill in the default values.

This commit prevents the panic by using the defaulters to ensure that
unpopulated fields in the behavior objects get filled in with their
defaults before they are used, so they can safely be dereferenced by
later code that performs calculations on them.
@openshift-ci-robot
Copy link

@jkyros: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@jkyros
Copy link
Author

jkyros commented Feb 1, 2024

Eads and I talked over slack (thanks David).

In summary, since the crash is in the controller, not the converter, we'd prefer to fix it in the controller to reduce the blast radius since this fix is only for two old versions.

I've retooled this PR such that:

  • The fix is in the controller now
  • It runs the HPA object through the behavior defaulters before we do any calculations on it (so the defaults are filled by the time it gets there)
  • I also added a test case for partial behaviors
  • I took out a "cheat" in the unit tests that was preventing us from finding the crash before now

This does come with the side effect that if the controller modifies the object, it gets written back out with the missing values filled in, but that seems better to me than refusing to scale until someone tediously fills them in?

@openshift-ci-robot
Copy link

@jkyros: This pull request references Jira Issue OCPBUGS-12210, which is invalid:

  • expected Jira Issue OCPBUGS-12210 to depend on a bug targeting a version in 4.14.0, 4.14.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

The short version here is that:

  • If you supply partial HPA behaviors (e.g. ScaleUp but not ScaleDown, etc ) in kube < 1.27, it will send the kube-controller-manager into CrashLoopBackOff
  • This is fixed in kube 1.27+ by defaulting to autoscaling v2: Autoscaling: advance v2 as the preferred API version over v1 kubernetes/kubernetes#114358 but we can't backport that type of change
  • So, since we're storing as v1 but consuming as v2 in the controller, we need to make sure that the behaviors aren't nil in the v2 object when someone creates or edits a v1 object to have partially filled behaviors

This PR:

  • Defauts any nil behaviors when converting from v1 -> internal
  • Makes the controller fill in missing behavior with defaults
  • Removes the "defaulter cheating" in the unit test that was masking the crash
  • Adds a test case to verify that it works
  • Is targeted straight to 4.13 because it's useless after that (if it were preexisting carry, it would have been dropped in 4.14)

Updstream details:

  • I did inquire upstream but we're already outside the n-3 supported versions and the fix is useless post 1.27, so the "juice wasn't worth the squeeze". We still have at least one customer that needs this fixed so we'd just need to get this into 4.13 and 4.12 since they're still supported.

Here is a straightforward crasher ( you might have to wait a little bit until the HPA touches it, but you should be able to see kube-controller-manager pods go into CrashLoopBackoff):

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
 name: crasher
 namespace: test 
 labels:
   app: test
 annotations:
   autoscaling.alpha.kubernetes.io/behavior: '{"ScaleDown":{"StabilizationWindowSeconds":600,"SelectPolicy":"Max","Policies":[{"type":"Pods","value":1,"periodSeconds":1}]}}'
spec:
 scaleTargetRef:
   kind: Deployment
   name: test
   apiVersion: apps/v1
 minReplicas: 8
 maxReplicas: 25
 targetCPUUtilizationPercentage: 120
---
kind: Deployment
apiVersion: apps/v1
metadata:
 name: test
 namespace: test
 labels:
   app: test
spec:
 replicas: 2
 selector:
   matchLabels:
     app: test
 template:
   metadata:
     creationTimestamp: null
     labels:
       app: test
   spec:
     containers:
       - resources:
           limits:
             cpu: 500m
             memory: 128Mi
           requests:
             cpu: 25m
             memory: 128Mi
         readinessProbe:
           httpGet:
             path: /
             port: 8080
             scheme: HTTP
           timeoutSeconds: 1
           periodSeconds: 10
           successThreshold: 1
           failureThreshold: 3
         terminationMessagePath: /dev/termination-log
         name: nginx
         livenessProbe:
           httpGet:
             path: /
             port: 8080
             scheme: HTTP
           timeoutSeconds: 1
           periodSeconds: 10
           successThreshold: 1
           failureThreshold: 3
         ports:
           - containerPort: 8080
             protocol: TCP
         imagePullPolicy: Always
         image: 'nginxinc/nginx-unprivileged:latest'
     restartPolicy: Always
     terminationGracePeriodSeconds: 30
     dnsPolicy: ClusterFirst
     securityContext: {}
     schedulerName: default-scheduler
 strategy:
   type: RollingUpdate
   rollingUpdate:
     maxUnavailable: 25%
     maxSurge: 25%
 revisionHistoryLimit: 10
 progressDeadlineSeconds: 600

Fixes: OCPBUGS-12210

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jkyros
Copy link
Author

jkyros commented Feb 1, 2024

/test e2e-aws-ovn-serial

@deads2k
Copy link

deads2k commented Feb 1, 2024

This looks like a great low-risk solution for 4.13 & 4.12. Thanks!

Agreed, thank you

/approve

Copy link

openshift-ci bot commented Feb 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aravindhp, deads2k, jkyros, joelsmith

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2024
@joelsmith
Copy link

/label backport-risk-assessed

Copy link

openshift-ci bot commented Feb 1, 2024

@joelsmith: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/label backport-risk-assessed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aravindhp
Copy link

/label backport-risk-assessed

@aravindhp aravindhp added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 1, 2024
@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Feb 1, 2024
@jkyros
Copy link
Author

jkyros commented Feb 5, 2024

@weinliu I know this isn't an actual "cherry pick" since we're stuffing it straight into 4.13, and it's not in a part of the repo that we "own", but would you be able to take a look here for the cherry-pick-approved label? (Or at least sign off from the QE side so we can have someone add the label?). Thank you much!

@weinliu
Copy link

weinliu commented Feb 6, 2024

/label cherry-pick-approved

Copy link

openshift-ci bot commented Feb 6, 2024

@weinliu: Can not set label cherry-pick-approved: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/label cherry-pick-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@weinliu
Copy link

weinliu commented Feb 6, 2024

@sunilcio could you help?

@sunilcio
Copy link

sunilcio commented Feb 6, 2024

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Feb 6, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 8f85140 into openshift:release-4.13 Feb 6, 2024
18 checks passed
@openshift-ci-robot
Copy link

@jkyros: Jira Issue OCPBUGS-12210: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-12210 has been moved to the MODIFIED state.

In response to this:

The short version here is that:

  • If you supply partial HPA behaviors (e.g. ScaleUp but not ScaleDown, etc ) in kube < 1.27, it will send the kube-controller-manager into CrashLoopBackOff
  • This is fixed in kube 1.27+ by defaulting to autoscaling v2: Autoscaling: advance v2 as the preferred API version over v1 kubernetes/kubernetes#114358 but we can't backport that type of change
  • So, since we're storing as v1 but consuming as v2 in the controller, we need to make sure that the behaviors aren't nil in the v2 object when someone creates or edits a v1 object to have partially filled behaviors

This PR:

  • Defauts any nil behaviors when converting from v1 -> internal
  • Makes the controller fill in missing behavior with defaults
  • Removes the "defaulter cheating" in the unit test that was masking the crash
  • Adds a test case to verify that it works
  • Is targeted straight to 4.13 because it's useless after that (if it were preexisting carry, it would have been dropped in 4.14)

Updstream details:

  • I did inquire upstream but we're already outside the n-3 supported versions and the fix is useless post 1.27, so the "juice wasn't worth the squeeze". We still have at least one customer that needs this fixed so we'd just need to get this into 4.13 and 4.12 since they're still supported.

Here is a straightforward crasher ( you might have to wait a little bit until the HPA touches it, but you should be able to see kube-controller-manager pods go into CrashLoopBackoff):

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
 name: crasher
 namespace: test 
 labels:
   app: test
 annotations:
   autoscaling.alpha.kubernetes.io/behavior: '{"ScaleDown":{"StabilizationWindowSeconds":600,"SelectPolicy":"Max","Policies":[{"type":"Pods","value":1,"periodSeconds":1}]}}'
spec:
 scaleTargetRef:
   kind: Deployment
   name: test
   apiVersion: apps/v1
 minReplicas: 8
 maxReplicas: 25
 targetCPUUtilizationPercentage: 120
---
kind: Deployment
apiVersion: apps/v1
metadata:
 name: test
 namespace: test
 labels:
   app: test
spec:
 replicas: 2
 selector:
   matchLabels:
     app: test
 template:
   metadata:
     creationTimestamp: null
     labels:
       app: test
   spec:
     containers:
       - resources:
           limits:
             cpu: 500m
             memory: 128Mi
           requests:
             cpu: 25m
             memory: 128Mi
         readinessProbe:
           httpGet:
             path: /
             port: 8080
             scheme: HTTP
           timeoutSeconds: 1
           periodSeconds: 10
           successThreshold: 1
           failureThreshold: 3
         terminationMessagePath: /dev/termination-log
         name: nginx
         livenessProbe:
           httpGet:
             path: /
             port: 8080
             scheme: HTTP
           timeoutSeconds: 1
           periodSeconds: 10
           successThreshold: 1
           failureThreshold: 3
         ports:
           - containerPort: 8080
             protocol: TCP
         imagePullPolicy: Always
         image: 'nginxinc/nginx-unprivileged:latest'
     restartPolicy: Always
     terminationGracePeriodSeconds: 30
     dnsPolicy: ClusterFirst
     securityContext: {}
     schedulerName: default-scheduler
 strategy:
   type: RollingUpdate
   rollingUpdate:
     maxUnavailable: 25%
     maxSurge: 25%
 revisionHistoryLimit: 10
 progressDeadlineSeconds: 600

Fixes: OCPBUGS-12210

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-pod-container-v4.13.0-202402060538.p0.g8f85140.assembly.stream for distgit openshift-enterprise-pod.
All builds following this will include this PR.

@jkyros
Copy link
Author

jkyros commented Feb 6, 2024

Thanks everyone! I'm only going to take this back one more to 4.12 (since it's EUS):
/cherry-pick release-4.12

@openshift-cherrypick-robot

@jkyros: #1876 failed to apply on top of branch "release-4.12":

Applying: UPSTREAM: 114358: Default missing fields in HPA behaviors
Using index info to reconstruct a base tree...
M	pkg/controller/podautoscaler/horizontal.go
M	pkg/controller/podautoscaler/horizontal_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/podautoscaler/horizontal_test.go
CONFLICT (content): Merge conflict in pkg/controller/podautoscaler/horizontal_test.go
Auto-merging pkg/controller/podautoscaler/horizontal.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 UPSTREAM: 114358: Default missing fields in HPA behaviors
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

Thanks everyone! I'm only going to take this back one more to 4.12 (since it's EUS):
/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link

Fix included in accepted release 4.13.0-0.nightly-2024-02-06-120750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. backports/validated-commits Indicates that all commits come to merged upstream PRs. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.