Resolve deployment upgrade with RWO PVCs #280

montaguethomas · 2024-01-30T00:24:54Z

When OLM tries to upgrade the operator, it will fail to rollout the new pod for the deployment due to RWO volume mounts.

The #141 PR resolved this by setting a preference to schedule the new pod on the name node as the existing pod. However this change was reverted in #145 for some unknown/undocumented reason.

maskarb · 2024-01-30T02:02:06Z

@montaguethomas Can you explain in more detail what issue this PR resolves? We have had no reports that the current versions of this operator are having any issues deploying during an upgrade.

montaguethomas · 2024-01-30T16:36:13Z

RH Case 03716933

The Cost Management Metrics Operator fails to upgrade due to PVC mounting issues.

multi-attach error for volume costmanagement-metrics-operator-reports already used by pod

We expect either the old pod to be deleted so a new pod can be created, or for the new pod to mount the same node to prevent this issue.

This PR seemed to address the issue:
#141

But it was reverted here:
#145

Cost Management Metrics Operator cannot update to the latest version unless the new pod luckily gets scheduled to the same node or manual intervention.

Expansion on the above, the costmanagement-metrics-operator consists of two main things:

a k8s deployment for the actual operator
a PV + PVC for its storage of cost data

The k8s deployment references the PV via:

    volumes:
    - name: costmanagement-metrics-operator-reports
      persistentVolumeClaim:
        claimName: costmanagement-metrics-operator-data

and mounts it via:

      volumeMounts:
      - mountPath: /tmp/costmanagement-metrics-operator-reports
        name: costmanagement-metrics-operator-reports

Given the volume reference the scheduler makes sure to schedule the pod on a node that can mount the volume. For example in AWS deployments the scheduler will assign the pod on a node that is in the same AZ as the volume (EBS drives can only span a single AZ).

When the deployment needs to spin up a new pod the scheduler will make the same decision: put the new pod on a node in the same AZ. The issue is that the EBS drives do not support multi-attach across machines so we're stuck in a place where the kubelet is not able to run the new pod.

A fix was pushed to add pod affinities to the k8s deployment: #141. That fix instructed the scheduler that it needs to place the new pod on the same node where a pod already exists that satisfy the specified labels. In that case it will resolve to a single node and the EBS drive is already attached.

The fix was reverted in #145 without any explanation.

The impact it had on our clusters is that a new version of the cost management metrics operator rolled out and OLM tried to upgrade it. The upgrade got stuck because the new pod would not run and OLM marked the CSV of the cost management metrics operator as failed.

maskarb · 2024-01-30T17:30:46Z

The pod affinities in #141 caused attach errors which is why it was reverted.

We can merge this and try to release it with the 3.2.0 version.

codecov · 2024-01-30T17:33:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d4acdd1) 85.12% compared to head (f848ca0) 85.12%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #280   +/-   ##
=======================================
  Coverage   85.12%   85.12%           
=======================================
  Files          13       13           
  Lines        2730     2730           
=======================================
  Hits         2324     2324           
  Misses        325      325           
  Partials       81       81

Flag	Coverage Δ
unittests	`85.12% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4acdd1...f848ca0. Read the comment docs.

maskarb · 2024-01-30T17:34:36Z

Thanks for the PR!

Resolve deployment upgrade with RWO PVCs

f848ca0

maskarb approved these changes Jan 30, 2024

View reviewed changes

maskarb merged commit 764287e into project-koku:main Jan 30, 2024
7 of 8 checks passed

montaguethomas deleted the patch-1 branch February 8, 2024 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve deployment upgrade with RWO PVCs #280

Resolve deployment upgrade with RWO PVCs #280

montaguethomas commented Jan 30, 2024

maskarb commented Jan 30, 2024

montaguethomas commented Jan 30, 2024 •

edited

Loading

maskarb commented Jan 30, 2024

codecov bot commented Jan 30, 2024

maskarb commented Jan 30, 2024

Resolve deployment upgrade with RWO PVCs #280

Resolve deployment upgrade with RWO PVCs #280

Conversation

montaguethomas commented Jan 30, 2024

maskarb commented Jan 30, 2024

montaguethomas commented Jan 30, 2024 • edited Loading

maskarb commented Jan 30, 2024

codecov bot commented Jan 30, 2024

Codecov Report

maskarb commented Jan 30, 2024

montaguethomas commented Jan 30, 2024 •

edited

Loading