Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve deployment upgrade with RWO PVCs #280

Merged
merged 1 commit into from
Jan 30, 2024

Conversation

montaguethomas
Copy link
Contributor

When OLM tries to upgrade the operator, it will fail to rollout the new pod for the deployment due to RWO volume mounts.

The #141 PR resolved this by setting a preference to schedule the new pod on the name node as the existing pod. However this change was reverted in #145 for some unknown/undocumented reason.

@maskarb
Copy link
Member

maskarb commented Jan 30, 2024

@montaguethomas Can you explain in more detail what issue this PR resolves? We have had no reports that the current versions of this operator are having any issues deploying during an upgrade.

@montaguethomas
Copy link
Contributor Author

montaguethomas commented Jan 30, 2024

RH Case 03716933

The Cost Management Metrics Operator fails to upgrade due to PVC mounting issues.

multi-attach error for volume costmanagement-metrics-operator-reports already used by pod

We expect either the old pod to be deleted so a new pod can be created, or for the new pod to mount the same node to prevent this issue.

This PR seemed to address the issue:
#141

But it was reverted here:
#145

Cost Management Metrics Operator cannot update to the latest version unless the new pod luckily gets scheduled to the same node or manual intervention.

Expansion on the above, the costmanagement-metrics-operator consists of two main things:

  • a k8s deployment for the actual operator
  • a PV + PVC for its storage of cost data

The k8s deployment references the PV via:

    volumes:
    - name: costmanagement-metrics-operator-reports
      persistentVolumeClaim:
        claimName: costmanagement-metrics-operator-data

and mounts it via:

      volumeMounts:
      - mountPath: /tmp/costmanagement-metrics-operator-reports
        name: costmanagement-metrics-operator-reports

Given the volume reference the scheduler makes sure to schedule the pod on a node that can mount the volume. For example in AWS deployments the scheduler will assign the pod on a node that is in the same AZ as the volume (EBS drives can only span a single AZ).

When the deployment needs to spin up a new pod the scheduler will make the same decision: put the new pod on a node in the same AZ. The issue is that the EBS drives do not support multi-attach across machines so we're stuck in a place where the kubelet is not able to run the new pod.

A fix was pushed to add pod affinities to the k8s deployment: #141. That fix instructed the scheduler that it needs to place the new pod on the same node where a pod already exists that satisfy the specified labels. In that case it will resolve to a single node and the EBS drive is already attached.

The fix was reverted in #145 without any explanation.

The impact it had on our clusters is that a new version of the cost management metrics operator rolled out and OLM tried to upgrade it. The upgrade got stuck because the new pod would not run and OLM marked the CSV of the cost management metrics operator as failed.

@maskarb
Copy link
Member

maskarb commented Jan 30, 2024

The pod affinities in #141 caused attach errors which is why it was reverted.

We can merge this and try to release it with the 3.2.0 version.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d4acdd1) 85.12% compared to head (f848ca0) 85.12%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #280   +/-   ##
=======================================
  Coverage   85.12%   85.12%           
=======================================
  Files          13       13           
  Lines        2730     2730           
=======================================
  Hits         2324     2324           
  Misses        325      325           
  Partials       81       81           
Flag Coverage Δ
unittests 85.12% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4acdd1...f848ca0. Read the comment docs.

@maskarb maskarb merged commit 764287e into project-koku:main Jan 30, 2024
7 of 8 checks passed
@maskarb
Copy link
Member

maskarb commented Jan 30, 2024

Thanks for the PR!

@montaguethomas montaguethomas deleted the patch-1 branch February 8, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants