-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve deployment upgrade with RWO PVCs #280
Conversation
@montaguethomas Can you explain in more detail what issue this PR resolves? We have had no reports that the current versions of this operator are having any issues deploying during an upgrade. |
RH Case 03716933
Expansion on the above, the costmanagement-metrics-operator consists of two main things:
The k8s deployment references the PV via: volumes:
- name: costmanagement-metrics-operator-reports
persistentVolumeClaim:
claimName: costmanagement-metrics-operator-data and mounts it via: volumeMounts:
- mountPath: /tmp/costmanagement-metrics-operator-reports
name: costmanagement-metrics-operator-reports Given the volume reference the scheduler makes sure to schedule the pod on a node that can mount the volume. For example in AWS deployments the scheduler will assign the pod on a node that is in the same AZ as the volume (EBS drives can only span a single AZ). When the deployment needs to spin up a new pod the scheduler will make the same decision: put the new pod on a node in the same AZ. The issue is that the EBS drives do not support multi-attach across machines so we're stuck in a place where the kubelet is not able to run the new pod. A fix was pushed to add pod affinities to the k8s deployment: #141. That fix instructed the scheduler that it needs to place the new pod on the same node where a pod already exists that satisfy the specified labels. In that case it will resolve to a single node and the EBS drive is already attached. The fix was reverted in #145 without any explanation. The impact it had on our clusters is that a new version of the cost management metrics operator rolled out and OLM tried to upgrade it. The upgrade got stuck because the new pod would not run and OLM marked the CSV of the cost management metrics operator as failed. |
The pod affinities in #141 caused attach errors which is why it was reverted. We can merge this and try to release it with the 3.2.0 version. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #280 +/- ##
=======================================
Coverage 85.12% 85.12%
=======================================
Files 13 13
Lines 2730 2730
=======================================
Hits 2324 2324
Misses 325 325
Partials 81 81
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report in Codecov by Sentry.
|
Thanks for the PR! |
When OLM tries to upgrade the operator, it will fail to rollout the new pod for the deployment due to RWO volume mounts.
The #141 PR resolved this by setting a preference to schedule the new pod on the name node as the existing pod. However this change was reverted in #145 for some unknown/undocumented reason.