Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect CSI driver node pods to avoid storage workload scheduling #122

Merged
merged 8 commits into from
May 19, 2022

Conversation

alikdell
Copy link
Collaborator

@alikdell alikdell commented May 17, 2022

Description

The CSI node driver when not running on a WN (worker node). the CSI controller should check and taint that node so that pods are not scheduled on that node.

GitHub Issues

List the GitHub issues impacted by this PR:

GitHub Issue #
#dell/csm#145

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • Backward compatibility is not broken

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please also list any relevant details for your test configuration

  • Unit test added
  • Test run integration test

time="2022-05-19T14:00:25-04:00" level=info msg="Node-mode test finished"
--- PASS: TestNodeMode (1.84s)
=== RUN TestMapEqualsMap
--- PASS: TestMapEqualsMap (0.00s)
=== RUN TestPowerFlexShortCheck
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping short integration test. To enable short integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestPowerFlexShortCheck (0.00s)
=== RUN TestUnityShortCheck
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping short integration test. To enable short integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestUnityShortCheck (0.00s)
=== RUN TestPowerFlexShortIntegration
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping integration test. To enable integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestPowerFlexShortIntegration (0.00s)
=== RUN TestUnityShortIntegration
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping integration test. To enable integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestUnityShortIntegration (0.00s)
PASS
coverage: 92.9% of statements
status 0
ok podmon/internal/monitor 8.292s coverage: 92.9% of statements

rbo54
rbo54 previously requested changes May 18, 2022
Copy link
Collaborator

@rbo54 rbo54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small changes I think would make it better but overall very good work!

podKey := getPodKey(pod)
// Clean up pod key to PodInfo and CrashLoopBackOffCount mappings if deleting.
if eventType == watch.Deleted {
cm.PodKeyToControllerPodInfo.Delete(podKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the call to cm.PodKeyToControllerPodInfo.Delete and cm.PodKeyToCrashLoopBackOffCount.Delete are needed, because you correctly intercept the pod before the are added into these maps. Look at controllerModePodHandler and you will see your driver pods get break out into the separate function before they can be added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// Determine pod status
ready := false
initialized := true
conditions := pod.Status.Conditions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want to break lines 679 through 688 into a subroutine that given a pod returns booleans for each of the conditions, as this code is cut and pasted from controllerModPodHandler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -174,6 +179,7 @@ func podMonitorHandler(eventType watch.EventType, object interface{}) error {
pm := &PodMonitor
switch PodMonitor.Mode {
case "controller":
// driver-namespace == pod.spec.namespace call different function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is incorrect now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

And a driver pod for node <podnode> with condition <condition>
And I induce error <error>
When I call controllerModeDriverPodHandler with event <eventtype>
And the <podnode> is tainted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check the untaint conditions? You could has a boolean indicated when it is expected to be tainted or not. In the examples, wouldn't the "Ready" line remove the taint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

func (f *feature) theIsTainted(node string) error {
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass the boolean in here and then you can check both the addition of the taint as well as the removal of the taint.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -291,6 +298,7 @@ func (pm *PodMonitorType) nodeModeCleanupPods(node *v1.Node) bool {

// Check containers to make sure they're not running. This uses the containerInfos map obtained above.
pod := podInfo.Pod
namespace, name := splitPodKey(podKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you moved this from below the for loop; looks like it doesn't make any difference. Doesn't really matter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved that when I thought I needed to check in this function also if pod is driver node pod, forgot to move it down back to original place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

Copy link
Collaborator

@rbo54 rbo54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Alik! Approved.

@alikdell alikdell merged commit 82a3ff9 into main May 19, 2022
@alikdell alikdell deleted the protect-driver-pod branch May 19, 2022 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants