-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect CSI driver node pods to avoid storage workload scheduling #122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small changes I think would make it better but overall very good work!
internal/monitor/controller.go
Outdated
podKey := getPodKey(pod) | ||
// Clean up pod key to PodInfo and CrashLoopBackOffCount mappings if deleting. | ||
if eventType == watch.Deleted { | ||
cm.PodKeyToControllerPodInfo.Delete(podKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the call to cm.PodKeyToControllerPodInfo.Delete and cm.PodKeyToCrashLoopBackOffCount.Delete are needed, because you correctly intercept the pod before the are added into these maps. Look at controllerModePodHandler and you will see your driver pods get break out into the separate function before they can be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
internal/monitor/controller.go
Outdated
// Determine pod status | ||
ready := false | ||
initialized := true | ||
conditions := pod.Status.Conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you want to break lines 679 through 688 into a subroutine that given a pod returns booleans for each of the conditions, as this code is cut and pasted from controllerModPodHandler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
internal/monitor/monitor.go
Outdated
@@ -174,6 +179,7 @@ func podMonitorHandler(eventType watch.EventType, object interface{}) error { | |||
pm := &PodMonitor | |||
switch PodMonitor.Mode { | |||
case "controller": | |||
// driver-namespace == pod.spec.namespace call different function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is incorrect now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
And a driver pod for node <podnode> with condition <condition> | ||
And I induce error <error> | ||
When I call controllerModeDriverPodHandler with event <eventtype> | ||
And the <podnode> is tainted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you check the untaint conditions? You could has a boolean indicated when it is expected to be tainted or not. In the examples, wouldn't the "Ready" line remove the taint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
|
||
func (f *feature) theIsTainted(node string) error { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass the boolean in here and then you can check both the addition of the taint as well as the removal of the taint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
internal/monitor/node.go
Outdated
@@ -291,6 +298,7 @@ func (pm *PodMonitorType) nodeModeCleanupPods(node *v1.Node) bool { | |||
|
|||
// Check containers to make sure they're not running. This uses the containerInfos map obtained above. | |||
pod := podInfo.Pod | |||
namespace, name := splitPodKey(podKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why you moved this from below the for loop; looks like it doesn't make any difference. Doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved that when I thought I needed to check in this function also if pod is driver node pod, forgot to move it down back to original place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes Alik! Approved.
Description
The CSI node driver when not running on a WN (worker node). the CSI controller should check and taint that node so that pods are not scheduled on that node.
GitHub Issues
List the GitHub issues impacted by this PR:
Checklist:
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Please also list any relevant details for your test configuration
time="2022-05-19T14:00:25-04:00" level=info msg="Node-mode test finished"
--- PASS: TestNodeMode (1.84s)
=== RUN TestMapEqualsMap
--- PASS: TestMapEqualsMap (0.00s)
=== RUN TestPowerFlexShortCheck
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping short integration test. To enable short integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestPowerFlexShortCheck (0.00s)
=== RUN TestUnityShortCheck
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping short integration test. To enable short integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestUnityShortCheck (0.00s)
=== RUN TestPowerFlexShortIntegration
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping integration test. To enable integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestPowerFlexShortIntegration (0.00s)
=== RUN TestUnityShortIntegration
time="2022-05-19T14:00:25-04:00" level=info msg="Skipping integration test. To enable integration test: export RESILIENCY_SHORT_INT_TEST=true"
--- PASS: TestUnityShortIntegration (0.00s)
PASS
coverage: 92.9% of statements
status 0
ok podmon/internal/monitor 8.292s coverage: 92.9% of statements