-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve errors for StatefulSet restart with updateStrategy: OnDelete #876
Conversation
60205ee
to
f81a87a
Compare
@@ -27,11 +30,10 @@ def deploy_succeeded? | |||
"Consider switching to rollingUpdate.") | |||
@success_assumption_warning_shown = true | |||
end | |||
else | |||
success &= desired_replicas == status_data['currentReplicas'].to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property is not supported for STS with OnDelete:
kubectl rollout status sts/foo
error: rollout status is only available for RollingUpdate strategy type
... therefore this condition can never be true in this case and krane will stall.
Edit: It turns out currentReplicas
can disappear sometimes, so this check can sometimes cause krane to stall, see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was the CI test in restart_task_test
passing if this is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timothysmith0609 Great question!
I had to look into this a bit. The brief answer is certain combinations of operations can cause the currentReplicas
field to become nil
. The simple test as-is does not hit these conditions, so it passes.
I've pushed a version of this test that hangs (in the same way I saw in real life) here: 5c4acd0
You can observe locally this is the state this sequence of events leaves the stateful set in:
status:
collisionCount: 0
currentRevision: stateful-busybox-6b5489d7d6
observedGeneration: 2
readyReplicas: 2
replicas: 2
updateRevision: stateful-busybox-96676bd54
updatedReplicas: 2
Notice the absence of currentReplicas
.
I did not push this version of the test here because it's hanging due to the previously mentioned bug so the only way to run this test is with verify_result: false
which doesn't make for a good test.
If you prefer, I can drop these changes from this PR since IRL I will have to run without verify result anyway for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation!
If you prefer, I can drop these changes from this PR since IRL I will have to run without verify result anyway for now.
May as well keep the changes for now, to avoid repeating the same work later.
@@ -213,7 +213,7 @@ def patch_daemonset_with_restart(record) | |||
def delete_statefulset_pods(record) | |||
pods = kubeclient.get_pods(namespace: record.metadata.namespace) | |||
pods.select! do |pod| | |||
pod.metadata.ownerReferences.find { |ref| ref.uid == record.metadata.uid } | |||
pod.metadata&.ownerReferences&.find { |ref| ref.uid == record.metadata.uid } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes the nil
errors encountered if bare pods are found in the namespaces. At Shopify nearly all app namespaces will have such a pod (e.g. secret seeder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Thank you for fixing that
f81a87a
to
6e8bf2b
Compare
@@ -19,6 +19,9 @@ def status | |||
end | |||
|
|||
def deploy_succeeded? | |||
success = observed_generation == current_generation && | |||
desired_replicas == status_data['readyReplicas'].to_i && | |||
status_data['currentRevision'] == status_data['updateRevision'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is currently broken in upstream Kubernetes for OnDelete
, but if we remove the check the method will return immediately, before the pods are actually restarted successfully.
We could (re)implement some of the controller logic to check for individual pods but I believe running with --no-verify-result
until the upstream fix is the better option.
Upstream fix: kubernetes/kubernetes#106059
1b2d9ae
to
4fb61d8
Compare
@@ -36,7 +36,8 @@ def test_restart_by_annotation | |||
end | |||
|
|||
def test_restart_statefulset_on_delete_restarts_child_pods | |||
result = deploy_fixtures("hello-cloud", subset: "stateful_set.yml") do |fixtures| | |||
result = deploy_fixtures("hello-cloud", subset: ["configmap-data.yml", "unmanaged-pod-1.yml.erb", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploying a bare pod in the integration test without the fix produces the expected error:
Error:
RestartTaskTest#test_restart_statefulset_on_delete_restarts_child_pods:
NoMethodError: undefined method `find' for nil:NilClass
/home/runner/work/krane/krane/lib/krane/restart_task.rb:216:in `block in delete_statefulset_pods'
/opt/hostedtoolcache/Ruby/3.0.3/x64/lib/ruby/3.0.0/delegate.rb:349:in `select!'
/opt/hostedtoolcache/Ruby/3.0.3/x64/lib/ruby/3.0.0/delegate.rb:349:in `block in delegating_block'
/home/runner/work/krane/krane/lib/krane/restart_task.rb:215:in `delete_statefulset_pods'
/home/runner/work/krane/krane/lib/krane/restart_task.rb:239:in `block in restart_statefulsets!'
/home/runner/work/krane/krane/lib/krane/restart_task.rb:233:in `each'
/home/runner/work/krane/krane/lib/krane/restart_task.rb:233:in `restart_statefulsets!'
/home/runner/work/krane/krane/lib/krane/restart_task.rb:74:in `run!'
/home/runner/work/krane/krane/lib/krane/restart_task.rb:49:in `run'
/home/runner/work/krane/krane/test/integration/restart_task_test.rb:49:in `test_restart_statefulset_on_delete_restarts_child_pods'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ Thank you for fixing this, just one small question
@@ -213,7 +213,7 @@ def patch_daemonset_with_restart(record) | |||
def delete_statefulset_pods(record) | |||
pods = kubeclient.get_pods(namespace: record.metadata.namespace) | |||
pods.select! do |pod| | |||
pod.metadata.ownerReferences.find { |ref| ref.uid == record.metadata.uid } | |||
pod.metadata&.ownerReferences&.find { |ref| ref.uid == record.metadata.uid } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Thank you for fixing that
@@ -27,11 +30,10 @@ def deploy_succeeded? | |||
"Consider switching to rollingUpdate.") | |||
@success_assumption_warning_shown = true | |||
end | |||
else | |||
success &= desired_replicas == status_data['currentReplicas'].to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was the CI test in restart_task_test
passing if this is the case?
What are you trying to accomplish with this PR?
Resolve some logic errors in StatefulSet restart handling introduced in #840.
delete_statefulset_pods
will crash. (See below)currentRevision
from matchingupdateRevision
OnDelete
thecurrentReplicas
field is sometimes nil.How is this accomplished?
nil
error conditionnil
dereferenceOnDelete
What could go wrong?
Until the upstream fix we'll still have to deploy with
--no-verify-result
for STS withOnDelete
.Acknowledgments
Thanks to @epk for helping to debug these issues.