-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152
Comments
/sig storage |
Let's continue the conversation here. @msau42 @jsafrane If the provisioning has started, shouldn't the state changed to ProvisioningInBackground? If it is ProvisioningInBackground and node is not found, we do not go down the code path: Instead we continue to the code:
|
@sunnylovestiramisu that will work only when we expect that the missing node comes back online. Is this the case we're trying to solve here? Because if the node was permanently deleted, then the provisioner will end in an endless loop. To fix the issue for all possible library users, the library should save node somewhere (PVC annotation?!) and give it to the provisioner, so it can retry until it gets a final success / error without getting the Node from the API server. But a whole node in PVC annotations is really ugly. CSI provisioner will need just node name from it to get CSINode and driver's topology labels from it*. So... should there be an extra call, say *) There is another question what should happen when CSINode is missing here: https://github.com/kubernetes-csi/external-provisioner/blob/3739170578f68aaf0594f631cae6d270bbfdc83e/pkg/controller/topology.go#L273 |
@jsafrane we are trying to solve the case that the provisioning did start already on the backend but then the node got deleted afterwards. |
Question is, do you expect the node to come back and optimize for this case? That's manageable. |
The node with the same node name will not come back if it gets deleted. But if a node with the same node name but a different UID may come back. Or if there is any other cases that I miss. |
There's 2 main reasons why a Node about may disappear:
The original motivation for #121 was for the autoscaling case where the node never comes back. |
That's the hard case. The provisioner lib then needs to store the whole node somewhere, so it can reconstruct it and give it to the provisioner in the next Since storing the whole node is ugly, we could extend the provisioner API, so the provisioner can distill a shorter string from the Node and then the library would store it (in PVC annotations?) and give it to the provisioner in all subsequent Provision calls. |
What happens if a node with the exact name but different uid comes back? Does this shorter string provide enough information for the provisioner to distinguish the node and decide what action to take? |
Question is, if |
But how do we know if a node will come back or not? How do we know if it will be recreated from symptoms? Let say we store the nodeName in annotations.
|
I don't think if it matters if the node will come back or not. What we return depends on if the operation returned final or non-final error. Whether or not the node exists or not determines how we retry in the non-final case. And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in #139 was trying to do). I think the challenge with getting a new node is what happens if the topology also changes? Then a fix would involve:
|
What if there is no node? Someone has deleted the Pod and PVC and provisioning is in progress. The provisioning should continue and eventually succeed or fail, but it needs some node to continue. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/triage accepted |
/lifecycle frozen |
This is a follow-on from #121.
We are still seeing "error syncing claim: node not found" with "final error received, removing pvc x from claims in progress" @songsunny made a fix for this, however, the fix does not handle the final/non-final error. So the fix can cause the PD to leak in this scenario: #139 (comment)
The text was updated successfully, but these errors were encountered: