Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

Open
judemars opened this issue Aug 15, 2023 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@judemars
Copy link

This is a follow-on from #121.

We are still seeing "error syncing claim: node not found" with "final error received, removing pvc x from claims in progress" @songsunny made a fix for this, however, the fix does not handle the final/non-final error. So the fix can cause the PD to leak in this scenario: #139 (comment)

@sunnylovestiramisu
Copy link
Contributor

/sig storage
/kind bug

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels Aug 16, 2023
@sunnylovestiramisu
Copy link
Contributor

Let's continue the conversation here. @msau42 @jsafrane

If the provisioning has started, shouldn't the state changed to ProvisioningInBackground?

If it is ProvisioningInBackground and node is not found, we do not go down the code path: return ctrl.provisionVolumeErrorHandling(ctx, ProvisioningReschedule, err, claim, operation)

Instead we continue to the code:

err = fmt.Errorf("failed to get target node: %v", err)
ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
return ProvisioningNoChange, err

@jsafrane
Copy link
Contributor

@sunnylovestiramisu that will work only when we expect that the missing node comes back online. Is this the case we're trying to solve here? Because if the node was permanently deleted, then the provisioner will end in an endless loop.

To fix the issue for all possible library users, the library should save node somewhere (PVC annotation?!) and give it to the provisioner, so it can retry until it gets a final success / error without getting the Node from the API server.

But a whole node in PVC annotations is really ugly. CSI provisioner will need just node name from it to get CSINode and driver's topology labels from it*. So... should there be an extra call, say ExtractNodeInfo, where the provisioner would get whatever it wants from the SelectedNode (and perhaps even CSINode) and return a shorter string that the library would save in the PVC? Then the library could give the string to every Provision() call until a final success/error. This is change of the library API. Is it worth the effort?

*) There is another question what should happen when CSINode is missing here: https://github.com/kubernetes-csi/external-provisioner/blob/3739170578f68aaf0594f631cae6d270bbfdc83e/pkg/controller/topology.go#L273

@sunnylovestiramisu
Copy link
Contributor

@jsafrane we are trying to solve the case that the provisioning did start already on the backend but then the node got deleted afterwards.

@jsafrane
Copy link
Contributor

Question is, do you expect the node to come back and optimize for this case? That's manageable.

@sunnylovestiramisu
Copy link
Contributor

sunnylovestiramisu commented Aug 24, 2023

The node with the same node name will not come back if it gets deleted. But if a node with the same node name but a different UID may come back. Or if there is any other cases that I miss.

@msau42
Copy link

msau42 commented Aug 24, 2023

There's 2 main reasons why a Node about may disappear:

  1. The node got autoscaled down. In this case the node is not expected to come back.
  2. The node got pre-emped and recreated. The node may come back, and often it comes back quickly.

The original motivation for #121 was for the autoscaling case where the node never comes back.

@jsafrane
Copy link
Contributor

The original motivation for #121 was for the autoscaling case where the node never comes back

That's the hard case. The provisioner lib then needs to store the whole node somewhere, so it can reconstruct it and give it to the provisioner in the next Provision() calls, until it succeeds / fails with a final error.

Since storing the whole node is ugly, we could extend the provisioner API, so the provisioner can distill a shorter string from the Node and then the library would store it (in PVC annotations?) and give it to the provisioner in all subsequent Provision calls.

@sunnylovestiramisu
Copy link
Contributor

What happens if a node with the exact name but different uid comes back? Does this shorter string provide enough information for the provisioner to distinguish the node and decide what action to take?

@jsafrane
Copy link
Contributor

Question is, if selectedNode annotation is valid when someone deletes a node and creates a new one with the same name. Assuming yes, the new node is still the right one, then I think the shorter string could be used only as a fallback, when the Node object is not in the API server (or in informer).

@sunnylovestiramisu
Copy link
Contributor

sunnylovestiramisu commented Aug 28, 2023

But how do we know if a node will come back or not? How do we know if it will be recreated from symptoms? Let say we store the nodeName in annotations.

  1. The node got autoscaled down. In this case the node is not expected to come back.
    <-- We will see apierrs.IsNotFound(err) to be true always. And we still have nodeName in annotations. What do we do next?

  2. The node got pre-emped and recreated. The node may come back, and often it comes back quickly.
    <-- We will see apierrs.IsNotFound(err) to be true in a short period of time. And we still have nodeName in annotations. And we use that nodeName to skip the ctrl.provisionVolumeErrorHandling(ctx, ProvisioningReschedule, err, claim, operation)?

@msau42
Copy link

msau42 commented Aug 28, 2023

I don't think if it matters if the node will come back or not. What we return depends on if the operation returned final or non-final error. Whether or not the node exists or not determines how we retry in the non-final case.

And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in #139 was trying to do).

I think the challenge with getting a new node is what happens if the topology also changes? Then a fix would involve:

  • The driver being able to handle a change in requested topology for an existing operation
  • provisioner being able to detect that the returned topology doesn't match the request and coordinating a deletion + retry

@jsafrane
Copy link
Contributor

jsafrane commented Sep 6, 2023

And so the question is for the non-final case, should we retry the CreateVolume() call with the old node (Jan's proposal), or get a new node from the scheduler (What the fix in #139 was trying to do).

What if there is no node? Someone has deleted the Pod and PVC and provisioning is in progress. The provisioning should continue and eventually succeed or fail, but it needs some node to continue.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2024
@msau42
Copy link

msau42 commented Feb 27, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2024
@xing-yang
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2024
@xing-yang
Copy link

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 5, 2024
@msau42
Copy link

msau42 commented Aug 2, 2024

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants