Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Controller Plugin Is Ignored After Scale Down #8034

Closed
tyler-domitrovich opened this issue May 21, 2020 · 5 comments
Closed

CSI Controller Plugin Is Ignored After Scale Down #8034

tyler-domitrovich opened this issue May 21, 2020 · 5 comments
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/waiting-reply theme/storage type/bug

Comments

@tyler-domitrovich
Copy link

Nomad version

Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Issue

If the controller plugin job is scaled down then a job which requires a CSI volume is scheduled, nomad will not call ControllerPublishVolume before calling NodePublishVolume. I am using the aws-ebs-csi-driver in particular, which produces the following error when ControllerPublishVolume is not called:

failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

Reproduction steps

The repro steps are similar to those in the following issue, however my plugin healthy/expected counts are correct now:

#7817

Additional Notes

  • The controller plugin is only ignored after a scale down. It is called as expected if the controller is scaled up or other changes are made to the controller job config.
  • The controller plugin still appears as healthy on the plugin page and nomad plugin status after a scale down.
@galeep galeep added the CSI label May 21, 2020
@tgross tgross added theme/storage and removed CSI labels May 21, 2020
@tgross
Copy link
Member

tgross commented May 21, 2020

Hi @tyler-domitrovich! Thanks for reporting this.

When you say "scaled down", are we talking about scaling down to 0 or just scaling down to a smaller number? I would expect to get an error if we'd scaled-down to 0, but that we'd stop and not continue on to placements that trigger the node publish steps.

Also, was there much of a delay between the controller scale-down and the job was run? I'm wondering if the delay in plugin fingerprinting (see #7296) is involved somewhere in this. When a plugin allocation starts/stops, it can take up to 30s for the change to be visible to the servers.

@tyler-domitrovich
Copy link
Author

Hello @tgross!

By "scaled down" I mean scaling down to a smaller number. In my particular case I deployed a controller job with 2 tasks then scaled down to one task.

I deployed the test job a minute after the controller plugin was scaled down and allocations immediately began to fail with "device path not provided". This behavior seems to continue indefinitely as I configured the test job to keep trying to reschedule the task every 30 seconds and allocs were still failing in the same way after an hour.

Interestingly, I seem to have found a workaround while testing this just now. If I restart the controller job, the test job is able call the controller plugin again and the test allocations come up healthy.

@tgross
Copy link
Member

tgross commented Oct 7, 2020

@tyler-domitrovich I meant to circle back on this after we released 0.12.2 and 0.12.4. You should see some improvements in the 0.12.x series around this problem.

@tgross
Copy link
Member

tgross commented Nov 25, 2020

In lieu of more data, closing with #9438, which will ship in Nomad 1.0

@tgross tgross closed this as completed Nov 25, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/waiting-reply theme/storage type/bug
Projects
None yet
Development

No branches or pull requests

3 participants