Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: make readiness probe timeout configurable to support faulty plugins #13179

Closed
calvarado2004 opened this issue May 30, 2022 · 13 comments · Fixed by #13340
Closed

CSI: make readiness probe timeout configurable to support faulty plugins #13179

calvarado2004 opened this issue May 30, 2022 · 13 comments · Fixed by #13340

Comments

@calvarado2004
Copy link

Nomad version

Output from nomad version
Nomad v1.3.1 (2b054e3)

Operating system and Environment details

Ubuntu Linux 20.04

Issue

CSI Plugin on Nomad 1.3 have a very short timeout. Affects Portworx which can take 15 minutes to be installed.

Reproduction steps

Try to install Portworx on Nomad as a job following this guide:
https://docs.portworx.com/install-with-other/nomad/installation/install-as-a-nomad-job/

Expected Result

Portworx running on Nomad

Actual Result

CSI plugin timeout

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation May 31, 2022
@tgross tgross self-assigned this May 31, 2022
@tgross
Copy link
Member

tgross commented May 31, 2022

Hi @calvarado2004 can you provide the specific log line you're seeing? It sounds like you're saying that after the plugin container starts that it takes 15min for the plugin to successfully respond to probes. But if that's the case, there's almost certainly something broken with your plugin's setup (or possibly the Portworx plugin has a bad bug; it definitely should not be taking 15min to start!). If so, can you please provide the full plugin allocation logs?

@RickyGrassmuck
Copy link
Contributor

It sounds like you're saying that after the plugin container starts that it takes 15min for the plugin to successfully respond to probes. But if that's the case, there's almost certainly something broken with your plugin's setup (or possibly the Portworx plugin has a bad bug; it definitely should not be taking 15min to start!).

It's been a while since we've used Portworx but IIRC, when you run the portworx container on a machine that's never had it, it runs a full initialization process that involves unpacking an OCI image onto the host and also identifying and provisioning the disks that it was told to use for storage. I vaguely remember going through this process on our clients which were 8 core 32GB nodes and I remember the initial setup taking some minutes to actually become "ready" (it's not ready until some threshold of nodes have joined each other to create a cluster). I can totally see this taking quite a bit of time in more resource constrained environments (though, 15 minutes does seem like it would be on the far end of the spectrum lol).

That said, after the initial run on the host, subsequent runs/restarts of portworx take considerably less time to start up. It may be worth first running the Nomad job without the CSI registration set in the job spec to let it do all the initial start-up stuff and then once the cluster finishes coming online, stop the job and resubmit it with the CSI stanza specified so that it registers.

@tgross
Copy link
Member

tgross commented May 31, 2022

That sounds like a reasonable workaround @RickyGrassmuck. Thanks!

But we should probably make the plugin supervisor timeout (ref plugin_supervisor_hook.go#L254-L256) configurable by the end user. This has some unfortunate side-effects in terms of trying to reliably detect when the plugin is broken vs slow, but at least we can give the operator a knob to adjust.

@tgross tgross changed the title CSI Plugin on Nomad 1.3 have a very short timeout. Affects Portworx which can take 15 minutes to be installed. CSI: make readiness probe timeout configurable to support faulty plugins Jun 6, 2022
@tgross tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Jun 6, 2022
@tgross tgross assigned tgross and unassigned tgross Jun 6, 2022
@ggriffiths
Copy link
Contributor

@RickyGrassmuck Yes that's still accurate, thanks for adding that context here. It normally only takes a few minutes to pull the image/unpack/start PX up depending on network speed.

@tgross is anyone working on this issue? If not, I can assist here.

Where were you thinking of making the timeout configurable? In the CSI Plugin stanza?

@tgross
Copy link
Member

tgross commented Jun 10, 2022

@tgross is anyone working on this issue? If not, I can assist here.

It's on our short-term todo list but no one is immediately assigned to it. I'd be happy to review a PR if you're interested.

Where were you thinking of making the timeout configurable? In the CSI Plugin stanza?

Yeah, that seems like the right place for it and that struct is available where the readiness probe timeout happens without too much trouble.

@ggriffiths
Copy link
Contributor

Sounds good, I'll send a PR soon.

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jun 14, 2022
@tgross tgross added this to the 1.3.2 milestone Jun 14, 2022
@tgross
Copy link
Member

tgross commented Jun 14, 2022

#13340 has been merged and will ship in Nomad 1.3.2. Thanks @ggriffiths!

@ggriffiths
Copy link
Contributor

ggriffiths commented Jun 14, 2022

Awesome! Can we backport this to 1.1.x and 1.2.x as well @tgross? We hit this on Nomad 1.1.14 originally in our lab and on 1.2.6 with a customer cluster.

@tgross
Copy link
Member

tgross commented Jun 14, 2022

We don't generally backport new features. But I get that it makes compatibility with Portworx tricky. Lemme see what it would take to get the backport thru (it's a headache because the code changes out from under us).

@ggriffiths
Copy link
Contributor

That makes sense. Hopefully we can get an exception here as the 30 second timeout was a breaking change for drivers that take longer than that to start up.

@tgross
Copy link
Member

tgross commented Jun 14, 2022

Ah yeah that's right, the timeout was backported, so that makes it a regression to fix. I'll make the automated backport, and then follow-up with a PR to fixup the changelog entry to make it clear we're fixing a regression.

@ggriffiths
Copy link
Contributor

Awesome, thank you @tgross!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants