Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace is not starting, when k8s StorageClass has volumeBindingMode=WaitForFirstConsumer #12889

Closed
rhopp opened this issue Mar 14, 2019 · 11 comments
Assignees
Labels
kind/bug Outline of a bug - must adhere to the bug report template. severity/blocker Causes system to crash and be non-recoverable or prevents Che developers from working on Che code.
Milestone

Comments

@rhopp
Copy link
Contributor

rhopp commented Mar 14, 2019

Description

When default StorageClass is configured to have volumeBindingMode set to WaitForFirstConsumer, workspaces do not start. I guess that's because che is waiting for PVC to be in "Bound" state before creating workspace (or mkdir) pod. But this is not happening with this volumeBindingMode and the PVC is stuck in "Pending" state with message "waiting for first consumer to be created before binding" until the workspace startup fails.

Reproduction Steps

  1. Create default StorageClass with WaitForFirstConsumer volumeBindingMode (more info here)
  2. Try to start workspace

Workspace startup fails with message:

Error: Failed to run the workspace: "Waiting for persistent volume claim 'claim-che-workspace' reached timeout"

And error log in che-master:

2019-03-14 14:06:31,927[aceSharedPool-0]  [ERROR] [o.e.c.a.w.s.WorkspaceRuntimes 813]   - Waiting for persistent volume claim 'claim-che-workspace' reached timeout
org.eclipse.che.api.workspace.server.spi.InternalInfrastructureException: Waiting for persistent volume claim 'claim-che-workspace' reached timeout
	at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesPersistentVolumeClaims.wait(KubernetesPersistentVolumeClaims.java:225)
	at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesPersistentVolumeClaims.waitBound(KubernetesPersistentVolumeClaims.java:165)
	at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.pvc.CommonPVCStrategy.prepare(CommonPVCStrategy.java:200)
	at org.eclipse.che.workspace.infrastructure.kubernetes.KubernetesInternalRuntime.internalStart(KubernetesInternalRuntime.java:200)
	at org.eclipse.che.api.workspace.server.spi.InternalRuntime.start(InternalRuntime.java:141)
	at org.eclipse.che.api.workspace.server.WorkspaceRuntimes$StartRuntimeTask.run(WorkspaceRuntimes.java:779)
	at org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:38)
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

OS and version:
This is reproducible with Codeready Workspaces on OCP 4 Beta and on latest Eclipse Che on pure kubernetes (minikube)

@rhopp rhopp added team/platform kind/bug Outline of a bug - must adhere to the bug report template. labels Mar 14, 2019
@slemeur slemeur added severity/blocker Causes system to crash and be non-recoverable or prevents Che developers from working on Che code. severity/P1 Has a major impact to usage or development of the system. and removed severity/blocker Causes system to crash and be non-recoverable or prevents Che developers from working on Che code. labels May 17, 2019
@gazarenkov gazarenkov removed their assignment May 22, 2019
@slemeur slemeur added severity/blocker Causes system to crash and be non-recoverable or prevents Che developers from working on Che code. and removed severity/P1 Has a major impact to usage or development of the system. labels May 23, 2019
@sleshchenko sleshchenko added status/in-progress This issue has been taken by an engineer and is under active development. team/platform labels May 23, 2019
@dmytro-ndp
Copy link
Contributor

@sleshchenko, @skabashnyuk: do you know how much time fixup could take?

@sleshchenko
Copy link
Member

@dmytro-ndp Like a couples of hours

@sleshchenko
Copy link
Member

Here is implemented a workaround that provides an ability to disable waiting for PVCs manually with configuration property #13409.
But a solution with good UX would be something like: Che Server should detect volumeBindingMode=WaitForFirstConsumer itself and do not wait for PVCs to be bound in such case.
I'm not familiar with WaitForFirstConsumer volume binding mode, and maybe Che Server has no permissions to check it at all, then before waiting for PVCs, Che Server may start tooling pod that would request PVCs, wait for PVCs to be bound and only then create all other workspace pods.
BTW some investigation is needed to understand the final solution of this issue.

@sleshchenko
Copy link
Member

Now, there a bit more info about this issue I would like to share:

Initially, we used not to wait for PVCs to be bound after creating but then we faced an issue on slow OpenShift cluster (more see #11848).

So, we introduced waiting to PVCs and it costs like nothing on fast OpenShift installation and helps slow installations.

Now, we faced an issue with WaitForFirstConsumer volume binding mode and waiting for PVCs to be bound.
I've checked and now know for sure that Che Server is not able to check volumeBindingMode of source storage that will be used for PVCs(default or configured) without cluster admin role which is not available to it.

Possible solutions we can move with:

  1. Merge a PR with an ability to enable/disable waiting for PVCs to be bound.
    Apply false (do not wait) default value, it should work fine for all fast enough OpenShift installations (fast enough is an abstract term, maybe some investigation should be done).
    Che Docs describes a possible issue with FailedScheduling issue and provides a configuration that should be applied on Che on such slow K8s/OpenShift installation (enable waiting for PVC to be bound)
    WaitForFirstConsumer works fine without any additional Che configuration in case if OpenShift installation is not slow.

  2. Move forward with making Che Server handles waitForFirstConsumer by itself.
    The only solution that I see here is to start additional pod before a start of workspace, to simulate the first consumer. Potential issues here:

  • it may not work for slow installation as Eugene had, since as he mentioned

    This cannot be fixed by removing this event from unrecoverable events in Che conf, since the error happens on k8s level - Che is attempting to create a deployment with a pod spec that uses unbound pvc.

    But I did not investigate this topic by myself

  • it would require additional time during the first start of a workspace even if WaitForFirstConsumer is not configured. I assume that it's like a couple of seconds if an image is already pulled.

  • the solution is general sounds simple but technically there are some difficulties like we already run additional tooling pod for subpaths pre-creation, and ideally, it should be reused instead of run additional tooling pod. But if a subpaths pre-create is disabled - then additional pod for simulating the first consumer should be run. Estimation is definitely more than 1-2 days.

@l0rd @slemeur I would like to hear your opinion about it, do you think the first way is good enough, or we invest time on the second solution before GA?

@dmytro-ndp
Copy link
Contributor

dmytro-ndp commented May 27, 2019

IMHO, lets have CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=false by default with as much as possible clear error message, may be with advise to set it to true, and we will check it carefully against OCP before release of CRW 1.2.

@rhopp
Copy link
Contributor Author

rhopp commented May 28, 2019

I would go with CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=true to have this behavior by default on che installations (a.k.a. don't change anything "by default").

And on CRW side (for 1.2 release) I think we should implement logic into operator which would inspect default StorageClass (operator has cluster-admin rights, so it should be able to see if WaitForFirstConsumer is configured) and based on that it will set CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND property to true/false. WDYT @davidfestal ?

For Che 7/CRW 2 it would be nice to have this implemented in Che Server, but I don't have any strong opinion on the proposed solution with the "artifical" pod.

@l0rd l0rd mentioned this issue May 28, 2019
@l0rd
Copy link
Contributor

l0rd commented May 28, 2019

I think that setting CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=true as suggested by @rhopp is the way to go in the short term.

Using an init container looks like overkill. And using the operator to detect the Volume Binding Mode doesn't look simple neither: theoretically the wsmaster and the workspace pods can be bound to PV with different volume binding modes (che.osio is an example we all know).

Something that I don't understand is why we can't just infer that we are in WaitForFirstConsumer mode at runtime if we intercept an event message that says waiting for first consumer to be created before binding.

@sleshchenko
Copy link
Member

Something that I don't understand is why we can't just infer that we are in WaitForFirstConsumer mode at runtime if we intercept an event message that says waiting for first consumer to be created before binding.

I missed this event. Checking an event message is not reliable but I like this proposal and think it will improve Che Server behavior:+1:

@sleshchenko
Copy link
Member

Now, Che Server may be configured not to wait for PVCs to be bound with configuration property and it unlocks Che on installations where waitForFirstConsumer PV binding mode is configured.

But there is another issue[1] to improve PVC waiting process and additionally to checking of PVC status, listen to PVC related events with message waiting for the first consumer to be created before binding.
It should make unneeded reconfiguring Che Server in case of waitForFirstConsumer.

[1] #13437

@sleshchenko sleshchenko removed the status/in-progress This issue has been taken by an engineer and is under active development. label May 29, 2019
@nickboldt nickboldt added this to the 7.0.0 milestone Aug 1, 2019
@bryantson
Copy link

Where is CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND set? I cannot find it in Configmap.

@sleshchenko
Copy link
Member

@bryantson It should not be needed after #14239.
BTW Default value is in che.properties file that is bundled to Che Server, and you can override it by providing CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND value to config map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Outline of a bug - must adhere to the bug report template. severity/blocker Causes system to crash and be non-recoverable or prevents Che developers from working on Che code.
Projects
None yet
Development

No branches or pull requests

9 participants