Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devfile Registry deployment on minikube with helm keeps crash looping #1295

Closed
michael-valdron opened this issue Oct 18, 2023 · 3 comments · Fixed by devfile/registry-support#187
Assignees
Labels
area/registry Devfile registry for stacks and infrastructure kind/bug Something isn't working severity/blocker Issues that prevent developers from working

Comments

@michael-valdron
Copy link
Member

michael-valdron commented Oct 18, 2023

Which area this feature is related to?

/kind bug

Which area this bug is related to?

/area registry

What versions of software are you using?

Go project

Operating System and version: N/A

Go Pkg Version: 1.18

Node.js project

Operating System and version: N/A

Node.js version: 18

Yarn version: 1.22.19

Project.json: https://github.com/devfile/devfile-web/blob/91b745246e20f760efd74758022420d7302becf6/package.json

Web browser

Operating System and version: N/A

Browser name and version: N/A

Bug Summary

Describe the bug:

Deploying the devfile registry using the helm chart on minikube results in the index server and registry viewer containers experiencing a repeating CrashLoopBackOff state for over 10 minutes, causing the integration testing on devfile registry under registry-support to timeout (has a 10 minute wait for deployment limit).

To Reproduce:

  1. Start minikube v1.21.0 using Kubernetes v1.21.0 with default settings.
  2. If using docker, run integration testing script: bash .ci/run_tests_minikube_linux.sh. Or follow steps 3-4 to run manually with podman or default next tag, skip to step 5 if using this step
  3. Deploy devfile registry by running: helm install devfile-registry ./deploy/chart/devfile-registry --set global.ingress.domain=$(minikube ip).nip.io
    • Add --set devfileIndex.image=quay.io/<user>/devfile-index --set devfileIndex.tag=<tag_label> to specify your own image
  4. Immediately after deploying, run kubectl wait deploy/devfile-registry --for=condition=Available --timeout=600s to wait for available condition
    • Use helm install ... & kubectl wait ... to best simulate the timing of the script
  5. The wait process will fail with the reported error

Expected behavior

Deploys successfully within 10 minutes without experiencing frequent CrashLoopBackOff states.

Any logs, error output, screenshots etc? Provide the devfile that sees this bug, if applicable

Full error log: devfile_registry_error.log

Error Message

+ kubectl wait deploy/devfile-registry --for=condition=Available --timeout=600s
error: timed out waiting for the condition on deployments/devfile-registry

Container State Details

Containers:
  devfile-registry:
    ...
    State:           Waiting
      Reason:        CrashLoopBackOff
    Last State:      Terminated
      Reason:        Error
      Exit Code:     137
      Started:       Wed, 18 Oct 2023 21:05:13 +0000
      Finished:      Wed, 18 Oct 2023 21:06:15 +0000
    Ready:           False
    Restart Count:   6
    ...
  registry-viewer:
    ...
    State:           Waiting
      Reason:        CrashLoopBackOff
    Last State:      Terminated
      Reason:        Error
      Exit Code:     134
      Started:       Wed, 18 Oct 2023 21:05:00 +0000
      Finished:      Wed, 18 Oct 2023 21:05:00 +0000
    Ready:           False
    Restart Count:   6
    ...

Events

  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  10m                    default-scheduler  Successfully assigned default/devfile-registry-54488859b-rzhnh to minikube
  Normal   Pulling    9m48s                  kubelet            Pulling image "quay.io/devfile/oci-registry:next"
  Normal   Pulled     9m48s                  kubelet            Successfully pulled image "quay.io/devfile/registry-viewer:next" in 12.152383211s
  Normal   Created    9m34s                  kubelet            Created container oci-registry
  Normal   Pulled     9m34s                  kubelet            Successfully pulled image "quay.io/devfile/oci-registry:next" in 14.156196105s
  Normal   Started    9m33s                  kubelet            Started container oci-registry
  Normal   Pulled     9m32s                  kubelet            Successfully pulled image "quay.io/devfile/registry-viewer:next" in 1.103481248s
  Normal   Created    9m32s (x2 over 9m48s)  kubelet            Created container registry-viewer
  Normal   Started    9m31s (x2 over 9m48s)  kubelet            Started container registry-viewer
  Warning  BackOff    9m30s (x2 over 9m31s)  kubelet            Back-off restarting failed container
  Warning  Unhealthy  9m28s (x3 over 9m30s)  kubelet            Startup probe failed: Get "http://172.17.0.4:3000/viewer": dial tcp 172.17.0.4:3000: connect: connection refused
  Normal   Killing    9m28s                  kubelet            Container devfile-registry failed startup probe, will be restarted
  Normal   Pulling    8m58s (x3 over 10m)    kubelet            Pulling image "quay.io/devfile/registry-viewer:next"
  Normal   Started    8m58s (x2 over 10m)    kubelet            Started container devfile-registry
  Normal   Created    8m58s (x2 over 10m)    kubelet            Created container devfile-registry
  Normal   Pulled     4m50s (x6 over 10m)    kubelet            Container image "devfile-index:latest" already present on machine

Additional context

Any workaround?

Increase the timeout limit of integration testing, this does not solve the problem of the devfile registry taking over 10 minutes to deploy.

Suggestion on how to fix the bug

Unknown at this time.

@michael-valdron michael-valdron added the severity/blocker Issues that prevent developers from working label Oct 18, 2023
@openshift-ci openshift-ci bot added kind/bug Something isn't working area/registry Devfile registry for stacks and infrastructure labels Oct 18, 2023
@michael-valdron
Copy link
Member Author

Might possibly block #1197 if this bug is not fixed by the review testing.

@thepetk
Copy link
Contributor

thepetk commented Oct 19, 2023

After some investigation found out the following:

Cause of the Failure

The ci check was failing due to a CrashLoopBackOff error. This caused because the container of registry-viewer was never started and hanging on (related issue: nodejs/node#48444):

 1: 0x55dd0567ab94 node::Abort() [node]
 2: 0x55dd0567aed1 node::Assert(node::AssertionInfo const&) [node]
 3: 0x55dd056fb06c node::WorkerThreadsTaskRunner::WorkerThreadsTaskRunner(int) [node]
 4: 0x55dd056fb1d7 node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
 5: 0x55dd0563656c node::V8Platform::Initialize(int) [node]
 6: 0x55dd05632a0b node::InitializeOncePerProcess(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, node::ProcessFlags::Flags) [node]
 7: 0x55dd05632e8e node::Start(int, char**) [node]
 8: 0x7f0af3764eb0  [/lib64/libc.so.6]
 9: 0x7f0af3764f60 __libc_start_main [/lib64/libc.so.6]
10: 0x55dd055a0545 _start [node]

Proposed fix

First approach should be to update the github-action that sets up minikube (manusa/actions-setup-minikube) in order to be able to update the versions of minikube and kubernetes used as this error is fixed in later version.

Another improvement could be made if we increase the memory inside the start args to 4gb.

More detailed logging

As CrashLoopBackOff is not related with the kubectl wait command I think we could add more detailed logging in case the conditions of kubectl wait are not met by using the kubectl logs command. An example of more detailed logging can be found here

@thepetk
Copy link
Contributor

thepetk commented Oct 19, 2023

I've created a PR with the proposed workaround. I've assigned the issue to me and as it has already a PR I've removed the refinement date, add it to the current sprint (because is a blocker) and story pointed with the time/complexity spent on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/registry Devfile registry for stacks and infrastructure kind/bug Something isn't working severity/blocker Issues that prevent developers from working
Projects
Status: Done ✅
Development

Successfully merging a pull request may close this issue.

2 participants