Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cortex store gateway keeps going into crash/failure during startup #4993

Closed
ajcts opened this issue Nov 24, 2022 · 7 comments
Closed

Cortex store gateway keeps going into crash/failure during startup #4993

ajcts opened this issue Nov 24, 2022 · 7 comments

Comments

@ajcts
Copy link

ajcts commented Nov 24, 2022

Issue -

On cortex hosted on AKS distributed env- during startup/new deployment, store gateway keeps going into crashloopback and never really comes up. store gateway has the PVC mount and its associated blob storage and is currently running with replication factor of 3. Tuned the settings of readiness/liveness probe timeouts to give the ring more time to turn out healthy but its not really helping.

All 3 instances are going into crashloop eventually.
When deployed fresh with blob and PVC deleted, store gateway comes up normally without any issues. But in a shared cluster env, this is not really a permanent option.

K8s events doesnt really help on narrowing down to what makes the SG to fail nor does the SG logs.

Infra - K8S istio environment
Arch - Microservices

@alanprot
Copy link
Member

Is it going oom or terminating for other reason? If is not oom can u try to fetch the log from the dead container (-p option on the kubectl logs)?

@ajcts
Copy link
Author

ajcts commented Nov 25, 2022

Yes checked the dead container logs and it was not due to OOM, as we recently increased a fair bit of memory. Errors were more on the side of memberlist failures (relatively minimal though) - and not much info on cause for termination

caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed"

@alanprot
Copy link
Member

Can you try to set the lazy load config to true?

  # If enabled, store-gateway will lazily memory-map an index-header only once
  # required by a query.
  # CLI flag: -blocks-storage.bucket-store.index-header-lazy-loading-enabled
  [index_header_lazy_loading_enabled: <boolean> | default = false]

And also bucket index?

  bucket_index:
    # True to enable querier and store-gateway to discover blocks in the storage
    # via bucket index instead of bucket scanning.
    # CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
    [enabled: <boolean> | default = false]

@ajcts
Copy link
Author

ajcts commented Nov 25, 2022

Yes, we have these enabled too. Also there is no definite pattern to these failures/crashes as it occurs intermittently but more often than not.

bucket_index:
enabled: true
idle_timeout: 30m
max_stale_period: 1h
index_header_lazy_loading_enabled: true
index_header_lazy_loading_idle_timeout: 20m

@yeya24
Copy link
Contributor

yeya24 commented Nov 28, 2022

If there are no enough log from pods, could you please increase log level to debug and try again?

@stale
Copy link

stale bot commented Jun 18, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 18, 2023
@yeya24
Copy link
Contributor

yeya24 commented Nov 10, 2023

After taking another look at this issue, I believe it is related to thanos-io/thanos#6509.

This bug caused SG initial sync takes too much memory, which is totally uncessary. The fix was included in the latest release RC so I will close this issue. Feel free to try it out and let us know if it works or not. https://github.com/cortexproject/cortex/releases/tag/v1.16.0-rc.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants