You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
when performing a scale-up of store-gateway pods followed by a scale-down memberlist entries of deleted store-gateway pods sporadically re-appear after a few hours as unhealthy in the memberlist ring.
The system doesn't recover from the ghost entries and they appear and disappear at random.
In our case we scaled 12 to 80 and back to 12, but this happens with lower scale-up numbers as well.
we verified that each unhealthy entry as reported by the metrics references a no longer existing store-gateway pod.
This is indicated in the logs with messages like
msg=\"auto-forgetting instance from the ring because it is unhealthy for a long time\" instance=store-gateway-15
To Reproduce
Steps to reproduce the behavior:
Start Cortex, using memberlist for store-gateway ring (efd1de4)
Expected behavior
Inspecting the cortex store-gateway ring status history for the lifetime of the cluster it shouldn't contain unhealthy store-gateways of deleted pods.
Environment:
Infrastructure: Kubernetes
Deployment tool: helm, custom chart
Storage Engine
Blocks
Chunks
Additional Context #3603 was a PR to fix it, but it seems it doesn't cover some edge cases.
The text was updated successfully, but these errors were encountered:
Describe the bug
when performing a scale-up of store-gateway pods followed by a scale-down memberlist entries of deleted store-gateway pods sporadically re-appear after a few hours as unhealthy in the memberlist ring.
The system doesn't recover from the ghost entries and they appear and disappear at random.
In our case we scaled 12 to 80 and back to 12, but this happens with lower scale-up numbers as well.
we verified that each unhealthy entry as reported by the metrics references a no longer existing store-gateway pod.
This is indicated in the logs with messages like
To Reproduce
Steps to reproduce the behavior:
relevant section of cortex configuration:
Expected behavior
Inspecting the cortex store-gateway ring status history for the lifetime of the cluster it shouldn't contain unhealthy store-gateways of deleted pods.
Environment:
Storage Engine
Additional Context
#3603 was a PR to fix it, but it seems it doesn't cover some edge cases.
The text was updated successfully, but these errors were encountered: