-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unhealthy compactors do not leave the ring after rollout #1081
Comments
So we do not see this issue internally and we roll our compactors multiple times a week. Let's see if we can figure out what differences exist between our setups. Some initial thoughts
Apologies for the barrage of questions, but I'm just kind of hunting around for something that would be different. Like I said, we haven't seen an issue like the one you are describing in months. I'm also mildly suspicious that if the memberlist members are rolling over too fast then they are dropping the messages that indicate that a given compactor is leaving. Perhaps just try slow rolling your pods and see if it has any impact.
I am also confused why we have this :). We should remove it. |
Thank you for the quick response. Here is more info about our setup:
Our ring config for the distributors and compactors is the same:
|
I hit the same issue yesterday, with a single compactor deployment. -> I bumped the resources.requests to give it extra CPU, and when the replicaset deleted the old pod, it never got removed from the ring.
we also have this. |
It would also be interesting to get the value of the following metric. Be sure to scope this by your tempo cluster as other applications emit this.
This will roughly tell us how quickly data is propagating in memberlist. If this value is very high we can make adjustments to more aggressively propagate data. Ours sits around 23s. |
That is quite high. Ours never exceeds 30. The following can be adjusted to increase propagation speed and reduce delay: (defaults listed)
Be warned that adjusting these settings will also increase CPU and throughput between your Tempo components so I would recommend keeping an eye on relevant metrics as you make adjustments. Also, in the end it may make more sense to move to consul or etcd as those are more reliable methods of persisting ring state. |
Hi there, I am pretty much having the same issues here on rollouts. The graph for the mentioned metric is around 30s. The issue with unhealthy instances not leaving the ring also happens when a machine interruption/restart happens (our tempo 1.1.0 cluster runs on spot instances). The only way to remove it is to port forwarding and clicking the Forget button. Question: What would be the side effect of having a cronjob that cleanup the unhealthy ring members? We also use Loki 2.3.0 for logs and such ring problems never happened during rollouts/interruptions. |
This is expected. If a compactor disappears then it will not have time to gracefully exit the ring. On a normal rollout, however, this should not be occurring.
There is a newer implementation of the ring that we could move to which allows for automatic forgetting of unhealthy members. Perhaps we could prioritize that for 1.3? I did not realize so many people were having these issues. |
Hey Joe, I've been tweaking our memberlist settings slowly (our rollout takes quite a while), now setting these properties:
We have also added a These changes have not made a difference with the average value of |
So we have assigned this issue to 1.3 and will explore using the new ring, but in the interest of trying to recreate it can people share how they are deploying: jsonnet, tempo-distributed helm, or custom? I'd also be interested to know what backend people are using: s3, gcs, azure. |
We're using a custom deployment method (homegrown templating mechanism) and using gcs for storage |
Hi Joe, we are using the following configuration:
|
So, we have just seen this for the first time. It occurred when we rolled the compactors twice within ~60s of each other. This is not something we would do normally and may be the cause of the issue. Is this consistent with what others are seeing? |
On our end, we see the issue when doing a single rollout of the compactors. |
Describe the bug
When rolling out a new deployment of the compactors, some old instances will remain in the ring as Unhealthy. The only fix seems to be to port-forward one of the compactors and use the
/compactor/ring
page to "Forget" all the unhealthy instances.To Reproduce
Steps to reproduce the behavior:
af34e132a1b8
)Expected behavior
The compactors from the previous deployment leave the ring correctly
Environment:
kubectl apply
Additional Context
We do not see this happen all the time. On one of our similarly sized but less busy clusters, old compactors rarely stay in the ring after a rollout. On the busier cluster, we had 14 unhealthy compactors from a previous deployment still in the ring, out of 30 in the deployment.
Our tempo config for memberlist:
Sample logs from a compactor that stayed in the ring as unhealthy, from the moment where shutdown was requested:
I was confused by that last
Tempo running
line but looking at the code in main.go, this seems normal.The text was updated successfully, but these errors were encountered: