Unhealthy compactors do not leave the ring after rollout #1081

gravelg · 2021-10-26T20:45:38Z

Describe the bug
When rolling out a new deployment of the compactors, some old instances will remain in the ring as Unhealthy. The only fix seems to be to port-forward one of the compactors and use the /compactor/ring page to "Forget" all the unhealthy instances.

To Reproduce
Steps to reproduce the behavior:

Start Tempo in kubernetes (we have tried the 1.1 release but the issue persists with af34e132a1b8)
Perform a rollout of the compactors

Expected behavior
The compactors from the previous deployment leave the ring correctly

Environment:

Infrastructure: kubernetes
Deployment tool: kubectl apply

Additional Context
We do not see this happen all the time. On one of our similarly sized but less busy clusters, old compactors rarely stay in the ring after a rollout. On the busier cluster, we had 14 unhealthy compactors from a previous deployment still in the ring, out of 30 in the deployment.

Our tempo config for memberlist:

memberlist:
      abort_if_cluster_join_fails: false
      join_members:
        - tempo-gossip-ring
      dead_node_reclaim_time: 15s
      bind_addr: ["${POD_IP}"]

Sample logs from a compactor that stayed in the ring as unhealthy, from the moment where shutdown was requested:

level=info ts=2021-10-26T15:26:54.628754652Z caller=signals.go:55 msg=""=== received SIGINT/SIGTERM ===\n*** exiting""level=info ts=2021-10-26T15:26:54.629120545Z caller=lifecycler.go:457 msg=""lifecycler loop() exited gracefully"" ring=compactor
level=info ts=2021-10-26T15:26:54.629162701Z caller=lifecycler.go:768 msg=""changing instance state from"" old_state=ACTIVE new_state=LEAVING ring=compactor
level=info ts=2021-10-26T15:26:55.632049563Z caller=lifecycler.go:509 msg=""instance removed from the KV store"" ring=compactor
level=info ts=2021-10-26T15:26:55.63212068Z caller=module_service.go:96 msg=""module stopped"" module=compactor
level=info ts=2021-10-26T15:26:55.632206675Z caller=module_service.go:96 msg=""module stopped"" module=overrides
level=info ts=2021-10-26T15:26:55.632206876Z caller=memberlist_client.go:572 msg=""leaving memberlist cluster""
level=info ts=2021-10-26T15:26:55.632249778Z caller=module_service.go:96 msg=""module stopped"" module=store
level=warn ts=2021-10-26T15:27:05.735678769Z caller=memberlist_client.go:587 msg=""broadcast messages left in queue"" count=16 nodes=146
level=info ts=2021-10-26T15:27:07.192768676Z caller=module_service.go:96 msg=""module stopped"" module=memberlist-kv
level=info ts=2021-10-26T15:27:07.194383366Z caller=server_service.go:50 msg=""server stopped""
level=info ts=2021-10-26T15:27:07.194466883Z caller=module_service.go:96 msg=""module stopped"" module=server
level=info ts=2021-10-26T15:27:07.19452497Z caller=app.go:271 msg=""Tempo stopped""
level=info ts=2021-10-26T15:27:07.194539895Z caller=main.go:135 msg=""Tempo running""

I was confused by that last Tempo running line but looking at the code in main.go, this seems normal.

The text was updated successfully, but these errors were encountered:

joe-elliott · 2021-10-26T21:16:00Z

So we do not see this issue internally and we roll our compactors multiple times a week. Let's see if we can figure out what differences exist between our setups.

Some initial thoughts

We set neither bind_addr or dead_node_reclaim_time. Seems unlikely but perhaps leaving those at default would help.
We do set abort_if_cluster_join_fails: false, but I don't think that would have an impact here.
If you do an nslookup of tempo-gossip-ring does it include pods that aren't compactors? Ideally it should include ingesters, queriers and distributors.
What is the terminationGracePeriodSeconds for the compactors? Ours is 30s. It certainly looks like the compactor is shutting down cleanly from the logs, but probably worth a look.
Do you have a distributor ring? If you do do you see this issue there?
What is your compactor ring config look like?

Apologies for the barrage of questions, but I'm just kind of hunting around for something that would be different. Like I said, we haven't seen an issue like the one you are describing in months. I'm also mildly suspicious that if the memberlist members are rolling over too fast then they are dropping the messages that indicate that a given compactor is leaving. Perhaps just try slow rolling your pods and see if it has any impact.

I was confused by that last Tempo running line but looking at the code in main.go, this seems normal.

I am also confused why we have this :). We should remove it.

gravelg · 2021-10-26T21:38:42Z

Thank you for the quick response. Here is more info about our setup:

I'll try removing dead_node_reclaim_time but we have had issues running without bind_addr for some reason, the pods crash with an error. I'll try looking more into that just to make sure it's not causing problems.
The tempo-gossip-ring is a headless kubernetes service that includes all compactors, distributors, ingesters, queriers and query-frontend
terminationGracePeriodSeconds is unset for the deployment, which should use the default of 30 seconds for our setup.
We have seen the issue for the distributor ring as well very recently
We are doing rollingUpdate for compactors with maxUnavailable: 1 and minReadySeconds: 10. If there is anything else I can try to add to slow down the rollout, I can try to add it. In theory, the cluster that is seeing the issue more often is busier and the compactors could already take longer to shutdown than our less busy cluster where the issue is rare.

Our ring config for the distributors and compactors is the same:

  ring:
    kvstore:
      store: memberlist

james-callahan · 2021-10-27T02:00:04Z

I hit the same issue yesterday, with a single compactor deployment. -> I bumped the resources.requests to give it extra CPU, and when the replicaset deleted the old pod, it never got removed from the ring.

The tempo-gossip-ring is a headless kubernetes service that includes all compactors, distributors, ingesters, queriers and query-frontend

we also have this.

joe-elliott · 2021-10-27T13:17:26Z

It would also be interesting to get the value of the following metric. Be sure to scope this by your tempo cluster as other applications emit this.

max(time() - cortex_ring_oldest_member_timestamp{state="ACTIVE", name="compactor"})

This will roughly tell us how quickly data is propagating in memberlist. If this value is very high we can make adjustments to more aggressively propagate data. Ours sits around 23s.

gravelg · 2021-10-27T13:50:00Z

Seems this hovers around ~35s for us but there are large spikes sometimes

joe-elliott · 2021-10-27T14:13:59Z

That is quite high. Ours never exceeds 30. The following can be adjusted to increase propagation speed and reduce delay: (defaults listed)

memberlist:
  retransmit_factor: 2        # this is your most powerful knob. it controls how many other nodes messages are sent to. incease to reduce propagation delay
  pull_push_interval: 30s     # decrease to reduce propagation delay
  gossip_interval: 1s         # decrease to reduce propagation delay
  gossip_nodes: 2             # increase to reduce propagation delay
  compression_enabled: false  # we turned this off b/c it cost CPU w/o giving a lot back in terms of propagation speed. YMMV?

Be warned that adjusting these settings will also increase CPU and throughput between your Tempo components so I would recommend keeping an eye on relevant metrics as you make adjustments. Also, in the end it may make more sense to move to consul or etcd as those are more reliable methods of persisting ring state.

james-callahan · 2021-10-27T23:08:57Z

also max by (job) (time() - cortex_ring_oldest_member_timestamp{namespace="tempo",state="ACTIVE"}):

creinheimer · 2021-10-28T08:43:37Z

Hi there,

I am pretty much having the same issues here on rollouts. The graph for the mentioned metric is around 30s.

The issue with unhealthy instances not leaving the ring also happens when a machine interruption/restart happens (our tempo 1.1.0 cluster runs on spot instances). The only way to remove it is to port forwarding and clicking the Forget button.

Question: What would be the side effect of having a cronjob that cleanup the unhealthy ring members?

We also use Loki 2.3.0 for logs and such ring problems never happened during rollouts/interruptions.

joe-elliott · 2021-10-28T13:48:10Z

The issue with unhealthy instances not leaving the ring also happens when a machine interruption/restart happens (our tempo 1.1.0 cluster runs on spot instances). The only way to remove it is to port forwarding and clicking the Forget button.

This is expected. If a compactor disappears then it will not have time to gracefully exit the ring. On a normal rollout, however, this should not be occurring.

Question: What would be the side effect of having a cronjob that cleanup the unhealthy ring members?

There is a newer implementation of the ring that we could move to which allows for automatic forgetting of unhealthy members. Perhaps we could prioritize that for 1.3? I did not realize so many people were having these issues.

gravelg · 2021-10-28T21:20:31Z

Hey Joe, I've been tweaking our memberlist settings slowly (our rollout takes quite a while), now setting these properties:

      retransmit_factor: 3
      pull_push_interval: 20s
      gossip_interval: 0.5s

We have also added a readinessProbe to the compactor deployment config, in an effort to try and slow down the rollout even more.

These changes have not made a difference with the average value of cortex_ring_oldest_member_timestamp or compactors leaving the ring automatically.

joe-elliott · 2021-10-29T18:46:04Z

So we have assigned this issue to 1.3 and will explore using the new ring, but in the interest of trying to recreate it can people share how they are deploying: jsonnet, tempo-distributed helm, or custom?

I'd also be interested to know what backend people are using: s3, gcs, azure.

gravelg · 2021-11-01T14:24:09Z

We're using a custom deployment method (homegrown templating mechanism) and using gcs for storage

creinheimer · 2021-11-01T14:30:02Z

Hi Joe, we are using the following configuration:

helm tempo-distributed-0.9.14
aws s3 as storage

joe-elliott · 2021-12-01T15:45:44Z

So, we have just seen this for the first time. It occurred when we rolled the compactors twice within ~60s of each other. This is not something we would do normally and may be the cause of the issue. Is this consistent with what others are seeing?

gravelg · 2021-12-01T15:48:04Z

On our end, we see the issue when doing a single rollout of the compactors.

joe-elliott added this to the v1.3 milestone Oct 28, 2021

mdisibio mentioned this issue Dec 15, 2021

Switch compactor to new BasicLifecycler with AutoForgetDelegate #1178

Merged

3 tasks

mdisibio closed this as completed in #1178 Dec 16, 2021

joe-elliott mentioned this issue Feb 7, 2022

autoforget_unhealthy for Tempo ingesters #1275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhealthy compactors do not leave the ring after rollout #1081

Unhealthy compactors do not leave the ring after rollout #1081

gravelg commented Oct 26, 2021

joe-elliott commented Oct 26, 2021 •

edited

Loading

gravelg commented Oct 26, 2021

james-callahan commented Oct 27, 2021

joe-elliott commented Oct 27, 2021

gravelg commented Oct 27, 2021

joe-elliott commented Oct 27, 2021 •

edited

Loading

james-callahan commented Oct 27, 2021 •

edited

Loading

creinheimer commented Oct 28, 2021

joe-elliott commented Oct 28, 2021

gravelg commented Oct 28, 2021

joe-elliott commented Oct 29, 2021

gravelg commented Nov 1, 2021

creinheimer commented Nov 1, 2021

joe-elliott commented Dec 1, 2021

gravelg commented Dec 1, 2021

Unhealthy compactors do not leave the ring after rollout #1081

Unhealthy compactors do not leave the ring after rollout #1081

Comments

gravelg commented Oct 26, 2021

joe-elliott commented Oct 26, 2021 • edited Loading

gravelg commented Oct 26, 2021

james-callahan commented Oct 27, 2021

joe-elliott commented Oct 27, 2021

gravelg commented Oct 27, 2021

joe-elliott commented Oct 27, 2021 • edited Loading

james-callahan commented Oct 27, 2021 • edited Loading

creinheimer commented Oct 28, 2021

joe-elliott commented Oct 28, 2021

gravelg commented Oct 28, 2021

joe-elliott commented Oct 29, 2021

gravelg commented Nov 1, 2021

creinheimer commented Nov 1, 2021

joe-elliott commented Dec 1, 2021

gravelg commented Dec 1, 2021

joe-elliott commented Oct 26, 2021 •

edited

Loading

joe-elliott commented Oct 27, 2021 •

edited

Loading

james-callahan commented Oct 27, 2021 •

edited

Loading