Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhealthy compactors do not leave the ring after rollout #1081

Closed
gravelg opened this issue Oct 26, 2021 · 15 comments · Fixed by #1178
Closed

Unhealthy compactors do not leave the ring after rollout #1081

gravelg opened this issue Oct 26, 2021 · 15 comments · Fixed by #1178
Milestone

Comments

@gravelg
Copy link

gravelg commented Oct 26, 2021

Describe the bug
When rolling out a new deployment of the compactors, some old instances will remain in the ring as Unhealthy. The only fix seems to be to port-forward one of the compactors and use the /compactor/ring page to "Forget" all the unhealthy instances.

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo in kubernetes (we have tried the 1.1 release but the issue persists with af34e132a1b8)
  2. Perform a rollout of the compactors

Expected behavior
The compactors from the previous deployment leave the ring correctly

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: kubectl apply

Additional Context
We do not see this happen all the time. On one of our similarly sized but less busy clusters, old compactors rarely stay in the ring after a rollout. On the busier cluster, we had 14 unhealthy compactors from a previous deployment still in the ring, out of 30 in the deployment.

Our tempo config for memberlist:

memberlist:
      abort_if_cluster_join_fails: false
      join_members:
        - tempo-gossip-ring
      dead_node_reclaim_time: 15s
      bind_addr: ["${POD_IP}"]

Sample logs from a compactor that stayed in the ring as unhealthy, from the moment where shutdown was requested:

level=info ts=2021-10-26T15:26:54.628754652Z caller=signals.go:55 msg=""=== received SIGINT/SIGTERM ===\n*** exiting""level=info ts=2021-10-26T15:26:54.629120545Z caller=lifecycler.go:457 msg=""lifecycler loop() exited gracefully"" ring=compactor
level=info ts=2021-10-26T15:26:54.629162701Z caller=lifecycler.go:768 msg=""changing instance state from"" old_state=ACTIVE new_state=LEAVING ring=compactor
level=info ts=2021-10-26T15:26:55.632049563Z caller=lifecycler.go:509 msg=""instance removed from the KV store"" ring=compactor
level=info ts=2021-10-26T15:26:55.63212068Z caller=module_service.go:96 msg=""module stopped"" module=compactor
level=info ts=2021-10-26T15:26:55.632206675Z caller=module_service.go:96 msg=""module stopped"" module=overrides
level=info ts=2021-10-26T15:26:55.632206876Z caller=memberlist_client.go:572 msg=""leaving memberlist cluster""
level=info ts=2021-10-26T15:26:55.632249778Z caller=module_service.go:96 msg=""module stopped"" module=store
level=warn ts=2021-10-26T15:27:05.735678769Z caller=memberlist_client.go:587 msg=""broadcast messages left in queue"" count=16 nodes=146
level=info ts=2021-10-26T15:27:07.192768676Z caller=module_service.go:96 msg=""module stopped"" module=memberlist-kv
level=info ts=2021-10-26T15:27:07.194383366Z caller=server_service.go:50 msg=""server stopped""
level=info ts=2021-10-26T15:27:07.194466883Z caller=module_service.go:96 msg=""module stopped"" module=server
level=info ts=2021-10-26T15:27:07.19452497Z caller=app.go:271 msg=""Tempo stopped""
level=info ts=2021-10-26T15:27:07.194539895Z caller=main.go:135 msg=""Tempo running""

I was confused by that last Tempo running line but looking at the code in main.go, this seems normal.

@joe-elliott
Copy link
Member

joe-elliott commented Oct 26, 2021

So we do not see this issue internally and we roll our compactors multiple times a week. Let's see if we can figure out what differences exist between our setups.

Some initial thoughts

  • We set neither bind_addr or dead_node_reclaim_time. Seems unlikely but perhaps leaving those at default would help.
  • We do set abort_if_cluster_join_fails: false, but I don't think that would have an impact here.
  • If you do an nslookup of tempo-gossip-ring does it include pods that aren't compactors? Ideally it should include ingesters, queriers and distributors.
  • What is the terminationGracePeriodSeconds for the compactors? Ours is 30s. It certainly looks like the compactor is shutting down cleanly from the logs, but probably worth a look.
  • Do you have a distributor ring? If you do do you see this issue there?
  • What is your compactor ring config look like?

Apologies for the barrage of questions, but I'm just kind of hunting around for something that would be different. Like I said, we haven't seen an issue like the one you are describing in months. I'm also mildly suspicious that if the memberlist members are rolling over too fast then they are dropping the messages that indicate that a given compactor is leaving. Perhaps just try slow rolling your pods and see if it has any impact.

I was confused by that last Tempo running line but looking at the code in main.go, this seems normal.

I am also confused why we have this :). We should remove it.

@gravelg
Copy link
Author

gravelg commented Oct 26, 2021

Thank you for the quick response. Here is more info about our setup:

  • I'll try removing dead_node_reclaim_time but we have had issues running without bind_addr for some reason, the pods crash with an error. I'll try looking more into that just to make sure it's not causing problems.
  • The tempo-gossip-ring is a headless kubernetes service that includes all compactors, distributors, ingesters, queriers and query-frontend
  • terminationGracePeriodSeconds is unset for the deployment, which should use the default of 30 seconds for our setup.
  • We have seen the issue for the distributor ring as well very recently
  • We are doing rollingUpdate for compactors with maxUnavailable: 1 and minReadySeconds: 10. If there is anything else I can try to add to slow down the rollout, I can try to add it. In theory, the cluster that is seeing the issue more often is busier and the compactors could already take longer to shutdown than our less busy cluster where the issue is rare.

Our ring config for the distributors and compactors is the same:

  ring:
    kvstore:
      store: memberlist

@james-callahan
Copy link

I hit the same issue yesterday, with a single compactor deployment. -> I bumped the resources.requests to give it extra CPU, and when the replicaset deleted the old pod, it never got removed from the ring.

The tempo-gossip-ring is a headless kubernetes service that includes all compactors, distributors, ingesters, queriers and query-frontend

we also have this.

@joe-elliott
Copy link
Member

It would also be interesting to get the value of the following metric. Be sure to scope this by your tempo cluster as other applications emit this.

max(time() - cortex_ring_oldest_member_timestamp{state="ACTIVE", name="compactor"})

This will roughly tell us how quickly data is propagating in memberlist. If this value is very high we can make adjustments to more aggressively propagate data. Ours sits around 23s.

@gravelg
Copy link
Author

gravelg commented Oct 27, 2021

Seems this hovers around ~35s for us but there are large spikes sometimes
image

@joe-elliott
Copy link
Member

joe-elliott commented Oct 27, 2021

That is quite high. Ours never exceeds 30. The following can be adjusted to increase propagation speed and reduce delay: (defaults listed)

memberlist:
  retransmit_factor: 2        # this is your most powerful knob. it controls how many other nodes messages are sent to. incease to reduce propagation delay
  pull_push_interval: 30s     # decrease to reduce propagation delay
  gossip_interval: 1s         # decrease to reduce propagation delay
  gossip_nodes: 2             # increase to reduce propagation delay
  compression_enabled: false  # we turned this off b/c it cost CPU w/o giving a lot back in terms of propagation speed. YMMV?

Be warned that adjusting these settings will also increase CPU and throughput between your Tempo components so I would recommend keeping an eye on relevant metrics as you make adjustments. Also, in the end it may make more sense to move to consul or etcd as those are more reliable methods of persisting ring state.

@james-callahan
Copy link

james-callahan commented Oct 27, 2021

image

also max by (job) (time() - cortex_ring_oldest_member_timestamp{namespace="tempo",state="ACTIVE"}):

image

@creinheimer
Copy link

Hi there,

I am pretty much having the same issues here on rollouts. The graph for the mentioned metric is around 30s.

The issue with unhealthy instances not leaving the ring also happens when a machine interruption/restart happens (our tempo 1.1.0 cluster runs on spot instances). The only way to remove it is to port forwarding and clicking the Forget button.

Question: What would be the side effect of having a cronjob that cleanup the unhealthy ring members?

We also use Loki 2.3.0 for logs and such ring problems never happened during rollouts/interruptions.

@joe-elliott
Copy link
Member

The issue with unhealthy instances not leaving the ring also happens when a machine interruption/restart happens (our tempo 1.1.0 cluster runs on spot instances). The only way to remove it is to port forwarding and clicking the Forget button.

This is expected. If a compactor disappears then it will not have time to gracefully exit the ring. On a normal rollout, however, this should not be occurring.

Question: What would be the side effect of having a cronjob that cleanup the unhealthy ring members?

There is a newer implementation of the ring that we could move to which allows for automatic forgetting of unhealthy members. Perhaps we could prioritize that for 1.3? I did not realize so many people were having these issues.

@joe-elliott joe-elliott added this to the v1.3 milestone Oct 28, 2021
@gravelg
Copy link
Author

gravelg commented Oct 28, 2021

Hey Joe, I've been tweaking our memberlist settings slowly (our rollout takes quite a while), now setting these properties:

      retransmit_factor: 3
      pull_push_interval: 20s
      gossip_interval: 0.5s

We have also added a readinessProbe to the compactor deployment config, in an effort to try and slow down the rollout even more.

These changes have not made a difference with the average value of cortex_ring_oldest_member_timestamp or compactors leaving the ring automatically.

@joe-elliott
Copy link
Member

So we have assigned this issue to 1.3 and will explore using the new ring, but in the interest of trying to recreate it can people share how they are deploying: jsonnet, tempo-distributed helm, or custom?

I'd also be interested to know what backend people are using: s3, gcs, azure.

@gravelg
Copy link
Author

gravelg commented Nov 1, 2021

We're using a custom deployment method (homegrown templating mechanism) and using gcs for storage

@creinheimer
Copy link

Hi Joe, we are using the following configuration:

  • helm tempo-distributed-0.9.14
  • aws s3 as storage

@joe-elliott
Copy link
Member

So, we have just seen this for the first time. It occurred when we rolled the compactors twice within ~60s of each other. This is not something we would do normally and may be the cause of the issue. Is this consistent with what others are seeing?

@gravelg
Copy link
Author

gravelg commented Dec 1, 2021

On our end, we see the issue when doing a single rollout of the compactors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants