Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refinery pods using LiveReload become unresponsive and eventually consume all CPU #836

Closed
jcmorrow opened this issue Aug 8, 2023 · 2 comments
Labels
type: bug Something isn't working

Comments

@jcmorrow
Copy link

jcmorrow commented Aug 8, 2023

Versions

  • Go: This is embarassing but I'm not sure!
  • Refinery: 2.1.2

Steps to reproduce

  1. Deploy to kubernetes using helm
  2. Make a rules change with LiveReload enabled
  3. Wait somewhere between 45 minutes and an hour for your servers to explode 😉

Additional context

Important preface: I'm going to give some context around an outage here, but I want to be clear: I'm not creating this issue to complain about free software! I 💝 refinery, I'm so happy we have it, and I'm very grateful for the work y'all put in to creating and maintaining it! As with all outages we learned some good lessons, and ultimately I just want to make sure we understand what went wrong and how we and others can avoid it in the future!

We had an outage yesterday and while I'm still trying to piece together exactly what happened, I'm relatively certain that LiveReload'ing refinery pods were the main driver. Here's what I know:

  • We recently upgrade from version 1 to version 2.1.2 of refinery
  • Since then I've noticed that in both our pre-production and production environments LiveReload does not seem to work. When we run a deploy the pods don't restart but they are also unreachable (http requests to the root will just hang rather than returning a JSON message).
  • For the past few days I've just thought "Huh, that's weird", and restarted the pods.
  • Yesterday someone ran a refinery deploy and did not restart the pods.
  • About an hour later, an instance of one of our main services stopped responding. It was clear from our monitoring that the CPU on that node was near 100% for about 10-20 minutes. I manually restarted the node, it's CPU went back to normal.
  • Other services continued to behave as if they were under heavy load, even though I could not see any high CPU usage in our monitoring tools.
  • Eventually I noticed that information had stopped coming in to honeycomb about an hour before the outage began, and started coming in again after I hard restarted the node.
  • We run three instances of refinery in production. I theorized based on what I was seeing that there might be other refinery pods on other nodes which also needed to be restarted, so I manually restarted all refinery instances using kubectl. After I did this two nodes which had not been reporting for the past hour-ish suddenly reported that they had been under near 100% load for the entire duration of the outage, and they recovered as soon as I restarted the refinery pods on them.
  • I realize this is all a little confusing, it certainly took me a long time to piece together, but it seems like good evidence to me that there is some failure path where LiveReload might cause infinite CPU usage. This seems like it would be related to fix: live reload deadlock #810 maybe? But we should have that patch already since we are on 2.1.2?
@jcmorrow jcmorrow added the type: bug Something isn't working label Aug 8, 2023
@TylerHelmuth
Copy link
Contributor

@jcmorrow thank you for the issue. This issue definitely sounds like the same problem described in #807 which has indeed been fixed in Refinery v2.1.0 which was released on Friday.

Also on Friday was the release of the Refinery Helm Chart v2.2.0. Don't be tricked by the similar version - Helm Charts and the applications they install are independently released. This allows the chart to be released independently of the thing it manages, which is useful if the chart has a bug or gets a new feature.

Refinery v2.1.0 fixed the deadlock issue and is used by default in the Refinery Helm Chart v2.2.0. Can you try out Refinery Helm Chart v2.2.0 and see if the issue persists?

@jcmorrow
Copy link
Author

jcmorrow commented Aug 9, 2023

Helm Charts and the applications they install are independently released

Ahhh, I was afraid that this might be the answer 🤦 . Our helm version was 2.1.2, meaning our app version was 2.0.2. I'm a helm noob, apologies for wasting your time but thanks for the prompt and helpful response! 🥂

@jcmorrow jcmorrow closed this as completed Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants