Refinery pods using `LiveReload` become unresponsive and eventually consume all CPU #836

jcmorrow · 2023-08-08T17:32:39Z

Versions

Go: This is embarassing but I'm not sure!
Refinery: 2.1.2

Steps to reproduce

Deploy to kubernetes using helm
Make a rules change with LiveReload enabled
Wait somewhere between 45 minutes and an hour for your servers to explode 😉

Additional context

Important preface: I'm going to give some context around an outage here, but I want to be clear: I'm not creating this issue to complain about free software! I 💝 refinery, I'm so happy we have it, and I'm very grateful for the work y'all put in to creating and maintaining it! As with all outages we learned some good lessons, and ultimately I just want to make sure we understand what went wrong and how we and others can avoid it in the future!

We had an outage yesterday and while I'm still trying to piece together exactly what happened, I'm relatively certain that LiveReload'ing refinery pods were the main driver. Here's what I know:

We recently upgrade from version 1 to version 2.1.2 of refinery
Since then I've noticed that in both our pre-production and production environments LiveReload does not seem to work. When we run a deploy the pods don't restart but they are also unreachable (http requests to the root will just hang rather than returning a JSON message).
For the past few days I've just thought "Huh, that's weird", and restarted the pods.
Yesterday someone ran a refinery deploy and did not restart the pods.
About an hour later, an instance of one of our main services stopped responding. It was clear from our monitoring that the CPU on that node was near 100% for about 10-20 minutes. I manually restarted the node, it's CPU went back to normal.
Other services continued to behave as if they were under heavy load, even though I could not see any high CPU usage in our monitoring tools.
Eventually I noticed that information had stopped coming in to honeycomb about an hour before the outage began, and started coming in again after I hard restarted the node.
We run three instances of refinery in production. I theorized based on what I was seeing that there might be other refinery pods on other nodes which also needed to be restarted, so I manually restarted all refinery instances using kubectl. After I did this two nodes which had not been reporting for the past hour-ish suddenly reported that they had been under near 100% load for the entire duration of the outage, and they recovered as soon as I restarted the refinery pods on them.
I realize this is all a little confusing, it certainly took me a long time to piece together, but it seems like good evidence to me that there is some failure path where LiveReload might cause infinite CPU usage. This seems like it would be related to fix: live reload deadlock #810 maybe? But we should have that patch already since we are on 2.1.2?

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2023-08-08T21:06:19Z

@jcmorrow thank you for the issue. This issue definitely sounds like the same problem described in #807 which has indeed been fixed in Refinery v2.1.0 which was released on Friday.

Also on Friday was the release of the Refinery Helm Chart v2.2.0. Don't be tricked by the similar version - Helm Charts and the applications they install are independently released. This allows the chart to be released independently of the thing it manages, which is useful if the chart has a bug or gets a new feature.

Refinery v2.1.0 fixed the deadlock issue and is used by default in the Refinery Helm Chart v2.2.0. Can you try out Refinery Helm Chart v2.2.0 and see if the issue persists?

jcmorrow · 2023-08-09T18:28:07Z

Helm Charts and the applications they install are independently released

Ahhh, I was afraid that this might be the answer 🤦 . Our helm version was 2.1.2, meaning our app version was 2.0.2. I'm a helm noob, apologies for wasting your time but thanks for the prompt and helpful response! 🥂

jcmorrow added the type: bug Something isn't working label Aug 8, 2023

jcmorrow closed this as completed Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refinery pods using `LiveReload` become unresponsive and eventually consume all CPU #836

Refinery pods using `LiveReload` become unresponsive and eventually consume all CPU #836

jcmorrow commented Aug 8, 2023

TylerHelmuth commented Aug 8, 2023

jcmorrow commented Aug 9, 2023

Refinery pods using LiveReload become unresponsive and eventually consume all CPU #836

Refinery pods using LiveReload become unresponsive and eventually consume all CPU #836

Comments

jcmorrow commented Aug 8, 2023

TylerHelmuth commented Aug 8, 2023

jcmorrow commented Aug 9, 2023

Refinery pods using `LiveReload` become unresponsive and eventually consume all CPU #836

Refinery pods using `LiveReload` become unresponsive and eventually consume all CPU #836