Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ unavailability halts processing of log_message_saving without triggering liveness probe failure #86

Open
cailyoung opened this issue Aug 2, 2023 · 2 comments

Comments

@cailyoung
Copy link

Hi there; I posted in Slack about this and I think this is the right place for feedback?

We recently had a routine cluster operation which took down the RabbitMQ node which held the log_message_saving queue. This jobs service retried a few times, but eventually gave up and stopped the consumer.

Because this didn't flag anything out to the cluster, Kubernetes left the pod running as-is, and we ended up with 16 million messages in the log_message_saving queue as whatever was producing them continued to do so.

Ideally, the liveness probe responder would be checking that this queue consumer is running and flag externally if it fails so that the cluster can restart the pod.

As an aside, I'm surprised that this consumer is in this service, as jobs cannot be run replicated (I assume, due to the cleanup jobs that would likely run in parallel as a result) - and therefore we cannot scale out to recover from high message backlogs. Is there a reason it's in this service and not somewhere else?

@DzmitryHumianiuk
Copy link
Member

@DzmitryHumianiuk
Copy link
Member

@cailyoung Scaling cleanup jobs doesn't sound feasible, as it creates a lot of intersections with the data being deleted and requires synchronization. Therefore, it's easier for us to leave the cleaning jobs without scaling. An alternative is to convert them into serverless calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants