-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add connection recovery to rmq on publish #180
Conversation
Is there a way to write a test for this without having to actually run it for 30 minutes? |
@wigging Thanks for the review. Looking at it... |
- Add consumer recovery for general connection errors. - Propagate errors for non-recovering cases. - TODO: Add tests.
I was able to reproduce the problem with the following steps (I also posted this on the related issue):
The errors displayed in the terminal are:
So I'm seeing the connection errors occur much sooner at 3 minutes compared to the 10-30 minutes that is reported here and in the related issue. I will pull down this pull request and see if I still get these errors after 3 minutes. |
@flourishingtune I pulled down your changes and ran the steps that I listed in my previous comment. I still get the RabbitMQ errors from missed heartbeats and the connections close after 3 minutes. |
Hi @wigging, thanks for checking out the PR. The second approach of reconnection implemented in this PR will reconnect while launching next campaign so that the message can relay. Thus, we still see timeouts in the RabbitMQ logs from producer although this won't affect Zambeze's message relays since we reconnect when required i.e. while publishing messages from campaigns. There are a few things like number of reconnection attempts (in both consumer/producer) that I wish to include going further. If we maintain heartbeats from producer as well (1st approach I described in the PR), then RabbitMQ connection can be maintained without loss. Let me know if you think that's a better approach. |
@flourishingtune I ran this pull request again and I can confirm that the reconnection is indeed implemented. But as you noted, subsequent runs of a campaign fail even if the reconnection is made. The only way I can run another campaign is to restart zambeze. So I guess this pull request is good to merge since it fixes the reconnection issue. The problem with running subsequent campaigns can be addressed in a separate pull request. |
I removed the This pull request partially fixes issue #168 regarding reconnection issues. But running subsequent campaigns still fail. |
This sounds good. Thanks @flourishingtune for the implementation and @wigging for the review. |
Issue description:
If campaigns are launched with long intervals between them, the RabbitMQ server closes the connection. This happens because the RabbitMQ server expects regular heartbeats from its connections, but it didn't receive heartbeats within the default timeout. This is the log on the RabbitMQ side:
Within Zambeze, we are maintaining 4 different connections to the RabbitMQ server which are within
message_handler
. Two of them are consumers (recv_activity
,recv_control
) and are able to maintain the heartbeats when they start_consuming, thus they don't lose the connection. This can be verified by launchingtshark
and monitoring the port5672
.With producers (
send_activity_dag
,send_control
), when we have gaps between launches, the basic_publish doesn't run and the RabbitMQ thus doesn't know if the connection is still active and closes the connection after the default timeout. Unlikestart_consuming
on the consumer side, thebasic_publish
on the producer is only responsible for publishing messages to channels and doesn't maintain heartbeats.Resolution:
The issue can be resolved by either maintaining regular heartbeats from the producers, or by implementing connection recovery when the
basic_publish
fails to publish message due to the lost connection. Seems like we can turn off the heartbeats altogether, but it is not recommended for reasons like monitoring, resource management, etc. I discovered that both of the implementations are documented on thepika
library repository (maintaining heartbeats and recovery). I have implemented the second approach of recovery since it doesn't require maintaining heartbeats and is a bit easy on CPU.Tests:
I tested the recovery approach with varying intervals (10-30 minutes) via a dummy bash script that intermittently launches jobs and made sure the connection lost log was seen in the RabbitMQ logs before launching new campaigns. I verified that the publisher was able to send messages by looking at the corresponding logs. The followed the relay of the new messages from the
message_handler
to theexecutor
and verified that they were relaying properly within Zambeze.Before:
After:
I verified that the RabbitMQ received the new recovery connections:
I noticed that the subsequent runs of campaigns (i.e. more than 1) is failing and had issues even without RabbitMQ connection loss. I think it would be good to address those in a different PR.
Please let me know if there are other tests you would like me to look into. Thanks!