Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer channel isn't closed in the event of unexpected disconnection #18

Closed
corcus opened this issue Sep 16, 2021 · 13 comments
Closed

Comments

@corcus
Copy link

corcus commented Sep 16, 2021

This is a simplified version of my code.

package main

import (
	amqp "github.com/rabbitmq/amqp091-go"
	"log"
)

func main() {
	amqpUri := "amqpuri"
	conn, err := amqp.Dial(amqpUri)
	if err != nil {
		return
	}
	notifyConnClose := make(chan *amqp.Error)
	conn.NotifyClose(notifyConnClose)
	log.Println("RabbitMQ client connected")

	ch, err := conn.Channel()
	if err != nil {
		return
	}
	notifyChanClose := make(chan *amqp.Error)
	ch.NotifyClose(notifyChanClose)

	queueName := "testqueue"
	_, err = ch.QueueDeclare(queueName, false, false, false, false, nil)
	if err != nil {
 		return
	}

	deliveryChan, err := ch.Consume(queueName, "", false, false, false, false, nil)
	if err != nil {
		return
    	}

	go func() {
	select {
	case <-notifyConnClose:
		log.Println("connection closed")
	case <-notifyChanClose:
		log.Println("channel closed")
	}
	}()

	for d := range deliveryChan {
	log.Println(string(d.Body))
	d.Ack(false)
	//ch.Close()  //comment out to test graceful close.
	}

	log.Println("terminating...")
}

In this code I am getting a connection and a channel and registering a notification (go)channel for both the connection and channel to be notified when they are closed.

Then I declare a queue and start consuming messages from it by ranging on the deliveryChan <-chan amqp.Delivery returned by the consume function.

The problem happens when an unexpected disconnection occurs (for example I turn off my internet) . In that case even though the notifyConnClose channel gets a message the deliveryChan is not closed, and the range loop blocks forever.

In the event of a graceful disconnection by a connection.Close() then both the notifyConnClose gets a message, and the deliveryChan is Closed.

In the event of the unexpected disconnection, given that I can't close the <-chan amqp.Delivery from my code how am I supposed to proceed and get the loop to end?

@espang
Copy link

espang commented Nov 8, 2021

Hi. I had the same problem with https://github.com/streadway/amqp.

I had to look at the channel shutdown function. It blocks on sending a notification to the notifyChanClose channel. I consumed that and the consumer channel is closed afterwards.

Instead of

	select {
	case <-notifyConnClose:
		log.Println("connection closed")
	case <-notifyChanClose:
		log.Println("channel closed")
	}

I have this:

select {
case <-s.notifyConnClose:
  log.Warn().Msg("Connection closed. Reconnecting...")
  select {
  case <-s.notifyChanClose:
    log.Warn().Msg("Channel closed(2). Re-running init...")
  case <-time.After(time.Second):
  }
  return false
case <-s.notifyChanClose:
  log.Warn().Msg("Channel closed. Re-running init...")
}

This is the line that blocks and prevents the consumer channel from being closed here

@corcus
Copy link
Author

corcus commented Nov 9, 2021

I tried your workaround solution on my repro project above and it works. Thank you very much.

Do you think that a quick fix like this in the lines you mention would be a good solution to this problem?

if e != nil {
   for _, c := range ch.closes {
	select {
	case c <- e:
	default:
	}
   }
}

Or should the blocking remain as intended functionality (we should listen on both connection and channel close channels in the event of abnormal disconnection). But it should be documented better?

@espang
Copy link

espang commented Nov 9, 2021

This would drop a notification in the lib. I have changed my reconnect code that is based on the example (here) to use buffered channels, but I need to check if it improves the behaviour.

Based on your code the change would be:

notifyConnClose := make(chan *amqp.Error)
notifyChanClose := make(chan *amqp.Error)

to

notifyConnClose := make(chan *amqp.Error, 1)
notifyChanClose := make(chan *amqp.Error, 1)

This means the above code isn't blocking anymore and I hope the mutexes don't deadlock. I still had problems when the channel or connection was closed while doing a call - in my case it was on DeclareQueue. It was basically the same problem. The shutdown handler from the channel wants to send a notification and waits there and the code in DeclareQueue waits for signals in some channels. Some other issue raises that the notifications should be handled in a different goroutine, but I hope that the buffered channels have the same result.

@andygrunwald
Copy link
Contributor

I can also reproduce this issue.

@andygrunwald
Copy link
Contributor

This might also be a useful approach: streadway/amqp#519

@DanielePalaia
Copy link
Contributor

DanielePalaia commented Mar 29, 2022

Hi,

thank you very much to have investigated on the issue both here and in #32

I can reproduce the issue with a local integration test.
As far as I understand here during an abnormal disconnection which can happen for example forcing a close connection from UI or simulating a network issue a shutdown process is invoked, and the shutdown function in channel.go is blocked here:

for _, c := range ch.closes {
           c <- e
}

because the channel notifyChanClose in the select statement of the example is not consumed anymore after the notifyConnClose get consumed causing the lock acquired from the shutdown method to never been released and so creating the deadlock

The options are:

  • Better document the usage/behaviour of the library during a shutdown event both in the documentation and in the code of the shutdown of the library
  • Let the shutdown behaves asynchronously as already suggested, causing some notifications to be lost.

This library is widely used since some time and it is used as wrapper for other libraries, modifying this behaviour could cause some issue elsewhere, so we are more inclined to proceed with the first option.

In this case we will state in the documentation that during a shutdown caused by an abnormal drop in the connection the notification channels both for connections and channels need to be consumed or a buffered channel needs to be used.

@andygrunwald
Copy link
Contributor

@DanielePalaia Thanks for the response.
Would Option 2 + a new major version (indicating a backward incompatible change) be also an option?

And can you elaborate on the bold part here a bit?

Let the shutdown behaves asynchronously as already suggested, causing some notifications to be lost.

@Gsantomaggio
Copy link
Member

We understood the problem but we'd like to avoid putting timeouts for every single channel.
Here we are analyzing one single channel, but we could extend it to all the channels.

Given a simple go program:

func handleInt(ch chan int) {
	fmt.Printf("Handle %d", <-ch)
}

func main() {
	fmt.Println("Starting...")

	ch := make(chan int)
	go handleInt(ch)
	ch <- 23

if we comment:

func handleInt(ch chan int) {
//	fmt.Printf("Handle %d", <-ch)
}

We have a deadlock, but this is how golang works :)!

Speaking with the team, we'd tend to add some documentation.
Even if this is more about how to use Golang channel instead of how to use the library.

@andygrunwald thank you for your contributions.
We'd like to avoid introducing a 2.x version.
Too many versions to handle. We also have several clients to follow :)

@andygrunwald
Copy link
Contributor

That is a fair design choice. Thanks for the context on this.

@corcus
Copy link
Author

corcus commented Mar 30, 2022

In the documentation you are going to provide please include an explanation and maybe an example of the difference in behavior between a graceful close of a connection versus an unexpected one.

When I first encountered the issue, that part was what threw me off the most.

The problem happens when an unexpected disconnection occurs (for example I turn off my internet) . In that case even though the notifyConnClose channel gets a message the deliveryChan is not closed, and the range loop blocks forever.
In the event of a graceful disconnection by a connection.Close() then both the notifyConnClose gets a message, and the deliveryChan is Closed.

Having a different behavior for different types of disconnection was counter intuitive for me.

@DanielePalaia
Copy link
Contributor

Hi,

we updated the documentation explaining this use case scenario https://pkg.go.dev/github.com/rabbitmq/amqp091-go, we also commented to the notifyClose functions of the connection and channel struct.

I think we can close this one for now.

@andygrunwald
Copy link
Contributor

@DanielePalaia Are these the commits?

@DanielePalaia
Copy link
Contributor

@andygrunwald yyes that ones!

Zerpet added a commit that referenced this issue Aug 18, 2022
The example function Example_consume() shows how to write a consumer
with reconnection support.

This commit also changes the notification channels to be buffered with
capacity of 1. This is important to avoid potential deadlocks during an
abnormal disconnection. See #32 and #18 for more details.

Both example functions now have a context with timeout, so that they
don't run forever. The QoS of 1 is set to slowdown the consumption; this
is helpful to test the reconnection capabilities, by giving enough time
to close the connection via the Management UI (or any other means).

Signed-off-by: Aitor Perez Cedres <acedres@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants