Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix - Add automatic reconnect support for STOMP producers #1099

Merged
merged 11 commits into from
Oct 8, 2020
Merged

Fix - Add automatic reconnect support for STOMP producers #1099

merged 11 commits into from
Oct 8, 2020

Conversation

atrauzzi
Copy link
Contributor

@atrauzzi atrauzzi commented Sep 29, 2020

Closes: #1097


To detail my scenario, I'm running a long-lived PHP process (laravel queue worker) that listens on one connection, but can occasionally publish to another connection. The connection used for listening doesn't suffer from any issues because enqueue successfully sets up heartbeats, anticipating the fact that consumers tend to be long lived. For producers however, there might be the assumption that they are transient (web requests) and as such, don't require connections to live any longer than it takes for the request thread to finish all its work.


@makasim - My apologies if I've missed something or if this seems like a shoehorn fix / neglects any other concerns. After doing some digging around however I am starting to wonder if this might actually be a desirable approach.

What I've noticed in the stomp-php library is that their implementation of Client::isConnected has the potential to report false positives for the state of the connection and that without getting into a more convoluted fix, can only be "tested" by attempting to use it.

With the code fix in this PR, failures to write are jogged with one last-ditch attempt to connect. My long lived mixed consumer+producer processes are able to send messages and reestablish connections on-demand with no ceremony or boilerplate.

I'd like to add the caveat that this is not my preferred strategy and would favour something more explicit, but just based on the fact that StompProducer isn't set up for configuration and the aforementioned false-positives from stomp-php, this may do well for all scenarios.

@makasim
Copy link
Member

makasim commented Sep 30, 2020

Do we not want this?

Yes, indeed. You should not commit lock file in libraries in contrast to application.

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Sep 30, 2020

Yes, indeed. You should not commit lock file in libraries in contrast to application.

Hmm, I always lock my dependencies regardless of project type, but no matter! 😄


I'm going through some additional testing right now and I'm noticing one other issue with stomp-php where it claims to have successfully written, but actually hasn't. I end up having to pump quite a few messages through before it reports a failed write.

I'll spend some more time today testing and will report back.

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Sep 30, 2020

@makasim - So after looking a little deeper into it, it looks like there's another issue with stomp-php when sending messages where it occasionally fails to detect a dead connection. I am able to reproduce this consistently.

The sequence looks roughly like:

  • Run a broker that connects to queue-1 and listens.
  • Worker receives a message on queue-1 and goes to handle it.
  • Worker requires a message to be written to queue-2
  • Worker connects to queue-2 and write; the message is successful. This leaves the connection open with no heartbeat.

    note: Up until this point, I think things are fine. If queue-1 works steadily for a while, messages sent from the worker to queue-2 are safe. This steady work simulates a heartbeat and prevents any dead connection detection on the broker from triggering.

  • If after a while (connection timeout window), a new message arrives on queue-1 and the worker goes to process it similar to before, it ends up unable to deliver any message to queue-2. The connection from the broker for queue-2 is effectively corrupted, but internally to stomp-php, it thinks it's still good. Even to the point of attempting to write bytes without triggering its exception.

Originally I thought stomp-php was simply not heartbeating a connection and not detecting dead connections, but now I'm able to see it also try to use its connection resource without throwing the "Was not possible to write frame!" exception. 😱

I've added another improvement to the PR which we can discuss further, it unfortunately has the impact of requiring one connection per message. Again though, I'm not sure how to workaround this and in fairness, I can see how this hasn't been observed until now because I could see the majority of projects using this library only dispatching from short-lived web requests.

@makasim
Copy link
Member

makasim commented Sep 30, 2020

To find out that connection is closed ( by the server) you have to read from the socket. You will get EOF or timeout.

But you cannot read before you write in client to server commutation.

There is no way to detect a closed connection by writing to it. Timeout is the only option.

Is there a way to set write timeout ? That could be used as a sign of closed connection and trigger reconnect.

@atrauzzi
Copy link
Contributor Author

Oh thanks! I'll tinker around with your suggestion and reply here with what I figure out. Definitely need to be able to come up with a way to gracefully handle this.

I'll see how long I have to go before something has to happen upstream 😆

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Sep 30, 2020

After spending a fair bit of time on it, I've been unable to come up with a way to jog or test the connection. Can you think of any concrete solutions that might get around this issue?

From my vantage point, I see only a few options:

  • Have enqueue send heartbeats for non-queue producer connections
  • Somehow figure out a way to jog the connection prior to using it
  • Allow enqueue to treat connections that don't have a heartbeat to be considered "transient" and destroy or reconnect them after each message sent.

IMO, I feel like the 3rd option is the most "predictable" philosophy and also aligns message publishing with how it's functioning right now during web requests.

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 1, 2020

Just a note in case you're at things early today, will be pushing up a new approach that I think I have working really nicely here...

*/
public function __construct($stomp, string $extensionType)
public function __construct($stomp, string $extensionType, $transient = true)
Copy link
Contributor Author

@atrauzzi atrauzzi Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connections start out as "transient" which means that they will use some new reconnect behaviour below. I have it here in the constructor as configurable if for whatever reason a connection needs to start out long-lived.

@@ -173,6 +180,8 @@ public function createConsumer(Destination $destination): Consumer
{
InvalidDestinationException::assertDestinationInstanceOf($destination, StompDestination::class);

$this->transient = false;
Copy link
Contributor Author

@atrauzzi atrauzzi Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as the first consumer is created against a context, we convert to long-lived and disable all transient functionality so that queues with heartbeats don't lose their connections while heartbeating.

Comment on lines 193 to 195
if ($this->transient && true == $this->stomp) {
$this->stomp->disconnect();
}
Copy link
Contributor Author

@atrauzzi atrauzzi Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens when a connection is not being used for consumption.

Triggering a disconnect prior to a producer being used is enough, as the underlying stomp-php library can safely rely on its false state around connections. Connections recover without any additional involvement from enqueue.

Comment on lines +224 to +233
private function createStomp(): BufferedStompClient
{
$stomp = call_user_func($this->stompFactory);

if (false == $stomp instanceof BufferedStompClient) {
throw new \LogicException(sprintf('The factory must return instance of BufferedStompClient. It returns %s', is_object($stomp) ? get_class($stomp) : gettype($stomp)));
}

return $stomp;
}
Copy link
Contributor Author

@atrauzzi atrauzzi Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope you don't mind, I moved this out. 😄

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 1, 2020

@makasim - Updates made and I'm really happy with this, it's working great and has greatly improved the stability and flexibility of enqueue STOMP! 😃

@atrauzzi atrauzzi requested a review from makasim October 1, 2020 14:46
@makasim
Copy link
Member

makasim commented Oct 1, 2020

To clarify:

You have to connections: one for consumers and another for publisher. The problem happens when you want to publish a message from a consumer ?

@makasim
Copy link
Member

makasim commented Oct 1, 2020

If so, It might be better to reconnect publisher connection only once before processing a message (assuming that message processing does not take long).

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 1, 2020

Hmm, let's see if I can do this justice 😆

I think part of the issue is around the current division of responsibilities. We need to do the correction at the Context because StompConsumer and StompProducer don't know if they should start out transient as well as when or whether they should stop being transient at some point.

  • StompContext only has the authority to manage connections, but never uses them
  • StompProducer and StompConsumer only have the authority to use connections , but not manage them

For the current design:

  • StompContext has responsibility for both createProducer and createConsumer
  • StompContext is responsible for triggering the creation and holding references to connections via StompContext::createStomp, this makes StompContext the first point of integration with stomp-php
  • So long as a StompContext is only ever used for producers, then it's better to preserve the one message per connection semantics by bouncing the connection before any attempt to use it.
    • This is because if a StompContext only has createProducer called, it almost definitely doesn't have a heartbeat configured
    • createProducer is called multiple times to send messages. StompProducers are not reused and do not have sole authority over a connection, so it made sense to me to reset the connection at the level where it's managed.

In theory I could add a $transient parameter to StompProducer instances, but it seems somewhat out of place as their current design is focused around using the connection and not managing it (as noted before).

Apologies, I'm hopefully explaining this correctly and in such a way that does the current design justice 😆

@makasim
Copy link
Member

makasim commented Oct 1, 2020

I am not that much into Laravel specifics. In general the idea is next

  • you've got a message
  • forced disconnect-connect of publisher connection, like $conn->reconnect();
  • consume the received message.

Once you get a new message do the same.

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 1, 2020

If I understand what you're saying correctly, that flow sounds like it's for receiving messages? Which isn't a problem in my scenario. The issue I'm observing is when I try to send to a queue (queue-2) that is not being listened on, from a long-lived process.

I have no reason to open and hold a connection open to the destination queue (queue-2) because I won't be publishing to it on every inbound message from say...queue-1, which is long lived and is working fine with a heartbeat configured.


FWIW, this could be an issue without Laravel involved, there's nothing specific to it here. Moreover, I think what I'm adding more explicitly reflects how people use Enqueue during regular web request lifecycle. It just happens to ensure that semantic is now consistently offered for long running processes as well.


Sorry, one more edit, but I'm also more than happy to do a screen share to show the difference. Sometimes a picture/video is way better for understanding what's going on.

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 2, 2020

@makasim - Is there anything specific I can do to help get this merged? At this point my testing locally has proven the fix in this MR to be quite robust for long lived processes. 😄

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 6, 2020

Hey @makasim, just checking in again. I'd love to get this merged so that we'll be able to use STOMP from PHP in some upcoming work. Everything is still working very smoothly with my local version of this fix.

@atrauzzi atrauzzi closed this Oct 6, 2020
@atrauzzi atrauzzi reopened this Oct 6, 2020
@makasim
Copy link
Member

makasim commented Oct 6, 2020

Honestly, I am not completely sold on this solution. Let's discuss the issue over a call. What about tomorrow?

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 6, 2020

Absolutely, I can definitely do that, I'm in the CDT timezone, happy to accommodate, what's your preference? 😄

It's worth noting that the transient flag is only in effect when the connection is going to be short-lived anyway. So there's very little risk of this breaking consumers. It's just causing the connection to recycle prior to use for non-consumer usage.

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 7, 2020

@makasim - If you're still available and unsure about this, I followed you on twitter if you wanna DM any details for a quick screen share or chat.

Again, I've been running this patch locally from two different queue workers that interact over the broker since figuring it out. Have made sure to come back and test dispatches. Consumers and heartbeats haven't been negatively impacted. My scenario also tests consuming and dispatching from the same queue, so the transient flag is being managed correctly.

I do feel like it's a stable change and am happy to commit long term to assisting if for some reason this causes a regression. In many ways though, I think this could improve stability. A long lived web request publishing to a queue with a short TTL could in theory encounter an issue without this fix.

@makasim
Copy link
Member

makasim commented Oct 7, 2020

@atrauzzi Sent you a message

}
$stomp = $this->config['lazy']
? function () { return $this->establishConnection(); }
: $this->establishConnection();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally I indent lines like 79, but the code style suggested this, apologies!

/**
* @param BufferedStompClient|callable $stomp
*/
public function __construct($stomp, string $extensionType)
public function __construct($stomp, string $extensionType, bool $detectTransientConnections = false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we default to "off".

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 8, 2020

Good morning @makasim! 😆

As per our chat earlier today, I've wired this up to the configuration obtained in StompConnectionFactory. Pretty straightforward I'd say and preserves all old behaviour by default.

@makasim
Copy link
Member

makasim commented Oct 8, 2020

@atrauzzi some tests are failing

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 8, 2020

@makasim - I think the two that are failing now are okay?

@makasim makasim merged commit 272c2bc into php-enqueue:master Oct 8, 2020
@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 8, 2020

Wonderful, thank you @makasim. Will this be automatically released?

@atrauzzi atrauzzi deleted the fix/dead-non-consumer-connections branch October 8, 2020 17:47
@makasim
Copy link
Member

makasim commented Oct 8, 2020

Semi-automatically...

I'll do it.

@makasim
Copy link
Member

makasim commented Oct 8, 2020

Thank you too!

@atrauzzi
Copy link
Contributor Author

atrauzzi commented Oct 9, 2020

Hey @makasim - did the release end up going through? I noticed that I'm still seeing version 0.10.4 as the latest. 😄

@makasim
Copy link
Member

makasim commented Oct 9, 2020

done https://github.com/php-enqueue/stomp/releases/tag/0.10.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

STOMP: "Was not possible to write frame!" causing zombie connections
2 participants