Worker restarts can cause extra ready messages in the distributor storage #978

andreasohlund · 2013-02-13T09:28:38Z

This can happen for at least the following use cases:

Worker auto subscribing at start up

If a worker auto subscribes to it self at start up the following can happen

Distributor pops a message from storage and forwards the sub request to worker
Worker start up message arrives, storage is cleared and new ready messages are added
Worker processed the subscription and sends back a ready message
Distributor now contains N+1 Ready messages where N is the number of threads in the worker

Worker restarts with messages in its input Q

Worker stops
N messages will arrive at the queue where N is the number of threads
Worker restarts and sends a Control message
Worker processed the old messages and sends ready messages
Distributor clear the storage and add N ready messages
Ready messages arrives to distributor and is added to storage
Storage now contains approx N*2 readymessages for the worker

The proposed solutions is:

Worker attaches a "session" id to each control/readymessage. This id reset on restarts. To avoid race conditions this should probably be the UtcNow at the worker
Distributor stores this value in the storage Q
Distributor keeps the value and worker address in a dictionary in memory
Distributor forwards this value as a header when sending work to workers
Workers flows the session id of the incoming message to the resulting ready message
Distributor discards readymessages that belongs to and old session id (per worker)

janovesk · 2013-02-13T10:01:52Z

Yeah, that's pretty much excactly what we've been seeing. Solution sounds good.

We're doing a quick workaround for our most frequent scenario: deploy new version and restart all the workers. (We do this often in our TEST and STAGE environment) We simply clear the storage and control queue before restarting the workers. We do them all at the same time and let them finish their local work first. This makes sure no workers have anything in their input queue when they start up and they will not receive any work before the startup message has been processed by the distributor. We still can get out of sync if one a worker restarts, but we can live with that until the issue is fixed.

andreasohlund · 2013-02-13T10:05:00Z

Yes, at least this wont cause a constant buildup of messages but rather a
few extra at startup but then stabilize

On Wed, Feb 13, 2013 at 11:01 AM, Jan Ove Skogheim Olsen <
notifications@github.com> wrote:

Yeah, that's pretty much excactly what we've been seeing. Solution sounds
good.

We're doing a quick workaround for our most frequent scenario: deploy new
version and restart all the workers. (We do this often in our TEST and
STAGE environment) We simply clear the storage and control queue before
restarting the workers. We do them all at the same time and let them finish
their local work first. This makes sure no workers have anything in their
input queue when they start up and they will not receive any work before
the startup message has been processed by the distributor. We still can get
out of sync if one a worker restarts, but we can live with that until the
issue is fixed.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/978#issuecomment-13481958.

http://andreasohlund.net
http://twitter.com/andreasohlund

janovesk · 2013-02-13T10:10:29Z

Yes, exactly. And that makes the issue non-critical in my humble opinion.

johnsimons · 2013-07-25T00:56:04Z

@andreasohlund is this a must have for v4.1?

andreasohlund · 2013-07-26T08:11:53Z

No

Sent from my iPhone

On 25 jul 2013, at 02:56, John Simons notifications@github.com wrote:

@andreasohlund is this a must have for v4.1?

—
Reply to this email directly or view it on GitHub.

andreasohlund · 2013-10-02T09:22:21Z

When we fix this we can remove

https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.Core/Transports/Msmq/WorkerAvailabilityManager/MsmqWorkerAvailabilityManager.cs#L36

andreasohlund · 2013-10-27T16:42:36Z

This will likely be fixed as part of #1361

andreasohlund · 2013-11-05T14:12:26Z

The worker session id concept will be needed for #978 as well

johnsimons · 2013-11-27T01:00:57Z

Fixed as part of #1743

JeffHenson · 2014-03-13T12:12:04Z

I'm trying to follow what was done to fix this issue (we are seeing it in our production environment) and I can't seem to find where it was actually fixed. We are running 4.4.x and we are still seeing extra entries in the storage queue after a worker node restarts.

Has this actually been fixed or have I missed some configuration step?

johnsimons · 2014-03-14T11:18:06Z

@JeffHenson, this is only fixed in the new distributor (https://www.nuget.org/packages/NServiceBus.Distributor.MSMQ/), which distributor are you using?

JeffHenson · 2014-03-14T15:54:09Z

@johnsimons I didn't know there was a new one so I'm still using the old one. I can't find any documentation on that package. Is there any configuration needed to use it other than adding it to my project?

johnsimons · 2014-03-15T04:09:10Z

Hi @JeffHenson

As usual we are a bit behind on the doco 😞
Anyway here are the instructions, Particular/docs.particular.net#178

JeffHenson · 2014-03-17T11:54:04Z

@johnsimons excellent, thanks!

Using the new distributor code has fixed both of the issues we are seeing.

ghost assigned johnsimons Nov 6, 2013

johnsimons mentioned this issue Nov 7, 2013

Support disconnecting workers from a running distributor #1743

Merged

11 tasks

johnsimons closed this as completed Nov 27, 2013

SimonCropp added the Bug label Feb 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker restarts can cause extra ready messages in the distributor storage #978

Worker restarts can cause extra ready messages in the distributor storage #978

andreasohlund commented Feb 13, 2013

janovesk commented Feb 13, 2013

andreasohlund commented Feb 13, 2013

janovesk commented Feb 13, 2013

johnsimons commented Jul 25, 2013

andreasohlund commented Jul 26, 2013

andreasohlund commented Oct 2, 2013

andreasohlund commented Oct 27, 2013

andreasohlund commented Nov 5, 2013

johnsimons commented Nov 27, 2013

JeffHenson commented Mar 13, 2014

johnsimons commented Mar 14, 2014

JeffHenson commented Mar 14, 2014

johnsimons commented Mar 15, 2014

JeffHenson commented Mar 17, 2014

Worker restarts can cause extra ready messages in the distributor storage #978

Worker restarts can cause extra ready messages in the distributor storage #978

Comments

andreasohlund commented Feb 13, 2013

Worker auto subscribing at start up

Worker restarts with messages in its input Q

janovesk commented Feb 13, 2013

andreasohlund commented Feb 13, 2013

janovesk commented Feb 13, 2013

johnsimons commented Jul 25, 2013

andreasohlund commented Jul 26, 2013

andreasohlund commented Oct 2, 2013

andreasohlund commented Oct 27, 2013

andreasohlund commented Nov 5, 2013

johnsimons commented Nov 27, 2013

JeffHenson commented Mar 13, 2014

johnsimons commented Mar 14, 2014

JeffHenson commented Mar 14, 2014

johnsimons commented Mar 15, 2014

JeffHenson commented Mar 17, 2014