Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker restarts can cause extra ready messages in the distributor storage #978

Closed
andreasohlund opened this issue Feb 13, 2013 · 14 comments
Closed
Assignees
Labels
Milestone

Comments

@andreasohlund
Copy link
Member

This can happen for at least the following use cases:

Worker auto subscribing at start up

If a worker auto subscribes to it self at start up the following can happen

  1. Distributor pops a message from storage and forwards the sub request to worker
  2. Worker start up message arrives, storage is cleared and new ready messages are added
  3. Worker processed the subscription and sends back a ready message
  4. Distributor now contains N+1 Ready messages where N is the number of threads in the worker

Worker restarts with messages in its input Q

  1. Worker stops
  2. N messages will arrive at the queue where N is the number of threads
  3. Worker restarts and sends a Control message
  4. Worker processed the old messages and sends ready messages
  5. Distributor clear the storage and add N ready messages
  6. Ready messages arrives to distributor and is added to storage
  7. Storage now contains approx N*2 readymessages for the worker

The proposed solutions is:

  1. Worker attaches a "session" id to each control/readymessage. This id reset on restarts. To avoid race conditions this should probably be the UtcNow at the worker
  2. Distributor stores this value in the storage Q
  3. Distributor keeps the value and worker address in a dictionary in memory
  4. Distributor forwards this value as a header when sending work to workers
  5. Workers flows the session id of the incoming message to the resulting ready message
  6. Distributor discards readymessages that belongs to and old session id (per worker)
@janovesk
Copy link
Contributor

Yeah, that's pretty much excactly what we've been seeing. Solution sounds good.

We're doing a quick workaround for our most frequent scenario: deploy new version and restart all the workers. (We do this often in our TEST and STAGE environment) We simply clear the storage and control queue before restarting the workers. We do them all at the same time and let them finish their local work first. This makes sure no workers have anything in their input queue when they start up and they will not receive any work before the startup message has been processed by the distributor. We still can get out of sync if one a worker restarts, but we can live with that until the issue is fixed.

@andreasohlund
Copy link
Member Author

Yes, at least this wont cause a constant buildup of messages but rather a
few extra at startup but then stabilize

On Wed, Feb 13, 2013 at 11:01 AM, Jan Ove Skogheim Olsen <
notifications@github.com> wrote:

Yeah, that's pretty much excactly what we've been seeing. Solution sounds
good.

We're doing a quick workaround for our most frequent scenario: deploy new
version and restart all the workers. (We do this often in our TEST and
STAGE environment) We simply clear the storage and control queue before
restarting the workers. We do them all at the same time and let them finish
their local work first. This makes sure no workers have anything in their
input queue when they start up and they will not receive any work before
the startup message has been processed by the distributor. We still can get
out of sync if one a worker restarts, but we can live with that until the
issue is fixed.


Reply to this email directly or view it on GitHubhttps://github.com//issues/978#issuecomment-13481958.

http://andreasohlund.net
http://twitter.com/andreasohlund

@janovesk
Copy link
Contributor

Yes, exactly. And that makes the issue non-critical in my humble opinion.

@johnsimons
Copy link
Member

@andreasohlund is this a must have for v4.1?

@andreasohlund
Copy link
Member Author

No

Sent from my iPhone

On 25 jul 2013, at 02:56, John Simons notifications@github.com wrote:

@andreasohlund is this a must have for v4.1?


Reply to this email directly or view it on GitHub.

@andreasohlund
Copy link
Member Author

This will likely be fixed as part of #1361

@andreasohlund
Copy link
Member Author

The worker session id concept will be needed for #978 as well

@johnsimons
Copy link
Member

Fixed as part of #1743

@JeffHenson
Copy link

I'm trying to follow what was done to fix this issue (we are seeing it in our production environment) and I can't seem to find where it was actually fixed. We are running 4.4.x and we are still seeing extra entries in the storage queue after a worker node restarts.

Has this actually been fixed or have I missed some configuration step?

@johnsimons
Copy link
Member

@JeffHenson, this is only fixed in the new distributor (https://www.nuget.org/packages/NServiceBus.Distributor.MSMQ/), which distributor are you using?

@JeffHenson
Copy link

@johnsimons I didn't know there was a new one so I'm still using the old one. I can't find any documentation on that package. Is there any configuration needed to use it other than adding it to my project?

@johnsimons
Copy link
Member

Hi @JeffHenson

As usual we are a bit behind on the doco 😞
Anyway here are the instructions, Particular/docs.particular.net#178

@JeffHenson
Copy link

@johnsimons excellent, thanks!

Using the new distributor code has fixed both of the issues we are seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants