-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve throttled processing strategy #546
Improve throttled processing strategy #546
Conversation
Codecov Report
@@ Coverage Diff @@
## master #546 +/- ##
==========================================
+ Coverage 38.12% 38.52% +0.39%
==========================================
Files 78 78
Lines 2662 2754 +92
Branches 465 487 +22
==========================================
+ Hits 1015 1061 +46
- Misses 1498 1536 +38
- Partials 149 157 +8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems legit. Only a couple of comments
var counter = new ThreadSafeCounter(); | ||
|
||
var actions = BuildFakeIncomingMessages(messageCount, counter); | ||
|
||
await ListenLoopExecuted(actions, messageProcessingStrategy); | ||
|
||
fakeMonitor.DidNotReceive().IncrementThrottlingStatistic(); | ||
if (messageCount == capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please separate the tests where the messageCount == capacity
into another test method so you can avoid the if
and make what you want to assert more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behaviour needs discussion overall, for now I just wanted it passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to separate the two cases in later commits.
messageProcessingStrategy.AvailableWorkers.ShouldBeGreaterThan(0); | ||
|
||
stopwatch.Elapsed.ShouldBeLessThanOrEqualTo(timeout, | ||
$"ListenLoopExecuted took longer than timeout of {timeoutSeconds}s, with {actions.Count} of {initalActionCount} messages remaining"); | ||
stopwatch.Elapsed.ShouldBeLessThanOrEqualTo((timeout * 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does the * 3
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threading means things don't take exact amounts of time and pending an agreement on exactly how this should work, this was a big enough number to get the time within enough of some bounds to get the test passing pending a proper discussion on how this should work because the overall approach to the message processing seems a bit flawed to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I suspected it was just to make it as big as possible to make it pass. Will wait for the refactoring (eventually) after the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test to not validate the timings anymore.
Intriguingly I just saw a talk from Daniel Marbach of Particular (NServiceBus) about writing an async message handler, where he called out the need to apply back pressure, and used a SemaphoreSlim to control the number of in-flight requests.
The video will be up shortly I suspect, though I don't think that you will get much beyond confirmation of approach.
On Thursday, 20 June 2019, 18:55:39 CEST, Martin Costello <notifications@github.com> wrote:
TL;DR In high throughput scenarios, the Throttled class' implementation causes Task backlogs that consume CPU and memory that cannot be recovered when message rates drop off.
This PR attempts to improve the behaviour of the Throttled implementation to use the async methods on the SemaphoreSlim class rather than wrap wait handles in tasks and manually queue to the thread pool.
Also makes changes to consider the fact that code using the processing strategy can race and think there are available workers but then there not be any available when it attempts to process the message.
This last point in particular needs some thought and discussion, as the implementation seems quite baked into that fact, so seems to just be asking for race conditions in retrospect.
The implementation does not appear to take into account that reads of the "free workers" value is non-atomic with respect to the time it takes to request messages from SQS and start to process them.
Details
These changes stem from a production issue where a worker that processes a single SQS queue was migrated from .NET Framework 4.7.2 to .NET Core 2.2.5. This change was canary'd on a single EC2 instance for ~24 hours on a queue that has a particularly high message throughput rate.
Below are the CPU and memory consumption for a 24 hour period where the code using JustSaying 6.0.1 is in use:
As can be seen, there is a slow memory leak in the service, as well as an increasing CPU usage from approximately 5pm which climbs up to ~50% by 9pm which never recovers even when messages being delivered drops off to near zero, as shown below.
There was otherwise no obvious signs of the worker being in distress, apart from the memory and CPU. The mean time to process messages in the canary was actually more consistent and messages where being handled faster than the .NET Framework version:
Capturing a process dump from the canary showed that there were hundreds of tasks scheduled to be run with various stacks and start points, suggesting the thread pool was reaching its limits.
Some sniffing around in the code lead to looking at the Throttled class for performance bottlenecks, as the application is small and only consumes one queue with a single handler and makes a HTTP call to an external service based on the contents of the message, so is quite simple and lightweight.
Digging into the history leads to some refactoring in #440, but that was merged to master after 6.0.1 was released to NuGet, with no significant changes in the rest of the commit history (as far back as 5th April 2017, it's hard to look before that due to renaming).
My working hunch is that the processing rate improved to a degree that the throttling became the bottleneck, and tasks to process messages were backing up due to the indefinite wait for the semaphore here:
JustSaying/JustSaying/Messaging/MessageProcessingStrategies/Throttled.cs
Line 26 in 88f241c
| | await _semaphore.WaitAsync(cancellationToken).ConfigureAwait(false); |
The attached refactoring is based on changes made directly to the application yesterday which was tested overnight last night to test the hypothesis (as well as increase the throttle factor from 8 to 16), which showed the same throughput improvements but without the CPU or memory issues. In fact, the CPU used is now less that the .NET Framework version in general.
Below are the comparison charts to the above with the fix. The CPU is the lowest orange line (look near 8pm) and the memory is the lowest yellow line, showing that while slightly more memory is used, there's no leak.
You can view, comment on, or merge this pull request online at:
#546
Commit Summary
- Improve async behaviour in Throttled
- Remove commented-out code
File Changes
- M JustSaying.UnitTests/AwsTools/MessageHandling/SqsNotificationListener/Support/ThrowingBeforeMessageProcessingStrategy.cs (10)
- M JustSaying.UnitTests/AwsTools/MessageHandling/SqsNotificationListener/Support/ThrowingDuringMessageProcessingStrategy.cs (11)
- M JustSaying.UnitTests/AwsTools/MessageHandling/SqsNotificationListener/WhenExactlyOnceIsAppliedToHandlerWithoutExplicitTimeout.cs (1)
- M JustSaying.UnitTests/Messaging/MessageProcessingStrategies/MessageLoopTests.cs (86)
- M JustSaying.UnitTests/Messaging/MessageProcessingStrategies/ThrottledTests.cs (35)
- M JustSaying/AwsTools/MessageHandling/SqsNotificationListener.cs (91)
- M JustSaying/Messaging/MessageHandling/ExactlyOnceAttribute.cs (4)
- M JustSaying/Messaging/MessageProcessingStrategies/DefaultThrottledThroughput.cs (15)
- M JustSaying/Messaging/MessageProcessingStrategies/IMessageProcessingStrategy.cs (36)
- M JustSaying/Messaging/MessageProcessingStrategies/Throttled.cs (160)
Patch Links:
- https://github.com/justeat/JustSaying/pull/546.patch
- https://github.com/justeat/JustSaying/pull/546.diff
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This is some really nice analysis 😄 |
Another thing to make you aware of (you may have already read it) is this great read from Mac Gravell on the topic of |
Thanks @slang25 - sounds like I did understand what was going on properly then, and that it is setting itself up for a race condition. As 7's going to be a breaking change anyway, sounds like this is something I can try and fix properly here as part of this PR then, rather than fudge it and come back to it later. Removing it and changing the "worker" design sounds like a sensible approach to make this nicer then, in some way. Once thing I'd considered is instead of:
We could change it to:
The idea being it's easier to have many things doing something serially, than one thing trying to manage many parallel things. Thoughts? I'll also check out that blog post and the related PR later on. |
@slang25 Sounds like the app was experiencing a variant of the resource exhaustion thing Mark Gravell’s post, as the old code was using a TCS too, except we were still processing messages so the instance wasn’t totally dead. |
The TCS code I added when adding the semaphore implementation, I wanted an async way of waiting for the semaphore value to change and then to peek at the count. I've given this some more thought, and I can't think of a way this would get tripped up, so I should have gone with this approach first time around. On your suggestion about changing the processing pattern, I think what you're suggesting is a better approach. What can happen now with a shared throttled strategy of let's say 10 workers, under queue back pressure, is that every other message fetch tries to get 1, and because of concurrency they are still fetching too many, so if you have multiple queues the fetch counts can look like: Q1 - ... 1, 9, 1, 1, 8, ... or something as random as that. Part of me wants to smooth out these numbers by tracking a rolling average of the rate of messages being processed, this would potentially introduce some (bounded) back pressure into the service (and out SQS) which is less desirable, but if kept minimal is manageable (messages could timeout easier for example). With this suggestion, the listenloop would have a steady cadence, this can also help reduce latency during quiet periods, rather than wait 20 seconds to receive a message batch containing 1 message, you could ask for 1 message and get it as soon as it's in the queue. However, all this comes at the cost of yet more complexity, which is where I always drop the idea. I think your suggestion wins in terms of simplicity and saneness. If TFMs were no barrier, what you are suggesting would be really nice exposed as an |
FWIW Brighter uses the model of multiple performers, where each thread is a message pump i.e.
And so works serially, with parallelism and explicit option. We use long-running threads not thread pool threads, as we have a long running loop, which is an anti-pattern for the thread pool which expects thread to do short jobs and optimizes for that AFAIK. We are bit more old school, and the loop reads from an in-memory queue, into which we can insert a quit message so that we exit after the next job is taken, instead of a cancellation token, and the queue itself can act as a shared buffer on some transports so you can fill that buffer and have multiple performers read the same pool of work (it won't work on RMQ for example, where there is a thread affinity with a channel and the delivery tag that you ack with, but does for something like SQS which is just using HTTP and doesn't have open channels etc). Our model reflects the need to preserve ordering which requires a single peformer and let's you manage that, and is overall simpler than trying to write back pressure on top of the pool, something that is possible but having done it a few times not something I really like over using explicit threads, as they are long running. The biggest problem I need to fix though is that we have on thread for the pump and handler, which makes it easy, you only get the next item when you finish work, but to support an async message pump (for scenarios where you don't need ordering). We want to support thread affinity on a call back and we can support that using our own queue to implement a sychronization context so that your callback becomes another item to execute via the loop, but that's not finished and has some tricky edge cases. Anyway, mentioned just to compare notes on approaches, but we steered away from the thread pool right from the beginning due to less that positive experiences for this kind of scenario. |
Having a synchronous message pump makes a lot of sense, certainly better than blocking on thread pool threads as you say. Here the pump is per queue and asynchronous, which saves us a thread per queue when idle. I'd certainly trade that for better-performing code. This PR hopefully gives us both (or at least to some extent). Out of curiosity, are you saying that your message handler async callbacks are run on one thread? What do you want/need that for? |
it's more "can be" we let you set configureawait to true so that you can have thread affinity on your callback. We have a number of users that want that, mostly for legacy reasons, and even HttpClient is not quite as thread safe as most of might like. In addition, in some cases, you might not want to be pushed back onto the thread pool for your callbacks which is where the ASP.NET went in the end, as there is no synchronisation context in ASP.NET Core. I'm not entirely sure that dropping back into the thread pool is what you always want to do, as it comes with its own gotchas. But for the moment it is parked because it is hard to get right, and the fact that the ASP.NET team quit on making it work was not encouraging... |
Thanks for the input everyone - when I get the chance this week I'll come back to this PR and rework it some more based on the above discussion to change to something like this for the pump: while (!cancellationToken.IsCancellationRequested)
{
var messages = await GetMessagesAsync();
foreach (var message in messages)
{
if (cancellationToken.IsCancellationRequested)
{
break;
}
await DoMessageStuffAsync(message);
}
} Then there just needs to be something opt-in to let you do that in parallel on several threads, which would let you do more in a worker that only processes one queue so that the app is making efficient usage of its resources (assuming multiple cores). |
I think this is done and ready for review now. |
I thought it didn’t queue anything else up further down the chain, so it would be actually handling the message before the await completed. Basically turning the strategy into “Do message, do message, do message, ..., get more messages, ...” rather than “get messages, despatch messages, get more messages, ...”. |
Sorry, I shouldn't try reviewing code on my phone 😳. I missed the await of |
JustSaying/Messaging/MessageProcessingStrategies/ThrottledOptions.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @martincostello, this head-on addresses some of the things that have bugged me for years, and the sequential processing option as well is a great addition. 😃
Improve the behaviour of the Throttled implementation to use the async methods on the SemaphoreSlim class rather than wrap wait handles in tasks and manually queue to the thread pool. Also makes changes to consider the fact that code using the processing strategy can race and think there are available workers but then there not be any available when it attempts to process the message.
Simplify the message loop and remove the concept of AvailableWorkers from IMessageProcessingStrategy.
Refactor the message loop to make it easier to follow. Remove AvailableWorkers property. Rename MaxWorkers to MaxConcurrency. Add ThrottledOptions class. Support serial processing as well as fire-and-forget using the thread pool.
Rename UseThreadPool to ProcessMessagesSequentially to make it more explicit and make it opt-in by default to match v6 behaviour.
Add unit tests that verify that messages are processed sequentially (or not) in Throttled.
Add a unit test that verifies that Throttled does not allow work concurrent work that specified by the options.
Rename method after changes in master.
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good! Well done!
TL;DR In high throughput scenarios, the
Throttled
class' implementation causesTask
backlogs that consume CPU and memory that cannot be recovered when message rates drop off.This PR attempts to improve the behaviour of the
Throttled
implementation to use the async methods on theSemaphoreSlim
class rather than wrap wait handles in tasks and manually queue to the thread pool.Also makes changes to consider the fact that code using the processing strategy can race and think there are available workers but then there not be any available when it attempts to process the message.
This last point in particular needs some thought and discussion, as the implementation seems quite baked into that fact, so seems to just be asking for race conditions in retrospect.
The implementation does not appear to take into account that reads of the "free workers" value is non-atomic with respect to the time it takes to request messages from SQS and start to process them.
Details
These changes stem from a production issue where a worker that processes a single SQS queue was migrated from .NET Framework 4.7.2 to .NET Core 2.2.5. This change was canary'd on a single EC2 instance for ~24 hours on a queue that has a particularly high message throughput rate.
Below are the CPU and memory consumption for a 24 hour period where the code using JustSaying 6.0.1 is in use:
As can be seen, there is a slow memory leak in the service, as well as an increasing CPU usage from approximately 5pm which climbs up to ~50% by 9pm which never recovers even when messages being delivered drops off to near zero, as shown below.
There was otherwise no obvious signs of the worker being in distress, apart from the memory and CPU. The mean time to process messages in the canary was actually more consistent and messages where being handled faster than the .NET Framework version:
Capturing a process dump from the canary showed that there were hundreds of tasks scheduled to be run with various stacks and start points, suggesting the thread pool was reaching its limits.
Some sniffing around in the code lead to looking at the
Throttled
class for performance bottlenecks, as the application is small and only consumes one queue with a single handler and makes a HTTP call to an external service based on the contents of the message, so is quite simple and lightweight.Digging into the history leads to some refactoring in #440, but that was merged to master after 6.0.1 was released to NuGet, with no significant changes in the rest of the commit history (as far back as 5th April 2017, it's hard to look before that due to renaming).
My working hunch is that the processing rate improved to a degree that the throttling became the bottleneck, and tasks to process messages were backing up due to the indefinite wait for the semaphore here:
https://github.com/justeat/JustSaying/blob/88f241c8a6bcb5768e2d57fae4aef906aacad53a/JustSaying/Messaging/MessageProcessingStrategies/Throttled.cs#L26
The attached refactoring is based on changes made directly to the application yesterday which was tested overnight last night to test the hypothesis (as well as increase the throttle factor from 8 to 16), which showed the same throughput improvements but without the CPU or memory issues. In fact, the CPU used is now less that the .NET Framework version in general.
Below are the comparison charts to the above with the fix. The CPU is the lowest orange line (look near 8pm) and the memory is the lowest yellow line, showing that while slightly more memory is used, there's no leak.