Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HystrixCollapser: Response Not Received #80

Closed
benjchristensen opened this issue Jan 8, 2013 · 6 comments
Closed

HystrixCollapser: Response Not Received #80

benjchristensen opened this issue Jan 8, 2013 · 6 comments
Labels
Milestone

Comments

@benjchristensen
Copy link
Contributor

Symptoms:

  • sometimes collapsed requests do not receive their response from the executing batch
  • collapsed requests that don't receive responses would block calling threads for 15+ seconds
  • latency in batch being executed

Causes:

  • the only cause that was reproducible (by using real production traffic) is high load average on a box causing the timer thread to become very latent in triggering batches (seconds, not just 10s or 100s of milliseconds).
  • various concurrency bugs that made the handling of this latency bad such as resulting in long running blocked threads

This issue was originally "IllegalStateException: Future Not Started" and thought to be related to this HystrixCollapser issue. It ended up not being so but taken over by it so the original bug was re-created at #113. The first 2 comments including pull request #81 are related to #113 and not this issue about HystrixCollapser.

@ghost ghost assigned benjchristensen Jan 8, 2013
@benjchristensen
Copy link
Contributor Author

Need to first add some more debug information and deploy into production to capture more data (as I have not been able to replicate it outside of production).

@benjchristensen
Copy link
Contributor Author

I figured out what's causing the issue ... under high traffic and high system load average, the ticker thread becomes overwhelmed so the collapser receives the ticks very late. In my canary testing when squeezing a box they were coming 5+ seconds late, sometimes as high as 8 seconds, when it should be 10ish milliseconds.

2013-02-24 01:24:46,337 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-154] [get] reqId=1361669037482r986.4960403647274eNFWII-005-08C19342-D86BF796D008 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@5d412316 batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 7131 timeToNow: 10065 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@291f135c  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false  timeout: 5 timeunit: SECONDS
2013-02-24 01:24:46,345 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-130] [get] reqId=1fada0b8-e08e-f350-2233-796eb6ef6e71 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@68011b48 batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 7066 timeToNow: 10008 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@51733ef  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false timeout: 5 timeunit: SECONDS
2013-02-24 01:24:47,913 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-187] [get] reqId=1361669058719r390.54578775539994eNFWII-005-07137C0C-002331F31882 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@1c2707e batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 5741 timeToNow: 10018 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@4ccc5d89  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false  timeout: 5 timeunit: SECONDS
2013-02-24 01:24:47,921 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-172] [get] reqId=d1c9684b-1d1c-3d6b-a23c-312fbd6bb195 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@438b70e9 batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 5748 timeToNow: 10033 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@54f21c1a  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false timeout: 5 timeunit: SECONDS

Note the 'batchReceivedTime' for how long from the time a batch began until it received the tick.

2013-02-24 04:49:12,153 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 5370  Listeners for this interval: 427
2013-02-24 04:49:19,377 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 7212  Listeners for this interval: 489
2013-02-24 04:49:25,453 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 6060  Listeners for this interval: 428
2013-02-24 04:49:37,421 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 11957  Listeners for this interval: 506
2013-02-24 04:49:42,546 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 5113  Listeners for this interval: 454
2013-02-24 04:49:53,374 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 10815  Listeners for this interval: 576

This shows the time for a loop on the timer. Ideally it would be 10ms ... here it is taking multiple seconds when under load.

When the machine is healthy (load average of 6-7 on an 8-core box instead of approach 20 like for the tests above) it looks more like this:

al: 663
2013-02-24 05:24:17,764 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 140  Listeners for this interval: 656
2013-02-24 05:24:21,568 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 133  Listeners for this interval: 654
2013-02-24 05:24:28,204 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 183  Listeners for this interval: 655
2013-02-24 05:24:29,669 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 122  Listeners for this interval: 655

It's still not the 10ms I want, but far healthier.

When load is <6 the loop is <100ms.

The considerations right now are:

  1. Should I shard the timer so that it's not handling the 600+ listeners (aggregate count for our production usage ... would vary by usage). Sharding would allow me to have a timer per 100 listeners for example (up to CPU count?).
  2. How can I get the tick() to be more efficient so the timer ticking isn't prone to backing up?
  3. Is there a completely different way to do this?

I have changed code that I'll commit soon to improve the error handling when the latency does occur - as right now it ends up waiting twice when it only needs to wait once and then throw an exception.

@benjchristensen
Copy link
Contributor Author

I also need to build in Collapser timeout that inherits from the underlying HystrixCommand timeout - so that if the collapser scheduling itself is struggling it just sheds load instead of blocking.

@benjchristensen
Copy link
Contributor Author

Here are performance comparisons from an improved implementation I'm testing on a production canary:

collapser-batch-trigger-changes

The new code has the following changes:

  • refactored HystrixTimer to be multi-threaded and use ScheduledThreadPoolExecutor to achieve this
  • refactored HystrixCollapser to:
    • timeout on wait and trigger batch execution from calling thread instead of relying on only background timer
    • use single CountDownLatch per batch rather than 2 per collapsed requests to dramatically reduce number of items being waited on
    • changed how batches are defined so the creation of a batch is a simpler step without a loop over all requests to register with the batch

I'm going to run it in production on a canary for a while before committing or merging to make sure the initial performance improvements hold true and I haven't introduced any other bugs or memory leaks.

@benjchristensen
Copy link
Contributor Author

Pull Request #114 was tested using a Netflix API production canary and tests by 2 other applications.

@benjchristensen
Copy link
Contributor Author

I have merged this code and closing this issue. It will be released in 1.2.10.

Notes from the pull request:


I have gotten positive test results of this from 3 teams including my own testing on production canaries so I am merging this code and releasing.

What I looked at in prod was:

  • no new errors showed up
  • concurrency bugs and "Response Not Received" are not gone
  • performance numbers under normal system load for application behavior did not change
  • thread dumps look clean
  • histo of objects on the heap looks fine after running through peak traffic and over night (so doesn't look like I created any memory leaks)

The other 2 teams did stress tests in the test environment and everything passed.

abersnaze pushed a commit to abersnaze/Hystrix that referenced this issue Nov 7, 2013
I still can't replicate this so am adding error handling anywhere I can foresee something as well as adding more logs.

Netflix#80
abersnaze pushed a commit to abersnaze/Hystrix that referenced this issue Nov 7, 2013
IllegalStateException: Future Not Started Netflix#80

- refactored HystrixTimer to be multi-threaded and use ScheduledThreadPoolExecutor to achieve this
- refactored HystrixCollapser to:
  -- timeout on wait and trigger batch execution from calling thread instead of relying on only background timer
  -- use single CountDownLatch per batch rather than 2 per collapsed requests to dramatically reduce number of items being waited on
  -- changed how batches are defined so the creation of a batch is a simpler step without a loop over all requests to register with the batch

Testing on a production canary shows these changes have given significant efficiency and thus performance gains.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant