-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat interval: Make it configurable #2243
Conversation
When I was looking at a separate timer for async timeout evaluation < 1000ms fidelity it became apparent _most_ of the cost in a heartbeat is the timeout evaluation anyway, but also: if you are chasing a lower timeout fidelity you likely want to observe a server outage ASAP as well so I think it makes more sense to actually open up the heartbeat to make it configurable.
cc @deanward81 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I'll re-run some of the load testing that got me to #2146 in the first place this week. This seems reasonable to me as a way of resolving our original issue.
Given the cautionary notes I've seen from developers here regarding shortening this interval, is this a measurable/meaningful impact? Decreasing the interval from 1s to 100ms will 10x the total amount of work the library will incur overall, but how much of an impact is this? Would this just be burning extra cycles in a separate thread, or does this impact request processing and throughput?
One of the workarounds we implemented in our codebase is await on a Task.Delay
with a short timeout, and just abandon the request if the timeout is exceeded. Not a great solution, since this still leaves tasks floating around (since there doesn't seem to be a cancellation interface), and also retries seem to still select the same failing connection.
If decreasing this heartbeat interval is seen as too drastic a measure, perhaps an alternative implementation to achieve a similar result is to present a cancellation interface to the client allowing the requester to abort the request early? I have no data to support either conclusion, hence the need for testing, but I don't have a great way of fixturing this library directly without digging into the code more in depth, only in the black-box method of testing we've performed so far.
@@ -20,7 +20,7 @@ internal sealed class PhysicalBridge : IDisposable | |||
|
|||
private const int ProfileLogSamples = 10; | |||
|
|||
private const double ProfileLogSeconds = (ConnectionMultiplexer.MillisecondsPerHeartbeat * ProfileLogSamples) / 1000.0; | |||
private const double ProfileLogSeconds = (1000 /* ms */ * ProfileLogSamples) / 1000.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why multiply by a thousand just to divide again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was really just showing intent of the net change without doing a bigger rename, etc. - can be simplified for sure.
@tyzoid We can't abort it early really, because the command has been sent, so we have to handle/account for it no matter what - any per-command tracking (especially out of order) would be much more costly. On the overall cost of this...it'd be awesome to experiment and see. The work of a heartbeat isn't crazy but there's a brief lock around the work queue to check timeouts. If people lower it a ton though, say to 10ms, it's probably not awesome and going to have some measurable jitter in spikes. I don't have exact numbers, by nature it's going to depend on the workload a lot as to if it'll have any impact or how much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickCraver - So I understand, the issue here is that the event that wakes up to check the heartbeat is the same that checks all of the timeouts?
If the goal here is to allow for tighter propagation of aggressively set timeouts, this seems like a somewhat roundabout way to get there. Wouldn't it make more sense to decouple the process that checks the timeouts from the hearbeat and make that interval configurable? It strikes me that the heartbeat needs a network call and a full round trip to Redis, which, even if it's piggybacking along with other packets being multiplexed to Redis, sounds like an expense you shouldn't need to pay to check some timeouts and wakeup the hung tasks/threads associated with those timeouts?
@slorello89 We only connect to Redis if we haven't written in a long time, this shouldn't actually increase the network activity any...and most of the work left is the timeout and messages. There's no additional thread here - only effectively the timeout checks and a few if statements ahead of them. I agree it sounds roundabout and my first work was the split path, but 2 things drove me back to simpler:
|
When I was looking at a separate timer for async timeout evaluation < 1000ms fidelity it became apparent most of the cost in a heartbeat is the timeout evaluation anyway, but also: if you are chasing a lower timeout fidelity you likely want to observe a server outage ASAP as well so I think it makes more sense to actually open up the heartbeat to make it configurable.
I added some docs about how this is risky but that portion and comments could use scrutiny (if we even agree with doing this) on being strong enough warnings around this. I would prefer to leave this as code-only due to potential downsides people may not appreciate otherwise.
Thoughts?