Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start measuring Ping latency closer to sending request and remove Stopwatch instance #1714

Merged

Conversation

eduardobr
Copy link
Contributor

@eduardobr eduardobr commented Mar 28, 2021

Inspired on ValueStopWatch: https://github.com/dotnet/extensions/blob/master/src/Shared/src/ValueStopwatch/ValueStopwatch.cs, I'm removing the instance of Stopwatch from TimerMessage.

There's a change of behavior here:
We're not starting latency measurement when message is instantiated, but right before it is written to network.

@eduardobr eduardobr force-pushed the feature/optimize-timermessage branch from 6552038 to 16a6124 Compare March 28, 2021 18:16
NickCraver pushed a commit that referenced this pull request May 22, 2021
Decreasing diff on #1714 to make things a bit easier to read.
NickCraver added a commit that referenced this pull request May 22, 2021
Decreasing diff on #1714 to make things a bit easier to read.
@NickCraver
Copy link
Collaborator

Pushed up naming separate to more easily see changes here!

Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, a subtle but important difference here is in system clock changes. For example if a system clock is off and get corrected by an NTP sync (this happens pretty often), it'll introduce incorrect timings, since 2 absolute timestamps, as opposed to Stopwatch (which is relative) will not be relatively correct. It could show too much time or even negative time, when corrections occur.

As an example, let's say a command starts just after midnight:

[System time: 00:00:05.500] Command starts (`timingMessage.CreatedTimestamp` is set)
[System time: 00:00:06.000] NTP time corrects, it's actually [00:00:04] (clock drift was 2 seconds)
[System time: 00:00:05.000] Command finishes

...according to profiling, this command took -500ms, because that's that the diff in the 2 timestamps told us.

I agree on potential efficiency here, but I think we are trading correctness (no non-high performance counter systems), which is a hard sell when it comes to profiling. But maybe, the number of systems affected would be so low we're okay with that.

Thoughts? Suggestions?

@eduardobr
Copy link
Contributor Author

eduardobr commented May 23, 2021

Hi Nick,
I was just checking the source of StopWatch, we can see that the same static GetTimestamp is used internally when Start is called and when we get Elapsed property of the running StopWatch.
Considering this, the change here calculates time exactly same way. It moved from StopWatch.Start() to StopWatch.GetTimestamp().
Unless I’m missing some other bit?

Other than this, what I really wanted when I was looking this code is to understand when the Ping() latency measurement starts. This is because in some projects I’m working I can see the latency calculated being much higher than the actual trips to redis when measured from outside the application. I suspect the reason was because the application was running in containers with extremely low CPU (0.2 cpu core) and the actual code path to run a Ping() call was slower than the actual roundtrip to Redis when application was busy processing other thing in parallel.
So I wanted to have a measure of latency that started as late as possible before reaching the network from the application and has finished as early as possible when the response comes.
But I think that requires a larger refactoring to achieve and needs to be announced so people don’t think suddenly their network became faster for some reason.
I would be happy to try something if you think all of this makes sense.
If the intention of TimerMessage class is really to measure everything with the code execution included, then maybe we can have a GetLatency() or use a different, new class (not TimerMessage) only for Ping because it’s documented to return latency.

Update: At least the start time for measuring latency could be quite precise if we get it in TimerMessage.WriteImpl instead, no?

Update 2: going deeper in the library code I just noticed that TimerMessage is only used for Pings, so I've changed it there already to GetTimestamp on WriteImpl().
I made a quick and dirty test running 10k pings in a for loop and storing the time taken from TimerMessage constructor to WriteImpl and the Max was surprisingly 27ms, with average being 0.05ms. Those shouldn't be present in the latency measurement anymore.
I hope it all makes sense.

@eduardobr eduardobr force-pushed the feature/optimize-timermessage branch from 0b7a81c to 1eeaf4c Compare May 23, 2021 18:06
@eduardobr eduardobr force-pushed the feature/optimize-timermessage branch from 1eeaf4c to e4328e4 Compare May 23, 2021 19:12
@eduardobr eduardobr changed the title Reduce allocation on TimerMessage by reusing Message.CreatedTimestamp Start measuring Ping latency closer to sending request and remove Stopwatch instance May 23, 2021
Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - thank you!

@NickCraver NickCraver merged commit 69a17b1 into StackExchange:main May 25, 2021
NickCraver pushed a commit that referenced this pull request May 25, 2021
Adding #1714 to release notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants