Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent panic due to overflowing our source calibration denominator. #61

Closed
tobz opened this issue Dec 23, 2021 · 5 comments
Closed

Comments

@tobz
Copy link
Member

tobz commented Dec 23, 2021

This was reported in an old issue in the metrics crate -- metrics-rs/metrics#230 -- but essentially the user was hitting the line where we hypothesized that something was amiss if we managed to wrap around when calling u64::next_power_of_two.

Thinking on this some more, one potential explanation is that the user had different TSC offsets on different cores, and somehow they were hitting a case where the source measurement happening in adjust_cal_ratio was actually on a core where the TSC value was smaller than the value taken initially via Counter::now in calibrate.

That would seemingly explain how the end - start calculation could yield a number that would cause next_power_of_two to overflow, and as long as the absolute delta was smaller than (2^64)-(2^63), we'd always end up with a value that would trigger that overflow.

The bigger question might be: why did this user's set-up somehow manage to trigger this behavior, even at the quoted "maybe once out of 20 times" rate, when quanta is used in many applications that never seem to experience this?

@tobz
Copy link
Member Author

tobz commented Dec 23, 2021

Digging a little further, and with some painfully obvious search terms, I discovered this forum discussion about unexpectedly high TSC warp/deviation specifically on Thinkpads with Ryzen mobile CPUs: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464?page=1#5155358. This matches the system description given by the user in the issue on the metrics repo.

In this case, the issue is one core (CPU #0) having a TSC offset that was much larger than the other cores, seemingly related to the BIOS doing weird things.

All of that said: having rubber ducked it when I wrote up this ticket, any calculation of end.wrapping_sub(start) where the values had an absolute difference of less than 2^63 should return a value that would trigger the overflow. So either the TSC offset was, in fact, that large, or this user was getting somewhat unlucky with thread scheduling every so often by hitting CPU #0 where the TSC warp was in place, but the rest of the cores were identically synchronized?

It's even more fuzzy because the TSC should always be starting from 0, on all cores, as I understand it... but maybe that understanding is incorrect. This could imply that we may actually want to specifically abort calls to adjust_cal_ratio where our measurement is non-monotonic, knowing that we may have hit another core that's warped in the wrong direction or otherwise not synchronized somehow.

More TSC reading required, and maybe it's time to work on #18, too.

@BrynCooke
Copy link

BrynCooke commented May 10, 2022

I think that I may be hitting this when using the mocka library and have pasted a stack trace at moka-rs/moka#113.

My Lenovo laptop is AMD based, and it happens pretty quickly.

It also occurred once on a CircleCI run.

My colleagues, who have M1 and Intel CPUs have not been able to reproduce.

@tobz
Copy link
Member Author

tobz commented May 11, 2022

Argh, yeah, that's not great.

I'll see if I can work on a change that bails out if we detect that the TSC is unstable, since falling back to the OS timing primitives is all we can reasonably do.

@tobz
Copy link
Member Author

tobz commented May 18, 2022

@tatsuya6502 I've published a new release: quanta@v0.10.0.

Ironically, I had already merged a fix for this particular panic, but forgot to cut a new release. 🤦🏻

I'm going to close this issue for now because this should definitively fix the given panic, but if anything still looks amiss.. feel free to reopen this issue, or create a new one.

@tobz tobz closed this as completed May 18, 2022
@tatsuya6502
Copy link

Thanks @tobz for fixing the issue and publishing quanta@v0.10.0! I will update my library (moka) to use v0.10.0 so that @BrynCooke can try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants