-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent panic due to overflowing our source calibration denominator. #61
Comments
Digging a little further, and with some painfully obvious search terms, I discovered this forum discussion about unexpectedly high TSC warp/deviation specifically on Thinkpads with Ryzen mobile CPUs: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464?page=1#5155358. This matches the system description given by the user in the issue on the In this case, the issue is one core (CPU #0) having a TSC offset that was much larger than the other cores, seemingly related to the BIOS doing weird things. All of that said: having rubber ducked it when I wrote up this ticket, any calculation of It's even more fuzzy because the TSC should always be starting from 0, on all cores, as I understand it... but maybe that understanding is incorrect. This could imply that we may actually want to specifically abort calls to More TSC reading required, and maybe it's time to work on #18, too. |
I think that I may be hitting this when using the mocka library and have pasted a stack trace at moka-rs/moka#113. My Lenovo laptop is AMD based, and it happens pretty quickly. It also occurred once on a CircleCI run. My colleagues, who have M1 and Intel CPUs have not been able to reproduce. |
Argh, yeah, that's not great. I'll see if I can work on a change that bails out if we detect that the TSC is unstable, since falling back to the OS timing primitives is all we can reasonably do. |
@tatsuya6502 I've published a new release: Ironically, I had already merged a fix for this particular panic, but forgot to cut a new release. 🤦🏻 I'm going to close this issue for now because this should definitively fix the given panic, but if anything still looks amiss.. feel free to reopen this issue, or create a new one. |
Thanks @tobz for fixing the issue and publishing |
This was reported in an old issue in the
metrics
crate -- metrics-rs/metrics#230 -- but essentially the user was hitting the line where we hypothesized that something was amiss if we managed to wrap around when callingu64::next_power_of_two
.Thinking on this some more, one potential explanation is that the user had different TSC offsets on different cores, and somehow they were hitting a case where the source measurement happening in
adjust_cal_ratio
was actually on a core where the TSC value was smaller than the value taken initially viaCounter::now
incalibrate
.That would seemingly explain how the
end - start
calculation could yield a number that would causenext_power_of_two
to overflow, and as long as the absolute delta was smaller than (2^64)-(2^63), we'd always end up with a value that would trigger that overflow.The bigger question might be: why did this user's set-up somehow manage to trigger this behavior, even at the quoted "maybe once out of 20 times" rate, when
quanta
is used in many applications that never seem to experience this?The text was updated successfully, but these errors were encountered: