Intermittent panic due to overflowing our source calibration denominator. #61

tobz · 2021-12-23T03:38:51Z

This was reported in an old issue in the metrics crate -- metrics-rs/metrics#230 -- but essentially the user was hitting the line where we hypothesized that something was amiss if we managed to wrap around when calling u64::next_power_of_two.

Thinking on this some more, one potential explanation is that the user had different TSC offsets on different cores, and somehow they were hitting a case where the source measurement happening in adjust_cal_ratio was actually on a core where the TSC value was smaller than the value taken initially via Counter::now in calibrate.

That would seemingly explain how the end - start calculation could yield a number that would cause next_power_of_two to overflow, and as long as the absolute delta was smaller than (2^64)-(2^63), we'd always end up with a value that would trigger that overflow.

The bigger question might be: why did this user's set-up somehow manage to trigger this behavior, even at the quoted "maybe once out of 20 times" rate, when quanta is used in many applications that never seem to experience this?

The text was updated successfully, but these errors were encountered:

tobz · 2021-12-23T04:12:52Z

Digging a little further, and with some painfully obvious search terms, I discovered this forum discussion about unexpectedly high TSC warp/deviation specifically on Thinkpads with Ryzen mobile CPUs: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464?page=1#5155358. This matches the system description given by the user in the issue on the metrics repo.

In this case, the issue is one core (CPU #0) having a TSC offset that was much larger than the other cores, seemingly related to the BIOS doing weird things.

All of that said: having rubber ducked it when I wrote up this ticket, any calculation of end.wrapping_sub(start) where the values had an absolute difference of less than 2^63 should return a value that would trigger the overflow. So either the TSC offset was, in fact, that large, or this user was getting somewhat unlucky with thread scheduling every so often by hitting CPU #0 where the TSC warp was in place, but the rest of the cores were identically synchronized?

It's even more fuzzy because the TSC should always be starting from 0, on all cores, as I understand it... but maybe that understanding is incorrect. This could imply that we may actually want to specifically abort calls to adjust_cal_ratio where our measurement is non-monotonic, knowing that we may have hit another core that's warped in the wrong direction or otherwise not synchronized somehow.

More TSC reading required, and maybe it's time to work on #18, too.

BrynCooke · 2022-05-10T12:00:00Z

I think that I may be hitting this when using the mocka library and have pasted a stack trace at moka-rs/moka#113.

My Lenovo laptop is AMD based, and it happens pretty quickly.

It also occurred once on a CircleCI run.

My colleagues, who have M1 and Intel CPUs have not been able to reproduce.

tobz · 2022-05-11T14:03:09Z

Argh, yeah, that's not great.

I'll see if I can work on a change that bails out if we detect that the TSC is unstable, since falling back to the OS timing primitives is all we can reasonably do.

tobz · 2022-05-18T19:42:01Z

@tatsuya6502 I've published a new release: quanta@v0.10.0.

Ironically, I had already merged a fix for this particular panic, but forgot to cut a new release. 🤦🏻

I'm going to close this issue for now because this should definitively fix the given panic, but if anything still looks amiss.. feel free to reopen this issue, or create a new one.

tatsuya6502 · 2022-05-19T01:38:30Z

Thanks @tobz for fixing the issue and publishing quanta@v0.10.0! I will update my library (moka) to use v0.10.0 so that @BrynCooke can try it out.

This was referenced Dec 30, 2021

Strong documentation around guarantees, cleaning up API, better calibration, and other stuff. #62

Merged

Comparing to minstant #63

Open

BrynCooke mentioned this issue May 10, 2022

Integer overflow detected in a unit test for the FrequencySketch moka-rs/moka#113

Closed

BrynCooke mentioned this issue May 10, 2022

APQ test fails occasionally apollographql/router#868

Closed

tatsuya6502 mentioned this issue May 11, 2022

Cache::get method sometimes panics due to an integer overflow in quanta::instant::Instant::now moka-rs/moka#119

Closed

tobz closed this as completed May 18, 2022

gootorov mentioned this issue Oct 21, 2023

Does it make sense to pin Upkeep thread? #92

Closed

This was referenced Jan 2, 2025

Option::unwrap() panics in base_cache::Clocks::to_std_instant moka-rs/moka#472

Closed

Detect unsynchronized TSCs across cores and fallback to the OS time source? #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent panic due to overflowing our source calibration denominator. #61

Intermittent panic due to overflowing our source calibration denominator. #61

tobz commented Dec 23, 2021

tobz commented Dec 23, 2021 •

edited

Loading

BrynCooke commented May 10, 2022 •

edited

Loading

tobz commented May 11, 2022

tobz commented May 18, 2022

tatsuya6502 commented May 19, 2022

Intermittent panic due to overflowing our source calibration denominator. #61

Intermittent panic due to overflowing our source calibration denominator. #61

Comments

tobz commented Dec 23, 2021

tobz commented Dec 23, 2021 • edited Loading

BrynCooke commented May 10, 2022 • edited Loading

tobz commented May 11, 2022

tobz commented May 18, 2022

tatsuya6502 commented May 19, 2022

tobz commented Dec 23, 2021 •

edited

Loading

BrynCooke commented May 10, 2022 •

edited

Loading