-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miri floating point NaN conversion issue #69532
Comments
(Miri-the-tool has its own bugtracker at https://github.com/rust-lang/miri/issues, where I see things earlier than when they are burried here.) Cc @eddyb, this could also be an apfloat issue as that's what Miri uses to implement casts. |
|
Sorry, I should have said truncated the fraction there. The For reference Note that the normal Rust compiler seems to always set the MSB in the fraction of a NaN during its conversion to avoid this problem. There |
Assuming that f64-to-f32 on NaN has fully defined semantics, we should probably have a test asserting these exact bit patterns for Miri. But what you wrote makes it sound even more like the bug is in apfloat, and not in the Miri code that feeds data to/from apfloat. FWIW, the float casting code is here. |
Yes, it does sound to me like an apfloat issue rather than something unique in Miri. Reading that casting method I can't see anything that is special casing the fraction truncation for NaNs, only for an 80bit extended precision type that has different NaN semantics. Sorry, I've not dealt with the bug workflow on Rust before - is it possible to move this bug to apfloat? Would you like me to create a separate bug on apfloat? |
apfloat does not have a separate issue tracker, no need to move anything. However, if I understood @eddyb correctly then apfloat is a direct Rust port of LLVM's apfloat -- so likely they have the same bug, and a fix would have to happen upstream as well. |
Yes, rustc_apfloat is a port of LLVM's APFloat, but LLVM folds this cast to |
Hm, that's still not the same as the |
I think the LLVM C++ code doing the same cast is here. |
Sure, but that's intentional. Optimizations need to preserve whether something is a NaN but (by default) the payload bits, quiet/signaling distinction and sign bit are not generally preserved or guaranteed to match between compile time evaluation and run time evaluation. |
So the semantics would be something like "if the result is a NaN, non-deterministically pick any legal NaN representation"? |
That sounds about right. Kind of off-topic here, though -- if you want to discuss further, we have #55131. |
The category should still be rust/src/librustc_apfloat/ieee.rs Lines 131 to 136 in 55aee8d
But neither does LLVM's APFLoat? @hanna-kruppe Wait, I just looked at your link and LLVM prints But with a |
Note that converting floats to double bit patterns happens when printing textual IR: https://github.com/llvm/llvm-project/blob/40da6be2bd393b41907588afad4253f45966fc25/llvm/lib/IR/AsmWriter.cpp#L1331-L1343 I'm reasonably sure the constant folding produces a proper float ( |
Just remembered I wanted to look at assembly or machine code to see what the LLVM |
Also isn't a |
Indeed. And I just realized that while the high half of |
So... what does this mean? Is it an LLVM bug, and do we have enough information to report it upstream? |
Yes, either cc @nikic @nagisa (I don't know who does upstream LLVM bug reports nowadays most often) |
@thomcc actually in the other discussion I misremembered what this bug is about. This is not just "surprising NaN payload", indeed the result here is not a NaN when it should be. So that is quite clearly a bug in LLVM's APfloat, or in the Rust port of it (but we seem confident the problem is upstream). It would be great if we could report this against LLVM, but I have no idea how to reproduce the problem with LLVM. Do we need to write a C++ program using APfloat, or can we somehow make this happen just with the right LLVM IR? |
Wow! That's extremely wrong! It turns a NaN into positive infinity! Poking it gives a bit more insight into the issue: https://godbolt.org/z/c59E4E shows that all bits in the NaN payload below the 29th bit are truncated here. Mercifully, I can repro this in rustc_apfloat, which is easier for me to experiment with than LLVM's (okay, I also remember what a headache it was trying to make changes to LLVM in the past, and it's now not something I'm interested in doing in my spare time) #[test]
fn convert_atypical_nan() {
let mut loses_info = false;
let test = Double::from_bits(0x7FF0000000000001);
assert_eq!(test.category(), Category::NaN);
let converted: Single = test.convert(&mut loses_info).value;
// This passes:
assert_eq!(converted.category(), Category::NaN);
// This fails w/ `'convert_atypical_nan' panicked at 'inf'`
let f = converted.to_f32();
assert!(f.is_nan(), "{}", f);
} That is, it reports Doing some further digging shows points me to rust/compiler/rustc_apfloat/src/ieee.rs Line 1491 in b5f55b7
shift_right with a number that comes from the difference between PRECISION (bits of significand) of the source and target, e.g. 53 - 22, or 29 (which all-but-confirms that this is the issue). (The branch where there's the sig::shift_left is wrong too, FWIW).
It's also pretty wrong — IEEE wants you to truncate the payload here, not shift it out. e.g. I think what you want to do is something like (this code intentionally doesn't compile, and is assuming let src_nan_mask = !(Self::QNAN.significand() | Self::SNAN.significand());
let dst_nan_mask = !(T::QNAN.significand() | T::SNAN.significand());
let unmasked_payload = source.significand() & src_nan_mask;
let dest_mask = T::largest() & dst_nan_mask;
let new_payload = unmasked_payload & dest_mask;
let mut nan = if source.is_snan() {
T::snan(Some(new_payload))
} else {
T::qnan(Some(new_payload))
};
nan.sign = source.sign;
return nan; That said, this is possibly hopelessly wrong? I don't thing so but I don't have a copy of IEEE754 around to verify. ... Okay I tried to make it compile, and if you drop this beast in before the shifts (e.g. line 1488 as of commit b5f55b7) the tests pass (Sorry, I foolishly did this in a github download of rustc_apfloat, since my rustc is in the middle of other things, and so I can't easily just give you a patch) // ⚠️ !HACKY & PROBABLY WRONG FIX! ⚠️
if r.category() == Category::NaN {
// this may or may not totally break the idea of arbitrary precision.
fn significand<T: Semantics>(s: IeeeFloat<T>) -> Limb {
// mask away int bit which is probably not relevant a lot of the time idk.
s.sig[0] & !(1 << (T::PRECISION - 1))
}
let src_nan_mask = !(
significand(Self::qnan(None)) |
significand(Self::snan(None))
);
let dst_nan_mask = !(
significand(<IeeeFloat<T>>::qnan(None)) |
significand(<IeeeFloat<T>>::snan(None))
);
let unmasked_payload = significand(r) & src_nan_mask;
let dest_mask = significand(<IeeeFloat<T>>::largest()) & dst_nan_mask;
let new_payload = unmasked_payload & dest_mask;
if new_payload != unmasked_payload {
*loses_info = true;
};
let mut nan = if r.is_signaling() {
<IeeeFloat<T>>::snan(Some(new_payload))
} else {
<IeeeFloat<T>>::qnan(Some(new_payload))
};
nan.sign = r.sign;
// FIXME: THERE'S A BUNCH OF NAN HANDLING BELOW WE SHOULD HAVE
// HERE I'M JUST LAZY. ALSO NOTE THAT SOMEONE WHO UNDERSTANDS
// X87 NAN SHOULD TAKE A LOOK AT THIS CODE PROBABLY.
return Status::OK.and(nan);
} That said it could still have other issues. And the fix could be extremely wrong for other reasons, it's the result of not-very-long spent debugging. Someone who understands the code better should audit it for this issue (@eddyb? It says you wrote E.g. Turning a NaN into an infinity is extremely bad — obviously so. Masking away the bottom bits like this is an obvious bug (nothing anywhere does that), and so even if they don't care about preserving NaNs specifically I suspect they'd want to fix this. Edit: I wrote this without carefully reading the rest of this issue. I think you uncovered some of this? Maybe? I don't know. Maybe this helped, sorry if not. |
Yeah that sounds wrong. :D I'm afraid I cannot comment on the internals of apfloat that you are getting into for the rest of your post. That stuff is way beyond me.^^ I understand you won't work on fixing this in LLVM, but could you make this a bugreport for them, in their bugzilla? I feel a bit awkward making a bugreport that basically says "look I don't know what I am talking about but I have it on good authority that what LLVM is doing is wrong". Would be better if someone who knows what they are talking about does that I think. ;) Chances are, they'll have someone who can fix it. |
I'll send them an email, but... |
Argh. :/ I think this worked fine for me though. If they don't respond in a few days, we can also use me as a proxy.^^ |
That sounds fine. In the meantime it maybe be good for someone more familiar with the rustc_apfloat code to take a look at what I wrote anyway... |
There was already a bug, but I added a comment for it: https://bugs.llvm.org/show_bug.cgi?id=43907. |
There's a fix on the LLVM side now, somewhere in https://bugs.llvm.org/show_bug.cgi?id=43907 -- there are multiple reviews, we should figure out which one actually landed. Anyone up for porting this back to the Rust apfloat? |
I guess https://github.com/llvm/llvm-project/commits/master/llvm/lib/Support/APFloat.cpp would be the page to check for what happened in apfloat upstream. |
Done: #77368 |
The fix on the LLVM side is the minimal possible fix to avoid the linked miscompilation here: https://godbolt.org/resetlayout/Zf7HMh. It doesn't fix this bug in the general case. |
https://reviews.llvm.org/D88238 is a more correct fix. But note that it's not just the change to APFloat that makes it correct (in fact, that change alone is arguably wrong), but at usage sites changing things to be (for example): if (IsSNAN) {
APInt Payload = ID.APFloatVal.bitcastToAPInt();
ID.APFloatVal = APFloat::getSNaN(ID.APFloatVal.getSemantics(),
ID.APFloatVal.isNegative(), &Payload);
} I think we should probably do this inside our |
@thomcc are you referring to this segment of your comment above?
Can you quote the part of the IEEE 754 standard which mandates this? Or alternatively, come up with an example like the one at the top where there is a mismatch between what happens at runtime vs during miri evaluation? Apparently hardware implementations shift out payloads as well: #![feature(test)]
extern crate test;
use test::black_box;
fn main() {
let nan1 = black_box(f64::from_bits(0x7FF0_0201_0000_0001u64));
let nan2 = black_box(f64::from_bits(0x7FF0_0000_0000_0001u64));
assert!(nan1.is_nan());
assert!(nan2.is_nan());
let nan1_32 = black_box(nan1 as f32);
let nan2_32 = black_box(nan2 as f32);
println!("{:0x} -> {:0x} {}", nan1.to_bits(), nan1_32.to_bits(), nan1_32.is_nan());
println!("{:0x} -> {:0x} {}", nan2.to_bits(), nan2_32.to_bits(), nan1_32.is_nan());
} prints:
|
Larger example that compares runtime and const eval (as of nightly master): #![feature(test)]
#![feature(const_fn_transmute)]
extern crate test;
use test::black_box;
const fn make_nans() -> (f64, f64, f32, f32) {
let nan1: f64 = unsafe { std::mem::transmute(0x7FF0_0201_0000_0001u64) };
let nan2: f64 = unsafe { std::mem::transmute(0x7FF0_0000_0000_0001u64) };
let nan1_32 = nan1 as f32;
let nan2_32 = nan2 as f32;
(nan1, nan2, nan1_32, nan2_32)
}
static NANS: (f64, f64, f32, f32) = make_nans();
fn main() {
let (nan1, nan2, nan1_32, nan2_32) = NANS;
assert!(nan1.is_nan());
assert!(nan2.is_nan());
println!("runtime eval:");
{
let nan1: f64 = black_box(f64::from_bits(0x7FF0_0201_0000_0001u64));
let nan2: f64 = black_box(f64::from_bits(0x7FF0_0000_0000_0001u64));
let nan1_32 = black_box(nan1 as f32);
let nan2_32 = black_box(nan2 as f32);
let nan1_32_again = black_box(nan1_32 as f64) as f32;
let nan2_32_again = black_box(nan2_32 as f64) as f32;
println!("{:0x} -> {:0x} {} -> {:0x} {}", nan1.to_bits(), nan1_32.to_bits(), nan1_32.is_nan(), nan1_32_again.to_bits(), nan1_32_again.is_nan());
println!("{:0x} -> {:0x} {} -> {:0x} {}", nan2.to_bits(), nan2_32.to_bits(), nan2_32.is_nan(), nan2_32_again.to_bits(), nan2_32_again.is_nan());
}
println!("const eval:");
println!("{:0x} -> {:0x} {}", nan1.to_bits(), nan1_32.to_bits(), nan2_32.is_nan());
println!("{:0x} -> {:0x} {}", nan2.to_bits(), nan2_32.to_bits(), nan2_32.is_nan());
} prints:
|
IEEE754-2019 6.2.3 (NaN Propagation) states that
This seems impossible if the implementation shifts the payload out. But I guess this only applies to qNaN, although in my testing earlier it seemed to happen on sNaN on my machines. Anyway, I'm a bit busy at the moment, or I'd dig further into this. |
I knew there were two patches but I have no idea which one landed... so far the plan was to only port things that have landed on the LLVM side. If there is still a bug left there, that seems worth reporting? |
@thomcc If the implementation shifts to the left on conversion to the wider format and back to the right on conversion to the narrower format, then the NaN payload will be the same. So I think that shifting is conformant with that rule of the standard. Otherwise hardware would be nonconformant as well. But maybe there's still some rule in the standard that LLVM apfloat violates. |
Hmm, fair enough. Unfortunately, I don't think I'm going to have time to dig further into this for at least another week, but #77368 (comment) sounds good to me. Sorry for any trouble. |
No trouble caused, thanks for looking into this. :) |
Yes your comments were very helpful. I wouldn't have found or thought of looking for that LLVM MR for example. |
I tried this code:
This runs correctly (i.e. doesn't assert) when run with rust nightly, but fails at
assert!(nan2_32.is_nan());
when run with miri on nightly (2020-02-20).It seems like miri just truncated the f64 to get the f32 and therefore converted the nan2 to a floating point zero instead of a floating point NaN
The text was updated successfully, but these errors were encountered: