Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miri floating point NaN conversion issue #69532

Closed
jodysankey opened this issue Feb 28, 2020 · 41 comments · Fixed by #77368
Closed

Miri floating point NaN conversion issue #69532

jodysankey opened this issue Feb 28, 2020 · 41 comments · Fixed by #77368
Labels
A-miri Area: The miri tool C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@jodysankey
Copy link

I tried this code:

fn main() {
    let nan1 = f64::from_bits(0x7FF0_0001_0000_0001u64);
    let nan2 = f64::from_bits(0x7FF0_0000_0000_0001u64);

    assert!(nan1.is_nan());
    assert!(nan2.is_nan());

    let nan1_32 = nan1 as f32;
    let nan2_32 = nan2 as f32;

    assert!(nan1_32.is_nan());
    assert!(nan2_32.is_nan());

This runs correctly (i.e. doesn't assert) when run with rust nightly, but fails at assert!(nan2_32.is_nan()); when run with miri on nightly (2020-02-20).

It seems like miri just truncated the f64 to get the f32 and therefore converted the nan2 to a floating point zero instead of a floating point NaN

@jodysankey jodysankey added the C-bug Category: This is a bug. label Feb 28, 2020
@jonas-schievink jonas-schievink added A-miri Area: The miri tool T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 28, 2020
@RalfJung
Copy link
Member

RalfJung commented Feb 28, 2020

(Miri-the-tool has its own bugtracker at https://github.com/rust-lang/miri/issues, where I see things earlier than when they are burried here.)

Cc @eddyb, this could also be an apfloat issue as that's what Miri uses to implement casts.
I don't think we are doing any truncation here. Would be interesting to see the bit patterns involved.

@RalfJung
Copy link
Member

to_bits for nan2_32 returns 0x7f800000. So, this doesn't look like just a truncated nan2 to me, it looks like apfloat did something but that something is not correct (or at least, not nan-preserving -- I am not sure if that is a guarantee).

@eddyb
Copy link
Member

eddyb commented Feb 28, 2020

cc @hanna-kruppe

@jodysankey
Copy link
Author

jodysankey commented Feb 29, 2020

Sorry, I should have said truncated the fraction there. The 0x7f8 is the floating point exponent being set to all ones, the 0x00000 (plus the empty bottom 3 bits in the 8) is a truncation of the original 0x1 fraction in the f64. By removing all the non-zero bits in the fraction it converted the NaN to an infinity.

For reference to_bits for nan1_32 returns 0x7f800008 where the 0x00008 (plus the empty bottom 3 bits in the 8) is a truncation of the original 0x0000100000001 fraction from 52 to 23 bits

Note that the normal Rust compiler seems to always set the MSB in the fraction of a NaN during its conversion to avoid this problem. There nan1_32 is 0x7fc00008 and nan2_32 is 0x7fc00000

@RalfJung
Copy link
Member

There nan1_32 is 0x7fc00008 and nan2_32 is 0x7fc00000

Assuming that f64-to-f32 on NaN has fully defined semantics, we should probably have a test asserting these exact bit patterns for Miri.

But what you wrote makes it sound even more like the bug is in apfloat, and not in the Miri code that feeds data to/from apfloat. FWIW, the float casting code is here.

@jodysankey
Copy link
Author

Yes, it does sound to me like an apfloat issue rather than something unique in Miri. Reading that casting method I can't see anything that is special casing the fraction truncation for NaNs, only for an 80bit extended precision type that has different NaN semantics.

Sorry, I've not dealt with the bug workflow on Rust before - is it possible to move this bug to apfloat? Would you like me to create a separate bug on apfloat?

@RalfJung
Copy link
Member

RalfJung commented Mar 2, 2020

apfloat does not have a separate issue tracker, no need to move anything.

However, if I understood @eddyb correctly then apfloat is a direct Rust port of LLVM's apfloat -- so likely they have the same bug, and a fix would have to happen upstream as well.

@hanna-kruppe
Copy link
Contributor

Yes, rustc_apfloat is a port of LLVM's APFloat, but LLVM folds this cast to 0x7FF0_0000 (also in older versions, back to before @eddyb started the Rust port). Note that the MSB of the significand is set, so this is indeed a NaN. It seems LLVM does not have the same bug. Something strange is up.

@RalfJung
Copy link
Member

RalfJung commented Mar 2, 2020

LLVM folds this cast to 0x7FF0_0000

Hm, that's still not the same as the 0x7fc00000 that we see at run-time though?

@RalfJung
Copy link
Member

RalfJung commented Mar 2, 2020

I think the LLVM C++ code doing the same cast is here.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Mar 2, 2020

LLVM folds this cast to 0x7FF0_0000

Hm, that's still not the same as the 0x7fc00000 that we see at run-time though?

Sure, but that's intentional. Optimizations need to preserve whether something is a NaN but (by default) the payload bits, quiet/signaling distinction and sign bit are not generally preserved or guaranteed to match between compile time evaluation and run time evaluation.

@RalfJung
Copy link
Member

RalfJung commented Mar 2, 2020

So the semantics would be something like "if the result is a NaN, non-deterministically pick any legal NaN representation"?

@hanna-kruppe
Copy link
Contributor

That sounds about right. Kind of off-topic here, though -- if you want to discuss further, we have #55131.

@eddyb
Copy link
Member

eddyb commented Mar 2, 2020

The category should still be NaN, but it looks like we don't enforce the significand is non-0:

Category::Infinity => {
// FIXME(eddyb) Maybe we should guarantee an invariant instead?
significand = 0;
Self::MAX_EXP + 1
}
Category::NaN => Self::MAX_EXP + 1,

But neither does LLVM's APFLoat?


@hanna-kruppe Wait, I just looked at your link and LLVM prints float 0x7FF0000000000000 - that's a double bitpattern, not a float, lol.

But with a bitcast to integer, it matches rustc_apfloat: https://godbolt.org/z/HHgVYL.
(if you hover over the decimal value, godbolt helpfully shows it as hex)

@hanna-kruppe
Copy link
Contributor

Note that converting floats to double bit patterns happens when printing textual IR: https://github.com/llvm/llvm-project/blob/40da6be2bd393b41907588afad4253f45966fc25/llvm/lib/IR/AsmWriter.cpp#L1331-L1343

I'm reasonably sure the constant folding produces a proper float (semIEEEsingle). But perhaps the non-zero bits in its significand are in bit positions > 32? Then the bitcast code would silently drop them.

@eddyb
Copy link
Member

eddyb commented Mar 3, 2020

Just remembered I wanted to look at assembly or machine code to see what the LLVM APFloat lowers to and show that it's buggy in LLVM even outside of constant-folded bitcasts.

@eddyb
Copy link
Member

eddyb commented Mar 3, 2020

Also isn't a double 0x7FF0000000000000 +Infinity, showing the fptrunc had already truncated the significand?

@hanna-kruppe
Copy link
Contributor

Just remembered I wanted to look at assembly or machine code to see what the LLVM APFloat lowers to and show that it's buggy in LLVM even outside of constant-folded bitcasts.

Indeed. And I just realized that while the high half of 0x7FF0000000000000 is a f32 NaN, the whole bit string as an f64 is in fact a NaN and setting just one bit more (e.g. 0x7FF8000000000000) does make llc (and presumably bitcast folding) produce 0x7FC00000 (an f32 NaN). So I guess that I interpreted float <64b hex> incorrectly and the fptrunc folding is at fault after all.

@RalfJung
Copy link
Member

RalfJung commented Mar 5, 2020

So... what does this mean? Is it an LLVM bug, and do we have enough information to report it upstream?

@eddyb
Copy link
Member

eddyb commented Mar 5, 2020

Is it an LLVM bug, and do we have enough information to report it upstream?

Yes, either APFloat truncation needs to result in a non-0 significand (e.g. significand becoming 0 when losing lower bits could be "rounded up" by setting it to 1), or extracting the IEEE representation from APFloat should handle the significand == 0 situation by setting some bit.

cc @nikic @nagisa (I don't know who does upstream LLVM bug reports nowadays most often)

@RalfJung
Copy link
Member

@thomcc actually in the other discussion I misremembered what this bug is about. This is not just "surprising NaN payload", indeed the result here is not a NaN when it should be. So that is quite clearly a bug in LLVM's APfloat, or in the Rust port of it (but we seem confident the problem is upstream).

It would be great if we could report this against LLVM, but I have no idea how to reproduce the problem with LLVM. Do we need to write a C++ program using APfloat, or can we somehow make this happen just with the right LLVM IR?
This was previously posted as an example, does this show an incorrect compilation? These bit patterns mean nothing to me.^^

@thomcc
Copy link
Member

thomcc commented Sep 14, 2020

This was previously posted as an example, does this show an incorrect compilation? These bit patterns mean nothing to me.^^

Wow! That's extremely wrong! It turns a NaN into positive infinity!

Poking it gives a bit more insight into the issue: https://godbolt.org/z/c59E4E shows that all bits in the NaN payload below the 29th bit are truncated here.

Mercifully, I can repro this in rustc_apfloat, which is easier for me to experiment with than LLVM's (okay, I also remember what a headache it was trying to make changes to LLVM in the past, and it's now not something I'm interested in doing in my spare time)

#[test]
fn convert_atypical_nan() {
    let mut loses_info = false;
    let test = Double::from_bits(0x7FF0000000000001);
    assert_eq!(test.category(), Category::NaN);
    let converted: Single = test.convert(&mut loses_info).value;
    // This passes:
    assert_eq!(converted.category(), Category::NaN);
    // This fails w/ `'convert_atypical_nan' panicked at 'inf'`
    let f = converted.to_f32();
    assert!(f.is_nan(), "{}", f);
}

That is, it reports converted has Category::NaN, but then to_f32() turns it into inf.

Doing some further digging shows points me to

sig::shift_right(&mut r.sig, &mut 0, -shift as usize)
, where we call shift_right with a number that comes from the difference between PRECISION (bits of significand) of the source and target, e.g. 53 - 22, or 29 (which all-but-confirms that this is the issue). (The branch where there's the sig::shift_left is wrong too, FWIW).

It's also pretty wrong — IEEE wants you to truncate the payload here, not shift it out. e.g. I think what you want to do is something like (this code intentionally doesn't compile, and is assuming T is the target type and that things are beautifully simple enough that there's just a significand function which returns something I can bitand...):

let src_nan_mask = !(Self::QNAN.significand() | Self::SNAN.significand());
let dst_nan_mask = !(T::QNAN.significand() | T::SNAN.significand());

let unmasked_payload = source.significand() & src_nan_mask;
let dest_mask = T::largest() & dst_nan_mask;
let new_payload = unmasked_payload & dest_mask;

let mut nan = if source.is_snan() {
    T::snan(Some(new_payload))
} else {
    T::qnan(Some(new_payload))
};
nan.sign = source.sign;
return nan;

That said, this is possibly hopelessly wrong? I don't thing so but I don't have a copy of IEEE754 around to verify.

...
...

Okay I tried to make it compile, and if you drop this beast in before the shifts (e.g. line 1488 as of commit b5f55b7) the tests pass (Sorry, I foolishly did this in a github download of rustc_apfloat, since my rustc is in the middle of other things, and so I can't easily just give you a patch)

// ⚠️ !HACKY & PROBABLY WRONG FIX! ⚠️
if r.category() == Category::NaN {
    // this may or may not totally break the idea of arbitrary precision.
    fn significand<T: Semantics>(s: IeeeFloat<T>) -> Limb {
        // mask away int bit which is probably not relevant a lot of the time idk.
        s.sig[0] & !(1 << (T::PRECISION - 1))
    }
    let src_nan_mask = !(
        significand(Self::qnan(None)) |
        significand(Self::snan(None))
    );
    let dst_nan_mask = !(
        significand(<IeeeFloat<T>>::qnan(None)) |
        significand(<IeeeFloat<T>>::snan(None))
    );

    let unmasked_payload = significand(r) & src_nan_mask;
    let dest_mask = significand(<IeeeFloat<T>>::largest()) & dst_nan_mask;
    let new_payload = unmasked_payload & dest_mask;
    if new_payload != unmasked_payload {
        *loses_info = true;
    };

    let mut nan = if r.is_signaling() {
        <IeeeFloat<T>>::snan(Some(new_payload))
    } else {
        <IeeeFloat<T>>::qnan(Some(new_payload))
    };
    nan.sign = r.sign;
    // FIXME: THERE'S A BUNCH OF NAN HANDLING BELOW WE SHOULD HAVE
    // HERE I'M JUST LAZY. ALSO NOTE THAT SOMEONE WHO UNDERSTANDS
    // X87 NAN SHOULD TAKE A LOOK AT THIS CODE PROBABLY.
    return Status::OK.and(nan);
}

That said it could still have other issues. And the fix could be extremely wrong for other reasons, it's the result of not-very-long spent debugging.

Someone who understands the code better should audit it for this issue (@eddyb? It says you wrote rustc_apfloat). Ditto for LLVM's version which I can hardly follow since my C++ is so rusty. That said, LLVM is a popular enough project that someone hopefully could take it from here?

E.g. Turning a NaN into an infinity is extremely bad — obviously so. Masking away the bottom bits like this is an obvious bug (nothing anywhere does that), and so even if they don't care about preserving NaNs specifically I suspect they'd want to fix this.


Edit: I wrote this without carefully reading the rest of this issue. I think you uncovered some of this? Maybe? I don't know. Maybe this helped, sorry if not.

@RalfJung
Copy link
Member

Wow! That's extremely wrong! It turns a NaN into positive infinity!

Yeah that sounds wrong. :D I'm afraid I cannot comment on the internals of apfloat that you are getting into for the rest of your post. That stuff is way beyond me.^^

I understand you won't work on fixing this in LLVM, but could you make this a bugreport for them, in their bugzilla? I feel a bit awkward making a bugreport that basically says "look I don't know what I am talking about but I have it on good authority that what LLVM is doing is wrong". Would be better if someone who knows what they are talking about does that I think. ;) Chances are, they'll have someone who can fix it.

@thomcc
Copy link
Member

thomcc commented Sep 14, 2020

https://bugs.llvm.org/ says

New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

I'll send them an email, but...

@RalfJung
Copy link
Member

Argh. :/ I think this worked fine for me though.

If they don't respond in a few days, we can also use me as a proxy.^^

@thomcc
Copy link
Member

thomcc commented Sep 14, 2020

That sounds fine. In the meantime it maybe be good for someone more familiar with the rustc_apfloat code to take a look at what I wrote anyway...

@thomcc
Copy link
Member

thomcc commented Sep 14, 2020

There was already a bug, but I added a comment for it: https://bugs.llvm.org/show_bug.cgi?id=43907.

@RalfJung
Copy link
Member

There's a fix on the LLVM side now, somewhere in https://bugs.llvm.org/show_bug.cgi?id=43907 -- there are multiple reviews, we should figure out which one actually landed.

Anyone up for porting this back to the Rust apfloat?

@RalfJung
Copy link
Member

I guess https://github.com/llvm/llvm-project/commits/master/llvm/lib/Support/APFloat.cpp would be the page to check for what happened in apfloat upstream.

@est31
Copy link
Member

est31 commented Sep 30, 2020

Done: #77368

@thomcc
Copy link
Member

thomcc commented Sep 30, 2020

The fix on the LLVM side is the minimal possible fix to avoid the linked miscompilation here: https://godbolt.org/resetlayout/Zf7HMh.

It doesn't fix this bug in the general case.

@thomcc
Copy link
Member

thomcc commented Sep 30, 2020

https://reviews.llvm.org/D88238 is a more correct fix. But note that it's not just the change to APFloat that makes it correct (in fact, that change alone is arguably wrong), but at usage sites changing things to be (for example):

if (IsSNAN) {
    APInt Payload = ID.APFloatVal.bitcastToAPInt();
    ID.APFloatVal = APFloat::getSNaN(ID.APFloatVal.getSemantics(),
                                     ID.APFloatVal.isNegative(), &Payload);
}

I think we should probably do this inside our convert_r function, otherwise it would be incorrect too. This is unfortunate as it causes more divergence to upstream APFloat

@est31
Copy link
Member

est31 commented Sep 30, 2020

@thomcc are you referring to this segment of your comment above?

It's also pretty wrong — IEEE wants you to truncate the payload here, not shift it out.

Can you quote the part of the IEEE 754 standard which mandates this? Or alternatively, come up with an example like the one at the top where there is a mismatch between what happens at runtime vs during miri evaluation?

Apparently hardware implementations shift out payloads as well:

#![feature(test)]
extern crate test;
use test::black_box;

fn main() {
    let nan1 = black_box(f64::from_bits(0x7FF0_0201_0000_0001u64));
    let nan2 = black_box(f64::from_bits(0x7FF0_0000_0000_0001u64));

    assert!(nan1.is_nan());
    assert!(nan2.is_nan());

    let nan1_32 = black_box(nan1 as f32);
    let nan2_32 = black_box(nan2 as f32);

    println!("{:0x} -> {:0x} {}", nan1.to_bits(), nan1_32.to_bits(), nan1_32.is_nan());
    println!("{:0x} -> {:0x} {}", nan2.to_bits(), nan2_32.to_bits(), nan1_32.is_nan());
}

prints:

7ff0020100000001 -> 7fc01008 true
7ff0000000000001 -> 7fc00000 true

@est31
Copy link
Member

est31 commented Sep 30, 2020

Larger example that compares runtime and const eval (as of nightly master):

#![feature(test)]
#![feature(const_fn_transmute)]
extern crate test;
use test::black_box;

const fn make_nans() -> (f64, f64, f32, f32) {
    let nan1: f64 = unsafe { std::mem::transmute(0x7FF0_0201_0000_0001u64) };
    let nan2: f64 = unsafe { std::mem::transmute(0x7FF0_0000_0000_0001u64) };

    let nan1_32 = nan1 as f32;
    let nan2_32 = nan2 as f32;

    (nan1, nan2, nan1_32, nan2_32)
}

static NANS: (f64, f64, f32, f32) = make_nans();

fn main() {
    let (nan1, nan2, nan1_32, nan2_32) = NANS;

    assert!(nan1.is_nan());
    assert!(nan2.is_nan());

    println!("runtime eval:");
    {
        let nan1: f64 = black_box(f64::from_bits(0x7FF0_0201_0000_0001u64));
        let nan2: f64 = black_box(f64::from_bits(0x7FF0_0000_0000_0001u64));

        let nan1_32 = black_box(nan1 as f32);
        let nan2_32 = black_box(nan2 as f32);
            
        let nan1_32_again = black_box(nan1_32 as f64) as f32;
        let nan2_32_again = black_box(nan2_32 as f64) as f32;
        println!("{:0x} -> {:0x} {} -> {:0x} {}", nan1.to_bits(), nan1_32.to_bits(), nan1_32.is_nan(), nan1_32_again.to_bits(), nan1_32_again.is_nan());
        println!("{:0x} -> {:0x} {} -> {:0x} {}", nan2.to_bits(), nan2_32.to_bits(), nan2_32.is_nan(), nan2_32_again.to_bits(), nan2_32_again.is_nan());
    }
    println!("const eval:");
    println!("{:0x} -> {:0x} {}", nan1.to_bits(), nan1_32.to_bits(), nan2_32.is_nan());
    println!("{:0x} -> {:0x} {}", nan2.to_bits(), nan2_32.to_bits(), nan2_32.is_nan());
}

prints:

runtime eval:
7ff0020100000001 -> 7fc01008 true -> 7fc01008 true
7ff0000000000001 -> 7fc00000 true -> 7fc00000 true
const eval:
7ff0020100000001 -> 7f801008 false
7ff0000000000001 -> 7f800000 false

I guess it works on the hardware because there is an internal register set or something that says the number is NaN. Edit: apparently on the hardware it always unconditionally sets the highest bit of the mantissa when a NaN truncation is performed to work around the issue. Edit2: which is nothing other than turning the sNaN into a qNaN.

@thomcc
Copy link
Member

thomcc commented Sep 30, 2020

Can you quote the part of the IEEE 754 standard which mandates this

IEEE754-2019 6.2.3 (NaN Propagation) states that

Conversion of a quiet NaN from a narrower format to a wider format in the same radix, and then back to the same narrower format, should not change the quiet NaN payload in any way except to make it canonical.

This seems impossible if the implementation shifts the payload out. But I guess this only applies to qNaN, although in my testing earlier it seemed to happen on sNaN on my machines.

Anyway, I'm a bit busy at the moment, or I'd dig further into this.

@RalfJung
Copy link
Member

It doesn't fix this bug in the general case.

I knew there were two patches but I have no idea which one landed... so far the plan was to only port things that have landed on the LLVM side. If there is still a bug left there, that seems worth reporting?

@est31
Copy link
Member

est31 commented Sep 30, 2020

@thomcc If the implementation shifts to the left on conversion to the wider format and back to the right on conversion to the narrower format, then the NaN payload will be the same. So I think that shifting is conformant with that rule of the standard. Otherwise hardware would be nonconformant as well. But maybe there's still some rule in the standard that LLVM apfloat violates.

@thomcc
Copy link
Member

thomcc commented Sep 30, 2020

Hmm, fair enough. Unfortunately, I don't think I'm going to have time to dig further into this for at least another week, but #77368 (comment) sounds good to me.

Sorry for any trouble.

@RalfJung
Copy link
Member

RalfJung commented Oct 1, 2020

No trouble caused, thanks for looking into this. :)

@est31
Copy link
Member

est31 commented Oct 1, 2020

Yes your comments were very helpful. I wouldn't have found or thought of looking for that LLVM MR for example.

@RalfJung
Copy link
Member

RalfJung commented Oct 4, 2020

This issue will be fixed by #77368 but @thomcc noted that there are some other interesting test-cases that would not have been fixed by a more naive approach... would be great if we could get those testcases into our test suite as well :)

@bors bors closed this as completed in 7f5008c Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-miri Area: The miri tool C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants