Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which patterns on union fields should be considered safe? #87520

Open
LeSeulArtichaut opened this issue Jul 27, 2021 · 8 comments
Open

Which patterns on union fields should be considered safe? #87520

LeSeulArtichaut opened this issue Jul 27, 2021 · 8 comments
Assignees
Labels
T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@LeSeulArtichaut
Copy link
Contributor

LeSeulArtichaut commented Jul 27, 2021

Moved from #85263 as the PR has been merged.

Summary of safety of pattern matching against Rust unions

Written by @Smittyvb, copied from #85263 (comment)

The unsafety checker is being to rewritten operate on the THIR instead of the MIR. As a part of that, I was implementing the unsafety rules for unions, and encountered some weird edge cases with the way the MIR unsafety checker handles unions. In general, writing to a union is safe but reading is unsafe. There are some cases where writing to a union is unsafe (such as when that might cause a Drop call) that are not being implemented in this PR.

Behavior specified by RFC 1444

Unsafe code may pattern match on union fields, using the same syntax as a
struct, without the requirement to mention every field of the union in a match
or use ..:

fn f(u: MyUnion) {
    unsafe {
        match u {
            MyUnion { f1: 10 } => { println!("ten"); }
            MyUnion { f2 } => { println!("{}", f2); }
        }
    }
}

Matching a specific value from a union field makes a refutable pattern; naming
a union field without matching a specific value makes an irrefutable pattern.
Both require unsafe code.

Pattern matching may match a union as a field of a larger structure. In
particular, when using a Rust union to implement a C tagged union via FFI, this
allows matching on the tag and the corresponding field simultaneously:

#[repr(u32)]
enum Tag { I, F }

#[repr(C)]
union U {
    i: i32,
    f: f32,
}

#[repr(C)]
struct Value {
    tag: Tag,
    u: U,
}

fn is_zero(v: Value) -> bool {
    unsafe {
        match v {
            Value { tag: I, u: U { i: 0 } } => true,
            Value { tag: F, u: U { f: 0.0 } } => true,
            _ => false,
        }
    }
}

Note that a pattern match on a union field that has a smaller size than the
entire union must not make any assumptions about the value of the union's
memory outside that field. For example, if a union contains a u8 and a
u32, matching on the u8 may not perform a u32-sized comparison over the
entire union.

Actual behavior

The MIR unsafety checker doesn't implement that behavior exactly. Due to the way that the MIR is constructed and optimized, some destructuring patterns against unions that that the RFC specifies to be unsafe are allowed.

The behavior of the MIR is mostly "irrefutable pattern matching against unions without any bindings after desugaring or-patterns is safe", with some extra weird behavior when niches are involved.

Here are some examples of what the MIR considers safe and unsafe. Here is the prelude for all of these examples:

union Foo { bar: i8, zst: (), pizza: Pizza, oneval: OneVal, twoval: TwoVal, khar: char }

#[derive(Copy, Clone)]
struct Pizza { topping: Option<PizzaTopping> }

#[derive(Copy, Clone)]
enum PizzaTopping { Cheese, Pineapple }

#[derive(Copy, Clone)]
#[repr(u8)]
enum OneVal { One = 1 }

#[derive(Copy, Clone)]
#[repr(u8)]
pub enum TwoVal {
    One = 1,
    Two = 2,
}

let mut foo = Foo { bar: 5 };

Patterns considered safe

match (Foo { bar: 42 }) {
    Foo { oneval: OneVal::One } => {
        // always run
    },
}
match foo {
    Foo { bar: _ | _ } => {},
}
match u {
    Foo { pizza: Pizza { .. } } => {},
};
match u {
    Foo { pizza: Pizza { topping: _ } } => {},
};
match foo {
    Foo { zst: () } => {},
}
let Foo { bar: _ } = foo;
match Some(foo) {
    Some(Foo { bar: _ }) => 3,
    None => 4,
};

Patterns considered unsafe

All of these require an unsafe block to compile.

match (Foo { bar: 42 }) {
    Foo { twoval: TwoVal::One | TwoVal::Two } => {
        // always run
    },
}
match foo {
    Foo {
        pizza: Pizza {
            topping: Some(PizzaTopping::Cheese) | Some(PizzaTopping::Pineapple) | None
        }
    } => {},
}
match foo {
    Foo { bar: _a } => {},
}
let Foo { bar: inner } = foo;
let (Foo { bar } | Foo { bar }) = foo;
match foo.khar {
    '\0'..='\u{D7FF}' | '\u{E000}'..='\u{10FFFF}' => ()
};
match x.b {
    '\0'..='\u{10FFFF}' => 1,
};

cc @Smittyvb @RalfJung @nikomatsakis

@LeSeulArtichaut LeSeulArtichaut added I-nominated T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Jul 27, 2021
@petrochenkov
Copy link
Contributor

The MIR unsafety checker doesn't implement that behavior exactly.

In the initial HIR-based implementation any field read was considered unsafe as described in the RFC, whether it's done through a dot access or pattern destructuring.
So the regression happened together with migration to the MIR unsafety checker.

@petrochenkov
Copy link
Contributor

The Foo { oneval: OneVal::One } example is strikingly similar to UB-based optimizations a la "we know that overflow is impossible, so we'll delete your code".

Here by accessing oneval we assert that it's indeed containing a valid value (and this assertion is used by the compiler to generate // always run), this is an unsafe action so the compiler shouldn't perform it on our behalf.

@RalfJung
Copy link
Member

RalfJung commented Jul 28, 2021

The Foo { oneval: OneVal::One } example is strikingly similar to UB-based optimizations a la "we know that overflow is impossible, so we'll delete your code".
Here by accessing oneval we assert that it's indeed containing a valid value (and this assertion is used by the compiler to generate // always run), this is an unsafe action so the compiler shouldn't perform it on our behalf.

OneVal has size 0. So comparing something against OneVal::One involves loading 0 bytes and comparing them; this will always succeed.

So, indeed it always contains a valid value, and that is trivially correct -- no assumption/assertion needed.

So, I don't see any parallel with UB-based optimizations here.

@syvb
Copy link
Contributor

syvb commented Jul 28, 2021

@RalfJung OneVal has a #[repr(u8)], which forces OneVal::One to be represented in memory as a single byte with a value of 1. Transmuting any other value to a OneVal would be UB. Without that attribute it would be a ZST.

@RalfJung
Copy link
Member

Oh, I missed the repr. (Again...)

So in a sense this is assuming that the union field satisfies the validity invariant without even looking at it... which does seem very dubious, yes.

@joshtriplett
Copy link
Member

We discussed this in today's @rust-lang/lang meeting.

Our initial reaction is that several of the examples currently considered safe should not be. (Some of them might potentially be in the future based on safe-transmute infrastructure, but they shouldn't be until we have that infrastructure. And some of them shouldn't ever be.)

It seems OK to match a ZST without an unsafe block. But matching a non-ZST implies that the code inside the match arm will only run if the value matches, and if the compiler doesn't enforce that semantic then that seems inordinately confusing and error-prone. For a non-union, it's UB to put an invalid value into an enum, but for a union field, it's perfectly valid to write a value to one field that would be invalid for the type of another, as long as you never read a field that doesn't have a valid value for its type.

I've started a Zulip thread to discuss the desired semantics further.

@nikomatsakis
Copy link
Contributor

Removing nomination as we are not doing much here at the moment. We do need to return to this topic!

@pnkfelix
Copy link
Member

pnkfelix commented Oct 30, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants