Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize enums by niche-filling even when the smaller variants also have data. #46213

Closed
eddyb opened this issue Nov 23, 2017 · 35 comments · Fixed by #94075
Closed

Optimize enums by niche-filling even when the smaller variants also have data. #46213

eddyb opened this issue Nov 23, 2017 · 35 comments · Fixed by #94075
Labels
A-codegen Area: Code generation A-layout Area: Memory layout of types C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@eddyb
Copy link
Member

eddyb commented Nov 23, 2017

The current implementation only handles enums of roughly the form:

enum E {
    A(ALeft..., Niche, ARight...),
    B,
    C
}

This can be seen as a special-case, where B and C occupy 0 bytes, of:

enum E {
    A(ALeft..., Niche, ARight...),
    B(B...),
    C(C...)
}

As long as B and C can fit before or after A's Niche, we can still apply the optimization.
Also see rust-lang/rfcs#1230 (comment) for the initial description.

@kennytm kennytm added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Nov 23, 2017
@Gankra

This comment has been minimized.

@est31
Copy link
Member

est31 commented Nov 23, 2017

what is "niche" in this case?

@Gankra
Copy link
Contributor

Gankra commented Nov 23, 2017

A niche is a location in the type where some bit patterns aren't valid.

For instance struct Foo(u32, bool, u32) has a niche where the bool is, because only 0 and 1 are valid bools. Option<Foo> can therefore use Foo.1 == 2 to represent None.

@est31
Copy link
Member

est31 commented Nov 23, 2017

@gankro thanks, that makes sense.

@eddyb
Copy link
Member Author

eddyb commented Nov 23, 2017

We probably need better terminology, "invalid" values and "invalid reuse" optimization?

@Gankra
Copy link
Contributor

Gankra commented Nov 23, 2017

Here is a test case we currently fail to optimize, but this issue would fix:

Result<Result<u32, u32>, u32>

The compiler sees:

enum Result<Result<u32, u32>, u32>  {
  Ok({ tag: 0u32 | 1u32, payload: u32 }),
  Err(u32),
}

which should lower to:

{ tag: 0u32 | 1u32 | 2u32, payload: u32 }

because the Err payload fits after the niche (Ok.tag).

@leonardo-m
Copy link

Before merging this PR I think we should see benchmarks of both memory saved and run-time saved for some real program, like the rust compiler and Servo.

@leonardo-m
Copy link

I have yet to see similar benchmaks for #45225 .

@Gankra
Copy link
Contributor

Gankra commented Nov 23, 2017

Some slightly more complex cases from webrender that would also benefit from this:

// Use the Opacity.0.tag niche
pub enum FilterOp {
    Blur(f32),
    Brightness(f32),
    Contrast(f32),
    Grayscale(f32),
    HueRotate(f32),
    Invert(f32),
    Opacity(PropertyBinding<f32>, f32),
    Saturate(f32),
    Sepia(f32),
}

pub enum PropertyBinding<T> {
    Value(T),
    Binding(PropertyBindingKey<T>),
}

pub struct PropertyBindingKey<T> {
    pub id: PropertyBindingId,
    _phantom: PhantomData<T>,
}

pub struct PropertyBindingId {
    namespace: IdNamespace,
    uid: u32,
}

pub struct IdNamespace(pub u32);
// use the RoundedRect.1.mode.tag niche
pub enum LocalClip {
    Rect(LayoutRect),
    RoundedRect(LayoutRect, ComplexClipRegion),
}

#[repr(C)]
pub struct ComplexClipRegion {
    pub rect: LayoutRect,
    pub radii: BorderRadius,
    pub mode: ClipMode,
}

#[repr(C)]
pub struct BorderRadius {
    pub top_left: LayoutSize,
    pub top_right: LayoutSize,
    pub bottom_left: LayoutSize,
    pub bottom_right: LayoutSize,
}

#[repr(C)]
pub enum ClipMode {
    Clip,
    ClipOut,
}

#[repr(C)]
struct LayoutRect(f32, f32, f32, f32);

#[repr(C)]
struct LayoutSize(f32, f32);

@abonander
Copy link
Contributor

abonander commented Dec 17, 2017

A simpler case that I think fits this optimization (or a specialization of it as a starting point):

// my use-case, a single value is most common
enum OccupiedSmallVec {
     Single(Foo),
     Multiple(Vec<Foo>),
}

If Foo is 2 usizes or smaller, then OccupiedSmallVec should be the same size as Vec, as if it was implemented via this union:

#![feature(untagged_unions)]
union OccupiedSmallVec {
    // where `single.0 == 0`
    single: (usize, Foo),
    multiple: Vec<Foo>,
}

However, it currently requires a separate tag and padding to align the pointers, wasting basically another usize.

@eddyb
Copy link
Member Author

eddyb commented Dec 17, 2017

@abonander Not really, that's the entire optimization, assuming Vec<Foo> starts with a non-null pointer, and you're packing Foo in the leftover space after that pointer.

@eddyb
Copy link
Member Author

eddyb commented Mar 20, 2018

From IRC (cc @nox): we could extend this to always compute the size of the tagged layout (which would likely be larger than A itself) and allow using more space around A's fields up to that size, to fit other variants, if we generally want to avoid overlapping variant data where possible.

Then Result<&T, E> and similar would have the layout of (Option<&T>, E).

@nox
Copy link
Contributor

nox commented Mar 20, 2018

@killercup
Copy link
Member

IIUC, this optimization could also work to make Cow<str> and Cow<[T]> 3 words, right? (cf. discussion on reddit)

@nox
Copy link
Contributor

nox commented Jun 4, 2018

Yes.

@newpavlov
Copy link
Contributor

Another example which can be optimized is:

pub enum Bar {
    A = 1,
    B = 2,
    C = 3,
}

pub enum Baz {
    D = 4,
    E = 5,
    F = 6,
}

pub enum Foo {
    Bar(Bar),
    Baz(Baz),
}

Currently Foo takes 2 bytes, while it can be trivially represented as 1.

@nox
Copy link
Contributor

nox commented Aug 24, 2018 via email

@nagisa
Copy link
Member

nagisa commented Aug 24, 2018

Another example which can be optimized is:

This cannot happen, unless we make Bar and Baz types that have size less than byte, which is not something we can do (for variety reasons). Consider that it is possible to take an internal reference into data stored within some variant of Foo (i.e. either &Bar or &Baz) and this reference, when dereferenced should return the original data. To achieve that, the only way to do so is to use some of the "unused" bits to store the active variant of the enum, but that would then require bit twiddling when dereferencing any &Bar or &Baz, even if those weren’t stored within the enum.

@RalfJung
Copy link
Member

@nagisa Bar and Baz have disjoint sets of valid values, though. So it could be done.

It doesn't fit the "niche" scheme, though.

@newpavlov
Copy link
Contributor

newpavlov commented Aug 24, 2018

@nox I think it can be handled as a special case, i.e. if all enum elements are field-less enums, check if tags do not overlap and fit into a minimal repr. Yes, making it more generic will be significantly more involved, but even with field-less constraint this optimization will help a lot for FFI and parsing cases. (currently you often have to keep huge amount of consts, unwieldy enums or sacrifice performance)

@nagisa Ah, I should've explicitly mentioned, I want it foremost for field-less enums. (see linked issue before my first message) More generic handling would be nice, but less important.

@nox
Copy link
Contributor

nox commented Aug 26, 2018

It makes mem::discriminant way more complex. Currently computing the discriminant of any enum value is at most a comparison and a subtraction. Such optimisations would make the computation more complex, for example what would if let Foo::Bar(_) = foo { ... } compile to?

@alercah
Copy link
Contributor

alercah commented Sep 11, 2018

Optimizing for discriminant extraction seems, to me, far less valuable on average than optimizing for space. Optimizing for the ease of writing an intrisinc for it seems even less valuable. In the provided example, if let Foo::Bar(_) = foo { .. } is if *std::mem::transmute::<&u8>(&foo) <= 3 { ... }, which doesn't even have a subtraction.

@alercah
Copy link
Contributor

alercah commented Sep 11, 2018

I just now realized that std::mem::discriminant is, in general, not required to correspond to the specified discriminant. The Discriminant type does not directly expose the actual value of the discriminant, and like all intrinsics, the underlying intrinsic is unstable. Only the Debug and Hash instances expose it indirectly, and I don't think those should be considered guaranteed to be stable.

In fact, I believe that the only way you can observe the discriminant of a variant of a #[repr(Rust)] enum is via numeric cast, and you can't do that for an enum with fields. So I would propose that, at least for now, explicit discriminants only be permitted on enums with other representations.

@upsuper
Copy link
Contributor

upsuper commented Sep 3, 2022

IIUC, this optimization could also work to make Cow<str> and Cow<[T]> 3 words, right? (cf. discussion on reddit)

So this could potentially regress accessing data of Cow with slice / str.

Currently, for code like

pub fn read_len(x: &std::borrow::Cow<str>) -> usize {
    x.len()
}

the compiler generates optimized code:

example::read_len:
        mov     rax, qword ptr [rdi]
        mov     rax, qword ptr [rdi + 8*rax + 16]
        ret

i.e. without check of the tag.

But if we optimize Cow to take advantage of the non-nullness of ptr, the ptr and len fields would no longer align across the two variants, and it would need to always do an extra tag check for accessing data.

@SimonSapin
Copy link
Contributor

SimonSapin commented Sep 3, 2022

The code above does check the tag since the length field is at different offsets in the two variants, though the 8*rax multiplication makes the offset computation branchless.

With a more compact Cow<str> representation the length will be at the same offset so the code accessing it will be even simpler. However accessing the pointer field will regress and require a branch:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=1b23594473567cc429a98932b9f8592d

playground::read_len:
	mov	rax, qword ptr [rdi + 16]
	ret

playground::read_ptr:
	mov	rax, qword ptr [rdi]
	test	rax, rax
	jne	.LBB1_2
	mov	rax, qword ptr [rdi + 8]

.LBB1_2:
	ret
pub union Cow {
    borrowed: (usize, &'static str),
    owned: std::mem::ManuallyDrop<String>,
    owned_repr: (*mut u8, usize, usize), // Hack for demo outside of std
}

impl std::ops::Deref for Cow {
    type Target = str;
    
    #[inline]
    fn deref(&self) -> &str {
        unsafe {
            if self.owned_repr.0.is_null() {
                self.borrowed.1
            } else {
                &**self.owned
            }
        }
    }
}

pub fn read_len(x: &Cow) -> usize {
    x.len()
}

pub fn read_ptr(x: &Cow) -> *const u8 {
    x.as_bytes().as_ptr()
}

@oxalica
Copy link
Contributor

oxalica commented Sep 3, 2022

However accessing the pointer field will regress and require a branch:

playground::read_ptr:
	mov	rax, qword ptr [rdi]
	test	rax, rax
	jne	.LBB1_2
	mov	rax, qword ptr [rdi + 8]

.LBB1_2:
	ret

It's possible to eliminate that branch via CMOVE. Not sure why LLVM refuse to do that currently.
But it's definitely not blocked by niche-filling.

EDIT: The LLVM IR do optimize the code into select. Thus I'd assume LLVM thinks CMOV here would perform no better than JNE + MOV, considering it always reads the memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-layout Area: Memory layout of types C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet