Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add i32.popcnt and i64.popcnt to winch #6531

Merged
merged 7 commits into from
Jun 14, 2023
Merged

Add i32.popcnt and i64.popcnt to winch #6531

merged 7 commits into from
Jun 14, 2023

Conversation

itsrainy
Copy link
Contributor

@itsrainy itsrainy commented Jun 6, 2023

No description provided.

@github-actions github-actions bot added the winch Winch issues or pull requests label Jun 6, 2023
@github-actions
Copy link

github-actions bot commented Jun 6, 2023

Subscribe to Label Action

cc @saulecabrera

This issue or pull request has been labeled: "winch"

Thus the following users have been cc'd because of the following labels:

  • saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@fitzgen fitzgen requested a review from saulecabrera June 7, 2023 16:38
Copy link
Member

@saulecabrera saulecabrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! The current changes look good to me. I have one thought/comment though: are you planning on providing a fallback implementation for popcnt if the has_popcnt flag is not enabled, similar to what Cranelift provides? If it's too burdensome to do as part of this PR a TODO is also totally fine I think, but we might want to consider updating the fuzzing configuration so that it always enables the has_popcnt flag to avoid failures when fuzzing winch. Even though we only enable the fuzzer locally, it's still useful for verifying our changes.

Also, you might also want to add <I32|I64>Popcnt here https://github.com/bytecodealliance/wasmtime/blob/main/fuzz/fuzz_targets/differential.rs#L335

@itsrainy
Copy link
Contributor Author

itsrainy commented Jun 9, 2023

are you planning on providing a fallback implementation for popcnt if the has_popcnt flag is not enabled, similar to what Cranelift provides?

Yeah I can take a crack at that! I'll also add it to the fuzz targets

@itsrainy
Copy link
Contributor Author

@saulecabrera I added two popcnt fallbacks that I'd like help choosing between. popcount_fallback1 is a naive loop that is a bit more compact and easier to read (but ends up emitting more instructions). popcount_fallback2 is my attempt at implementing the fallback that is in cranelift.

Copy link
Member

@saulecabrera saulecabrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it'd be nice to save a couple of instructions if possible, so I'm leaning towards the second fallback. This is also how SpiderMonkey handles this case https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h#110

I left a couple of extra comments, but feel free to ignore if those are things that you were already thinking about!

});

let fives = regs::scratch();
self.load_constant(&0x5555555555555555, fives, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong, in the case of a fallback we'd need extra temporary register because the scratch is clobbered by all the calls to load_constant right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, perhaps we can pass in a mutable reference to the CodeGenContext to MacroAssembler::popcnt and do the dispatching at that level and if we need to emit the fallback, we can request an extra temporary register via CodeGenContext::any_gpr? Similar to to the implementation of clz https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/isa/x64/masm.rs#L343

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch is clobbered by all the calls to load_constant right?

Oh this is good to know! I didn't know that. I am still learning the relationships between the different reg types and how their different abstraction layers are used, so if it seems like I'm doing something weird, please let me know :)

I'll take the approach in your second comment!

}
}

fn popcnt_fallback2(&mut self, size: OperandSize, reg: Reg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace the emit calls in this function with the emit functions in the assembler? That has the added benefit that it already handles constant loading based on the operand size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. So, for instance I could replace this:

self.emit(Inst::AluRmiR {
    size: size.into(),
    op: AluRmiROpcode::Sub,
    src1: reg.into(),
    src2: masked1.into(),
    dst: reg.into(),
});

with something like this:

self.sub_rr(reg, masked, size);

right? I noticed some of the instructions don't have corresponding functions on the assembler (shiftr, and), should I add functions for those?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly.

For shift we have shift_ir and shift_rr in the assembler and for and we have and_rr and and_ir , I think that should cover all the cases for the lowering here. But if it doesn't feel free to make any additions to the assembler!

@itsrainy itsrainy force-pushed the rainy/winch-popcnt branch from e42626d to 103115d Compare June 13, 2023 21:39
itsrainy and others added 2 commits June 13, 2023 18:08
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
Move popcnt fallback up into the macroassembler.

Share code between 32-bit and 64-bit popcnt

Add Popcnt to winch differential fuzzing
@itsrainy itsrainy force-pushed the rainy/winch-popcnt branch from d65bd49 to 7e301f8 Compare June 13, 2023 22:11
@itsrainy
Copy link
Contributor Author

I moved the fallback logic up into the MacroAssembler. I was running into some tricky ownership issues when trying to pass the CodeGenContext into the Fn passed to unop, but I was able to implement the fallback only using the scratch register, so I don't think we actually need any more registers. I'd like to write some tests that actually use the fallback before this gets merged.

@itsrainy itsrainy marked this pull request as ready for review June 13, 2023 22:18
@itsrainy itsrainy requested review from a team as code owners June 13, 2023 22:18
@itsrainy itsrainy requested review from jameysharp and removed request for a team June 13, 2023 22:18
@itsrainy itsrainy changed the title wip - add i32.popcnt and i64.popcnt to winch Add i32.popcnt and i64.popcnt to winch Jun 13, 2023
Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems right to me! I agree that tests are a good idea before merging but assuming those go okay then I think this is good.

self.asm.sub(dst.into(), tmp.into(), size);

// x = (x & m2) + ((x >> 2) & m2);
self.asm.mov(tmp.into(), dst.into(), size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this and the other mov below be written like this? I'm not sure it makes any difference but it seems a little more clear to me if we can avoid using .into() so much. I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.

Suggested change
self.asm.mov(tmp.into(), dst.into(), size);
self.asm.mov_rr(tmp, dst, size);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.

there are! I'll change those

@github-actions github-actions bot added the fuzzing Issues related to our fuzzing infrastructure label Jun 14, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

itsrainy added 2 commits June 14, 2023 11:19
The scratch register was getting clobbered by the calls to `and`,
so this is instead passing in a CodeGenContext to the masm's `popcnt`
and letting it handle its own registers
@itsrainy
Copy link
Contributor Author

@saulecabrera I moved the register management bit to the MacroAssembler and added a couple of filetests for the fallback. I manually tested the fallback by forcing that branch in the code (just added an ! to the has_popcnt check), and then running cargo run --features=winch -- run --compiler winch --invoke "popcount" <module> on this module with a bunch of different values:

(module
    (func $popcount (result i64)
      i64.const 0xfffffffff0ffffff
      i64.popcnt
    )
    (export "popcount" (func $popcount))
)

It seems to behave as intended! Let me know if there's anything else that needs cleaning up, otherwise I think this might be good to go!

Copy link
Member

@saulecabrera saulecabrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one minor comment, but overall this looks great to me, thanks!

Comment on lines 697 to 698
src: Gpr::new(src.into()).unwrap().into(),
dst: Writable::from_reg(Gpr::new(src.into()).unwrap()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have impl From<Reg> for Gpr and impl From<Reg> for WritableGpr so you could use those here if you wanted to reduce the boilerplate.

self.asm.popcnt(src, size);
context.stack.push(Val::reg(src));
} else {
let tmp = context.any_gpr(self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a comment that the fallback is based on MacroAssembler::popcnt32 in https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h?

Comment on lines +430 to +446
let (masks, shift_amt) = match size {
OperandSize::S64 => (
[
0x5555555555555555, // m1
0x3333333333333333, // m2
0x0f0f0f0f0f0f0f0f, // m4
0x0101010101010101, // h01
],
56u8,
),
// 32-bit popcount is the same, except the masks are half as
// wide and we shift by 24 at the end rather than 56
OperandSize::S32 => (
[0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64],
24u8,
),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perfectly reasonable as-is but it keeps bothering me that the constant masks are duplicated in this way. One alternative might be:

Suggested change
let (masks, shift_amt) = match size {
OperandSize::S64 => (
[
0x5555555555555555, // m1
0x3333333333333333, // m2
0x0f0f0f0f0f0f0f0f, // m4
0x0101010101010101, // h01
],
56u8,
),
// 32-bit popcount is the same, except the masks are half as
// wide and we shift by 24 at the end rather than 56
OperandSize::S32 => (
[0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64],
24u8,
),
};
let masks = [
0x5555555555555555, // m1
0x3333333333333333, // m2
0x0f0f0f0f0f0f0f0f, // m4
0x0101010101010101, // h01
];
let (mask, shift_amt) = match size {
OperandSize::S64 => (u64::MAX as i64, 56u8),
OperandSize::S32 => (u32::MAX as i64, 24u8),
};

Then use e.g. masks[0] & mask. (Maybe rename mask though.)

Another approach is to generate the constants using bit-twiddling tricks. I don't think this is a good idea but I spent enough time figuring out how it would work that I'm going to write it down anyway. You can divide u64::MAX or u32::MAX by certain constants to get these repeating patterns: specifically, [0x3, 0x5, 0x11, 0xff] to produce [0x55..., 0x33..., 0x0f..., 0x01...] respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thiiiink I'm inclined to keep it as is, though I don't feel too strongly about that. I initially had the 32-bit and 64-bit code as completely separate branches (this is what spidermonkey and cranelift do) and wasn't sure if I should even combine them in the first place. I kind of like the different constants being explicitly in the code, but I also see that using one to generate the other is tempting. Happy to go either way here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I don't feel strongly about it either, so I'm good with leaving it as is. 👍


// x -= (x >> 1) & m1;
self.asm.shift_ir(1u8, dst, ShiftKind::ShrU, size);
self.asm.and(RegImm::imm(masks[0]).into(), dst.into(), size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pretty confused about why you didn't use and_ir, until I dug into the implementation enough to understand that x86 only supports 32-bit immediates for and, so doing it this way lets the underlying assembler decide whether to use load_constant into the scratch register or to emit the immediate operand inline.

I don't know what to suggest but maybe there's a comment that could go here. Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆

Also, it's a little unfortunate that in the 64-bit case, masks[1] gets loaded into the scratch register twice. It would be nice to avoid that, but it would clutter the implementation here and maybe that's not worth doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆

The scratch register has caused confusion, I agree. This is something that I'd like to improve, so I'm considering adding guards, which will hopefully make its usage traceable and more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just haven't gotten to it 😄

Copy link
Contributor Author

@itsrainy itsrainy Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did have the thought about 0x333... getting loaded in twice, so I had briefly replaced the second RegImm::imm(masks[0]).into() with regs::scratch().into(), but that wouldn't work in the 32-bit case and then trying to work around that seemed like added complexity for not a lot of gain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be terrible to unconditionally call load_constant here to put the 0x333... mask in the scratch register, and use and_rr explicitly. It's one extra instruction in the 32-bit case, but it's only a move-immediate, and the generated code is shorter due to only emitting the constant once.

@itsrainy itsrainy added this pull request to the merge queue Jun 14, 2023
Merged via the queue into main with commit 7513464 Jun 14, 2023
@itsrainy itsrainy deleted the rainy/winch-popcnt branch June 14, 2023 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzzing Issues related to our fuzzing infrastructure winch Winch issues or pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants