Add i32.popcnt and i64.popcnt to winch #6531

itsrainy · 2023-06-06T20:50:47Z

No description provided.

github-actions · 2023-06-06T21:45:04Z

Subscribe to Label Action

cc @saulecabrera

This issue or pull request has been labeled: "winch"

Thus the following users have been cc'd because of the following labels:

saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

saulecabrera

Thanks for putting this together! The current changes look good to me. I have one thought/comment though: are you planning on providing a fallback implementation for popcnt if the has_popcnt flag is not enabled, similar to what Cranelift provides? If it's too burdensome to do as part of this PR a TODO is also totally fine I think, but we might want to consider updating the fuzzing configuration so that it always enables the has_popcnt flag to avoid failures when fuzzing winch. Even though we only enable the fuzzer locally, it's still useful for verifying our changes.

Also, you might also want to add <I32|I64>Popcnt here https://github.com/bytecodealliance/wasmtime/blob/main/fuzz/fuzz_targets/differential.rs#L335

itsrainy · 2023-06-09T19:50:15Z

are you planning on providing a fallback implementation for popcnt if the has_popcnt flag is not enabled, similar to what Cranelift provides?

Yeah I can take a crack at that! I'll also add it to the fuzz targets

itsrainy · 2023-06-12T23:09:42Z

@saulecabrera I added two popcnt fallbacks that I'd like help choosing between. popcount_fallback1 is a naive loop that is a bit more compact and easier to read (but ends up emitting more instructions). popcount_fallback2 is my attempt at implementing the fallback that is in cranelift.

saulecabrera

I think that it'd be nice to save a couple of instructions if possible, so I'm leaning towards the second fallback. This is also how SpiderMonkey handles this case https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h#110

I left a couple of extra comments, but feel free to ignore if those are things that you were already thinking about!

saulecabrera · 2023-06-13T13:51:02Z

winch/codegen/src/isa/x64/asm.rs

+        });
+
+        let fives = regs::scratch();
+        self.load_constant(&0x5555555555555555, fives, size);


If I'm not wrong, in the case of a fallback we'd need extra temporary register because the scratch is clobbered by all the calls to load_constant right?

If that's the case, perhaps we can pass in a mutable reference to the CodeGenContext to MacroAssembler::popcnt and do the dispatching at that level and if we need to emit the fallback, we can request an extra temporary register via CodeGenContext::any_gpr? Similar to to the implementation of clz https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/isa/x64/masm.rs#L343

scratch is clobbered by all the calls to load_constant right?

Oh this is good to know! I didn't know that. I am still learning the relationships between the different reg types and how their different abstraction layers are used, so if it seems like I'm doing something weird, please let me know :)

I'll take the approach in your second comment!

saulecabrera · 2023-06-13T13:58:54Z

winch/codegen/src/isa/x64/asm.rs

+        }
+    }
+
+    fn popcnt_fallback2(&mut self, size: OperandSize, reg: Reg) {


Could we replace the emit calls in this function with the emit functions in the assembler? That has the added benefit that it already handles constant loading based on the operand size.

I think so. So, for instance I could replace this:

self.emit(Inst::AluRmiR { size: size.into(), op: AluRmiROpcode::Sub, src1: reg.into(), src2: masked1.into(), dst: reg.into(), });

with something like this:

self.sub_rr(reg, masked, size);

right? I noticed some of the instructions don't have corresponding functions on the assembler (shiftr, and), should I add functions for those?

Yeah exactly.

For shift we have shift_ir and shift_rr in the assembler and for and we have and_rr and and_ir , I think that should cover all the cases for the lowering here. But if it doesn't feel free to make any additions to the assembler!

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> Co-authored-by: Chris Fallin <chris@cfallin.org>

Move popcnt fallback up into the macroassembler. Share code between 32-bit and 64-bit popcnt Add Popcnt to winch differential fuzzing

itsrainy · 2023-06-13T22:18:37Z

I moved the fallback logic up into the MacroAssembler. I was running into some tricky ownership issues when trying to pass the CodeGenContext into the Fn passed to unop, but I was able to implement the fallback only using the scratch register, so I don't think we actually need any more registers. I'd like to write some tests that actually use the fallback before this gets merged.

jameysharp

Seems right to me! I agree that tests are a good idea before merging but assuming those go okay then I think this is good.

jameysharp · 2023-06-13T22:42:31Z

winch/codegen/src/isa/x64/masm.rs

+            self.asm.sub(dst.into(), tmp.into(), size);
+
+            // x = (x & m2) + ((x >> 2) & m2);
+            self.asm.mov(tmp.into(), dst.into(), size);


Can this and the other mov below be written like this? I'm not sure it makes any difference but it seems a little more clear to me if we can avoid using .into() so much. I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.

Suggested change

self.asm.mov(tmp.into(), dst.into(), size);

self.asm.mov_rr(tmp, dst, size);

I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.

there are! I'll change those

github-actions · 2023-06-14T02:50:24Z

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

The scratch register was getting clobbered by the calls to `and`, so this is instead passing in a CodeGenContext to the masm's `popcnt` and letting it handle its own registers

itsrainy · 2023-06-14T19:06:10Z

@saulecabrera I moved the register management bit to the MacroAssembler and added a couple of filetests for the fallback. I manually tested the fallback by forcing that branch in the code (just added an ! to the has_popcnt check), and then running cargo run --features=winch -- run --compiler winch --invoke "popcount" <module> on this module with a bunch of different values:

(module
    (func $popcount (result i64)
      i64.const 0xfffffffff0ffffff
      i64.popcnt
    )
    (export "popcount" (func $popcount))
)

It seems to behave as intended! Let me know if there's anything else that needs cleaning up, otherwise I think this might be good to go!

saulecabrera

Left one minor comment, but overall this looks great to me, thanks!

saulecabrera · 2023-06-14T19:15:15Z

winch/codegen/src/isa/x64/asm.rs

+            src: Gpr::new(src.into()).unwrap().into(),
+            dst: Writable::from_reg(Gpr::new(src.into()).unwrap()),


We currently have impl From<Reg> for Gpr and impl From<Reg> for WritableGpr so you could use those here if you wanted to reduce the boilerplate.

jameysharp · 2023-06-14T19:07:46Z

winch/codegen/src/isa/x64/masm.rs

+            self.asm.popcnt(src, size);
+            context.stack.push(Val::reg(src));
+        } else {
+            let tmp = context.any_gpr(self);


Would you add a comment that the fallback is based on MacroAssembler::popcnt32 in https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h?

jameysharp · 2023-06-14T19:19:24Z

winch/codegen/src/isa/x64/masm.rs

+            let (masks, shift_amt) = match size {
+                OperandSize::S64 => (
+                    [
+                        0x5555555555555555, // m1
+                        0x3333333333333333, // m2
+                        0x0f0f0f0f0f0f0f0f, // m4
+                        0x0101010101010101, // h01
+                    ],
+                    56u8,
+                ),
+                // 32-bit popcount is the same, except the masks are half as
+                // wide and we shift by 24 at the end rather than 56
+                OperandSize::S32 => (
+                    [0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64],
+                    24u8,
+                ),
+            };


This is perfectly reasonable as-is but it keeps bothering me that the constant masks are duplicated in this way. One alternative might be:

Suggested change

let (masks, shift_amt) = match size {

OperandSize::S64 => (

[

0x5555555555555555, // m1

0x3333333333333333, // m2

0x0f0f0f0f0f0f0f0f, // m4

0x0101010101010101, // h01

],

56u8,

),

// 32-bit popcount is the same, except the masks are half as

// wide and we shift by 24 at the end rather than 56

OperandSize::S32 => (

[0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64],

24u8,

),

};

let masks = [

0x5555555555555555, // m1

0x3333333333333333, // m2

0x0f0f0f0f0f0f0f0f, // m4

0x0101010101010101, // h01

];

let (mask, shift_amt) = match size {

OperandSize::S64 => (u64::MAX as i64, 56u8),

OperandSize::S32 => (u32::MAX as i64, 24u8),

};

Then use e.g. masks[0] & mask. (Maybe rename mask though.)

Another approach is to generate the constants using bit-twiddling tricks. I don't think this is a good idea but I spent enough time figuring out how it would work that I'm going to write it down anyway. You can divide u64::MAX or u32::MAX by certain constants to get these repeating patterns: specifically, [0x3, 0x5, 0x11, 0xff] to produce [0x55..., 0x33..., 0x0f..., 0x01...] respectively.

I thiiiink I'm inclined to keep it as is, though I don't feel too strongly about that. I initially had the 32-bit and 64-bit code as completely separate branches (this is what spidermonkey and cranelift do) and wasn't sure if I should even combine them in the first place. I kind of like the different constants being explicitly in the code, but I also see that using one to generate the other is tempting. Happy to go either way here

Yup, I don't feel strongly about it either, so I'm good with leaving it as is. 👍

jameysharp · 2023-06-14T19:24:51Z

winch/codegen/src/isa/x64/masm.rs

+
+            // x -= (x >> 1) & m1;
+            self.asm.shift_ir(1u8, dst, ShiftKind::ShrU, size);
+            self.asm.and(RegImm::imm(masks[0]).into(), dst.into(), size);


I was pretty confused about why you didn't use and_ir, until I dug into the implementation enough to understand that x86 only supports 32-bit immediates for and, so doing it this way lets the underlying assembler decide whether to use load_constant into the scratch register or to emit the immediate operand inline.

I don't know what to suggest but maybe there's a comment that could go here. Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆

Also, it's a little unfortunate that in the 64-bit case, masks[1] gets loaded into the scratch register twice. It would be nice to avoid that, but it would clutter the implementation here and maybe that's not worth doing.

Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆

The scratch register has caused confusion, I agree. This is something that I'd like to improve, so I'm considering adding guards, which will hopefully make its usage traceable and more explicit.

I just haven't gotten to it 😄

Yeah I did have the thought about 0x333... getting loaded in twice, so I had briefly replaced the second RegImm::imm(masks[0]).into() with regs::scratch().into(), but that wouldn't work in the 32-bit case and then trying to work around that seemed like added complexity for not a lot of gain

I don't think it would be terrible to unconditionally call load_constant here to put the 0x333... mask in the scratch register, and use and_rr explicitly. It's one extra instruction in the 32-bit case, but it's only a move-immediate, and the generated code is shorter due to only emitting the constant once.

github-actions bot added the winch Winch issues or pull requests label Jun 6, 2023

itsrainy mentioned this pull request Jun 6, 2023

Winch Core Wasm Opcode Support #6528

Closed

fitzgen requested a review from saulecabrera June 7, 2023 16:38

saulecabrera reviewed Jun 7, 2023

View reviewed changes

saulecabrera reviewed Jun 13, 2023

View reviewed changes

itsrainy force-pushed the rainy/winch-popcnt branch from e42626d to 103115d Compare June 13, 2023 21:39

itsrainy and others added 2 commits June 13, 2023 18:08

Add i32.popcnt and i64.popcnt to winch

87b4096

Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> Co-authored-by: Chris Fallin <chris@cfallin.org>

Add fallback implementation for popcnt

7e301f8

Move popcnt fallback up into the macroassembler. Share code between 32-bit and 64-bit popcnt Add Popcnt to winch differential fuzzing

itsrainy force-pushed the rainy/winch-popcnt branch from d65bd49 to 7e301f8 Compare June 13, 2023 22:11

itsrainy marked this pull request as ready for review June 13, 2023 22:18

itsrainy requested review from a team as code owners June 13, 2023 22:18

itsrainy requested review from jameysharp and removed request for a team June 13, 2023 22:18

itsrainy changed the title ~~wip - add i32.popcnt and i64.popcnt to winch~~ Add i32.popcnt and i64.popcnt to winch Jun 13, 2023

jameysharp approved these changes Jun 13, 2023

View reviewed changes

jameysharp reviewed Jun 13, 2023

View reviewed changes

Use _rr functions where possible

e30cf34

github-actions bot added the fuzzing Issues related to our fuzzing infrastructure label Jun 14, 2023

itsrainy added 2 commits June 14, 2023 11:19

Avoid using scratch register for popcnt

3ff5b35

The scratch register was getting clobbered by the calls to `and`, so this is instead passing in a CodeGenContext to the masm's `popcnt` and letting it handle its own registers

Add filetests for the fallback popcnt impls

ab5587b

saulecabrera approved these changes Jun 14, 2023

View reviewed changes

jameysharp approved these changes Jun 14, 2023

View reviewed changes

address PR comments

bf64ed4

Update filetests

c752d00

itsrainy added this pull request to the merge queue Jun 14, 2023

Merged via the queue into main with commit 7513464 Jun 14, 2023

itsrainy deleted the rainy/winch-popcnt branch June 14, 2023 22:20

jameysharp mentioned this pull request Jun 16, 2023

riscv64: Implement SIMD popcnt #6587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add i32.popcnt and i64.popcnt to winch #6531

Add i32.popcnt and i64.popcnt to winch #6531

itsrainy commented Jun 6, 2023

github-actions bot commented Jun 6, 2023

saulecabrera left a comment

itsrainy commented Jun 9, 2023

itsrainy commented Jun 12, 2023

saulecabrera left a comment

saulecabrera Jun 13, 2023

saulecabrera Jun 13, 2023

itsrainy Jun 13, 2023

saulecabrera Jun 13, 2023

itsrainy Jun 13, 2023

saulecabrera Jun 13, 2023

itsrainy commented Jun 13, 2023

jameysharp left a comment

jameysharp Jun 13, 2023

itsrainy Jun 14, 2023

github-actions bot commented Jun 14, 2023

itsrainy commented Jun 14, 2023

saulecabrera left a comment

saulecabrera Jun 14, 2023

jameysharp Jun 14, 2023

jameysharp Jun 14, 2023

itsrainy Jun 14, 2023

jameysharp Jun 14, 2023

jameysharp Jun 14, 2023

saulecabrera Jun 14, 2023

saulecabrera Jun 14, 2023

itsrainy Jun 14, 2023 •

edited

Loading

jameysharp Jun 14, 2023

	self.asm.mov(tmp.into(), dst.into(), size);
	self.asm.mov_rr(tmp, dst, size);

		src: Gpr::new(src.into()).unwrap().into(),
		dst: Writable::from_reg(Gpr::new(src.into()).unwrap()),

Add i32.popcnt and i64.popcnt to winch #6531

Add i32.popcnt and i64.popcnt to winch #6531

Conversation

itsrainy commented Jun 6, 2023

github-actions bot commented Jun 6, 2023

Subscribe to Label Action

saulecabrera left a comment

Choose a reason for hiding this comment

itsrainy commented Jun 9, 2023

itsrainy commented Jun 12, 2023

saulecabrera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itsrainy commented Jun 13, 2023

jameysharp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 14, 2023

Subscribe to Label Action

itsrainy commented Jun 14, 2023

saulecabrera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itsrainy Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itsrainy Jun 14, 2023 •

edited

Loading