Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investiage -Z mir-opt-level more. #287

Closed
vext01 opened this issue Mar 5, 2021 · 8 comments
Closed

Investiage -Z mir-opt-level more. #287

vext01 opened this issue Mar 5, 2021 · 8 comments

Comments

@vext01
Copy link
Contributor

vext01 commented Mar 5, 2021

We currently don't optimise the external workspace, as hardware tracing requires that LLVM doesn't re-order our blocks.

But that's not to say that we can't optimise the MIR before it is lowered to SIR. And this pretty much amounts to passing -Z mir-opt-level=X, where X is some value, e.g. 3.

I quickly tried passing -Z mir-opt-level=3 to see if the speedup would be dramatic. I fully expected breakage due to unimplemented codegen paths, but actually it wasn't that bad.

I had to implement comparison with constants, which I did quickly and inefficiently:

--- a/internal_ws/ykcompile/src/arch/x86_64/mod.rs
+++ b/internal_ws/ykcompile/src/arch/x86_64/mod.rs
@@ -864,6 +865,42 @@ impl TraceCompiler {
                 }
                 _ => unreachable!(),
             },
+            Location::Const{val, ..} => {
+                let val = val.i64_cast();
+                // CMP can't work on imm64 operands and the immediate operands it does support must
+                // appear on the right-hand side (which would invert our comparison condition). To
+                // work around this, we borrow RCX.
+                dynasm!(self.asm
+                    ; push rcx
+                    ; mov rcx, QWORD val
+                );
+                match ty.size() {
+                    8 => {
+                        dynasm!(self.asm
+                            ; cmp rcx, Rq(*TEMP_REG)
+                        );
+                    },
+                    4 => {
+                        dynasm!(self.asm
+                            ; cmp ecx, Rd(*TEMP_REG)
+                        );
+                    },
+                    2 => {
+                        dynasm!(self.asm
+                            ; cmp ax, Rw(*TEMP_REG)
+                        );
+                    },
+                    1 => {
+                        dynasm!(self.asm
+                            ; cmp al, Rb(*TEMP_REG)
+                        );
+                    },
+                    _ => unreachable!(),
+                }
+                dynasm!(self.asm
+                    ; pop rcx
+                );
+            }
             _ => todo!(),
         }
         dynasm!(self.asm

And then we don't lower constant structs (to get that right, I'd recommend addressing #139). This is only needed for the simple_multithreaded_interpreter test, which uses a constant ArcInner. I just commented the test.

Then the results for running the test suite (10 runs) look like:

single test thread before

1: cargo xtask test
            Mean        Std.Dev.    Min         Median      Max
real        3.114       0.046       3.032       3.115       3.226
user        15.699      1.312       13.055      15.525      17.881
sys         6.504       0.279       6.046       6.530       7.000

single test thread after

1: cargo xtask test
            Mean        Std.Dev.    Min         Median      Max
real        3.727       0.030       3.684       3.731       3.774
user        2.911       0.040       2.852       2.920       2.972
sys         0.719       0.031       0.655       0.722       0.764

many test threads before

1: cargo xtask test
            Mean        Std.Dev.    Min         Median      Max
real        3.072       0.035       3.030       3.063       3.159
user        15.313      0.485       14.307      15.278      16.120
sys         6.196       0.315       5.898       6.119       7.045

many test threads after

1: cargo xtask test
            Mean        Std.Dev.    Min         Median      Max
real        3.098       0.038       3.041       3.089       3.164
user        15.004      0.655       14.240      14.920      16.349
sys         6.649       0.247       6.398       6.570       7.244

This indicates very little performance change. We put this down to the code we trace in tests being small and probably not benefiting much from MIR (and thus SIR) optimisations.

We agreed to come back to this when we have a larger interpreter to test with.

@bjorn3
Copy link
Collaborator

bjorn3 commented Mar 5, 2021

Did you also enable -Zmir-opt-level=3 for building the sysroot? Crate metadata contains the already optimized MIR. It will not be further optimized in downstream crates.

@vext01
Copy link
Contributor Author

vext01 commented Mar 5, 2021

The sysroot is built as-per-usual (with optimisations), so I guess that will be either 2 or 3? Either way, we can't trace the std lib.

@bjorn3
Copy link
Collaborator

bjorn3 commented Mar 5, 2021

-Zmir-opt-level=1 is the default. It doesn't depend on the optimization level. rust-lang/rust#82736 will introduce a level between 1 and 2 that will be the default in release mode though.

Either way, we can't trace the std lib.

Yes, but if any #[inline] or generic method from the std lib is called, it is codegened in the current crate and as such traced. Without -Zmir-opt-level=3 when building the sysroot, it will codegen a version that never had more mir optimizations applied than the default (-Zmir-opt-level=1).

@vext01
Copy link
Contributor Author

vext01 commented Mar 5, 2021

Repeating the experiment with opt-level 2 or 3 for the compiler, I often get this crash:

munmap_chunk(): invalid pointer

I'm guessing that's our fault.

@bjorn3
Copy link
Collaborator

bjorn3 commented Mar 5, 2021

What is the backtrace?

@vext01
Copy link
Contributor Author

vext01 commented Mar 5, 2021

There's no rust backtrace, as the process crashes in a dirty way.

Getting a backtrace in gdb, the top frame is a ?? frame, so probably jitcode...

@vext01
Copy link
Contributor Author

vext01 commented Mar 5, 2021

Wait, I'm talking rubbish:

#0  0x00007ffff7b317bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7b1c535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff7b73508 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff7b79c1a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff7b7b42c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00005555557ad1b1 in ?? ()
#6  0x00005555557c1ba8 in ?? ()
#7  0x00005555557ccfb0 in ?? ()
#8  0x0000555555b30f4a in ?? ()
#9  0x00007ffff7e4cfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007ffff7bf34cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

so it's libc cutting us short. That explains why RUST_BACKTRACE doesn't work.

@vext01
Copy link
Contributor Author

vext01 commented Jun 24, 2021

Closing, as this is for the old yk design.

@vext01 vext01 closed this as completed Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants