Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64 first argument passing problem #781

Open
honggyukim opened this issue Jun 5, 2019 · 9 comments
Open

AArch64 first argument passing problem #781

honggyukim opened this issue Jun 5, 2019 · 9 comments

Comments

@honggyukim
Copy link
Collaborator

honggyukim commented Jun 5, 2019

As I wrote in #777 #778, there is a problem passing the first argument in AArch64.

The test result of running fibonacci example in both x86_64 and aarch64 is as follows.
1. x86_64

$ gcc -pg -g -o t-fib tests/s-fibonacci.c

$ file t-fib
t-fib: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=1af59904aedd31ceed7b7e3fef57738d4138ff56, not stripped

$ uftrace -a t-fib 5
# DURATION     TID     FUNCTION
   1.747 us [ 47054] | __monstartup();
   0.785 us [ 47054] | __cxa_atexit();
            [ 47054] | main(2, 0x7fff7f1bb1a8) {
 200.475 us [ 47054] |   atoi("5") = 5;
            [ 47054] |   fib(5) {
            [ 47054] |     fib(4) {
            [ 47054] |       fib(3) {
   4.075 us [ 47054] |         fib(2) = 1;
   0.153 us [ 47054] |         fib(1) = 1;
   5.551 us [ 47054] |       } = 2; /* fib */
   0.153 us [ 47054] |       fib(2) = 1;
   6.420 us [ 47054] |     } = 3; /* fib */
            [ 47054] |     fib(3) {
   0.173 us [ 47054] |       fib(2) = 1;
   0.148 us [ 47054] |       fib(1) = 1;
   1.184 us [ 47054] |     } = 2; /* fib */
   8.465 us [ 47054] |   } = 5; /* fib */
 218.079 us [ 47054] | } = 0; /* main */

2. aarch64
It shows the arguments incorrectly unlike x86_64.

$ gcc -pg -g -o t-fib.aarch64 tests/s-fibonacci.c

$ file t-fib.aarch64
t-fib.aarch64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=fb4d0d23b882e32d6a477b55eb7626808d773b2b, for GNU/Linux 3.7.0, with debug_info, not stripped, too many notes (256)

$ uftrace -a a.out 5
# DURATION     TID     FUNCTION
   0.542 us [634229] | __monstartup();
   0.375 us [634229] | __cxa_atexit();
            [634229] | main(0xffffafdbf984, 0xffffc0047bb8) {
  82.958 us [634229] |   atoi("5") = 5;
            [634229] |   fib(0x4007a0) {
            [634229] |     fib(0x40072c) {
            [634229] |       fib(0x40072c) {
   0.708 us [634229] |         fib(0x40072c) = 1;
   0.083 us [634229] |         fib(0x40073c) = 1;
   1.166 us [634229] |       } = 2; /* fib */
   0.083 us [634229] |       fib(0x40073c) = 1;
   1.458 us [634229] |     } = 3; /* fib */
            [634229] |     fib(0x40073c) {
   0.084 us [634229] |       fib(0x40072c) = 1;
   0.083 us [634229] |       fib(0x40073c) = 1;
   0.375 us [634229] |     } = 2; /* fib */
   2.083 us [634229] |   } = 5; /* fib */
  86.208 us [634229] | } = 0; /* main */

I just compiled and dumped for both original s-fibonacci.c and its -pg compiled version as follows:

$ aarch64-linux-gnu-gcc        -c -o s-fibonacci.o      tests/s-fibonacci.c
$ aarch64-linux-gnu-gcc -pg -g -c -o s-fibonacci.o.pg.g tests/s-fibonacci.c

$ aarch64-linux-objdump -d s-fibonacci.o      > s-fibonacci.o.asm
$ aarch64-linux-objdump -d s-fibonacci.o.pg.g > s-fibonacci.o.pg.g.asm

For better comparison, I just modified some offset labels in s-fibonacci.o.asm and its diff result is as follows:

 0000000000000000 <fib>:
    0:  a9bd7bfd        stp     x29, x30, [sp,#-48]!
    4:  910003fd        mov     x29, sp
    8:  f9000bf3        str     x19, [sp,#16]
+   c:  aa1e03e1        mov     x1, x30
   10:  b9002fa0        str     w0, [x29,#44]
+  14:  aa0103e0        mov     x0, x1
+  18:  94000000        bl      0 <_mcount>
   1c:  b9402fa0        ldr     w0, [x29,#44]
   20:  7100081f        cmp     w0, #0x2
   24:  5400006c        b.gt    30 <fib+0x30>
   28:  52800020        mov     w0, #0x1                        // #1
   2c:  14000009        b       50 <fib+0x50>
   30:  b9402fa0        ldr     w0, [x29,#44]
   34:  51000400        sub     w0, w0, #0x1
   38:  94000000        bl      0 <fib>
   3c:  2a0003f3        mov     w19, w0
   40:  b9402fa0        ldr     w0, [x29,#44]
   44:  51000800        sub     w0, w0, #0x2
   48:  94000000        bl      0 <fib>
   4c:  0b000260        add     w0, w19, w0
   50:  f9400bf3        ldr     x19, [sp,#16]
   54:  a8c37bfd        ldp     x29, x30, [sp],#48
   58:  d65f03c0        ret

 000000000000005c <main>:
   5c:  a9bd7bfd        stp     x29, x30, [sp,#-48]!
   60:  910003fd        mov     x29, sp
+  64:  aa1e03e2        mov     x2, x30
   68:  b9001fa0        str     w0, [x29,#28]
   6c:  f9000ba1        str     x1, [x29,#16]
+  70:  aa0203e0        mov     x0, x2
+  74:  94000000        bl      0 <_mcount>
   78:  52800100        mov     w0, #0x8                        // #8
   7c:  b9002fa0        str     w0, [x29,#44]
   80:  b9401fa0        ldr     w0, [x29,#28]
   84:  7100041f        cmp     w0, #0x1
   88:  540000cd        b.le    a0 <main+0x44>
   8c:  f9400ba0        ldr     x0, [x29,#16]
   90:  91002000        add     x0, x0, #0x8
   94:  f9400000        ldr     x0, [x0]
   98:  94000000        bl      0 <atoi>
   9c:  b9002fa0        str     w0, [x29,#44]
   a0:  b9402fa0        ldr     w0, [x29,#44]
   a4:  94000000        bl      0 <fib>
   a8:  52800000        mov     w0, #0x0                        // #0
   ac:  a8c37bfd        ldp     x29, x30, [sp],#48
   b0:  d65f03c0        ret

I'm not sure why x0 is overwritten by other registers in both functions.

@honggyukim
Copy link
Collaborator Author

honggyukim commented Jun 5, 2019

@namhyung Could you please explain the problem and how to avoid it?

@namhyung
Copy link
Owner

namhyung commented Jun 7, 2019

It's to pass the address of parent function to _mcount.

@honggyukim
Copy link
Collaborator Author

honggyukim commented Jun 8, 2019

It may not be guaranteed but it seems that x0 is overwritten at the previous instruction of _mcount calls. If this assumption is correct, we may be able to fix this by replacing the instruction to nop.

$ aarch64-linux-gnu-gcc -pg -o t-arg.aarch64.pg tests/s-arg.c
$ aarch64-linux-gnu-objdump -d t-arg.aarch64.pg | grep -B1 _mcount

00000000004007e0 <_mcount@plt>:
--
  400984:       aa0203e0        mov     x0, x2
  400988:       97ffff96        bl      4007e0 <_mcount@plt>
--
  4009e0:       aa0103e0        mov     x0, x1
  4009e4:       97ffff7f        bl      4007e0 <_mcount@plt>
--
  400aa4:       aa0803e0        mov     x0, x8
  400aa8:       97ffff4e        bl      4007e0 <_mcount@plt>
--
  400bd4:       aa0203e0        mov     x0, x2
  400bd8:       97ffff02        bl      4007e0 <_mcount@plt>
--
  400cc4:       aa0103e0        mov     x0, x1
  400cc8:       97fffec6        bl      4007e0 <_mcount@plt>
--
  400db8:       aa0203e0        mov     x0, x2
  400dbc:       97fffe89        bl      4007e0 <_mcount@plt>

We have to determine the address of bl <_mcount@plt>.

@honggyukim
Copy link
Collaborator Author

The encoding of mov instruction that overwrites to x0 is as follows:

1010 1010 000* **** 0000 0011 1110 0000
   a    a [01]    *    0    3    e    0

It will be interpreted as mov x0, <reg>.

The bit-masking for this is ffe0ffff and have to compare it to aa0003e0 whether the instruction is same as mov x0, <reg> format.

Reference: https://static.docs.arm.com/ddi0596/a/DDI_0596_ARM_a64_instruction_set_architecture.pdf
image

@honggyukim
Copy link
Collaborator Author

honggyukim commented Jun 8, 2019

It's aarch64 _mcount code generation code of gcc.

$ cat gcc/config/aarch64/aarch64.h
    ...
#define MCOUNT_NAME "_mcount"

#define NO_PROFILE_COUNTERS 1

/* Emit rtl for profiling.  Output assembler code to FILE
   to call "_mcount" for profiling a function entry.  */
#define PROFILE_HOOK(LABEL)                                             \
  {                                                                     \
    rtx fun, lr;                                                        \
    lr = get_hard_reg_initial_val (Pmode, LR_REGNUM);                   \
    fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME);                      \
    emit_library_call (fun, LCT_NORMAL, VOIDmode, lr, Pmode);           \
  }

/* All the work done in PROFILE_HOOK, but still required.  */
#define FUNCTION_PROFILER(STREAM, LABELNO) do { } while (0)
    ...
$ cat gcc/function.c
    ...
/*
  `expand_function_start' is called at the beginning of a function,
   before the function body is parsed, and `expand_function_end' is
   called after parsing the body.
*/
    ...
void expand_function_start (tree subr)
{
      ...
  if (crtl->profile)
    {
#ifdef PROFILE_HOOK
      PROFILE_HOOK (current_function_funcdef_no);
#endif
    }
      ...
}
    ...
$ cat gcc/cfgexpand.c
    ...
unsigned int
pass_expand::execute (function *fun)
{
      ...
  /* Set up parameters and prepare for return, for the function.  */
  expand_function_start (current_function_decl);
      ...
}
    ...

@honggyukim
Copy link
Collaborator Author

honggyukim commented Jun 9, 2019

There is another case that mov x0, <reg> instruction is placed in a different place.

00000000009308b0 <node::Init(...)>:
  9308b0:       a9b37bfd        stp     x29, x30, [sp,#-208]!
  9308b4:       910003fd        mov     x29, sp
  9308b8:       a90153f3        stp     x19, x20, [sp,#16]
  9308bc:       a9025bf5        stp     x21, x22, [sp,#32]
  9308c0:       a90363f7        stp     x23, x24, [sp,#48]
  9308c4:       a9046bf9        stp     x25, x26, [sp,#64]
  9308c8:       a90573fb        stp     x27, x28, [sp,#80]
  9308cc:       aa0003f8        mov     x24, x0
  9308d0:       aa1e03e0        mov     x0, x30
  9308d4:       aa0103f9        mov     x25, x1
  9308d8:       b000bf56        adrp    x22, 2119000 <v8::platform::tracing::g_category_groups+0x510>
  9308dc:       d000bf55        adrp    x21, 211a000 <node::Environment::set_debug_categories(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::available_category+0x18>
  9308e0:       97fed818        bl      8e6940 <_mcount@plt>
  9308e4:       9111e2b3        add     x19, x21, #0x478
    ...

In this case, the original x0 is kept in x24, but cannot find where to restore it after _mcount call is finished in the near place.

@ParkHanbum
Copy link
Contributor

ParkHanbum commented Aug 26, 2023

maybe we can use callee-saved register instead use parameter registers.
but this is also very restrictive

https://developer.arm.com/documentation/102374/0101/Procedure-Call-Standard

risc-v & arm64 results
https://godbolt.org/z/WE41oW6eY

instructions generated like this:

        mov     w20, w1
        mov     w21, w2
        mov     w22, w3
        mov     w19, w0
        mov     x0, x30
        bl      _mcount

but it is depends on opt level. when you put 0 in
instruction generated like this :

        str     w0, [sp, 28]
        str     w1, [sp, 24]
        str     w2, [sp, 20]
        str     w3, [sp, 16]
        mov     x30, x4
        hint    7 // xpaclri
        mov     x0, x30
        bl      _mcount

so, I think it require tricky tweak for that.
maybe much easier support possible when use dynamic tracing.

@honggyukim
Copy link
Collaborator Author

@ParkHanbum Your explanation doesn’t look clear to me but yes, there is no problem accessing the first argument in dynamic tracing.

@namhyung
Copy link
Owner

GNU gprof requires two arguments (frompc and selfpc) for mcount implementation. From the document:

Since this is a very machine-dependant operation, mcount itself is typically a short assembly-language stub routine that extracts the required information, and then calls __mcount_internal (a normal C function) with two arguments - frompc and selfpc. __mcount_internal is responsible for maintaining the in-memory call graph, which records frompc, selfpc, and the number of times each of these call arcs was transversed.

It depends on CPU/psABI how to pass those arguments. IIRC x86 can pass them on stack, but AArch64 uses LR for selfpc and X0 for frompc. So compilers would generate code to write x0 before calling mcount anyway. Maybe we can fine the instruction for it and overwrite as NOP. But it might be tricky if compiler optimizes the code in a creative way. So I think it'd be hard to fix that as long as it conforms the GNU ABI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants