Syscallbuf on AArch64

This document presents the logical steps that led to the initial implementation of sycallbuf on AArch64. The code below reflects the implementation of syscallbuf as designed and merged as of 2022/07/01 and may not be sync'd with the current implementation.

Need for runtime stub

On AArch64, all the instructions are of the same length. This means that we can replace the syscall instruction svc 0 with any other single instruction we want. However, since the instruction length is 32bit, there’s no way we can encode the jump target address in the instruction so it’s impossible for us to jump directly to the syscall hook in librrpreload.so. Thankfully, this is an issue that affects x86_64 as well due to the lack of 64bit immediate in the branch instruction (as well as the stack switching code which will be discussed later). The same trick used there works for aarch64 as well. We simply need to allocate a stub at runtime close enough to the syscall site and then we can have as many instructions and immediates as we want (within reason) in the stub to encode the jump to the syscall hook.
Register requirement

On AArch64, the glibc syscall wrapper assumes that all the visible state of the process other than x0 (used for syscall return value) remains unchanged. This includes the low 128bit of the vector register (it doesn’t include the higher bit in the SVE registers but we can’t rely on that being present yet and even if it does, its not required to have higher bits, e.g. neoverse-n2). This effectively mean that we can’t use any registers before we save some of them to memory.

In principle, the processor state flags also needs to be saved (nzcv). However, not saving it seems to be OK so far. If it really needs to be saved, we might need to use some branches to detect the flags up front since we need to use branch before we can save register values (see below).
Stack requirement

In order to support things like go, that does aggressive tricks with the stack, we want to not use the native stack in the syscall buf code. This severely limits how we can save the registers to memory. On AArch64, all instructions that writes to memory must use one of the 31 general purpose register or the stack pointer as the base pointer to compute the address. (PC relative addressing also exists but is limited to
- address computation, which store the result to a register, overwriting it’s previous value
- branch, which is virtually of no use for writing to memory
- prefetch, and load instruction (but not store...).
It’s worth noting that this isn’t an issue on 32bit arm since PC can be used as a general purpose register in both load and store). If we cannot assume SP points to valid memory, and we don’t know the value of any other registers (xzr cannot be the base register used for address computation) there is no way we can store to a known address without trashing a register.

Fortunately, even though we don’t know the exact values of any of the other registers, nor their offset to a known memory location, we do know the likely range of one register, or at least the range of that register that we care about, i.e. the syscall number register x8. If the value of x8 is larger than some pre-determined value (~1000 should be good enough since that’s the RR_CALL_BASE), we can either issue a real syscall directly or just return -ENOSYS. If the value of x8 is within a small range, we can map that to a range of address that we know is valid in a reversible way and use that to store any number of other registers to memory.

For this to work we need to do the comparison and the x8 to address mapping without using any registers. The comparison can be done using either cmp or tst easily. AFAICT, there is no easy way to do the comparison without setting the processor flags (a chain of tbnz works but is not ideal.) The most straight forward mapping to use is to map x8 to an address in the thread local page at 0x70010000. Unfortunately, this offset cannot be encoded in an add instruction. We can, however, use movk since we know the high bits of x8 is zero and doesn’t need to be saved. To summarize, the minimum stub prologue we should use is roughly
```
    cmp     x8, 1024
    b.hi    .Lnosys
    movk    x8, 0x70010000
    stp     xm, xn, [x8, #offset]
    // we can use xm and xn after this point

.Lnosys:
    mov     x0, -ENOSYS // or `svc 0` if we want to support that.
    b       syscall_return_address
```
Exiting from the stub

Before we return to the original code, we need to set all registers back to the original values (minus the ones changed by syscall). However, we need to use at least one register to store the address for the branch back and since we don’t have any instructions after the syscall we cannot have any instructions to restore the register values in the original code and it’ll have to be done in the runtime stub instead. We could in principle use PC relative addressing to load/restore the original value for this register but it’ll be easier to use a register for to point to a address in the thread local area for this instead so that it can more easily use the current logic for thread local storage and avoid potential conflicts between threads. The restoring logic can simply be ldp xm, xn, [xm] and we can actually reuse part of the same area we’ve allocated for the initial stash area for this purpose. This area will also be used for restoring registers from clone (see below) and it’ll need to be cloned (just the first two elements) for a cloned task.
clone handling

The clone syscall may do something to the stack that is very difficult for us to handle in the C code. Therefore, we should have a branch to check that and do a raw traced syscall directly from the syscall hook assembly code (x86 avoids this issue by failing to match and patch the syscall site). We can do this with a simple cmp x8, 0xdc and branch when entering the syscall hook code. We have to be careful to never touch sp in this branch since the clone syscall may be doing things with it that we don’t want to deal with... The branch may need to happen after we’ve restored x8 and saved the registers in the location that the stub will restore them from.
Passing information from stub to the syscall hook.

After we enters the syscall hook, we need to know how to make the syscall and return to where we came from. Since we need to return to the stub, this means we need to know the address of the stub. This is most easily done by using a blr instruction from the stub when jumping to the syscall hook. This overwrite the x30 register so that’s one of the registers we need to save (to memory) in the stub. In order to help with unwinding and since the stub won’t have unwind info, we should pass the address of the real callsite to the syscall hook as well. We can do this by simply store the address in the stub and have the syscall hook code load from there. The syscall hook code would just need to know how to find it based on the x30 value after we enter from the stub code. To make the control flow looks more like normal code, we'll store the return address with an offset, past the end of all the instructions in the stub.

Full stub code

    cmp     x8, 1024
    b.hi    .Lnosys
    movk    x8, preload_thread_locals
    stp     x15, x30, [x8, stub_scratch_2 - preload_thread_locals]
    movz    x30, #:abs_g3:_syscall_hook_trampoline
    movk    x30, #:abs_g2_nc:_syscall_hook_trampoline
    movk    x30, #:abs_g1_nc:_syscall_hook_trampoline
    movk    x30, #:abs_g0_nc:_syscall_hook_trampoline // Might be shorter depending on the address
    blr     x30
    // we return from syscall hook to here
    ldp     x15, x30, [x15]
.Lreturn:
    b       syscall_return_address
.Lnosys:
    mov     x0, -ENOSYS // or `svc 0` if we want to support that.
    b       .Lreturn
    .long <syscall return address>

We save x15 in addition to x30 since it’s much easier if we have two registers to play with. Since x15 is a scratch register, hopefully it matters less if the debugger can’t restore it in unwinding. (We’ll have valid unwind info for it so this shouldn’t matter much.)

Update: it turns out that at least the RR test does rely on invalid syscall still triggering an event (which they won’t if we simply returned ENOSYS from userspace) so we need to do a syscall from within the stub with a check in the patching code to avoid patching this syscall again.

IP range checking that involves the runtime stub

AFAICT, there is currently one place where RR check whether the code is in the runtime jump stub. With the need to make a syscall from it (see above), we also need to add another one. These are,
1. Check to make sure if we can deliver a signal
  
  The code also make assumption that if we are in the stub (more like syscallbuf) code, we’ll exit through _syscallbuf_final_exit_instruction so that we can catch it by setting a breakpoint there. Therefore, we have to skip the first two instructions (x8 range check) and the last four instructions (return stub and the fallback syscall handling) in the check. We could maybe change the _syscallbuf_final_exit_instruction handling to look for the the addresses in the stubs instead but that’s a bit unnecessary...
  
  Since we are returning through the stub epilogue and the code there uses the thread local memory, we need to avoid delivering signal before the stub finish using the thread local memory. Otherwise, the user signal handler might clobber it causing us to restore the registers to the wrong content. The only instruction we need to be careful of here is the ldp on the return path. It is past the normal syscall hook exit breakpoint so we have to deal with it slightly differently. For now we can simply check and add a breakpoint on the exit of the stub and manually do the return from stub when we hit that breakpoint.
2. Check to make sure if we should patch the syscall.
  
  This should include the full stub range.
Unpatching

Since we use the stub for return as well and it seems like some caller might be in the hook when we call unpatch, we need to make sure not to overwrite anything that’s used by the return path. It seems that it’s the easiest to just overwrite the first two instructions with svc 0 and b syscall_return_address. Since the branch instruction encodes the relative address, this won’t be the same instruction as the original branch, and we need to make sure the address of the stub is within the right range for the unpatch jump to be encoded. (In practice though, the fact that we can jump here and jump back already guarantees that mathematically.)

Syscall hook prologue

Once we enter the syscall hook, we need to change how the registers are saved and restore x8. Changing the register saving address to a fixed one means that we don’t need to waste a register remembering that address anymore. We also want to restore x8 since we’ll be done using this trick and we want to enter the syscall with the right syscall number. This would also prepare us in case we got here with a clone call and we want to bail out early.

    bti     c // BTI compatible
    mov     x15, preload_thread_locals
    // Stash away x30 so that we can have two registers to use again
    // we can't use stub_scratch_2 since we might overwrite the data there
    str     x30, [x15, stub_scratch_1 - preload_thread_locals]
    // Move the saving area to the start of scratch_2
    // Do it in the forward order since we know x8 >= x15
    ldr     x30, [x8, stub_scratch_2 - preload_thread_locals]
    str     x30, [x15, stub_scratch_2 - preload_thread_locals]
    ldr     x30, [x8, stub_scratch_2 - preload_thread_locals + 8]
    str     x30, [x15, stub_scratch_2 - preload_thread_locals + 8]
    // Restore x8
    movk    x8, 0, LSL 16

By the end of the prologue, every registers are back to their original values, except for x15 and x30 which have their old values in stub_scratch_2. The stub address is saved in stub_scratch_1.

Clone handling

Most of the requirement has by laid out already,

Do not touch sp (before or after syscall)
Return through _syscallbuf_final_exit_instruction (which is just a ret)
Store the return address in stub_scratch_1 (for signal handling)
Bonus point for keeping the unwind info valid the whole way through

    cmp     x8, 0xdc // SYS_clone
    b.eq    .Lclone

.Lclone:
    // Must not touch sp in this branch.
    // Use x15 to remember the return address since we are only copying
    // the first two elements of stub_scratch_2 for the child.
    ldr     x15, [x15, stub_scratch_1 - preload_thread_locals]
    mov     x30, 0x70000000 // RR_PAGE_SYSCALL_TRACED
    blr     x30
    // stub_scratch_2 content is maintained by rr
    // we need to put the syscall return address in stub_scratch_1
    movz    x30, #:abs_g1:stub_scratch_2 // assume 32bit address
    movk    x30, #:abs_g0_nc:stub_scratch_2
    str     x15, [x30, 16] // stash away stub address
    ldr     x15, [x15] // syscall return address
    str     x15, [x30, stub_scratch_1 - stub_scratch_2]
    mov     x15, x30
    ldr     x30, [x15, 16]
    add     x30, x30, 8 // actual return address
    b       _syscallbuf_final_exit_instruction

Stack switching

Once we know that we are not dealing with clone, we can switch to the new stack and save everything to the new one

    ldr     w30, [x15, alt_stack_nesting_level - preload_thread_locals]
    cmp     w30, 0
    add     w30, w30, 1
    str     w30, [x15, alt_stack_nesting_level - preload_thread_locals]

    b.ne    .Lnest
    ldr     x30, [x15, syscallbuf_stub_alt_stack - preload_thread_locals]
    sub     x30, x30, 48
    b       .Lstackset
.Lnest:
    sub     x30, sp, 48
.Lstackset:
    // Now x30 points to the new stack with 48 bytes of space allocated

    // Move sp into a normal register. Otherwise we can't store it
    mov     x15, sp
    // Save sp to new stack.
    str     x15, [x30, 16]
    mov     sp, x30
    // sp is switched, x15 and x30 are free to use
    // [stub_scratch_1] holds the stub address

    // Now we need to construct the stack frame, with everything
    // in the scratch area copied over so that we can nest again.
    mov     x15, preload_thread_locals
    // load runtime stub address
    ldr     x30, [x15, stub_scratch_1 - preload_thread_locals]
    // save stub return address
    str     x30, [sp]
    // load syscall return address
    ldr     x30, [x30, -8]
    str     x30, [sp, 8]
    ldr     x30, [x15, stub_scratch_2 - preload_thread_locals]
    str     x30, [sp, 24]
    ldr     x30, [x15, stub_scratch_2 - preload_thread_locals + 8]
    str     x30, [sp, 32]

    // stackframe layout
    // 32: original x30
    // 24: original x15
    // 16: original sp
    // 8: return address to syscall
    // 0: return address to stub

syscall hook epilogue

The _syscall_hook_trampoline restores all the registers to the previous values (again, minus the register for syscall return value) so we just need to restore the registers we’ve overwritten by the end of the stack switch, i.e. x15 , x30 and sp. The x15 and x30 will be restored when we get back to the stub so we don’t need to restore them here but we do need to copy their values to stub_scratch_2 again so that the stub can restore them (since without a valid stack that is still the only memory we can use to restore things. At least this time we don’t need to hunt for a register to store the address). We also need to store the return address to stub_scratch_1 since that’ll help rr with setting breakpoint.

    movz    x15, #:abs_g1:stub_scratch_2 // assume 32bit address
    movk    x15, #:abs_g0_nc:stub_scratch_2
    ldr     x30, [sp, 24] // x15
    str     x30, [x15]
    ldr     x30, [sp, 32] // x30
    str     x30, [x15, 8]
    ldr     x30, [sp, 8] // syscall return address
    // tell rr breakpoint handling where we are going
    str     x30, [x15, stub_scratch_1 - stub_scratch_2]
    ldr     x30, [sp] // stub return address
    ldr     x15, [sp, 16] // sp
    mov     sp, x15
    movz    x15, #:abs_g1:stub_scratch_2 // assume 32bit address
    movk    x15, #:abs_g0_nc:stub_scratch_2
_syscallbuf_final_exit_instruction:
    ret

The manual unwind info for the syscall hook is left as an exercise to the reader (see the actual implementation for the answer).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syscallbuf on AArch64

Clone this wiki locally