Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lldb/aarch64] Fix unwinding when signal interrupts a leaf function #91321

Merged
merged 1 commit into from
May 9, 2024

Conversation

labath
Copy link
Collaborator

@labath labath commented May 7, 2024

A leaf function may not store the link register to stack, but we it can still end up being a non-zero frame if it gets interrupted by a signal. Currently, we were unable to unwind past this function because we could not read the link register value.

To make this work, this patch:

  • changes the function-entry unwind plan to include the fp|lr = <same> rules. This in turn necessitated an adjustment in the generic instruction emulation logic to ensure that lr=[sp-X] can override the <same> rule.
  • allows the <same> rule for pc and lr in all m_all_registers_available frames (and not just frame zero).

The test verifies that we can unwind in a situation like this, and that the backtrace matches the one we computed before getting a signal.

A leaf function may not store the link register to stack, but we it can
still end up being a non-zero frame if it gets interrupted by a signal.
Currently, we were unable to unwind past this function because we could
not read the link register value.

To make this work, this patch:
- changes the function-entry unwind plan to include the
  `fp|lr = <same>` rules. This in turn necessitated an adjustment in the
  generic instruction emulation logic to ensure that `lr=[sp-X]` can
  override the `<same>` rule.
- allows the `<same>` rule for pc and lr in all `m_all_registers_available`
  frames (and not just frame zero).

The test verifies that we can unwind in a situation like this, and that
the backtrace matches the one we computed before getting a signal.
@labath labath requested a review from jasonmolenda May 7, 2024 12:00
@labath labath requested a review from JDevlieghere as a code owner May 7, 2024 12:00
@llvmbot llvmbot added the lldb label May 7, 2024
@llvmbot
Copy link
Member

llvmbot commented May 7, 2024

@llvm/pr-subscribers-lldb

Author: Pavel Labath (labath)

Changes

A leaf function may not store the link register to stack, but we it can still end up being a non-zero frame if it gets interrupted by a signal. Currently, we were unable to unwind past this function because we could not read the link register value.

To make this work, this patch:

  • changes the function-entry unwind plan to include the fp|lr = &lt;same&gt; rules. This in turn necessitated an adjustment in the generic instruction emulation logic to ensure that lr=[sp-X] can override the &lt;same&gt; rule.
  • allows the &lt;same&gt; rule for pc and lr in all m_all_registers_available frames (and not just frame zero).

The test verifies that we can unwind in a situation like this, and that the backtrace matches the one we computed before getting a signal.


Full diff: https://github.com/llvm/llvm-project/pull/91321.diff

6 Files Affected:

  • (modified) lldb/source/Plugins/Instruction/ARM64/EmulateInstructionARM64.cpp (+2)
  • (modified) lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp (+1-3)
  • (modified) lldb/source/Target/RegisterContextUnwind.cpp (+3-3)
  • (added) lldb/test/Shell/Unwind/Inputs/signal-in-leaf-function-aarch64.c (+15)
  • (added) lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test (+24)
  • (modified) lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp (+20-4)
diff --git a/lldb/source/Plugins/Instruction/ARM64/EmulateInstructionARM64.cpp b/lldb/source/Plugins/Instruction/ARM64/EmulateInstructionARM64.cpp
index 6ca4fb052457e..62ecac3e0831d 100644
--- a/lldb/source/Plugins/Instruction/ARM64/EmulateInstructionARM64.cpp
+++ b/lldb/source/Plugins/Instruction/ARM64/EmulateInstructionARM64.cpp
@@ -444,6 +444,8 @@ bool EmulateInstructionARM64::CreateFunctionEntryUnwind(
 
   // Our previous Call Frame Address is the stack pointer
   row->GetCFAValue().SetIsRegisterPlusOffset(gpr_sp_arm64, 0);
+  row->SetRegisterLocationToSame(gpr_lr_arm64, /*must_replace=*/false);
+  row->SetRegisterLocationToSame(gpr_fp_arm64, /*must_replace=*/false);
 
   unwind_plan.AppendRow(row);
   unwind_plan.SetSourceName("EmulateInstructionARM64");
diff --git a/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp b/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
index c4a171ec7d01b..49edd40544e32 100644
--- a/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
+++ b/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
@@ -424,8 +424,6 @@ size_t UnwindAssemblyInstEmulation::WriteMemory(
     log->PutString(strm.GetString());
   }
 
-  const bool cant_replace = false;
-
   switch (context.type) {
   default:
   case EmulateInstruction::eContextInvalid:
@@ -467,7 +465,7 @@ size_t UnwindAssemblyInstEmulation::WriteMemory(
         m_pushed_regs[reg_num] = addr;
         const int32_t offset = addr - m_initial_sp;
         m_curr_row->SetRegisterLocationToAtCFAPlusOffset(reg_num, offset,
-                                                         cant_replace);
+                                                         /*can_replace=*/true);
         m_curr_row_modified = true;
       }
     }
diff --git a/lldb/source/Target/RegisterContextUnwind.cpp b/lldb/source/Target/RegisterContextUnwind.cpp
index 13e101413a477..e2d712cb72eae 100644
--- a/lldb/source/Target/RegisterContextUnwind.cpp
+++ b/lldb/source/Target/RegisterContextUnwind.cpp
@@ -1555,12 +1555,12 @@ RegisterContextUnwind::SavedLocationForRegister(
   }
 
   if (unwindplan_regloc.IsSame()) {
-    if (!IsFrameZero() &&
+    if (!m_all_registers_available &&
         (regnum.GetAsKind(eRegisterKindGeneric) == LLDB_REGNUM_GENERIC_PC ||
          regnum.GetAsKind(eRegisterKindGeneric) == LLDB_REGNUM_GENERIC_RA)) {
       UnwindLogMsg("register %s (%d) is marked as 'IsSame' - it is a pc or "
-                   "return address reg on a non-zero frame -- treat as if we "
-                   "have no information",
+                   "return address reg on a frame which does not have all "
+                   "registers available -- treat as if we have no information",
                    regnum.GetName(), regnum.GetAsKind(eRegisterKindLLDB));
       return UnwindLLDB::RegisterSearchResult::eRegisterNotFound;
     } else {
diff --git a/lldb/test/Shell/Unwind/Inputs/signal-in-leaf-function-aarch64.c b/lldb/test/Shell/Unwind/Inputs/signal-in-leaf-function-aarch64.c
new file mode 100644
index 0000000000000..9a751330623f4
--- /dev/null
+++ b/lldb/test/Shell/Unwind/Inputs/signal-in-leaf-function-aarch64.c
@@ -0,0 +1,15 @@
+#include <signal.h>
+#include <unistd.h>
+
+int __attribute__((naked)) signal_generating_add(int a, int b) {
+  asm("add w0, w1, w0\n\t"
+      "udf #0xdead\n\t"
+      "ret");
+}
+
+void sigill_handler(int) { _exit(0); }
+
+int main() {
+  signal(SIGILL, sigill_handler);
+  return signal_generating_add(42, 47);
+}
diff --git a/lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test b/lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test
new file mode 100644
index 0000000000000..0580d0cf734ae
--- /dev/null
+++ b/lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test
@@ -0,0 +1,24 @@
+# REQUIRES: target-aarch64 && native
+# UNSUPPORTED: system-windows
+
+# RUN: %clang_host %S/Inputs/signal-in-leaf-function-aarch64.c -o %t
+# RUN: %lldb -s %s -o exit %t | FileCheck %s
+
+breakpoint set -n sigill_handler
+# CHECK: Breakpoint 1: where = {{.*}}`sigill_handler
+
+run
+# CHECK: thread #1, {{.*}} stop reason = signal SIGILL
+
+thread backtrace
+# CHECK: frame #0: [[ADD:0x[0-9a-fA-F]*]] {{.*}}`signal_generating_add
+# CHECK: frame #1: [[MAIN:0x[0-9a-fA-F]*]] {{.*}}`main
+
+continue
+# CHECK: thread #1, {{.*}} stop reason = breakpoint 1
+
+thread backtrace
+# CHECK: frame #0: {{.*}}`sigill_handler
+# Unknown number of signal trampoline frames
+# CHECK: frame #{{[0-9]+}}: [[ADD]] {{.*}}`signal_generating_add
+# CHECK: frame #{{[0-9]+}}: [[MAIN]] {{.*}}`main
diff --git a/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp b/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp
index 80abeb8fae9e5..9303d6f5f3c6e 100644
--- a/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp
+++ b/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp
@@ -77,7 +77,7 @@ TEST_F(TestArm64InstEmulation, TestSimpleDarwinFunction) {
 
   // UnwindPlan we expect:
 
-  // row[0]:    0: CFA=sp +0 =>
+  // row[0]:    0: CFA=sp +0 => fp= <same> lr= <same>
   // row[1]:    4: CFA=sp+16 => fp=[CFA-16] lr=[CFA-8]
   // row[2]:    8: CFA=fp+16 => fp=[CFA-16] lr=[CFA-8]
   // row[2]:   16: CFA=sp+16 => fp=[CFA-16] lr=[CFA-8]
@@ -88,13 +88,19 @@ TEST_F(TestArm64InstEmulation, TestSimpleDarwinFunction) {
   EXPECT_TRUE(engine->GetNonCallSiteUnwindPlanFromAssembly(
       sample_range, data, sizeof(data), unwind_plan));
 
-  // CFA=sp +0
+  // CFA=sp +0 => fp= <same> lr= <same>
   row_sp = unwind_plan.GetRowForFunctionOffset(0);
   EXPECT_EQ(0ull, row_sp->GetOffset());
   EXPECT_TRUE(row_sp->GetCFAValue().GetRegisterNumber() == gpr_sp_arm64);
   EXPECT_TRUE(row_sp->GetCFAValue().IsRegisterPlusOffset() == true);
   EXPECT_EQ(0, row_sp->GetCFAValue().GetOffset());
 
+  EXPECT_TRUE(row_sp->GetRegisterInfo(gpr_fp_arm64, regloc));
+  EXPECT_TRUE(regloc.IsSame());
+
+  EXPECT_TRUE(row_sp->GetRegisterInfo(gpr_lr_arm64, regloc));
+  EXPECT_TRUE(regloc.IsSame());
+
   // CFA=sp+16 => fp=[CFA-16] lr=[CFA-8]
   row_sp = unwind_plan.GetRowForFunctionOffset(4);
   EXPECT_EQ(4ull, row_sp->GetOffset());
@@ -146,6 +152,12 @@ TEST_F(TestArm64InstEmulation, TestSimpleDarwinFunction) {
   EXPECT_TRUE(row_sp->GetCFAValue().GetRegisterNumber() == gpr_sp_arm64);
   EXPECT_TRUE(row_sp->GetCFAValue().IsRegisterPlusOffset() == true);
   EXPECT_EQ(0, row_sp->GetCFAValue().GetOffset());
+
+  EXPECT_TRUE(row_sp->GetRegisterInfo(gpr_fp_arm64, regloc));
+  EXPECT_TRUE(regloc.IsSame());
+
+  EXPECT_TRUE(row_sp->GetRegisterInfo(gpr_lr_arm64, regloc));
+  EXPECT_TRUE(regloc.IsSame());
 }
 
 TEST_F(TestArm64InstEmulation, TestMediumDarwinFunction) {
@@ -381,8 +393,12 @@ TEST_F(TestArm64InstEmulation, TestFramelessThreeEpilogueFunction) {
   EXPECT_FALSE(row_sp->GetRegisterInfo(gpr_x26_arm64, regloc));
   EXPECT_FALSE(row_sp->GetRegisterInfo(gpr_x27_arm64, regloc));
   EXPECT_FALSE(row_sp->GetRegisterInfo(gpr_x28_arm64, regloc));
-  EXPECT_FALSE(row_sp->GetRegisterInfo(gpr_fp_arm64, regloc));
-  EXPECT_FALSE(row_sp->GetRegisterInfo(gpr_lr_arm64, regloc));
+
+  EXPECT_TRUE(row_sp->GetRegisterInfo(gpr_fp_arm64, regloc));
+  EXPECT_TRUE(regloc.IsSame());
+
+  EXPECT_TRUE(row_sp->GetRegisterInfo(gpr_lr_arm64, regloc));
+  EXPECT_TRUE(regloc.IsSame());
 
   row_sp = unwind_plan.GetRowForFunctionOffset(36);
   EXPECT_TRUE(row_sp->GetCFAValue().GetRegisterNumber() == gpr_sp_arm64);

@@ -467,7 +465,7 @@ size_t UnwindAssemblyInstEmulation::WriteMemory(
m_pushed_regs[reg_num] = addr;
const int32_t offset = addr - m_initial_sp;
m_curr_row->SetRegisterLocationToAtCFAPlusOffset(reg_num, offset,
cant_replace);
/*can_replace=*/true);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be narrowed down so that it only overwrites the <same> rules, but I'm not sure it's necessary given that lines 464&465 ensure that the register can get pushed only once.

@@ -1555,12 +1555,12 @@ RegisterContextUnwind::SavedLocationForRegister(
}

if (unwindplan_regloc.IsSame()) {
if (!IsFrameZero() &&
if (!m_all_registers_available &&
(regnum.GetAsKind(eRegisterKindGeneric) == LLDB_REGNUM_GENERIC_PC ||
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pc=<same> is probably still only valid for real frame zero, so we could make the m_all_registers_available check lr-only.

.. or drop the lr check entirely, since some non-ABI-respecting functions could actually preserve the value of lr even if they are not leaf functions.

Copy link
Collaborator

@jasonmolenda jasonmolenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I know there are other codepaths that handle this correctly, where we can backtrace out of a frameless function that faults into a trap handler and we have the entire register state available in the trap handler.

Looking at this, I'm a little uncertain why we have m_behaves_like_zeroth_frame and m_all_registers_available which are both set to true under the same conditions, and then we sometimes use m_behaves_like_zeroth_frame, sometimes m_all_registers_available , and sometimes call RegisterContextUnwind::BehavesLikeZerothFrame which has a redundant check if the frame number is 0, sigh. Looks like some accumulated nonsense that you shouldn't have to deal with in this patch.

Copy link
Collaborator

@clayborg clayborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@labath labath merged commit fd1bd53 into llvm:main May 9, 2024
6 checks passed
@labath
Copy link
Collaborator Author

labath commented May 9, 2024

This looks good to me, I know there are other codepaths that handle this correctly, where we can backtrace out of a frameless function that faults into a trap handler and we have the entire register state available in the trap handler.

Looking at this, I'm a little uncertain why we have m_behaves_like_zeroth_frame and m_all_registers_available which are both set to true under the same conditions, and then we sometimes use m_behaves_like_zeroth_frame, sometimes m_all_registers_available , and sometimes call RegisterContextUnwind::BehavesLikeZerothFrame which has a redundant check if the frame number is 0, sigh. Looks like some accumulated nonsense that you shouldn't have to deal with in this patch.

Thanks for the review. I see that the new test is failing on green dragon https://green.lab.llvm.org/job/llvm.org/view/LLDB/job/as-lldb-cmake/3596/testReport/lldb-shell/Unwind/signal_in_leaf_function_aarch64_test/, and from the looks of it, it is failing because it is failing to unwind in the test scenario (I don't have aarch64 mac hardware to confirm). What do you want me to do about this? Disable the test on mac and file a bug?

@adrian-prantl
Copy link
Collaborator

I think that would be a good start. Also tagging @jasonmolenda for advice.

@jasonmolenda
Copy link
Collaborator

Will debug this today.

@jasonmolenda
Copy link
Collaborator

Ah, so the problem is,

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0xdead)
  * frame #0: 0x0000000100003f2c b.out`signal_generating_add + 4 at signal-in-leaf-function-aarch64.c:5
    frame #1: 0x0000000100003f7c b.out`main + 44 at signal-in-leaf-function-aarch64.c:14
    frame #2: 0x0000000185c76244 dyld`start + 2792

The bad instruction in signal_generating_add(), is sent to the program master a EXC_BAD_INSTRUCTION mach exception, not a posix signal (or maybe it was received by lldb as a mach exception and if it had continued to the process it would be recieved as a signal).

When I've done these kinds of test cases in the past, I usually add a signal handler and then send the signal to the inferior, e.g. lldb/test/API/functionalities/unwind/sigtramp/main.c -- this test is marked @skipUnlessDarwin with a comment of

    # On different platforms the "_sigtramp" and "__kill" frames are likely to be different.
    # This test could probably be adapted to run on linux/*bsd easily enough.

(the test explicitly checks the stack for the function names that should appear above the signal handler)

@labath labath deleted the unwind branch May 10, 2024 06:10
@labath
Copy link
Collaborator Author

labath commented May 10, 2024

I have fixed/worked around the mach exception issue in a followup commit with a settings set platform.plugin.darwin.ignored-exceptions EXC_BAD_INSTRUCTION. Now the process gets a SIGILL as expected, but it still fails at the backtrace step (the second one, after stopping inside the handler).

@jasonmolenda
Copy link
Collaborator

I have fixed/worked around the mach exception issue in a followup commit with a settings set platform.plugin.darwin.ignored-exceptions EXC_BAD_INSTRUCTION. Now the process gets a SIGILL as expected, but it still fails at the backtrace step (the second one, after stopping inside the handler).

Oh, that's a clever idea, I forgot about that setting.

If you add process handle -p true -s false SIGILL (lldb was stopping on the SIGILL signal before sigtramp / the registered handler ran), then sigill_handler is called,

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100003f4c a.out`sigill_handler(signo=4) + 20 at signal-in-leaf-function-aarch64.c:10
    frame #1: 0x0000000197f93584 libsystem_platform.dylib`_sigtramp + 56
    frame #2: 0x0000000100003f7c a.out`main + 44 at signal-in-leaf-function-aarch64.c:14
    frame #3: 0x0000000197bda0e0 dyld`start + 2360

@jasonmolenda
Copy link
Collaborator

Ah, I misunderstood what the nature of the failure was. I tried running the shell test, and it's failing for different reasons. I almost never touch shell tests, I find them really hard to debug so I'm not sure what the problem is. If I run it by hand,

(lldb) settings set platform.plugin.darwin.ignored-exceptions EXC_BAD_INSTRUCTION
(lldb) b sigill_handler
Breakpoint 1: where = a.out`sigill_handler + 20 at signal-in-leaf-function-aarch64.c:10:34, address = 0x0000000100003f4c
(lldb) r
Process 25854 launched: '/tmp/a.out' (arm64)
Process 25854 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGILL
    frame #0: 0x0000000100003f2c a.out`signal_generating_add at signal-in-leaf-function-aarch64.c:5:3
   2   	#include <unistd.h>
   3   	
   4   	int __attribute__((naked)) signal_generating_add(int a, int b) {
-> 5   	  asm("add w0, w1, w0\n\t"
   6   	      "udf #0xdead\n\t"
   7   	      "ret");
   8   	}
(lldb) c
Process 25854 resuming
Process 25854 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f4c a.out`sigill_handler(signo=4) at signal-in-leaf-function-aarch64.c:10:34
   7   	      "ret");
   8   	}
   9   	
-> 10  	void sigill_handler(int signo) { _exit(0); }
   11  	
   12  	int main() {
   13  	  signal(SIGILL, sigill_handler);
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100003f4c a.out`sigill_handler(signo=4) at signal-in-leaf-function-aarch64.c:10:34
    frame #1: 0x0000000197f93584 libsystem_platform.dylib`_sigtramp + 56
    frame #2: 0x0000000100003f7c a.out`main at signal-in-leaf-function-aarch64.c:14:3
    frame #3: 0x0000000197bda0e0 dyld`start + 2360
(lldb) 

which all looks good to me, but it the shell test fails with

/Volumes/work/llvm/llvm-project/lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test:26:10: error: CHECK: expected string not found in input
# CHECK: frame #{{[0-9]+}}: [[ADD]] {{.*}}`signal_generating_add
         ^
<stdin>:32:86: note: scanning from here
 frame #0: 0x0000000100003f38 signal-in-leaf-function-aarch64.test.tmp`sigill_handler
                                                                                     ^
<stdin>:32:86: note: with "ADD" equal to "0x0000000100003f2c"
 frame #0: 0x0000000100003f38 signal-in-leaf-function-aarch64.test.tmp`sigill_handler
                                                                                     ^
<stdin>:33:23: note: possible intended match here
signal-in-leaf-function-aarch64.test.tmp`sigill_handler:
                      ^

Input file: <stdin>
Check file: /Volumes/work/llvm/llvm-project/lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            .
            .
            .
           27:  frame #2: 0x0000000197bda0e0 dyld`start + 2360 
           28: (lldb) continue 
           29: Process 23467 resuming 
           30: Process 23467 stopped 
           31: * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 
           32:  frame #0: 0x0000000100003f38 signal-in-leaf-function-aarch64.test.tmp`sigill_handler 
check:26'0                                                                                          X error: no match found
check:26'1                                                                                            with "ADD" equal to "0x0000000100003f2c"
           33: signal-in-leaf-function-aarch64.test.tmp`sigill_handler: 
check:26'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:26'2                           ?                                   possible intended match
           34: -> 0x100003f38 <+0>: sub sp, sp, #0x20 
check:26'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           35:  0x100003f3c <+4>: stp x29, x30, [sp, #0x10] 
check:26'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           36:  0x100003f40 <+8>: add x29, sp, #0x10 
check:26'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           37:  0x100003f44 <+12>: stur w0, [x29, #-0x4] 
check:26'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           38: (lldb) thread backtrace 
check:26'0     ~~~~~~~~~~~~~~~~~~~~~~~~
            .
            .
            .
>>>>>>

@jasonmolenda
Copy link
Collaborator

maybe the shell test is building without debug info, I am surprised to see assembly there. If I build it like that and run it by hand,

(lldb) settings set platform.plugin.darwin.ignored-exceptions EXC_BAD_INSTRUCTION
(lldb) b sigill_handler
Breakpoint 1: where = a.out`sigill_handler, address = 0x0000000100003f38
(lldb) r
Process 25891 launched: '/tmp/a.out' (arm64)
Process 25891 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGILL
    frame #0: 0x0000000100003f2c a.out`signal_generating_add + 4
a.out`signal_generating_add:
->  0x100003f2c <+4>:  udf    #0xdead
    0x100003f30 <+8>:  ret    
    0x100003f34 <+12>: brk    #0x1

a.out`sigill_handler:
    0x100003f38 <+0>:  sub    sp, sp, #0x20
(lldb) c
Process 25891 resuming
Process 25891 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f38 a.out`sigill_handler
a.out`sigill_handler:
->  0x100003f38 <+0>:  sub    sp, sp, #0x20
    0x100003f3c <+4>:  stp    x29, x30, [sp, #0x10]
    0x100003f40 <+8>:  add    x29, sp, #0x10
    0x100003f44 <+12>: stur   w0, [x29, #-0x4]
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100003f38 a.out`sigill_handler
    frame #1: 0x0000000197f93584 libsystem_platform.dylib`_sigtramp + 56
    frame #2: 0x0000000100003f7c a.out`main + 44
    frame #3: 0x0000000197bda0e0 dyld`start + 2360
(lldb) 

@omjavaid
Copy link
Contributor

@labath this seems to have broken lldb-aarch64-windows bot with TestInterruptBacktrace.py failing on
num_frames = thread.GetNumFrames()

@labath
Copy link
Collaborator Author

labath commented May 13, 2024

Ah, I misunderstood what the nature of the failure was. I tried running the shell test, and it's failing for different reasons. I almost never touch shell tests, I find them really hard to debug so I'm not sure what the problem is. If I run it by hand,

(lldb) settings set platform.plugin.darwin.ignored-exceptions EXC_BAD_INSTRUCTION
(lldb) b sigill_handler
Breakpoint 1: where = a.out`sigill_handler + 20 at signal-in-leaf-function-aarch64.c:10:34, address = 0x0000000100003f4c
(lldb) r
Process 25854 launched: '/tmp/a.out' (arm64)
Process 25854 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGILL
    frame #0: 0x0000000100003f2c a.out`signal_generating_add at signal-in-leaf-function-aarch64.c:5:3
   2   	#include <unistd.h>
   3   	
   4   	int __attribute__((naked)) signal_generating_add(int a, int b) {
-> 5   	  asm("add w0, w1, w0\n\t"
   6   	      "udf #0xdead\n\t"
   7   	      "ret");
   8   	}
(lldb) c
Process 25854 resuming
Process 25854 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f4c a.out`sigill_handler(signo=4) at signal-in-leaf-function-aarch64.c:10:34
   7   	      "ret");
   8   	}
   9   	
-> 10  	void sigill_handler(int signo) { _exit(0); }
   11  	
   12  	int main() {
   13  	  signal(SIGILL, sigill_handler);
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100003f4c a.out`sigill_handler(signo=4) at signal-in-leaf-function-aarch64.c:10:34
    frame #1: 0x0000000197f93584 libsystem_platform.dylib`_sigtramp + 56
    frame #2: 0x0000000100003f7c a.out`main at signal-in-leaf-function-aarch64.c:14:3
    frame #3: 0x0000000197bda0e0 dyld`start + 2360
(lldb) 

which all looks good to me, but it the shell test fails with
It's not completely correct -- the frame for signal_generating_add is missing. That's what the error message is complaining about.

The test stops the process before and after injecting a signal and verifies that the relevant parts of backtrace (i.e., calls to signal_generating_add and main) are present (and have the same PC) in both places.

/Volumes/work/llvm/llvm-project/lldb/test/Shell/Unwind/signal-in-leaf-function-aarch64.test:26:10: error: CHECK: expected string not found in input
# CHECK: frame #{{[0-9]+}}: [[ADD]] {{.*}}`signal_generating_add

^ This is the most important part of the message.

Btw, at the top of the failure message lit will print the commands executed as a part of this test. For most tests (this one included), you can just yank the lldb command out of there and run it in a terminal to debug.

omjavaid added a commit that referenced this pull request May 13, 2024
…nction (#91321)"

This reverts commit fd1bd53.

TestInterruptBacktrace was broken on AArch64/Windows as a result of this change.
See lldb-aarch64-windows buildbot here:
https://lab.llvm.org/buildbot/#/builders/219/builds/11261
@omjavaid
Copy link
Contributor

LLDB became unresponsive on windows when a thread backtrace command was issued after hitting the exception
i have temporarily reverted the change to make buildbot green.

@labath
Copy link
Collaborator Author

labath commented May 13, 2024

LLDB became unresponsive on windows when a thread backtrace command was issued after hitting the exception i have temporarily reverted the change to make buildbot green.

Could you please give me some more information about the problem? I don't have access to a windows arm machine, and the failure message doesn't give me much to go on.

@jasonmolenda
Copy link
Collaborator

It's not completely correct -- the frame for signal_generating_add is missing. That's what the error message is complaining about.

Ah, thanks, I missed that! Let me debug it and comment further

@jasonmolenda
Copy link
Collaborator

Ah, so the problem here is that we're missing the eh_frame instructions for _sigtramp on arm64 with macOS 14. signal_generating_add is a frameless function (a great stress test in this instance), and _sigtramp is called with enough of a faked-up stack that a stack walk will find the last frame that set up a stack frame before faulting, main in this case. But we skip signal_generating_add. We're going to need to skip this test on macOS for now, and I'll dig in to where the eh_frame instructions went and see about getting them re-added, but it may take a while to get that into builds and on to the bots.

@omjavaid
Copy link
Contributor

LLDB became unresponsive on windows when a thread backtrace command was issued after hitting the exception i have temporarily reverted the change to make buildbot green.

Could you please give me some more information about the problem? I don't have access to a windows arm machine, and the failure message doesn't give me much to go on.

I will try to debug and get back with more information. If you need specific information or logs please let me know.

@jasonmolenda
Copy link
Collaborator

(and it turns out the reason we don't have eh_frame is because _sigtramp on arm64 is written in C, and I'm not sure how I'm going to track which callee-saved register the argument is copied into, so this is definitely not something I can get fixed quickly.)

@labath
Copy link
Collaborator Author

labath commented May 14, 2024

LLDB became unresponsive on windows when a thread backtrace command was issued after hitting the exception i have temporarily reverted the change to make buildbot green.

Could you please give me some more information about the problem? I don't have access to a windows arm machine, and the failure message doesn't give me much to go on.

I will try to debug and get back with more information. If you need specific information or logs please let me know.

Thanks. It's hard to say what exactly will help, because I don't have any idea what could be going wrong. However, since this is an unwinding problem, the unwind log will probably contain some clues. Since this is causing a hang, simply getting the backtrace of lldb in the hanging state would likely be useful. Same goes for the disassembly and the unwind plans (image show-unwind) of the functions on the target stack at the point of hanging.

I know this is quite a lot, but I'm kinda shooting in the dark here. :)

@labath
Copy link
Collaborator Author

labath commented May 14, 2024

Ah, so the problem here is that we're missing the eh_frame instructions for _sigtramp on arm64 with macOS 14. signal_generating_add is a frameless function (a great stress test in this instance), and _sigtramp is called with enough of a faked-up stack that a stack walk will find the last frame that set up a stack frame before faulting, main in this case. But we skip signal_generating_add. We're going to need to skip this test on macOS for now, and I'll dig in to where the eh_frame instructions went and see about getting them re-added, but it may take a while to get that into builds and on to the bots.

Makes sense. I'll leave the test XFAILed then (once the windows issue is sorted out). Thanks for looking into this, I'm glad the test proved to be useful.

labath added a commit to labath/llvm-project that referenced this pull request May 17, 2024
…unction (llvm#91321)"

This reapplies fd1bd53, which was
reverted due to a test failure on aarch64/windows. The failure was
caused by a combination of several factors:
- clang targeting aarch64-windows (unlike msvc, and unlike clang
  targeting other aarch64 platforms) defaults to -fomit-frame-pointers
- lldb's code for looking up register values for `<same>` unwind rules
  is recursive
- the test binary creates a very long chain of fp-less function frames
  (it manages to fit about 22k frames before it blows its stack)

Together, these things have caused lldb to recreate the same deep
recursion when unwinding through this, and blow its own stack as well.
Since lldb frames are larger, about 4k frames like this was sufficient
to trigger the stack overflow.

This version of the patch works around this problem by increasing the
frame size of the test binary, thereby causing it to blow its stack
sooner. This doesn't fix the issue -- the same problem can occur with a
real binary -- but it's not very likely, as it requires an infinite
recursion in a simple (so it doesn't use the frame pointer) function
with a very small frame (so you can fit a lot of them on the stack).

A more principled fix would be to make lldb's lookup code non-recursive,
but I believe that's out of scope for this patch.

The original patch description follows:

A leaf function may not store the link register to stack, but we it can
still end up being a non-zero frame if it gets interrupted by a signal.
Currently, we were unable to unwind past this function because we could
not read the link register value.

To make this work, this patch:
- changes the function-entry unwind plan to include the `fp|lr = <same>`
rules. This in turn necessitated an adjustment in the generic
instruction emulation logic to ensure that `lr=[sp-X]` can override the
`<same>` rule.
- allows the `<same>` rule for pc and lr in all
`m_all_registers_available` frames (and not just frame zero).

The test verifies that we can unwind in a situation like this, and that
the backtrace matches the one we computed before getting a signal.
@labath
Copy link
Collaborator Author

labath commented May 17, 2024

So, it took some effort, but I managed to reproduce the problem and figure out what's going on. You can find my analysis on #92503

labath added a commit that referenced this pull request May 21, 2024
#92503)

…unction (#91321)"

This reapplies fd1bd53, which was
reverted due to a test failure on aarch64/windows. The failure was
caused by a combination of several factors:
- clang targeting aarch64-windows (unlike msvc, and unlike clang
targeting other aarch64 platforms) defaults to -fomit-frame-pointers
- lldb's code for looking up register values for `<same>` unwind rules
is recursive
- the test binary creates a very long chain of fp-less function frames
(it manages to fit about 22k frames before it blows its stack)

Together, these things have caused lldb to recreate the same deep
recursion when unwinding through this, and blow its own stack as well.
Since lldb frames are larger, about 4k frames like this was sufficient
to trigger the stack overflow.

This version of the patch works around this problem by increasing the
frame size of the test binary, thereby causing it to blow its stack
sooner. This doesn't fix the issue -- the same problem can occur with a
real binary -- but it's not very likely, as it requires an infinite
recursion in a simple (so it doesn't use the frame pointer) function
with a very small frame (so you can fit a lot of them on the stack).

A more principled fix would be to make lldb's lookup code non-recursive,
but I believe that's out of scope for this patch.

The original patch description follows:

A leaf function may not store the link register to stack, but we it can
still end up being a non-zero frame if it gets interrupted by a signal.
Currently, we were unable to unwind past this function because we could
not read the link register value.

To make this work, this patch:
- changes the function-entry unwind plan to include the `fp|lr = <same>`
rules. This in turn necessitated an adjustment in the generic
instruction emulation logic to ensure that `lr=[sp-X]` can override the
`<same>` rule.
- allows the `<same>` rule for pc and lr in all
`m_all_registers_available` frames (and not just frame zero).

The test verifies that we can unwind in a situation like this, and that
the backtrace matches the one we computed before getting a signal.
jasonmolenda pushed a commit to jasonmolenda/llvm-project that referenced this pull request Nov 20, 2024
llvm#92503)

…unction (llvm#91321)"

This reapplies fd1bd53, which was
reverted due to a test failure on aarch64/windows. The failure was
caused by a combination of several factors:
- clang targeting aarch64-windows (unlike msvc, and unlike clang
targeting other aarch64 platforms) defaults to -fomit-frame-pointers
- lldb's code for looking up register values for `<same>` unwind rules
is recursive
- the test binary creates a very long chain of fp-less function frames
(it manages to fit about 22k frames before it blows its stack)

Together, these things have caused lldb to recreate the same deep
recursion when unwinding through this, and blow its own stack as well.
Since lldb frames are larger, about 4k frames like this was sufficient
to trigger the stack overflow.

This version of the patch works around this problem by increasing the
frame size of the test binary, thereby causing it to blow its stack
sooner. This doesn't fix the issue -- the same problem can occur with a
real binary -- but it's not very likely, as it requires an infinite
recursion in a simple (so it doesn't use the frame pointer) function
with a very small frame (so you can fit a lot of them on the stack).

A more principled fix would be to make lldb's lookup code non-recursive,
but I believe that's out of scope for this patch.

The original patch description follows:

A leaf function may not store the link register to stack, but we it can
still end up being a non-zero frame if it gets interrupted by a signal.
Currently, we were unable to unwind past this function because we could
not read the link register value.

To make this work, this patch:
- changes the function-entry unwind plan to include the `fp|lr = <same>`
rules. This in turn necessitated an adjustment in the generic
instruction emulation logic to ensure that `lr=[sp-X]` can override the
`<same>` rule.
- allows the `<same>` rule for pc and lr in all
`m_all_registers_available` frames (and not just frame zero).

The test verifies that we can unwind in a situation like this, and that
the backtrace matches the one we computed before getting a signal.

(cherry picked from commit bbd54e0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants