Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging segfault on aarch64 (ARM) #86

Closed
Zenexer opened this issue Feb 10, 2022 · 45 comments · Fixed by #175
Closed

Debugging segfault on aarch64 (ARM) #86

Zenexer opened this issue Feb 10, 2022 · 45 comments · Fixed by #175

Comments

@Zenexer
Copy link

Zenexer commented Feb 10, 2022

I'm encountering an occasional segfault in PHP 8.1 on aarch64. The coredumps I've collected point to pcre2, and disabling JIT seems mitigate the segfaults. However, I haven't been able to get a reliable repro because it's quite a rare issue.

I posted details at oerdnj/deb.sury.org#1721, but there probably isn't much there that will help other than a sparse backtrace from the coredump. Do you have any advice for debugging the issue further or developing a reliable repro?

Here's the backtrace copied from the other issue, for convenience--not too helpful, I'm afraid:

#0  vld1q_u8 (__a=0xffff9e000000 <error: Cannot access memory at address 0xffff9e000000>) at /usr/lib/gcc/aarch64-linux-gnu/10/include/arm_neon.h:17550
#1  ffcps_1_utf (str_end=0xffff9dfffffa "", str_ptr=0xffff9e000000 <error: Cannot access memory at address 0xffff9e000000>, offs1=8, offs2=<optimized out>,
    chars=<optimized out>) at src/pcre2_jit_neon_inc.h:190
#2  0x0000ffffa6662118 in ?? ()
#3  0x0000ffff9dfffff8 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

There isn't much else that the segfaults have in common. They're happening during arbitrary PHP requests handled by XenForo, relatively popular software. There doesn't appear to be anything that the requests have in common. Based on the timing I'm seeing, it's likely happening toward the end of each request; XenForo's PCRE usage happens during the beginning of each request, which makes developing a repro more difficult.

Installed package version is 10.39-2+0~20211122.14+debian11~1.gbp0d570b

@Zenexer Zenexer changed the title Debugging segfault on aarch64 Debugging segfault on aarch64 (ARM) Feb 10, 2022
@zherczeg
Copy link
Collaborator

Based on this trace the issue is in the simd accelerated character search. The string ends at 0xffff9dfffffa, but the code tries to load the next aligned word from 0xffff9e000000, which is probably an invalid address, and an error occurs. Maybe stopping at ffcps_1_utf when str_ptr>str_end condition happens could help.

@Seldaek
Copy link

Seldaek commented May 20, 2022

I got another one on PHP 8.1 / aarch64, not sure if related but also fairly cryptic (to me anyway..)

Core was generated by `php -dmemory_limit=512M bin/console cache:warmup --env=prod --no-debug'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000ffffad30a518 in ?? () from /lib/aarch64-linux-gnu/libpcre2-8.so.0

It does seem to be related to PCRE JIT as disabling that in PHP fixes it. I wasn't able to get more info than that despite installing libpcre2-dev, not sure what else I am missing.

@zherczeg
Copy link
Collaborator

zherczeg commented May 21, 2022

Hard to tell. The original report blamed a C function, not jit code. To do something with it, I would need a pattern/input pair. Does it happen with the latest master?

@Seldaek
Copy link

Seldaek commented May 21, 2022

Ah sorry I was misled by the pcre2_jit_neon_inc.h containing jit. I will see if I can get master compiled and run that. Getting the pattern/input is probably very tricky as it's a huge piece of code which ends up crashing, but I'll see what I can do.

@Seldaek
Copy link

Seldaek commented May 21, 2022

Running master I now get this:

#0  0x0000ffff9fd55d98 in ffcps_default () from /lib/aarch64-linux-gnu/libpcre2-8.so.0
#1  0x0000ffff98623404 in ?? ()

I had no luck figuring out what code exactly causes it sorry :/

@zherczeg
Copy link
Collaborator

pcre2_jit_neon_inc.h generates these ffcps_something functions with macros. Then it might be related to the original issue.

@Seldaek
Copy link

Seldaek commented May 21, 2022

Ok, well I don't have an easy way for you to repro unfortunately but at least I can reliably reproduce it so I can offer to try out a patch in case you have any "stabbing in the dark" fixes in mind :)

@Zenexer
Copy link
Author

Zenexer commented May 21, 2022

You have a reliable repro? I gave up because it was happening for maybe one in a thousand requests, which made debugging quite tedious. I may be able to patch it--or at least narrow it down--if I have a better repro.

@zherczeg
Copy link
Collaborator

The code is an external contribution so I don't know much about it, but it seems a buffer overrun happens, probably a while loop with a ptr<=end check instead of a ptr<end.

@svenauhagen
Copy link

Hi,

I am running into the same problem on our server with php8.1 and libpcre2.
I compiled libpcre2 without optimizations to get a better output.
It happens every now and then so it is not really consistent.

Here is the latest crash report, I will try to find the problem but help is appreciated :)

(gdb) bt full
#0 0x0000ffffbac124d0 in vld1q_u8 (__a=0xffff98600000 <error: Cannot access memory at address 0xffff98600000>) at /usr/lib/gcc/aarch64-linux-gnu/10/include/arm_neon.h:17551
No locals.
oerdnj/deb.sury.org#1 ffcps_1 (str_end=0xffff985fffff "", str_ptr=0xffff98600000 <error: Cannot access memory at address 0xffff98600000>, offs1=11, offs2=10, chars=1164258605)
at src/pcre2_jit_neon_inc.h:190
qw = {mem = "\000\000\000\377", '\000' <repeats 11 times>, dw = {4278190080, 0}}
ic = {x = 1164258605, c = {c1 = 45 '-', c2 = 45 '-', c3 = 101 'e', c4 = 69 'E'}}
compare1_type = compare_match1
compare2_type = compare_match1i
cmp1a = {45 <repeats 16 times>}
cmp1b = {0 <repeats 16 times>}
cmp2a = {101 <repeats 16 times>}
cmp2b = {32 <repeats 16 times>}
diff = 1
char1a = 45 '-'
char2a = 101 'e'
char1b = 45 '-'
char2b = 69 'E'
p1 = 0xffff98600009 <error: Cannot access memory at address 0xffff98600009>
align_offset = 10
data = {0, 0, 0, 255, 0, 0, 0, 0, 0, 255, 0, 0, 0, 0, 0, 0}
prev_data = {112, 108, 117, 103, 105, 110, 115, 47, 119, 111, 111, 99, 111, 109, 109, 101}
data2 = {255, 0, 0, 255, 0, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
eq = {0, 0, 0, 255, 0 <repeats 12 times>}
oerdnj/deb.sury.org#2 0x0000ffff9d512090 in ?? ()
No symbol table info available.
oerdnj/deb.sury.org#3 0x00ff0000ff000000 in ?? ()
No symbol table info available.

@zherczeg
Copy link
Collaborator

it looks a similar problem, str_end=0xffff985fffff and Cannot access memory at address 0xffff98600000.

@svenauhagen
Copy link

yes, looks like a one off in PHP at some point during processing

@Zenexer
Copy link
Author

Zenexer commented Nov 21, 2022

In my case, the lack of consistency remains the primary impediment to narrowing down the cause. I still can't repro it outside of production. arm_neon.h is quite dense, and I'm unfamiliar with NEON instructions, so I'm not able to easily spot any logic errors by reviewing the code. And, of course, by the time it segfaults, a lot of context leading up to the crash has been lost.

I'm willing to help in any way I can, but I don't really have the necessary expertise to handle this efficiently.

@svenauhagen
Copy link

This piece of code crashes my server in the end:

const PATH_PATTERN_SEARCH_JSON = '#(DOMAIN_PLACEHOLDER)([a-z]+)([_A-Z])(-[-_a-z0-9]+).json$#i';
$domain_replace = 'default';
$search_pattern = str_replace( 'DOMAIN_PLACEHOLDER', $domain_replace, PATH_PATTERN_SEARCH_JSON );
preg_replace("#(woocommerce-)([a-z]+)([_A-Z])(-[-_a-z0-9]+).json$#i", $search_pattern, "/var/www/wordpress/wp-content/languages/plugins/woocommerce-en_US-d5ba42
31878fa2e7d350f73c3da91138.json");

But it takes a lot of requests to do so.
I am also not very familiar with the NEON instructions and it does not look like the neon code is the problem to begin with. ffcps_1 gets wrong arguments if from php I think.

@zherczeg
Copy link
Collaborator

Since it is a memory address, you can run the application with gdb, wait for the crash, and inspect the data around the address. The issue might be that a <= is used somewhere instead of < for checking the subject end. I suspect the code is always wrong, but when the address is accessible, the issue is hidden.

@MatthewVernon
Copy link
Contributor

MatthewVernon commented Nov 22, 2022 via email

@Zenexer
Copy link
Author

Zenexer commented Nov 22, 2022

This piece of code crashes my server in the end

That code alone, or is there also other code? I can't repro it with just that code.

Since it is a memory address, you can run the application with gdb, wait for the crash, and inspect the data around the address. The issue might be that a <= is used somewhere instead of < for checking the subject end. I suspect the code is always wrong, but when the address is accessible, the issue is hidden.

I haven't been able to repro it while running with gdb. I've been stuck relying on coredumps. It probably has nothing to do with gdb, but rather the fact that it's hard to repro.

Unfortunately, by the time the coredump happens, a lot of the context is lost, and I'm not able to narrow it down.

@svenauhagen
Copy link

Hi,

in the end I always end up at the same code block at least when analyzing the core dump:

(gdb) source /opt/.gdbinit
(gdb) zbacktrace
[0xffffb8615180] preg_replace("#(woocommerce-)([a-z]+)([_A-Z]*)(-[-_a-z0-9]+).json$#i", "${1}%s${4}.json", "/var/www/wordpress/wp-content/languages/plugins/woocommerce-en_US-d5ba42
31878fa2e7d350f73c3da91138.json") [internal function]
[0xffffb86150a0] WPML_ST_Translations_File_Registration->get_file_path_pattern("/var/www/wordpress/wp-content/languages/plugins/woocommerce-en_US-d5ba4231878fa2e7d350f73c3da91138.js

The last argument is usually different.

Best
Sven

@Zenexer
Copy link
Author

Zenexer commented Nov 22, 2022

@zherczeg, str_ptr is passed through SLJIT_R1. The segfault is invariably occurring in emitted instructions that are invoked via sljit_emit_icall. Am I correct in saying that gdb is likely going to show str_ptr as the register's value at the time of the segfault rather than when the icall instruction runs?

If so, we don't actually know that the str_ptr arg is bad when ffcps_* is called. In @svenauhagen's backtrace, the original value could've been as low as 0xffff985ffff6, which is accessible and within the range of the string:

// offs1 = 11
// offs2 = 10

const sljit_u32 diff = offs1 - offs2;               // diff = 1
// ...
str_ptr += offs1;                                   // str_ptr += 11
// ...
sljit_u8 *p1 = str_ptr - diff;                      // p1 = str_ptr - 1
// ...
str_ptr = (sljit_u8 *) ((uint64_t)str_ptr & ~0xf);  // Substract between 0 and 15 from str_ptr
vect_t data = VLD1Q(str_ptr);                       // Segfault here

Assigning any number between 0xffff985ffff6 and 0xffff98600000 to str_ptr and running that logic results in str_ptr being 0xffff98600000 at the time of the segfault.

I don't think this is an issue with PHP passing bad arguments. STR_PTR (SLJIT_R1) seems to be the culprit, and access to that is abstracted away from PHP. The bug is likely within emitted instructions.

My stacktrace shows that this isn't an off-by-one: the difference in my case was 6. I'm betting on this being the result of erroneous alignment logic.

@Zenexer
Copy link
Author

Zenexer commented Nov 22, 2022

I got make check to fail by adding SLJIT_ASSERT(str_ptr <= str_end); before line 190, so PHP can be ruled out. Compiled with --enable-jit --enable-debug.

Edit 1: If I move that assertion to the top of the function, tests still fail.

Tests pass if I remove the assertion.

Edit 2: I think moving it to the top of the function doesn't actually help much; there's only one failure there, and it's inconsistent. I suspect it's just coincidentally crashing sometimes. Moving it back to line 190 causes consistent, repeatable failures. There's definitely something up with that logic.

Edit 3: There might be multiple issues. An assertion at the top of the function causes grep test 84 to fail consistently. An assertion on line 190 causes a large number of grep tests to fail.

Edit 4: Examining the situation in gdb, it looks as though it's normal for str_end to point within a string rather than at the actual end. Maybe it assumes it's safe to read up to 15 bytes past the end of string? Grep test 84 still failed with the assertion at the start of the function, but I don't understand how it would make sense to pass those args:

# LD_LIBRARY_PATH="/pcre2/.libs:$LD_LIBRARY_PATH" gdb --args .libs/pcre2grep --file-list ./testdata/grepfilelist --file-list testtemp1grep "fox|complete|t7"

(gdb) run
Starting program: /pcre2/.libs/pcre2grep --file-list ./testdata/grepfilelist --file-list testtemp1grep fox\|complete\|t7
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
testdata/grepinputv:fox jumps
Assertion failed at src/pcre2_jit_neon_inc.h:91

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffffb0343aa0 in __GI_abort () at abort.c:79
#2  0x0000ffffb050f7c8 in ffcs_2 (
    str_end=0xaaaaf5cddb17 "\nHere is the pattern again.\n\nPattern\nThat time it was on a line by itself.\n\nTo pat or not to pat, that is the question.\n\ncomplete pair\nof lines\n\nThat was a complete pair\nof lines all by themselves.\n\nc"...,
    str_ptr=0xaaaaf5cddb18 "Here is the pattern again.\n\nPattern\nThat time it was on a line by itself.\n\nTo pat or not to pat, that is the question.\n\ncomplete pair\nof lines\n\nThat was a complete pair\nof lines all by themselves.\n\nco"..., offs1=1, offs2=0, chars=14191) at src/pcre2_jit_neon_inc.h:91
#3  0x0000ffffb0322b20 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I don't think I have the necessary familiarity with pcre2 or sljit to solve this.

@svenauhagen
Copy link

@Zenexer thank you for the great testing and information
Do you know if str_end is supplied by the PHP function or is it computed by the pcre2 library?

@zherczeg
Copy link
Collaborator

Am I correct in saying that gdb is likely going to show str_ptr as the register's value at the time of the segfault rather than when the icall instruction runs?

Yes, it shows the current value of the variable. On ARM64 arguments are passed in registers, and the previous value of a register cannot be inspected in general.

Maybe it assumes it's safe to read up to 15 bytes past the end of string?

It should not, at least the x86 simd code does not need this. What it (at least x86 simd) expects though, that 16 bytes can be read from (void*)((uintptr_t)p & ~0xf), as long as p < 'str_end'. It is not true for <= of course.

str_ptr += offs1;

Am I see it correctly, that there is no check that str_ptr < str_end? This looks like a serious bug. In general the code should not run if str_ptr + offs1 >= str_end since no match can be achieved.

line 190

Is this something in php?

The ffcps_1 function is generated here (looks like ffcs_2 has this issue as well):
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_neon_inc.h#L85

Do you know if str_end is supplied by the PHP function or is it computed by the pcre2 library?

str_end should be subject_ptr + subject_size. Even with constrained search, the end of the string should be fixed.

@zherczeg
Copy link
Collaborator

I get it now that line 190 is in pcre2_jit_neon_inc.h.

Maybe adding a if (str_ptr >= str_end) return NULL right after line 174: str_ptr += IN_UCHARS(offs1); could fix this.

@Zenexer
Copy link
Author

Zenexer commented Nov 23, 2022

I don’t understand what FF_FUN is doing well enough to say whether that’s a valid fix. I’m not keen on adding fixes until I understand the problem, as I might just move the problem around. Are you sure that fix results in correct behavior? Are there tests I can add to make this issue easier to detect in the future?

Do you know if str_end is supplied by the PHP function or is it computed by the pcre2 library?

It’s not. I’m pretty confident this bug isn’t on PHP’s side.

What it (at least x86 simd) expects though, that 16 bytes can be read from (void*)((uintptr_t)p & ~0xf), as long as p < 'str_end'. It is not true for <= of course.

You can read up to 15 bytes past the end of the string with that logic—a full word past the string’s final word. I’m not too familiar with x86 internals, but that seems risky. I was under the impression that was unsafe. If the string is 1 byte, for example, you’ll read the final word of the string (8 bytes) plus an additional word (another 8 bytes), 16 bytes in total. I’m not sure whether that’s cause for concern.

Am I see it correctly, that there is no check that str_ptr < str_end? This looks like a serious bug. In general the code should not run if str_ptr + offs1 >= str_end since no match can be achieved.

Possibly, but an equivalent check may be performed elsewhere. What’s FF_FUN trying to do, and what do the offsets signify? str_end often points within a string, not at the actual end; is that intentional?

@Zenexer
Copy link
Author

Zenexer commented Nov 23, 2022

I was debugging this on the side yesterday while dealing with a more urgent issue. That debugging eventually led to this comment. I took notes as I went so I wouldn't lose my place. I'm going to copy them here since they might be useful to anyone else looking into this issue, but keep in mind I was just narrating my thought process, so they jump around a bit and are incomplete.


ffcps_1 gets wrong arguments if from php I think.

From what I can tell, ffcps_1 isn't called directly from PHP--in fact, it isn't called directly by anything. It's only referenced in emitted sljit bytecode:

pcre2/src/pcre2_jit_simd_inc.h

Lines 1100 to 1101 in 14c1327

sljit_emit_icall(compiler, SLJIT_CALL, SLJIT_ARGS4(W, W, W, W, W),
SLJIT_IMM, SLJIT_FUNC_ADDR(ffcps_1));

I'm not familiar with sljit--@zherczeg is in a better position to comment--but going from the backtrace:

  1. ffcps probably stands for "fast forward character pair"--presumably we're searching for something within a string
  2. SLJIT_R0 = 0xffff985fffff
  3. SLJIT_R1 = 0xffff98600000
  4. That doesn't make sense; R0 is supposed to be the end of the string and R1 is supposed to point within the string.

Where are R0 and R1 coming from?

First, we need some important context from pcre2_jit_compile.c:

#define STR_PTR       SLJIT_R1
#define STR_END       SLJIT_S0

/* Local space layout. */
/* These two locals can be used by the current opcode. */
#define LOCALS0          (0 * sizeof(sljit_sw))
#define LOCALS1          (1 * sizeof(sljit_sw))

#define OP1(op, dst, dstw, src, srcw) \
	sljit_emit_op1(compiler, (op), (dst), (dstw), (src), (srcw))

#define CMOV(type, dst_reg, src, srcw) \
  sljit_emit_cmov(compiler, (type), (dst_reg), (src), (srcw))

typedef struct compiler_common {
  /* ... */
  /* Pointer of the match end position. */
  sljit_s32 match_end_ptr;
  /* ... */
}

#if PCRE2_CODE_UNIT_WIDTH == 8
#  define IN_UCHARS(x) (x)
/* ... */
#endif

Now, let's take a look at fast_forward_char_pair_simd, where FF_FUN is called. Here's the definition:

static void fast_forward_char_pair_simd(
	compiler_common *common,
	sljit_s32 offs1,
	PCRE2_UCHAR char1a,
	PCRE2_UCHAR char1b,
	sljit_s32 offs2,
	PCRE2_UCHAR char2a,
	PCRE2_UCHAR char2b
)

Here's where R0 is assigned:

/* Save temporary register STR_PTR. */
OP1(SLJIT_MOV, SLJIT_MEM1(SLJIT_SP), LOCALS0, STR_PTR, 0);

/* Prepare arguments for the function call. */
if (common->match_end_ptr == 0)
   OP1(SLJIT_MOV, SLJIT_R0, 0, STR_END, 0);
else
  {
  OP1(SLJIT_MOV, SLJIT_R0, 0, SLJIT_MEM1(SLJIT_SP), common->match_end_ptr);
  OP2(SLJIT_ADD, SLJIT_R0, 0, SLJIT_R0, 0, SLJIT_IMM, IN_UCHARS(offs1 + 1));

  OP2U(SLJIT_SUB | SLJIT_SET_LESS, STR_END, 0, SLJIT_R0, 0);
  CMOV(SLJIT_LESS, SLJIT_R0, STR_END, 0);
  }

OP1(SLJIT_MOV, SLJIT_R1, 0, STR_PTR, 0);

I think the resulting logic is roughly:

LOCALS0 = R1;

if (common->match_end_ptr == 0) {  /* This condition isn't emitted; it changes what's emitted. */
	R0 = S0;
} else {
	R0 = &LOCALS0[common->match_end_ptr + offs1 + 1];

	/*
	 * I might have the operands swapped on the next line, but I think SLJIT is emitting:
	 *
	 *   subs  S0, R0
	 *   movlt R0, S0
	 *
	 * Reviewing SLJIT's code, I see SLJIT_LESS is mapped to 0x2, so it might actually be movcs instead of movlt--I can't tell.
	 */
	tmp = S0 < R0;
	S0 -= R0;
	if (tmp) {
		R0 = S0;
	}
}

R1 = R1;  /* ??? */

I'm intermingling emitted logic with non-emitted logic. The comparison against match_end_ptr isn't emitted; it changes what's emitted.

Somehow, R1 > R0, which doesn't make sense. That offs1 + 1 looks suspicious, but in my first backtrace, R1 = R0 + 6, so that doesn't add up--at this point in the code, there's a potential to be off by more than one. That also means this isn't the difference between < and <=--at least, not this close to the emitted code. There could still be an off-by-one elsewhere, but probably not here.

R1 and R0 are most affected by the values of R1 and S0 (not R0). S0 is also affected by match_end_ptr and offs1, but those are ultimately emitted as immediates, so they're not going to change every time the emitted bytecode runs. Since the crash is rare, let's assume that match_end_ptr and offs1 are unlikely to be the culpits; we'll avoid focusing on them for now.

Futhermore, we can be reasonably confident that STR_PTR is wrong since it can't even be accessed in the backtrace. That means we should focus our attention on R1 since STR_PTR is just an alias for SLJIT_R1. STR_END shows up as an empty string in the backtrace, which make sense for the pointer to the end of a string.

@zherczeg, would you be able to translate the SLJIT instructions a bit better? I'm not sure my pseudocode is accurate.

Carrying on, there are two places fast_forward_char_pair_simd gets called. One is in fast_forward_newline. For now, let's assume that's not relevant, as there aren't any newlines in play here, and we can see that our search chars are --eE, none of which are line breaks.

The second place is in check_fast_forward_char_pair_simd. check_fast_forward_char_pair_simd determines the parameters that are passed to fast_forward_char_pair_simd (including offs1), but it doesn't do anything that would affect S0 and R1, the two values we're most interested in. It doesn't emit anything itself, and the only function it calls that emits anything is fast_forward_char_pair_simd.

check_fast_forward_char_pair_simd only gets called in one place: fast_forward_first_n_chars. This is more interesting since it calls scan_prefix before calling check_fast_forward_char_pair_simd. There's nothing obviously wrong in scan_prefix, though; let's search the entirety of pcre2_jit_compile.c for anything that might compare STR_PTR and STR_END:

% grep -F STR_PTR pcre2_jit_compile.c | grep -F STR_END | grep -Fe GREATER -e LESS -e EQUAL
  add_jump(compiler, end_reached, CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0));
jump = CMP(SLJIT_LESS, STR_PTR, 0, STR_END, 0);
  add_jump(compiler, backtracks, CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0));
jump = CMP(SLJIT_LESS, STR_PTR, 0, STR_END, 0);
CMPTO(SLJIT_LESS, STR_PTR, 0, STR_END, 0, label);
      add_jump(compiler, backtracks, CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0));
  add_jump(compiler, backtracks, CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0));
buffer_end_close = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);
exit_invalid[7] = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);
exit_invalid[9] = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
  jump[0] = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
jump[0] = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
jump[4] = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
CMPTO(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0, three_byte_exit);
exit_invalid[1] = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
exit_invalid[0] = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
    end = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
    end = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
    CMPTO(SLJIT_LESS, STR_PTR, 0, STR_END, 0, mainloop);
  end = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
  jump = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
  jump = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
partial_quit = CMP(SLJIT_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
...

It looks like the consensus is that we should stop when STR_PTR >= STR_END. Are there any comparisons that deviate from that? Let's focus on the most likely culprits: SLJIT_GREATER conditions:

% grep -F STR_PTR pcre2_jit_compile.c | grep -F STR_END | grep -Fe GREATER -e LESS -e EQUAL | grep -F '_GREATER,
 STR_PTR'
buffer_end_close = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);
exit_invalid[7] = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);
add_jump(compiler, &common->failed_match, CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0));
      OP2U(SLJIT_SUB | SLJIT_SET_GREATER, STR_PTR, 0, STR_END, 0);
      CMOV(SLJIT_GREATER, STR_PTR, STR_END, 0);
      add_jump(compiler, backtracks, CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0));
  add_jump(compiler, backtracks, CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0));
  partial = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);
        OP2U(SLJIT_SUB | SLJIT_SET_GREATER, STR_PTR, 0, STR_END, 0);
        CMOV(SLJIT_GREATER, STR_PTR, STR_END, 0);
      OP2U(SLJIT_SUB | SLJIT_SET_GREATER, STR_PTR, 0, STR_END, 0);
      CMOV(SLJIT_GREATER, STR_PTR, STR_END, 0);

Most of these look benign, so let's rank these references by the chances that they're the underlying cause. Note that some statements appear multiple times.

/* Potential culprits */
add_jump(compiler, &common->failed_match, CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0));
OP2U(SLJIT_SUB | SLJIT_SET_GREATER, STR_PTR, 0, STR_END, 0);
add_jump(compiler, backtracks, CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0));
partial = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);

/* Almost certainly fine */
buffer_end_close = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);
exit_invalid[7] = CMP(SLJIT_GREATER, STR_PTR, 0, STR_END, 0);

/* Definitely fine */
CMOV(SLJIT_GREATER, STR_PTR, STR_END, 0);

Skimming through each occurrence of the potential culprits, nothing looks too bad.

Let's instead shift focus to FF_FUN. If I understand sljit's docs correctly, the parameters being passed to FF_FUN are stored in registers and might never end up on the stack; that could mean the values we're seeing in the backtrace aren't the original values passed to FF_FUN, but rather the final values when FF_FUN crashes. FF_FUN modifies STR_PTR--that seems to be its purpose--so it's possible it's ending up beyond the edge of the string. That would explain why we never see this on x86; FF_FUN has a separate implementation for each architecture.

In the backtraces, we seem to be regularly dying on line 190:

vect_t data = VLD1Q(str_ptr);

Also, we only know for certain that this happens with FFCPS. Let's take a look at everything up to that point, with some conditional macros resolved. We're assuming:

#define FFCPS
#define PCRE2_CODE_UNIT_WIDTH 8
#define IN_UCHARS(x) (x)
#undef FFCPS_CHAR1A2A

We have backtraces for ffcps_1, ffcps_1_utf, and ffcps_default.

Since we have backtraces showing both *_utf functions and non-*_utf functions, we can assume FF_UTF doesn't matter, but we'll keep those conditions in.

FFCPS_CHAR1A2A is defined for ffcps_0* but not for ffcps_1* or ffcps_default*. We'll keep the conditions for FFCPS_CHAR1A2A in and focus on them, as we only have evidence that the crash occurs when FFCPS_CHAR1A2A is undefined.

FFCPS_DIFF1 is defined for ffcps_0 and ffcps_1 but not ffcps_default. We can assume it doesn't have any impact on the issue.

static sljit_u8* SLJIT_FUNC FF_FUN(
	sljit_u8 *str_end,  // 0xffff985fffff
	sljit_u8 *str_ptr,  // 0xffff98600000
	sljit_uw offs1,     // 11
	sljit_uw offs2,     // 10
	sljit_uw chars      // 1164258605 = {'-', '-', 'e', 'E'}, assuming little endian
)
{
quad_word qw;
int_char ic;  // Union type such that ic.x maps to the same memory as {ic.c.c1, ic.c.c2, ic.c.c3, ic.c.c4}

// Why are these marked as unused?  They are used.
SLJIT_UNUSED_ARG(offs1);
SLJIT_UNUSED_ARG(offs2);

ic.x = chars;  // --eE

compare_type compare1_type = compare_match1;
compare_type compare2_type = compare_match1;
vect_t cmp1a, cmp1b, cmp2a, cmp2b;
const sljit_u32 diff = offs1 - offs2;        // diff = 1
PCRE2_UCHAR char1a = ic.c.c1;                // char1a = '-'
PCRE2_UCHAR char2a = ic.c.c3;                // char2a = 'e'

# ifdef FFCPS_CHAR1A2A
cmp1a = VDUPQ(char1a);                       // fill cmp1a with '-'     /!\ Doesn't run in error case
cmp2a = VDUPQ(char2a);                       // fill cmp2a with 'e'     /!\ Doesn't run in error case
cmp1b = VDUPQ(0);                            // fill cmp1b with '\0'    /!\ Doesn't run in error case
cmp2b = VDUPQ(0);                            // fill cmp2b with '\0'    /!\ Doesn't run in error case
# else
PCRE2_UCHAR char1b = ic.c.c2;                // char1b = '-'
PCRE2_UCHAR char2b = ic.c.c4;                // char2b = 'E'
if (char1a == char1b)                        // true in error case, false in ffcps0
  {
  cmp1a = VDUPQ(char1a);                     // fill cmp1a with '-'
  cmp1b = VDUPQ(0);                          // fill cmp1b with '\0'
  }
else
  {
  sljit_u32 bit1 = char1a ^ char1b;
  if (is_powerof2(bit1))
    {
    compare1_type = compare_match1i;
    cmp1a = VDUPQ(char1a | bit1);
    cmp1b = VDUPQ(bit1);
    }
  else
    {
    compare1_type = compare_match2;
    cmp1a = VDUPQ(char1a);
    cmp1b = VDUPQ(char1b);
    }
  }

if (char2a == char2b)
  {
  cmp2a = VDUPQ(char2a);
  cmp2b = VDUPQ(0);
  }
else
  {
  sljit_u32 bit2 = char2a ^ char2b;
  if (is_powerof2(bit2))
    {
    compare2_type = compare_match1i;
    cmp2a = VDUPQ(char2a | bit2);
    cmp2b = VDUPQ(bit2);
    }
  else
    {
    compare2_type = compare_match2;
    cmp2a = VDUPQ(char2a);
    cmp2b = VDUPQ(char2b);
    }
  }
# endif

str_ptr += offs1;  // str_ptr += 11

#if defined(FF_UTF)
restart:;
#endif

sljit_u8 *p1 = str_ptr - diff;                      // p1 = str_ptr - 1
sljit_s32 align_offset = ((uint64_t)str_ptr & 0xf);
str_ptr = (sljit_u8 *) ((uint64_t)str_ptr & ~0xf);  // Substract between 0 and 15 from str_ptr
vect_t data = VLD1Q(str_ptr);                       // /!\ Segfault here

// The rest probably doesn't matter.
}

If we assume that the backtrace contains the final value for str_ptr rather than the value actually passed to FF_FUN, in @svenauhagen's stacktrace, the original value for str_ptr could've been anything between 0xffff985ffff6 and 0xffff98600000. 0xffff985ffff6 is perfectly valid; it doesn't point past the end of the string.

// offs1 = 11
// offs2 = 10

const sljit_u32 diff = offs1 - offs2;               // diff = 1
// ...
str_ptr += offs1;                                   // str_ptr += 11
// ...
sljit_u8 *p1 = str_ptr - diff;                      // p1 = str_ptr - 1
// ...
str_ptr = (sljit_u8 *) ((uint64_t)str_ptr & ~0xf);  // Substract between 0 and 15 from str_ptr
vect_t data = VLD1Q(str_ptr);                       // Segfault here

Assigning any number between 0xffff985ffff6 and 0xffff98600000 to str_ptr and running that logic results in str_ptr being 0xffff98600000 at the time of the segfault.

@zherczeg
Copy link
Collaborator

zherczeg commented Nov 23, 2022

You really put a lot of effort in this, thank you very much. This is a great help.

You can read up to 15 bytes past the end of the string with that logic—a full word past the string’s final word.
I’m not too familiar with x86 internals, but that seems risky. I was under the impression that was unsafe.

I know this is tricky. Virtual memory is based on pages, and all pages are at least 1K aligned (usually 4K). If an address is valid, and that address modulo 16 is 1, you can safely read the next 15 bytes as well. If modulo 16 is 15, you cannot read any more bytes. I think that is the problem here, that (str_end % 16) == 15.

The FF_FUN is a SIMD accelerator, if reads 16 byte blocks, and checks specific characters in them. The 16 byte blocks are 16 byte aligned, so it should be able to do it, as long as any byte of a 16 byte block also part of the subject string. The FF_FUN is called from JIT only, since its purpose is finding specific characters. It is a generated SIMD helper.

I am still unsure if str_ptr == 0xffff98600000 happens on the crash, or when the function is called. You mentioned some asserts somewhere. I would need to check how x86 simd works.

@Zenexer
Copy link
Author

Zenexer commented Nov 23, 2022

str_ptr can point to any point in the string, though--it's not constrained to mod 16 until after 10 has been added to it. If you have a 1-byte string at the end of a page, and str_ptr points to the first byte in that string, FF_FUN is going to read into the next page.

@zherczeg
Copy link
Collaborator

If you check this here x86 simd checks that STR_PTR >= STR_END after offs1 added to STR_PTR
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L545

@Zenexer
Copy link
Author

Zenexer commented Nov 23, 2022

Yeah, it's pretty clear the logic doesn't match between x86 and ARM.

What does offs1 represent here? I assume fast_forward_char_pair_simd is trying to search for a character pair, but it's pretty dense. It'd be helpful to know what it does so I'm not stuck reverse engineering hundreds of lines of vector arithmetic.

@zherczeg
Copy link
Collaborator

Yes it search a character pair, where the characters are different. Consider the following pattern: /...a.....b...../ The dots match to any characters. In this case you search the a,b pair, where a offset is 4 and b offset is 10. When it founds a valid pair, it returns with its starting position. The code is based on my observation that different characters, which are far from each other rarely match to an input. If hello is searched, it is pretty efficient to find h...o pair first, before doing any match.

@Zenexer
Copy link
Author

Zenexer commented Nov 23, 2022

I see. It looks like four bytes are passed, though. Is this intended to handle case insensitivity? I assume the offsets indicate the relative position of each character (in your example pattern h…o, 0 and 4).

@zherczeg
Copy link
Collaborator

This is a code generator template, it can generate both case sensitive and insensitive code depending on what found during the pattern analysis. Yes offsets are relative positions, and I think the first offset is greater than the second, so characters passed in reversed order.

@Zenexer
Copy link
Author

Zenexer commented Nov 23, 2022

This is a code generator template, it can generate both case sensitive and insensitive code depending on what found during the pattern analysis.

Is that the reason four bytes are passed instead of two? It looks like we're dealing with four characters in most error cases, rather than a single pair.

@zherczeg
Copy link
Collaborator

I don't understand the four byte passed part. The generator receives two characters and their corresponding offsets. The generated code receives a byte stream, which length can be any (>=0, integer).

@Zenexer
Copy link
Author

Zenexer commented Nov 24, 2022

The final parameter to FF_FUN, chars, is four bytes wide—it’s an int_char. ffcps_1 and ffcps_default use all four bytes: they get loaded into char1a, char1b, char2a, and char2b. In one of the stack traces, the characters passed are: E, e, -, and -. When a and b are equal, as is the case with char2a and char2b, b is treated as a null byte, so we effectively get E, e, -, and \0 here.

ffcps_0 receives all four bytes, but it only use the first and third bytes (E and -). The other two are treated as null bytes. This is the same as passing E, E, -, - to ffcps_1, or at least that’s probably the intention.

I’m assuming this allows for case-insensitive matching for single-byte characters, but that’s just a guess.

Curiously, this isn’t actually emitted as bytecode, so all this branching has to run on every call even though it’s based on values that will never change (namely, offs1, offs2, and chars). I haven’t done any testing, but I would guess the performance gains for this logic on ARM aren’t as great as the performance gains on x86. I could be completely wrong, but this looks like a buggy shortcut that was never really finished—although I probably would’ve done the same at the time this was written. I know ARM and x86 optimize branching differently; it’s possible emitting bytecode without branching wouldn’t have offered any gains and would’ve instead just increased memory footprint.

@zherczeg
Copy link
Collaborator

I haven't noticed that 4 byte arg so far. Yes, it looks like a compressed byte array, I agree. The x86 template receives this as 4 separate characters, since a character can be 4 byte long (utf32), and may take 16 bytes to pass all four of them. The neon code is probably a simplification (only 4 arguments can be passed by the jit compiler to a function unless a buffer is used).

@Zenexer
Copy link
Author

Zenexer commented Nov 25, 2022

It's effectively passing an int_char. It seems to assume sizeof(PCRE2_UCHAR) == 1:

pcre2/src/pcre2_jit_simd_inc.h

Lines 1077 to 1080 in f1e48eb

ic.c.c1 = char1a;
ic.c.c2 = char1b;
ic.c.c3 = char2a;
ic.c.c4 = char2b;
It makes the same mistake in a bunch of places. If I understand correctly, the ARM JIT compiler is going to break with UTF-16 and UTF-32--and quite possibly with null bytes in a string (as would be common with UTF-16 and UTF-32).

When you say the x86 template receives 4 separate chars, what's the purpose of each char? Am I correct in assuming you use 4 instead of 2 for case insensitive matching?

@zherczeg
Copy link
Collaborator

Yes. The char1a/char1b and char2a/char2b represent two valid characters for the same location.

ARM JIT is correct though, it will just match more cases that it should, and a fail happens later. This has a performance overhead.

This is sometimes happen with x86 as well. Consider the following pattern: /ab|cd/ Then char1a=a, char2a=c, char2a=b, char2b=d This will match to ad, and the match fails later. Usually, when char1a != char2a, they are the case insensitive variants of the same character, but not always.

carenas added a commit to carenas/pcre2 that referenced this issue Dec 6, 2022
FF_FUN would try loading a vector from an invalid address
triggering a crash.

Add the same check that is done in the x86/s390x implementations
and that was missing from the original code.

Fixes: PCRE2Project#86
@svenauhagen
Copy link

@carenas thanks for the commit, have you tested the fix you added on a PHP system?
I am having the error multiple times per day and I would like to test it :)

@carenas
Copy link
Contributor

carenas commented Dec 6, 2022

have you tested the fix you added on a PHP system?

the alignment requiriments make it difficult to reproduce with PHP, and there might be an additional trigger in your system setup (which compiler and flags were used to build pcre?), but the code is at fault as mentioned above, and I was able to reproduce setting an invalid str_ptr and crashing in the same way with a different synthetic test.

your testing and confirmation will be appreciated.

@svenauhagen
Copy link

I am compiling of the debian package source https://salsa.debian.org/debian/pcre2 as I am using a Debian system.
Ok let me try it out, it crashes between 5-10 times a day for me on our busy PHP webserver and I can give you some feedback in 24-48 hours.

@svenauhagen
Copy link

So far so good, I did not have a crash in 24 hours. I will keep on testing.

@Zenexer
Copy link
Author

Zenexer commented Dec 8, 2022

I'm not able to run this in production yet, but thanks for the fix, @carenas! Should it be >= instead of >?

@carenas
Copy link
Contributor

carenas commented Dec 8, 2022

Should it be >= instead of >?

Either one will work and doing '>' is more conservative (hence what I proposed for testing in production as it would only match for the reported crash scenarios).

The final patch is most likely to have >= though for consistency with the other implementations and will hopefully include more fixes (currently still in development in the neon branch)

@svenauhagen
Copy link

I haven't had a crash in the last 48 hours so the problem is definitely gone on my production server, thank you @carenas

carenas added a commit to carenas/pcre2 that referenced this issue Dec 9, 2022
FF_FUN would try loading a vector from an invalid address
triggering a crash.

Add the same check that is done in the x86/s390x implementations
and that was missing from the original code.

Fixes: PCRE2Project#86
PhilipHazel pushed a commit that referenced this issue Dec 12, 2022
FF_FUN would try loading a vector from an invalid address
triggering a crash.

Add the same check that is done in the x86/s390x implementations
and that was missing from the original code.

Fixes: #86
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants