(regex) Segmentation fault: 11 #922

pkoppstein · 2015-08-22T17:54:32Z

$ jq --version
jq-1.5rc2-57-g2c6c521

$ jq 'sub( "(?<x>.)"; "\(.x)!")'
"’"
Segmentation fault: 11

Some other examples:

$ jq -R 'sub( "(.)"; "")'
—
Segmentation fault: 11

$ jq 'sub( "(.)"; "")'
"—"
Segmentation fault: 11

$ uname -a
Darwin mini 13.4.0 Darwin Kernel Version 13.4.0: Wed Mar 18 16:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64 x86_64

The text was updated successfully, but these errors were encountered:

dtolnay · 2015-08-22T18:44:56Z

This is due to incorrectly decoding the width of 3-byte UTF-8 characters: https://github.com/stedolan/jq/blob/370833d55573a223b60ea51b4cea7b6c0326e030/jv_unicode.c#L62

nicowilliams · 2015-08-22T19:19:54Z

Ouch.

Proposed fix:

diff --git a/jv_unicode.c b/jv_unicode.c
index c3f9f11..767d4a5 100644
--- a/jv_unicode.c
+++ b/jv_unicode.c
@@ -61,8 +61,9 @@ int jvp_utf8_is_valid(const char* in, const char* end) {

 int jvp_utf8_decode_length(char startchar) {
    if ((startchar & 0x80) == 0) return 1;
-   else if ((startchar & 0xC0) == 0xC0) return 2;
+   else if ((startchar & 0xF0) == 0xF0) return 4;
    else if ((startchar & 0xE0) == 0xE0) return 3;
+   else if ((startchar & 0xC0) == 0xC0) return 2;
    else return 4;
 }

nicowilliams · 2015-08-22T19:20:54Z

Actually, this probably needs to deal with invalid sequences (not alias them to 4-bytes).

nicowilliams · 2015-08-22T19:21:57Z

Hmm, actually, we don't need to deal with invalid sequences, since these should be validated strings.

nicowilliams · 2015-08-22T19:22:11Z

Feel free to push. I gtg.

nicowilliams · 2015-08-22T19:43:42Z

Thanks for the report @pkoppstein.

lackneets · 2015-12-02T02:39:17Z

Input: "小型車計時：每小時20元；月租：每月3000元"

this is OK
jq scan("每月(\\d+)")[0] => "3000"
jq scan("每小時\\d+")[0] => "每小時20"

when this is not
jq scan("每小時(\\d+)")[0] => unknown jq execution error: signal: segmentation fault
jq scan("(\\d+)")[0] => unknown jq execution error: signal: segmentation fault

I have no idea if I am facing the same Unicode bug? I am using jqplay.org

dtolnay · 2015-12-02T02:49:30Z

$ jq <test.json 'scan("每月(\\d+)")[0]'
"3000"
$ jq <test.json 'scan("每小時(\\d+)")[0]'
"20"
$ jq <test.json 'scan("(\\d+)")[0]'
"20"
"3000"

jqplay.org must not be using a version of jq that contains the fix.

nicowilliams · 2015-12-02T04:21:48Z

The fix for this didn't make jq-1.5. The milestone is 1.5.1 and the commit log shows it's not in jq-1.5. We should probably prep a 1.5.1 or a 1.6 release.

Justin-W · 2017-09-20T17:39:32Z

#922 is a really serious bug that makes much of jq's functionality (e.g. nearly all regex-related functionality) effectively unusable. And hence, makes jq effectively unusable if such functionality is needed.

#922 also makes silent (non-failure) errors leading to data corruption very probable. E.g. Any transformations or queries of non-ascii data that use sub, match, or capture are highly prone to silent data corruption. E.g. gsub replacements can cause catastrophically wrong transformations of the input, *without any error or warning, and only for input containing codepoints of certain byte lengths AND with specific relative position to the actual locations of the regex matches (making it that much harder for a user to notice the bug during testing). (I.e. It is very easy to encounter such errors, but also very easy to miss them unless you test with the right combinations and sequences/positions of unicode chars and regex patterns.

I suffered some serious data corruption (caused by jq #922) until the 'right' combination of data and jq filters lead to some corruption that was catastrophic and non-silent. Until then, the silent corruption was very hard to detect. And even after, it was difficult to diagnose and predict the precise cause, and effects.

RE: Workarounds:

It also seems impossible to work around the bug using only a jq library/module, since it appears to be impossible to use a jq def to override/hide a builtin that is implemented in 'jq.c'. Hence, even if all jq def builtins affected by #922 were overridden, any jq filters that use the jq.c builtins directly would still be affected. Hence, the only way to reliably prevent #922 from causing data loss seems to be to fix bug in the c code.

STEPS TO REPRODUCE:

Note: In the following series, each step attempts to sub a 1 ascii char with "#" (also ascii). However, the actual effects of the sub function differ dramatically (due to #922), depending on where in the string the first char is located/matched (i.e. where it is relative to non-ascii, multi-byte chars). In particular, note the 2 cases where substantial portions of the string (38% & 62% in these examples) are simply dropped completely, representing massive data loss. Also, note the data corruption in many cases, where the '#' is incorrectly substituted for multiple chars (in multiple locations) instead of just 1 char.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: =>
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[b]"; "#"; "ig")' |cc
Expect: => "a#c ² def © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: => "a#c ² def © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[e]"; "#"; "ig")' |cc
Expect: => "abc ² d#f © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: => "abc ² d#f © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[h]"; "#"; "ig")' |cc
Expect: => "abc ² def © g#i … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: => "abc ² def © g#i … jkl ® mno “ pqr ¶ stu ³ vxy"
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[k]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … j#l ® mno “ pqr ¶ stu ³ vxy"
Output: => "#j#l ® mno “ pqr ¶ stu ³ vxy"
WRONG! Very, very wrong!!! Substituted 2 chars, plus removed the first 17 chars (~38% data loss; plus corruption of remaining data)!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[n]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® m#o “ pqr ¶ stu ³ vxy"
Output: => "abc ² def © ghi … jkl ®#m#o “ pqr ¶ stu ³ vxy"
WRONG! Very wrong! Substituted 2 chars instead of 1!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[q]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ p#r ¶ stu ³ vxy"
Output: => "##p#r ¶ stu ³ vxy"
WRONG! Very, very wrong!!! Substituted 3 chars, plus removed the first 28 chars (~62% data loss; plus corruption of remaining data)!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[t]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ pqr ¶ s#u ³ vxy"
Output: => "abc ² def © ghi … jkl ® mno “ pqr#¶ s#u ³ vxy"
WRONG! Substituted 2 chars instead of 1!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[x]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ pqr ¶ stu ³ v#y"
Output: => "abc ² def © ghi … jkl ® mno “ pqr ¶ stu#³ v#y"
WRONG! Substituted 2 chars instead of 1!

pkoppstein · 2017-09-20T18:20:18Z

This issue (#922) was CLOSED once a fix had been installed in the "master" version of jq.

In any case, using the current version of "master", I have verified that all the test cases in your post pass. Thank you for providing them.

If your point is that the latest official numbered release (currently jq 1.5) does not include this fix, then it might help to make that explicit.

Justin-W · 2017-09-20T18:24:10Z

Can we please get an update and ETA on when an official release containing this fix will be released? I couldn't find any info about plans or dates for any releases past 1.5 (aside from the unexplained cancellation of v1.5.1).

Given that #922 has now been fixed for 2 years, and was already scheduled for a previous release (1.5.1), it seems like it shouldn't be that much work to do at least a 1.5.1 release. And not releasing the fix is a showstopper for many.

FYI: The use of a custom build (e.g. from a Master branch) is unfeasible or forbidden in many organizations. (E.g. Due to policy restrictions related to security, legal, and/or technical reasons.) And as mentioned above, it appears impossible to implement any reliable workaround for #922 without either a new release or a (unfeasible) custom build.

(Sorry for not being explicit enough above. You pre-empted my planned follow-up by 3 minutes. :) )

deiwin · 2020-02-18T09:34:05Z

This seems pretty ridiculous:

❯ jq 'sub("-"; "x")' <<< '"-"'
[1]    57117 segmentation fault  jq 'sub("-"; "x")' <<< '"-"'
❯ jq --version
jq-1.6
❯ uname -a
Darwin <host-name> 19.2.0 Darwin Kernel Version 19.2.0: Sat Nov  9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64 x86_64

dtolnay self-assigned this Aug 22, 2015

dtolnay added a commit to dtolnay/jq that referenced this issue Aug 22, 2015

Fix decoding of UTF-8 sequence length (fix jqlang#922)

61f158d

dtolnay added a commit to dtolnay/jq that referenced this issue Aug 22, 2015

Fix decoding of UTF-8 sequence length (fix jqlang#922)

e975c19

dtolnay mentioned this issue Aug 22, 2015

Fix decoding of UTF-8 sequence length (fix #922) #923

Merged

dtolnay added the bug label Aug 22, 2015

dtolnay closed this as completed in 6c3934d Aug 22, 2015

pkoppstein mentioned this issue Aug 23, 2015

Use after free bug #896

Closed

dtolnay added this to the 1.5.1 release milestone Sep 11, 2015

pkoppstein mentioned this issue Apr 18, 2016

Segfault in capture with emoji #1134

Closed

mark-kubacki pushed a commit to mark-kubacki/jq that referenced this issue Aug 19, 2016

Fix decoding of UTF-8 sequence length (fix jqlang#922)

86bc917

atschabu mentioned this issue Jun 19, 2017

index() function returns wrong offset for non-ascii chars #1430

Closed

Justin-W mentioned this issue Sep 20, 2017

Question about release policy for security issues #1406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(regex) Segmentation fault: 11 #922

(regex) Segmentation fault: 11 #922

pkoppstein commented Aug 22, 2015

dtolnay commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

lackneets commented Dec 2, 2015

dtolnay commented Dec 2, 2015

nicowilliams commented Dec 2, 2015

Justin-W commented Sep 20, 2017

pkoppstein commented Sep 20, 2017

Justin-W commented Sep 20, 2017

deiwin commented Feb 18, 2020

(regex) Segmentation fault: 11 #922

(regex) Segmentation fault: 11 #922

Comments

pkoppstein commented Aug 22, 2015

dtolnay commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

nicowilliams commented Aug 22, 2015

lackneets commented Dec 2, 2015

dtolnay commented Dec 2, 2015

nicowilliams commented Dec 2, 2015

Justin-W commented Sep 20, 2017

RE: Workarounds:

STEPS TO REPRODUCE:

pkoppstein commented Sep 20, 2017

Justin-W commented Sep 20, 2017

deiwin commented Feb 18, 2020