Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(regex) Segmentation fault: 11 #922

Closed
pkoppstein opened this issue Aug 22, 2015 · 13 comments
Closed

(regex) Segmentation fault: 11 #922

pkoppstein opened this issue Aug 22, 2015 · 13 comments
Assignees
Labels

Comments

@pkoppstein
Copy link
Contributor

$ jq --version
jq-1.5rc2-57-g2c6c521

$ jq 'sub( "(?<x>.)"; "\(.x)!")'
"’"
Segmentation fault: 11

Some other examples:

$ jq -R 'sub( "(.)"; "")'
—
Segmentation fault: 11

$ jq 'sub( "(.)"; "")'
"—"
Segmentation fault: 11

$ uname -a
Darwin mini 13.4.0 Darwin Kernel Version 13.4.0: Wed Mar 18 16:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64 x86_64
@dtolnay dtolnay self-assigned this Aug 22, 2015
@dtolnay
Copy link
Member

dtolnay commented Aug 22, 2015

This is due to incorrectly decoding the width of 3-byte UTF-8 characters: https://github.com/stedolan/jq/blob/370833d55573a223b60ea51b4cea7b6c0326e030/jv_unicode.c#L62

@nicowilliams
Copy link
Contributor

Ouch.

Proposed fix:

diff --git a/jv_unicode.c b/jv_unicode.c
index c3f9f11..767d4a5 100644
--- a/jv_unicode.c
+++ b/jv_unicode.c
@@ -61,8 +61,9 @@ int jvp_utf8_is_valid(const char* in, const char* end) {

 int jvp_utf8_decode_length(char startchar) {
    if ((startchar & 0x80) == 0) return 1;
-   else if ((startchar & 0xC0) == 0xC0) return 2;
+   else if ((startchar & 0xF0) == 0xF0) return 4;
    else if ((startchar & 0xE0) == 0xE0) return 3;
+   else if ((startchar & 0xC0) == 0xC0) return 2;
    else return 4;
 }

@nicowilliams
Copy link
Contributor

Actually, this probably needs to deal with invalid sequences (not alias them to 4-bytes).

@nicowilliams
Copy link
Contributor

Hmm, actually, we don't need to deal with invalid sequences, since these should be validated strings.

@nicowilliams
Copy link
Contributor

Feel free to push. I gtg.

@nicowilliams
Copy link
Contributor

Thanks for the report @pkoppstein.

@dtolnay dtolnay added this to the 1.5.1 release milestone Sep 11, 2015
@lackneets
Copy link

Input: "小型車計時:每小時20元;月租:每月3000元"

this is OK
jq scan("每月(\\d+)")[0] => "3000"
jq scan("每小時\\d+")[0] => "每小時20"

when this is not
jq scan("每小時(\\d+)")[0] => unknown jq execution error: signal: segmentation fault
jq scan("(\\d+)")[0] => unknown jq execution error: signal: segmentation fault

I have no idea if I am facing the same Unicode bug? I am using jqplay.org

@dtolnay
Copy link
Member

dtolnay commented Dec 2, 2015

$ jq <test.json 'scan("每月(\\d+)")[0]'
"3000"
$ jq <test.json 'scan("每小時(\\d+)")[0]'
"20"
$ jq <test.json 'scan("(\\d+)")[0]'
"20"
"3000"

jqplay.org must not be using a version of jq that contains the fix.

@nicowilliams
Copy link
Contributor

The fix for this didn't make jq-1.5. The milestone is 1.5.1 and the commit log shows it's not in jq-1.5. We should probably prep a 1.5.1 or a 1.6 release.

@Justin-W
Copy link

#922 is a really serious bug that makes much of jq's functionality (e.g. nearly all regex-related functionality) effectively unusable. And hence, makes jq effectively unusable if such functionality is needed.

#922 also makes silent (non-failure) errors leading to data corruption very probable. E.g. Any transformations or queries of non-ascii data that use sub, match, or capture are highly prone to silent data corruption. E.g. gsub replacements can cause catastrophically wrong transformations of the input, *without any error or warning, and only for input containing codepoints of certain byte lengths AND with specific relative position to the actual locations of the regex matches (making it that much harder for a user to notice the bug during testing). (I.e. It is very easy to encounter such errors, but also very easy to miss them unless you test with the right combinations and sequences/positions of unicode chars and regex patterns.

I suffered some serious data corruption (caused by jq #922) until the 'right' combination of data and jq filters lead to some corruption that was catastrophic and non-silent. Until then, the silent corruption was very hard to detect. And even after, it was difficult to diagnose and predict the precise cause, and effects.

RE: Workarounds:

It also seems impossible to work around the bug using only a jq library/module, since it appears to be impossible to use a jq def to override/hide a builtin that is implemented in 'jq.c'. Hence, even if all jq def builtins affected by #922 were overridden, any jq filters that use the jq.c builtins directly would still be affected. Hence, the only way to reliably prevent #922 from causing data loss seems to be to fix bug in the c code.

STEPS TO REPRODUCE:

Note: In the following series, each step attempts to sub a 1 ascii char with "#" (also ascii). However, the actual effects of the sub function differ dramatically (due to #922), depending on where in the string the first char is located/matched (i.e. where it is relative to non-ascii, multi-byte chars). In particular, note the 2 cases where substantial portions of the string (38% & 62% in these examples) are simply dropped completely, representing massive data loss. Also, note the data corruption in many cases, where the '#' is incorrectly substituted for multiple chars (in multiple locations) instead of just 1 char.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: =>
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[b]"; "#"; "ig")' |cc
Expect: => "a#c ² def © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: => "a#c ² def © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[e]"; "#"; "ig")' |cc
Expect: => "abc ² d#f © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: => "abc ² d#f © ghi … jkl ® mno “ pqr ¶ stu ³ vxy"
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[h]"; "#"; "ig")' |cc
Expect: => "abc ² def © g#i … jkl ® mno “ pqr ¶ stu ³ vxy"
Output: => "abc ² def © g#i … jkl ® mno “ pqr ¶ stu ³ vxy"
OK.

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[k]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … j#l ® mno “ pqr ¶ stu ³ vxy"
Output: => "#j#l ® mno “ pqr ¶ stu ³ vxy"
WRONG! Very, very wrong!!! Substituted 2 chars, plus removed the first 17 chars (~38% data loss; plus corruption of remaining data)!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[n]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® m#o “ pqr ¶ stu ³ vxy"
Output: => "abc ² def © ghi … jkl ®#m#o “ pqr ¶ stu ³ vxy"
WRONG! Very wrong! Substituted 2 chars instead of 1!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[q]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ p#r ¶ stu ³ vxy"
Output: => "##p#r ¶ stu ³ vxy"
WRONG! Very, very wrong!!! Substituted 3 chars, plus removed the first 28 chars (~62% data loss; plus corruption of remaining data)!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[t]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ pqr ¶ s#u ³ vxy"
Output: => "abc ² def © ghi … jkl ® mno “ pqr#¶ s#u ³ vxy"
WRONG! Substituted 2 chars instead of 1!

jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[x]"; "#"; "ig")' |cc
Expect: => "abc ² def © ghi … jkl ® mno “ pqr ¶ stu ³ v#y"
Output: => "abc ² def © ghi … jkl ® mno “ pqr ¶ stu#³ v#y"
WRONG! Substituted 2 chars instead of 1!

@pkoppstein
Copy link
Contributor Author

This issue (#922) was CLOSED once a fix had been installed in the "master" version of jq.

In any case, using the current version of "master", I have verified that all the test cases in your post pass. Thank you for providing them.

If your point is that the latest official numbered release (currently jq 1.5) does not include this fix, then it might help to make that explicit.

@Justin-W
Copy link

Can we please get an update and ETA on when an official release containing this fix will be released? I couldn't find any info about plans or dates for any releases past 1.5 (aside from the unexplained cancellation of v1.5.1).

Given that #922 has now been fixed for 2 years, and was already scheduled for a previous release (1.5.1), it seems like it shouldn't be that much work to do at least a 1.5.1 release. And not releasing the fix is a showstopper for many.

FYI: The use of a custom build (e.g. from a Master branch) is unfeasible or forbidden in many organizations. (E.g. Due to policy restrictions related to security, legal, and/or technical reasons.) And as mentioned above, it appears impossible to implement any reliable workaround for #922 without either a new release or a (unfeasible) custom build.

(Sorry for not being explicit enough above. You pre-empted my planned follow-up by 3 minutes. :) )

@deiwin
Copy link

deiwin commented Feb 18, 2020

This seems pretty ridiculous:

❯ jq 'sub("-"; "x")' <<< '"-"'
[1]    57117 segmentation fault  jq 'sub("-"; "x")' <<< '"-"'
❯ jq --version
jq-1.6
❯ uname -a
Darwin <host-name> 19.2.0 Darwin Kernel Version 19.2.0: Sat Nov  9 03:47:04 PST 2019; root:xnu-6153.61.1~20/RELEASE_X86_64 x86_64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants