index() function returns wrong offset for non-ascii chars #1430

atschabu · 2017-06-19T02:56:19Z

I'm trying to strip away some text from part of a text. Trying to use something like sub("!.*"; "") doesn't work, as it is giving me a Segmentation fault when text is too long. So I tried to go this route:

$ jq '.msg | .[0:index("!")]'

which works fine with input like:
{"msg": "hello world!"}
but fails when text contains wide characters:
{"msg": "здравствуй мир!"}

$ echo '{"msg": "здравствуй мир!"}' | jq '.msg | index("!")'
27

$ echo '{"msg": "hello world!"}' | jq '.msg | index("!")'
11

$ jq --version
jq-1.5
$ uname -a
Darwin atschabu-C02SF0UTG8WM 15.6.0 Darwin Kernel Version 15.6.0: Tue Apr 11 16:00:51 PDT 2017; root:xnu-3248.60.11.5.3~1/RELEASE_X86_64 x86_64

The text was updated successfully, but these errors were encountered:

pkoppstein · 2017-06-19T03:15:30Z

There is some documentation about this on the "Pitfalls" page (https://github.com/stedolan/jq/wiki/How-to:-Avoid-Pitfalls)

In brief, you can use match/1:

echo '{"msg": "здравствуй мир!"}' | jq '.msg | match("!").offset'
14

This works in jq 1.5 and later.

By the way, could you please give more details about the failure of sub/2. Here is an illustration that it does not always fail when given a long string:

 jq1.5 -n '[range(0;100000) | "a"] | join("") + "!xx" | sub("!.*";"") | length'
100000

atschabu · 2017-06-19T17:03:49Z

My bad. I haven't even realized there is a wiki. I took all the information from the manual, which didn't mention anything about index being byte wise. I'll give match a go.

I still haven't figured out when exactly the Segmentation fault is happening, as I couldn't find the input yet which is producing it. But I went by the assumption it is related to issue 922 until I can proof the opposite.

I guess we can close this one, and I'll open a new ticket, in case my segmentation fault issue is not related to 922.

nicowilliams · 2017-11-28T19:00:25Z

No, this is a bug. We should fix it.

Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064

nicowilliams added the bug label Nov 28, 2017

This was referenced Jun 4, 2019

substring of index/rindex doesn't work for utf8 inputs #1624

Closed

improve index, rindex and indices against string to count the index by utf8 characters #1916

Closed

itchyny mentioned this issue Apr 29, 2021

About the jq's release process (Was: Is jq is still alive/maintained ?) #2305

Closed

D3vil0p3r mentioned this issue Jun 8, 2023

[Request] gojq chaotic-aur/packages#2543

Closed

itchyny mentioned this issue Mar 12, 2024

indices reports byte offsets instead of character offsets #3064

Closed

wader added a commit to wader/jq that referenced this issue Mar 12, 2024

Use codepoint index for indices/1, index/ 1 and rindex/1

ca38058

Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064

wader mentioned this issue Mar 12, 2024

Use codepoint index for indices/1, index/1 and rindex/1 #3065

Merged

wader closed this as completed in #3065 Nov 17, 2024

wader closed this as completed in 8619f8a Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index() function returns wrong offset for non-ascii chars #1430

index() function returns wrong offset for non-ascii chars #1430

atschabu commented Jun 19, 2017

pkoppstein commented Jun 19, 2017 •

edited

Loading

atschabu commented Jun 19, 2017

nicowilliams commented Nov 28, 2017

index() function returns wrong offset for non-ascii chars #1430

index() function returns wrong offset for non-ascii chars #1430

Comments

atschabu commented Jun 19, 2017

pkoppstein commented Jun 19, 2017 • edited Loading

atschabu commented Jun 19, 2017

nicowilliams commented Nov 28, 2017

pkoppstein commented Jun 19, 2017 •

edited

Loading