Fix offset value when escape characters exist #48

SebastienGllmt · 2024-09-29T02:38:41Z

Currently, the code assumes that 1 character in JSON corresponds to 1 character in the input. However, this isn't true because JSON supports escape characters (so multiple characters can map to a single JSON character)

This PR fixes it, and adds tests for this case

juanjoDiaz · 2024-11-12T00:11:12Z

Hi @SebastienGllmt ,

This PR seems a duplicate of #24. Or at least related.

Currently, the offset property counts the bytes in the stream. Not the characters in the string.
That's because the parser can take in a string but also an array of bytes, a Uint8Array, etc...

It's impossible for the library to know if you passed the string"д", or the Uint8Array [0xD0, 0x94] or the Uint16Array [0x0414]. So, currently, it will offset 2 because that's the amount of bytes.
Escaped sequences are typically 1 bye and unicode characters can be 1, 2, 3 or 4 bytes...

What you changed here simply counts all escaped chars as 2 bytes and all the unicode sequences as 6.
So it doesn't seem correct.

What do you think?

SebastienGllmt · 2024-11-12T04:59:29Z

Escaped sequences are typically 1 bye and unicode characters can be 1, 2, 3 or 4 bytes...

While this is true in general, you can see from the image I posted above that this doesn't matter for JSON specification purposes. \u is always followed by exactly 4 hexadecimal digits (so 6 total since it's prefixed with \u), and other escape codes are always exactly 2

SebastienGllmt · 2024-11-12T06:16:37Z

It's impossible for the library to know if you passed the string"д", or the Uint8Array [0xD0, 0x94] or the Uint16Array [0x0414]. So, currently, it will offset 2 because that's the amount of bytes.

I don't think this matters. The two below are two totally separate and valid JSON strings. The escape is not done by Javascript - the escape is done by the JSON parser itself.

console.log(JSON.parse('{"char": "\\u0434"}'));
console.log(JSON.parse('{"char": "д"}'));

That is to say, the input to the parser are not both d0b4. One is d0b4 and the other is 5c7530343334

juanjoDiaz

I think that you are right, but I have a few coding styles comments.

Once they are fixed, I'm happy to merge this.

packages/plainjs/dist/deno/tokenizer.ts

SebastienGllmt added 2 commits September 29, 2024 11:12

Fix offset value when escape characters exist

51ed758

handle different length utf8

f2d1029

juanjoDiaz requested changes Nov 12, 2024

View reviewed changes

PR feedback on escape offset fix

283cdb6

juanjoDiaz merged commit af7d3d9 into juanjoDiaz:main Nov 13, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix offset value when escape characters exist #48

Fix offset value when escape characters exist #48

SebastienGllmt commented Sep 29, 2024

juanjoDiaz commented Nov 12, 2024

SebastienGllmt commented Nov 12, 2024 •

edited

Loading

SebastienGllmt commented Nov 12, 2024 •

edited

Loading

juanjoDiaz left a comment

Fix offset value when escape characters exist #48

Fix offset value when escape characters exist #48

Conversation

SebastienGllmt commented Sep 29, 2024

juanjoDiaz commented Nov 12, 2024

SebastienGllmt commented Nov 12, 2024 • edited Loading

SebastienGllmt commented Nov 12, 2024 • edited Loading

juanjoDiaz left a comment

Choose a reason for hiding this comment

SebastienGllmt commented Nov 12, 2024 •

edited

Loading

SebastienGllmt commented Nov 12, 2024 •

edited

Loading