Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix offset value when escape characters exist #48

Merged
merged 3 commits into from
Nov 13, 2024

Conversation

SebastienGllmt
Copy link
Contributor

Currently, the code assumes that 1 character in JSON corresponds to 1 character in the input. However, this isn't true because JSON supports escape characters (so multiple characters can map to a single JSON character)

SHLOB

This PR fixes it, and adds tests for this case

@juanjoDiaz
Copy link
Owner

Hi @SebastienGllmt ,

This PR seems a duplicate of #24. Or at least related.

Currently, the offset property counts the bytes in the stream. Not the characters in the string.
That's because the parser can take in a string but also an array of bytes, a Uint8Array, etc...

It's impossible for the library to know if you passed the string"д", or the Uint8Array [0xD0, 0x94] or the Uint16Array [0x0414]. So, currently, it will offset 2 because that's the amount of bytes.
Escaped sequences are typically 1 bye and unicode characters can be 1, 2, 3 or 4 bytes...

What you changed here simply counts all escaped chars as 2 bytes and all the unicode sequences as 6.
So it doesn't seem correct.

What do you think?

@SebastienGllmt
Copy link
Contributor Author

SebastienGllmt commented Nov 12, 2024

Escaped sequences are typically 1 bye and unicode characters can be 1, 2, 3 or 4 bytes...

While this is true in general, you can see from the image I posted above that this doesn't matter for JSON specification purposes. \u is always followed by exactly 4 hexadecimal digits (so 6 total since it's prefixed with \u), and other escape codes are always exactly 2

@SebastienGllmt
Copy link
Contributor Author

SebastienGllmt commented Nov 12, 2024

It's impossible for the library to know if you passed the string"д", or the Uint8Array [0xD0, 0x94] or the Uint16Array [0x0414]. So, currently, it will offset 2 because that's the amount of bytes.

I don't think this matters. The two below are two totally separate and valid JSON strings. The escape is not done by Javascript - the escape is done by the JSON parser itself.

console.log(JSON.parse('{"char": "\\u0434"}'));
console.log(JSON.parse('{"char": "д"}'));

That is to say, the input to the parser are not both d0b4. One is d0b4 and the other is 5c7530343334

Copy link
Owner

@juanjoDiaz juanjoDiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you are right, but I have a few coding styles comments.

Once they are fixed, I'm happy to merge this.

packages/plainjs/dist/deno/tokenizer.ts Outdated Show resolved Hide resolved
packages/plainjs/dist/deno/tokenizer.ts Outdated Show resolved Hide resolved
packages/plainjs/dist/deno/tokenizer.ts Outdated Show resolved Hide resolved
packages/plainjs/dist/deno/tokenizer.ts Outdated Show resolved Hide resolved
packages/plainjs/dist/deno/tokenizer.ts Outdated Show resolved Hide resolved
@juanjoDiaz juanjoDiaz merged commit af7d3d9 into juanjoDiaz:main Nov 13, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants