Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle UTF-8 labels #117

Open
V0ldek opened this issue Mar 30, 2023 · 2 comments
Open

Properly handle UTF-8 labels #117

V0ldek opened this issue Mar 30, 2023 · 2 comments
Assignees
Labels
acceptance: go ahead Reviewed, implementation can start area: app Improvements in overall CLI app usability type: feature New feature or request
Milestone

Comments

@V0ldek
Copy link
Member

V0ldek commented Mar 30, 2023

Is your feature request related to a problem? Please describe.
The engine currently violates the JSON spec by not normalizing Unicode escapes. We do this for performance purposes, since ordinal comparison can be easily SIMDified, but it's not correct.

For a simple example, the UTF-8 codepoint for the letter "a" is 0x0061. These JSONs are equivalent under RFC 8259:

{"a":42}
{"\u0061":42}

Therefore the query $["a"] should in both cases match the value 42.

Quite sensibly, and indeed officially under the current JSONPath RFC Draft, the queries $["a"] and $["\u0061"] must also be equivalent. All four combinations of the two documents above and the two queries must yield the same result -- the value 42.

Describe the solution you'd like
The tradeoff here is important. We expect the difference in performance to be staggering, especially since the head-skip optimisation is by design incompatible with this. We need a flag that will toggle this behaviour. I propose we make this the optional behaviour – we expect the vast majority of labels to be ASCII, if a user wants to match unicode they can use the flag.

@V0ldek V0ldek added the type: feature New feature or request label Mar 30, 2023
@github-actions github-actions bot added the acceptance: triage Waiting for owner's input label Mar 30, 2023
@github-actions
Copy link

Tagging @V0ldek for notifications

@V0ldek V0ldek added mod: engine area: app Improvements in overall CLI app usability labels Apr 3, 2023
@V0ldek V0ldek added this to the v1.0.0 milestone Apr 3, 2023
@github-actions github-actions bot added acceptance: go ahead Reviewed, implementation can start and removed acceptance: triage Waiting for owner's input labels Apr 3, 2023
@V0ldek V0ldek modified the milestones: v1.0.0, v1.1.0 Jun 14, 2023
@V0ldek V0ldek modified the milestones: v1.1.0, v1.0.0 Dec 7, 2023
@V0ldek
Copy link
Member Author

V0ldek commented Dec 11, 2023

I'm rolling all escape handling under this umbrella. It seems there is no easy way to handle strings with escapes like \" or \n without introducing general unicode support. I will start by introducing proper parsing in #116 for unicode in the query string values, and then proper unicode comparison should be easy to do in principle. As to how to do this fast...

@V0ldek V0ldek moved this from Todo to Committed in Active rq development Apr 7, 2024
@V0ldek V0ldek self-assigned this Apr 7, 2024
@V0ldek V0ldek moved this from Committed to In Progress in Active rq development Apr 10, 2024
V0ldek added a commit that referenced this issue Dec 22, 2024
…query

The `Automaton` struct borrowed the source query, which also caused the
Engine to carry the query's lifetime with it. The actual data being borrowed
were the `JsonString` values for member transitions.
In preparation for #117 we remove the borrowed `JsonString` and replace it
with `StringPattern`. For UTF-8 the `StringPattern` will be a more complex
struct that precomputes some stuff for efficient matching later.
For now, it's a thin wrapper over a `JsonString`.

During construction we may create many transitions over the same pattern.
To reduce the size of the automaton we cache the patterns and put them
into an `Rc`. This may get optimised later to instead use some kind of
inline storage, but it's unlikely to actually matter. I ran the benchmarks
and saw no measurable difference between the previous version and this one.

Refs: #117 #613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acceptance: go ahead Reviewed, implementation can start area: app Improvements in overall CLI app usability type: feature New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

1 participant