-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle UTF-8 labels #117
Labels
acceptance: go ahead
Reviewed, implementation can start
area: app
Improvements in overall CLI app usability
type: feature
New feature or request
Milestone
Comments
Tagging @V0ldek for notifications |
github-actions
bot
added
acceptance: go ahead
Reviewed, implementation can start
and removed
acceptance: triage
Waiting for owner's input
labels
Apr 3, 2023
I'm rolling all escape handling under this umbrella. It seems there is no easy way to handle strings with escapes like |
V0ldek
added a commit
that referenced
this issue
Dec 22, 2024
…query The `Automaton` struct borrowed the source query, which also caused the Engine to carry the query's lifetime with it. The actual data being borrowed were the `JsonString` values for member transitions. In preparation for #117 we remove the borrowed `JsonString` and replace it with `StringPattern`. For UTF-8 the `StringPattern` will be a more complex struct that precomputes some stuff for efficient matching later. For now, it's a thin wrapper over a `JsonString`. During construction we may create many transitions over the same pattern. To reduce the size of the automaton we cache the patterns and put them into an `Rc`. This may get optimised later to instead use some kind of inline storage, but it's unlikely to actually matter. I ran the benchmarks and saw no measurable difference between the previous version and this one. Refs: #117 #613
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
acceptance: go ahead
Reviewed, implementation can start
area: app
Improvements in overall CLI app usability
type: feature
New feature or request
Is your feature request related to a problem? Please describe.
The engine currently violates the JSON spec by not normalizing Unicode escapes. We do this for performance purposes, since ordinal comparison can be easily SIMDified, but it's not correct.
For a simple example, the UTF-8 codepoint for the letter "a" is 0x0061. These JSONs are equivalent under RFC 8259:
Therefore the query
$["a"]
should in both cases match the value42
.Quite sensibly, and indeed officially under the current JSONPath RFC Draft, the queries
$["a"]
and$["\u0061"]
must also be equivalent. All four combinations of the two documents above and the two queries must yield the same result -- the value42
.Describe the solution you'd like
The tradeoff here is important. We expect the difference in performance to be staggering, especially since the
head-skip
optimisation is by design incompatible with this. We need a flag that will toggle this behaviour. I propose we make this the optional behaviour – we expect the vast majority of labels to be ASCII, if a user wants to match unicode they can use the flag.The text was updated successfully, but these errors were encountered: