Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement JSON token stream deserializer #454

Merged
merged 4 commits into from
Jun 4, 2021

Conversation

jdisanti
Copy link
Collaborator

@jdisanti jdisanti commented Jun 3, 2021

This adds a JSON token streaming deserializer to smithy-json for #161.

It passes all tests in JSONTestSuite, except for some notable ones:

  • Number parsing is more lenient. We could put in more effort to make it more strictly adhere to the JSON spec, but it may not be worth the effort, and the leniency isn't going to result in any misinterpretations of valid JSON data.
  • There are some test cases around invalid Unicode sequences which are expected to pass but always fail since this implementation is coercing all JSON strings into Rust strings, and Rust wants valid UTF-8. I don't think accepting these invalid sequences would be valuable for our SDK implementation since the protocols call for valid UTF-8 in strings. I was wrong here.
  • JSONTestSuite has some tests around multiple values, such as [][], which is totally valid for this implementation since it would result in start_array, end_array, start_array, end_array.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall I like the direction, but we need to avoid allocating strings during parsing

}

/// Reads a JSON string out of the stream.
fn read_string(&mut self) -> Result<String, Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the XML parsing library I used didn't unescape in the tokenizer. That enabled the tokenizer to be allocation-free (and for input that you know won't need to be escaped (like base64), you can even skip the unescape step later)

I don't think we want to be allocating for every string we read out of the input stream, especially since lots of them (eg. a key in a map) we never need to own

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense. Will refactor.

pub enum Token {
StartArray,
EndArray,
ObjectKey(String),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these almost certainly want to be &'a str that refer into the input

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if there is a harness to attach this to the Json test suite, we should definitely include that

@jdisanti
Copy link
Collaborator Author

jdisanti commented Jun 3, 2021

Looks like the unescaping has a bug around UTF-16 surrogate pairs. Will examine fixing that in addition to decoupling parsing from unescaping.

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! we should add a proper fuzzing harness either here or in a follow up PR. A bench suite would also be good but not critical right now

@jdisanti
Copy link
Collaborator Author

jdisanti commented Jun 3, 2021

I will add fuzzing in a follow-up PR.

@jdisanti jdisanti merged commit 2cef1e6 into smithy-lang:main Jun 4, 2021
@jdisanti jdisanti deleted the json-de branch June 4, 2021 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants