-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul the text parsers, port from nom
to winnow
#892
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #892 +/- ##
==========================================
- Coverage 77.63% 77.51% -0.12%
==========================================
Files 136 136
Lines 35094 34278 -816
Branches 35094 34278 -816
==========================================
- Hits 27244 26572 -672
+ Misses 5793 5728 -65
+ Partials 2057 1978 -79 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR Tour 🧭
@@ -57,7 +57,7 @@ compact_str = "0.8.0" | |||
chrono = { version = "0.4", default-features = false, features = ["clock", "std", "wasmbind"] } | |||
delegate = "0.12.0" | |||
thiserror = "1.0" | |||
nom = "7.1.1" | |||
winnow = { version = "0.6", features = ["simd"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The simd
feature enables the memchr
operation when scanning for an expected token.
@@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 { | |||
|
|||
let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values); | |||
|
|||
let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 This benchmark is really showing its age. When it was written, there was no support for reading encoding directives, so the tests/benchmarks manually compiled and registered their own templates. Now that the readers manage their encoding context as expected, reading a leading IVM clears the manually registered templates.
When our managed writer API is fleshed out, we'll have a way to hand a macro to the writer so it gets serialized in the data stream. For now, we simply skip the IVM in binary 1.1.
let mut reader = LazyRawBinaryReader_1_1::new(binary_1_1_data); | ||
let mut reader = LazyRawBinaryReader_1_1::new(context_ref, binary_1_1_data); | ||
let mut num_top_level_values: usize = 0; | ||
// Skip past the IVM | ||
reader.next(context_ref).unwrap().expect_ivm().unwrap(); | ||
reader.next().unwrap().expect_ivm().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The raw readers now take a reference to the encoding context at construction time instead of having it be passed into each call to next()
.
Taking them as an argument to next
was intended to allow a raw reader to exist as long as needed, with the context being provided any time they read. In practice, however, the raw readers only exist long enough to read a single top-level value from the stream, and requiring a context ref at every turn gets pretty tedious.
@@ -45,28 +42,24 @@ use crate::lazy::raw_stream_item::LazyRawStreamItem; | |||
use crate::lazy::raw_value_ref::RawValueRef; | |||
use crate::lazy::span::Span; | |||
use crate::lazy::streaming_raw_reader::RawReaderState; | |||
use crate::lazy::text::raw::r#struct::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 Many of the raw-level, container-related types are now generic over their TextEncoding
, allowing them to work with both 1.0 and 1.1. The changes in this file reflect that update.
} | ||
|
||
fn resume_at_offset(data: &'data [u8], offset: usize, mut encoding_hint: IonEncoding) -> Self { | ||
fn resume(context: EncodingContextRef<'data>, mut saved_state: RawReaderState<'data>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The RawReaderState
type already existed but wasn't used in this method (which predated it). Replacing the individual arguments with one type made it easy to add a field to the state in a centralized place.
pub fn match_argument_for( | ||
self, | ||
&mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The parsing methods now use &mut self
, allowing them to avoid defining new variables at each step of parsing (unless that's what you want).
|
||
/// Matches a parser that must be followed by input that matches `terminator`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 More incompleteness detection special casing.
|
||
// === nom trait implementations === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 Here we're switching over nom
trait implementations to winnow
trait implementations.
], | ||
expect_incomplete: [ | ||
"0x", // Base 16 prefix w/no number | ||
"0b", // Base 2 prefix w/no number | ||
] | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 Because these unit tests are all reading from fixed slices, the parser will never return Incomplete
. Those inputs have been moved to the expect_mismatch
sections. We have a separate test suite just for incompleteness detection anyway.
impl<'data> LazyRawSequence<'data, TextEncoding_1_0> for LazyRawTextSExp_1_0<'data> { | ||
type Iterator = RawTextSExpIterator_1_0<'data>; | ||
impl<'data, E: TextEncoding<'data>> LazyRawSequence<'data, E> for RawTextSExp<'data, E> { | ||
type Iterator = RawTextSequenceCacheIterator<'data, E>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 All of the text containers now cache their child expressions and iterate over the cache as needed.
// These tests are all failing because multipart long strings are not handled correctly when the | ||
// "part" boundary happens to also fall on a point where the reader needs to refill the input buffer. | ||
const INCOMPLETE_LONG_STRING_SKIP_LIST: SkipList = &[ | ||
"ion-tests/iontestdata/good/equivs/localSymbolTableAppend.ion", | ||
"ion-tests/iontestdata/good/equivs/localSymbolTableNullSlots.ion", | ||
"ion-tests/iontestdata/good/equivs/longStringsWithComments.ion", | ||
"ion-tests/iontestdata/good/equivs/strings.ion", | ||
"ion-tests/iontestdata/good/lists.ion", | ||
"ion-tests/iontestdata/good/strings.ion", | ||
"ion-tests/iontestdata/good/stringsWithWhitespace.ion", | ||
"ion-tests/iontestdata/good/strings_cr_nl.ion", | ||
"ion-tests/iontestdata/good/strings2.ion", | ||
"ion-tests/iontestdata/good/structs.ion", | ||
"ion-tests/iontestdata/good/strings_nl.ion", | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
nom
to winnow
#[allow(clippy::should_implement_trait)] | ||
pub fn next(&mut self) -> IonResult<LazyRawStreamItem<'data, BinaryEncoding_1_0>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought—could we make the readers implement Iterator
? Would it provide any benefit for users?
src/lazy/streaming_raw_reader.rs
Outdated
/// If `true`, the current contents of the buffer may not be the complete stream. | ||
fn is_streaming(&self) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to use a function rather than a constant for this? (Then you probably wouldn't need to have the #[inline(always)]
on the implementations because the compiler should be smart enough to inline a constant value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until you asked, I was certain that associated const
s weren't stable on traits yet. But it works! 🎉
src/lazy/text/matched.rs
Outdated
let buffer = TextBuffer::new(context, data.as_bytes()); | ||
let (_remaining, matched) = buffer.match_int().unwrap(); | ||
let buffer = TextBuffer::new(context, data.as_bytes(), true); | ||
let matched = buffer.clone().match_int().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing clone()
occurring frequently is a little unexpected. Is that just because winnow
requires cloning instead of borrowing, or is this a change in the strategy that is not specific to winnow
/nom
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a weird corner case for the unit tests.
Winnow's parsers use &mut self
and consume any matched data rather than returning a copy of the remaining input slice. This means that buffer.match_int()
would return a MatchedInt
and consume buffer
. When the unit test went to actually read the int from the matched bytes:
let actual = matched.read(buffer).unwrap();
...buffer
would be empty, so it would fail.
I decided to match on a copy of buffer
so the original would remain unmodified. However, buffer
was a value, not a reference, so I couldn't simply dereference it to create a copy. I had three options:
- Make buffer a
&mut TextBuffer
that I could then dereference, adding some weirdness to the declaration in the process. - Call
winnow
'sStream::checkpoint()
method, which saves the parser state (in this case, copying theTextBuffer
), but which people won't intuitively understand. - Call
clone()
(bleh) and live with it 'cause it's only used in the unit tests.
I went with the latter. However, in writing this up I realized there was a fourth option that's more idiomatic to winnow
: using peek()
. I've updated the unit tests to use that instead.
src/lazy/text/raw/struct.rs
Outdated
let result = whitespace_and_then(alt(( | ||
"}".value(None), | ||
terminated( | ||
E::field_expr_matcher().map(Some), | ||
whitespace_and_then(opt(",")), | ||
), | ||
))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is surprising to me. The ,
is only optional if it's followed by }
, so I would have expected to see something more like this:
whitespace_and_then(alt((
(opt(","), whitespace_and_then ("}")).value(None),
(",", whitespace_and_then(E::field_expr_matcher()).map(Some))
)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right to be surprised--it is indeed wrong:
🤦
Similarly, I was surprised to learn that neither ion-tests
nor ion-rust
's test suite includes a test for a struct missing a comma. I opened amazon-ion/ion-tests#150 to track adding a test to ion-tests
and have added a unit test for this to ion-rust
.
whitespace_and_then(alt(( (opt(","), whitespace_and_then ("}")).value(None), (",", whitespace_and_then(E::field_expr_matcher()).map(Some)) )))
This is also not quite right since the first field cannot be preceded by a comma.
This PR migrates the text parser from
nom
to an actively developed fork calledwinnow
. You can read about the relationship between the two here.The migration offers a number of benefits:
winnow
includes adebug
feature that lets you see the reader's path through the branches of the parser, making it MUCH easier to find the root of parser misbehavior.nom
offers two flavors of each parser,complete
andstreaming
,winnow
allows an input source to report whether it is partial or complete. This means that there is a single flavor of each parser that will do the correct thing when it runs out of data. This allowed me to remove a LOT of special cases.Parser
trait takes&mut self
and modifies it in-place rather than takingself
and returning an updated copy along with the expected output value. (Discussion here.) Reducing the return value of every method from(TextBuffer<'_>, T)
to justT
increases the odds that the value will be returned in a register. Happily, becauseTextBuffer
isCopy
, we can still make as many intermediate state copies as we'd like when it's called for.alt((...))
combinator tries each of the provided parsers in turn and takes the first match. This is often quite slow--winnow
provides adispatch!
macro that allows you to prune the tree of options up-front bymatch
ing on the head of the stream.tag("foo")
/literal("foo")
can now be expressed as just"foo"
. Similarly, tuples of parsers are now themselves parsers, so you no longer need to writetuple(("/*", multiline_body_comment, "*/"))
. You can just write("/*", multiline_body_comment, "*/")
.I also made several improvements that did not require
winnow
per se:E: TextEncoding
, eliminating a large amount of mostly duplicated code.Hopefully this makes it much easier to both read and maintain.
Performance improvements
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Testing improvements
Because nearly all of the tests read some amount of Ion text, this patch lead to a huge drop in the time needed to run the test harness.
Command
This command runs everything but the doc tests:
Before
After