Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul the text parsers, port from nom to winnow #892

Merged
merged 33 commits into from
Jan 9, 2025
Merged

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Jan 3, 2025

This PR migrates the text parser from nom to an actively developed fork called winnow. You can read about the relationship between the two here.

The migration offers a number of benefits:

  • winnow includes a debug feature that lets you see the reader's path through the branches of the parser, making it MUCH easier to find the root of parser misbehavior.
  • Where nom offers two flavors of each parser, complete and streaming, winnow allows an input source to report whether it is partial or complete. This means that there is a single flavor of each parser that will do the correct thing when it runs out of data. This allowed me to remove a LOT of special cases.
  • The Parser trait takes &mut self and modifies it in-place rather than taking self and returning an updated copy along with the expected output value. (Discussion here.) Reducing the return value of every method from (TextBuffer<'_>, T) to just T increases the odds that the value will be returned in a register. Happily, because TextBuffer is Copy, we can still make as many intermediate state copies as we'd like when it's called for.
  • The alt((...)) combinator tries each of the provided parsers in turn and takes the first match. This is often quite slow--winnow provides a dispatch! macro that allows you to prune the tree of options up-front by matching on the head of the stream.
  • Some simple types are now considered parsers themselves, which makes the parser methods that use them easier to read. For example, tag("foo")/literal("foo") can now be expressed as just "foo". Similarly, tuples of parsers are now themselves parsers, so you no longer need to write tuple(("/*", multiline_body_comment, "*/")). You can just write ("/*", multiline_body_comment, "*/").

I also made several improvements that did not require winnow per se:

  • Made several encoding-version-specific methods and types generic over E: TextEncoding, eliminating a large amount of mostly duplicated code.
  • Made a few parsers do an easily inlineable up-front check before calling the real (not as inlineable) complex implementation.
  • Modified the 1.0 container parsers to cache the sub-expressions they encounter during lexing using the bump allocator, offering a big speedup. (This optimization had already been applied to the 1.1 container parsers.)

Hopefully this makes it much easier to both read and maintain.


Performance improvements

image

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


Testing improvements

Because nearly all of the tests read some amount of Ion text, this patch lead to a huge drop in the time needed to run the test harness.

Command

This command runs everything but the doc tests:

time cargo test --all-features --lib --tests

Before

image

After

image

Copy link

codecov bot commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 80.66465% with 128 lines in your changes missing coverage. Please review.

Project coverage is 77.51%. Comparing base (46cc6b2) to head (4085e38).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/lazy/any_encoding.rs 51.04% 22 Missing and 25 partials ⚠️
src/lazy/binary/raw/v1_1/reader.rs 48.83% 10 Missing and 34 partials ⚠️
src/lazy/text/raw/reader.rs 87.71% 3 Missing and 4 partials ⚠️
src/lazy/binary/raw/reader.rs 85.00% 6 Missing ⚠️
src/lazy/encoder/write_as_ion.rs 0.00% 4 Missing ⚠️
src/lazy/text/raw/sequence.rs 92.98% 3 Missing and 1 partial ⚠️
src/lazy/text/raw/v1_1/reader.rs 92.30% 1 Missing and 3 partials ⚠️
src/lazy/text/parse_result.rs 85.00% 3 Missing ⚠️
src/lazy/encoder/text/v1_1/writer.rs 50.00% 0 Missing and 2 partials ⚠️
src/lazy/streaming_raw_reader.rs 94.11% 0 Missing and 2 partials ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
- Coverage   77.63%   77.51%   -0.12%     
==========================================
  Files         136      136              
  Lines       35094    34278     -816     
  Branches    35094    34278     -816     
==========================================
- Hits        27244    26572     -672     
+ Misses       5793     5728      -65     
+ Partials     2057     1978      -79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR Tour 🧭

@@ -57,7 +57,7 @@ compact_str = "0.8.0"
chrono = { version = "0.4", default-features = false, features = ["clock", "std", "wasmbind"] }
delegate = "0.12.0"
thiserror = "1.0"
nom = "7.1.1"
winnow = { version = "0.6", features = ["simd"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The simd feature enables the memchr operation when scanning for an expected token.

@@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 {

let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);

let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 This benchmark is really showing its age. When it was written, there was no support for reading encoding directives, so the tests/benchmarks manually compiled and registered their own templates. Now that the readers manage their encoding context as expected, reading a leading IVM clears the manually registered templates.

When our managed writer API is fleshed out, we'll have a way to hand a macro to the writer so it gets serialized in the data stream. For now, we simply skip the IVM in binary 1.1.

Comment on lines -447 to +444
let mut reader = LazyRawBinaryReader_1_1::new(binary_1_1_data);
let mut reader = LazyRawBinaryReader_1_1::new(context_ref, binary_1_1_data);
let mut num_top_level_values: usize = 0;
// Skip past the IVM
reader.next(context_ref).unwrap().expect_ivm().unwrap();
reader.next().unwrap().expect_ivm().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The raw readers now take a reference to the encoding context at construction time instead of having it be passed into each call to next().

Taking them as an argument to next was intended to allow a raw reader to exist as long as needed, with the context being provided any time they read. In practice, however, the raw readers only exist long enough to read a single top-level value from the stream, and requiring a context ref at every turn gets pretty tedious.

@@ -45,28 +42,24 @@ use crate::lazy::raw_stream_item::LazyRawStreamItem;
use crate::lazy::raw_value_ref::RawValueRef;
use crate::lazy::span::Span;
use crate::lazy::streaming_raw_reader::RawReaderState;
use crate::lazy::text::raw::r#struct::{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 Many of the raw-level, container-related types are now generic over their TextEncoding, allowing them to work with both 1.0 and 1.1. The changes in this file reflect that update.

}

fn resume_at_offset(data: &'data [u8], offset: usize, mut encoding_hint: IonEncoding) -> Self {
fn resume(context: EncodingContextRef<'data>, mut saved_state: RawReaderState<'data>) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The RawReaderState type already existed but wasn't used in this method (which predated it). Replacing the individual arguments with one type made it easy to add a field to the state in a centralized place.

Comment on lines 780 to +781
pub fn match_argument_for(
self,
&mut self,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The parsing methods now use &mut self, allowing them to avoid defining new variables at each step of parsing (unless that's what you want).


/// Matches a parser that must be followed by input that matches `terminator`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 More incompleteness detection special casing.


// === nom trait implementations ===
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 Here we're switching over nom trait implementations to winnow trait implementations.

Comment on lines -3063 to +2389
],
expect_incomplete: [
"0x", // Base 16 prefix w/no number
"0b", // Base 2 prefix w/no number
]
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 Because these unit tests are all reading from fixed slices, the parser will never return Incomplete. Those inputs have been moved to the expect_mismatch sections. We have a separate test suite just for incompleteness detection anyway.

impl<'data> LazyRawSequence<'data, TextEncoding_1_0> for LazyRawTextSExp_1_0<'data> {
type Iterator = RawTextSExpIterator_1_0<'data>;
impl<'data, E: TextEncoding<'data>> LazyRawSequence<'data, E> for RawTextSExp<'data, E> {
type Iterator = RawTextSequenceCacheIterator<'data, E>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 All of the text containers now cache their child expressions and iterate over the cache as needed.

Comment on lines -16 to -30
// These tests are all failing because multipart long strings are not handled correctly when the
// "part" boundary happens to also fall on a point where the reader needs to refill the input buffer.
const INCOMPLETE_LONG_STRING_SKIP_LIST: SkipList = &[
"ion-tests/iontestdata/good/equivs/localSymbolTableAppend.ion",
"ion-tests/iontestdata/good/equivs/localSymbolTableNullSlots.ion",
"ion-tests/iontestdata/good/equivs/longStringsWithComments.ion",
"ion-tests/iontestdata/good/equivs/strings.ion",
"ion-tests/iontestdata/good/lists.ion",
"ion-tests/iontestdata/good/strings.ion",
"ion-tests/iontestdata/good/stringsWithWhitespace.ion",
"ion-tests/iontestdata/good/strings_cr_nl.ion",
"ion-tests/iontestdata/good/strings2.ion",
"ion-tests/iontestdata/good/structs.ion",
"ion-tests/iontestdata/good/strings_nl.ion",
];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@zslayton zslayton marked this pull request as ready for review January 7, 2025 21:10
@zslayton zslayton changed the title Winnow experiment Overhaul the text parsers, port from nom to winnow Jan 7, 2025
Comment on lines +81 to +82
#[allow(clippy::should_implement_trait)]
pub fn next(&mut self) -> IonResult<LazyRawStreamItem<'data, BinaryEncoding_1_0>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought—could we make the readers implement Iterator? Would it provide any benefit for users?

src/lazy/binary/raw/reader.rs Outdated Show resolved Hide resolved
src/lazy/binary/raw/struct.rs Outdated Show resolved Hide resolved
src/lazy/binary/raw/value.rs Outdated Show resolved Hide resolved
src/lazy/streaming_raw_reader.rs Outdated Show resolved Hide resolved
Comment on lines 281 to 282
/// If `true`, the current contents of the buffer may not be the complete stream.
fn is_streaming(&self) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to use a function rather than a constant for this? (Then you probably wouldn't need to have the #[inline(always)] on the implementations because the compiler should be smart enough to inline a constant value.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until you asked, I was certain that associated consts weren't stable on traits yet. But it works! 🎉

let buffer = TextBuffer::new(context, data.as_bytes());
let (_remaining, matched) = buffer.match_int().unwrap();
let buffer = TextBuffer::new(context, data.as_bytes(), true);
let matched = buffer.clone().match_int().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing clone() occurring frequently is a little unexpected. Is that just because winnow requires cloning instead of borrowing, or is this a change in the strategy that is not specific to winnow/nom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird corner case for the unit tests.

Winnow's parsers use &mut self and consume any matched data rather than returning a copy of the remaining input slice. This means that buffer.match_int() would return a MatchedInt and consume buffer. When the unit test went to actually read the int from the matched bytes:

let actual = matched.read(buffer).unwrap();

...buffer would be empty, so it would fail.

I decided to match on a copy of buffer so the original would remain unmodified. However, buffer was a value, not a reference, so I couldn't simply dereference it to create a copy. I had three options:

  • Make buffer a &mut TextBuffer that I could then dereference, adding some weirdness to the declaration in the process.
  • Call winnow's Stream::checkpoint() method, which saves the parser state (in this case, copying the TextBuffer), but which people won't intuitively understand.
  • Call clone() (bleh) and live with it 'cause it's only used in the unit tests.

I went with the latter. However, in writing this up I realized there was a fourth option that's more idiomatic to winnow: using peek(). I've updated the unit tests to use that instead.

Comment on lines 40 to 46
let result = whitespace_and_then(alt((
"}".value(None),
terminated(
E::field_expr_matcher().map(Some),
whitespace_and_then(opt(",")),
),
)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising to me. The , is only optional if it's followed by }, so I would have expected to see something more like this:

whitespace_and_then(alt((
    (opt(","), whitespace_and_then ("}")).value(None),
    (",", whitespace_and_then(E::field_expr_matcher()).map(Some))
)))

Copy link
Contributor Author

@zslayton zslayton Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right to be surprised--it is indeed wrong:
image
🤦

Similarly, I was surprised to learn that neither ion-tests nor ion-rust's test suite includes a test for a struct missing a comma. I opened amazon-ion/ion-tests#150 to track adding a test to ion-tests and have added a unit test for this to ion-rust.

whitespace_and_then(alt((
    (opt(","), whitespace_and_then ("}")).value(None),
    (",", whitespace_and_then(E::field_expr_matcher()).map(Some))
)))

This is also not quite right since the first field cannot be preceded by a comma.

src/lazy/text/raw/sequence.rs Outdated Show resolved Hide resolved
@zslayton zslayton merged commit 0943766 into main Jan 9, 2025
35 checks passed
@zslayton zslayton deleted the winnow-experiment branch January 9, 2025 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants