⚡ Better Faster Cleaner `STATUS` parsing #225

nevans · 2023-11-12T21:38:37Z

Although "number" is still the default status-att-val, this uses
ExtensionData with RFC4466's tagged_ext_val for any unknown
non-numeric STATUS attribute.

Running the benchmarks (on my phone, without YJIT) shows a 40% speedup!

invalid_status_response_trailing_space
      v0.4.4-16-g0be6b65b:     43956.1 i/s
                    0.4.4:     31788.6 i/s - 1.38x  slower

 rfc3501_7.2.4_STATUS_response_example
      v0.4.4-16-g0be6b65b:     45436.2 i/s
                    0.4.4:     32458.5 i/s - 1.40x  slower

   status_response_uidnext_uidvalidity
      v0.4.4-16-g0be6b65b:     45334.2 i/s
                    0.4.4:     32709.1 i/s - 1.39x  slower

The SequenceSet class is only a placeholder for now, because the more complete implementation isn't ready yet. But we need `sequence-set` for both `tagged-ext-value`. And we need `tagged-ext-value` for the RFC4466 extension grammar for `STATUS`, `ESEARCH`, `LIST`, etc. The more complete SequenceSet implementation is needed for `ESEARCH`.

Although this is currently unused, it should eventually be used for `StatusData`, `BodyStructure`, `ESEARCH`, `MailboxList`, etc.

Although this is currently unused, we need `tagged-ext-val` for the RFC4466 extension grammar for `STATUS`, `ESEARCH`, `LIST`, etc.

Although "number" is still the default `status-att-val`, this uses ExtensionData with RFC4466's `tagged_ext_val` for any unknown non-numeric `STATUS` attribute. Running the benchmarks (on my phone, without YJIT) shows a 40% speedup! invalid_status_response_trailing_space v0.4.4-16-g0be6b65b: 43956.1 i/s 0.4.4: 31788.6 i/s - 1.38x slower rfc3501_7.2.4_STATUS_response_example v0.4.4-16-g0be6b65b: 45436.2 i/s 0.4.4: 32458.5 i/s - 1.40x slower status_response_uidnext_uidvalidity v0.4.4-16-g0be6b65b: 45334.2 i/s 0.4.4: 32709.1 i/s - 1.39x slower Various changes: * Add alias for `mailbox` to `astring`. * Use char token matchers (faster than `match(T_#{name})`). * Extract `status-att-list` and `status-att-val` methods, to mimic ABNF. * Add a case statement to `status-att-val` and explicitly match all RFC3501 and RFC9051 status attributes.

The version of SequenceSet in net-imap prior to this commit was merely a placeholder, needed in order to complete `tagged-ext` for ruby#225. This updates it with a full API, inspired by Set, Range, and Array. This allows it to be more broadly useful, e.g. for storing and working with mailbox state. In addition to Integer, Range, and enumerables, any object with `#to_sequence_set` can now be used to create a sequence set. For compatibility with MessageSet, `ThreadMember#to_sequence_set` collects all child seqno into a SequenceSet. Because mailbox state can be _very_ large, inputs are stored in an internal sorted array of ranges. These are stored as `[start, stop]` tuples, not Range objects, for simpler manipulation. A future optimization could convert all tuples to a flat one-dimensional Array (to reduce object allocations). Storing the data in sorted range tuples allows many of the important operations to be `O(lg n)`. Although updates do use `Array#insert` and `Array#slice!`—which are technically `O(n)`—they tend to be fast until the number of elements is very large. Count and index-based methods are also `O(n)`. A future optimization could cache the count and compose larger sets from a sorted tree of smaller sets, to preserve `O(lg n)` for most operations. SequenceSet can be used to replace MessageSet (which is used internally to validate, format, and send certain command args). Some notable differences between the two: * Most validation is done up-front, when initializing or adding values. * A ThreadMember to `sequence-set` bug has been fixed. * The generated string is sorted and adjacent ranges are combined. TODO in future PRs: * #index_lte => get the index of a number in the set, or if the number isn't in the set, the number before it. * Replace or supplement the UID set implementation in UIDPlusData. * fully replace MessageSet (probably not before v0.5.0)

The version of SequenceSet in net-imap prior to this commit was merely a placeholder, needed in order to complete `tagged-ext` for #225. This updates it with a full API, inspired by Set, Range, and Array. This allows it to be more broadly useful, e.g. for storing and working with mailbox state. In addition to Integer, Range, and enumerables, any object with `#to_sequence_set` can now be used to create a sequence set. For compatibility with MessageSet, `ThreadMember#to_sequence_set` collects all child seqno into a SequenceSet. Because mailbox state can be _very_ large, inputs are stored in an internal sorted array of ranges. These are stored as `[start, stop]` tuples, not Range objects, for simpler manipulation. A future optimization could convert all tuples to a flat one-dimensional Array (to reduce object allocations). Storing the data in sorted range tuples allows many of the important operations to be `O(lg n)`. Although updates do use `Array#insert` and `Array#slice!`—which are technically `O(n)`—they tend to be fast until the number of elements is very large. Count and index-based methods are also `O(n)`. A future optimization could cache the count and compose larger sets from a sorted tree of smaller sets, to preserve `O(lg n)` for most operations. SequenceSet can be used to replace MessageSet (which is used internally to validate, format, and send certain command args). Some notable differences between the two: * Most validation is done up-front, when initializing or adding values. * A ThreadMember to `sequence-set` bug has been fixed. * The generated string is sorted and adjacent ranges are combined. TODO in future PRs: * #index_lte => get the index of a number in the set, or if the number isn't in the set, the number before it. * Replace or supplement the UID set implementation in UIDPlusData. * fully replace MessageSet (probably not before v0.5.0)

nevans added 4 commits November 12, 2023 20:10

🚧 Add ExtensionData for unhandled extensions

aacc2ac

Although this is currently unused, it should eventually be used for `StatusData`, `BodyStructure`, `ESEARCH`, `MailboxList`, etc.

🚧 Parse RFC4466 tagged-ext-val

c52853e

Although this is currently unused, we need `tagged-ext-val` for the RFC4466 extension grammar for `STATUS`, `ESEARCH`, `LIST`, etc.

nevans force-pushed the parser/better-faster-cleaner-status branch from 6129112 to 8070925 Compare November 13, 2023 01:11

nevans merged commit dcbdb21 into ruby:master Nov 13, 2023
11 checks passed

nevans deleted the parser/better-faster-cleaner-status branch November 13, 2023 01:22

This was referenced Nov 13, 2023

Add TaggedExtensionData (with #to_hash) #127

Open

Add ExtensionDataList (with #to_ary) #128

Open

Add UnparsedData (with #to_str) #129

Open

RFC4466 (2006): Collected Extensions to IMAP4 ABNF #35

Open

Support for IMAP4rev2 and modern extensions #12

Open

nevans mentioned this pull request Nov 22, 2023

✨ Add support for the CONDSTORE extension (RFC7162) #236

Merged

24 tasks

nevans mentioned this pull request Dec 11, 2023

✨ Improve SequenceSet with Set, Range, Enumerable methods #239

Merged

73 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Better Faster Cleaner `STATUS` parsing #225

⚡ Better Faster Cleaner `STATUS` parsing #225

nevans commented Nov 12, 2023

⚡ Better Faster Cleaner STATUS parsing #225

⚡ Better Faster Cleaner STATUS parsing #225

Conversation

nevans commented Nov 12, 2023

⚡ Better Faster Cleaner `STATUS` parsing #225

⚡ Better Faster Cleaner `STATUS` parsing #225