Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡ Better Faster Cleaner STATUS parsing #225

Merged
merged 4 commits into from
Nov 13, 2023

Conversation

nevans
Copy link
Collaborator

@nevans nevans commented Nov 12, 2023

Although "number" is still the default status-att-val, this uses
ExtensionData with RFC4466's tagged_ext_val for any unknown
non-numeric STATUS attribute.

Running the benchmarks (on my phone, without YJIT) shows a 40% speedup!

invalid_status_response_trailing_space
      v0.4.4-16-g0be6b65b:     43956.1 i/s
                    0.4.4:     31788.6 i/s - 1.38x  slower

 rfc3501_7.2.4_STATUS_response_example
      v0.4.4-16-g0be6b65b:     45436.2 i/s
                    0.4.4:     32458.5 i/s - 1.40x  slower

   status_response_uidnext_uidvalidity
      v0.4.4-16-g0be6b65b:     45334.2 i/s
                    0.4.4:     32709.1 i/s - 1.39x  slower

The SequenceSet class is only a placeholder for now, because the more
complete implementation isn't ready yet.  But we need `sequence-set` for
both `tagged-ext-value`.  And we need `tagged-ext-value` for the RFC4466
extension grammar for `STATUS`, `ESEARCH`, `LIST`, etc.

The more complete SequenceSet implementation is needed for `ESEARCH`.
Although this is currently unused, it should eventually be used for
`StatusData`, `BodyStructure`, `ESEARCH`, `MailboxList`, etc.
Although this is currently unused, we need `tagged-ext-val` for the
RFC4466 extension grammar for `STATUS`, `ESEARCH`, `LIST`, etc.
Although "number" is still the default `status-att-val`, this uses
ExtensionData with RFC4466's `tagged_ext_val` for any unknown
non-numeric `STATUS` attribute.

Running the benchmarks (on my phone, without YJIT) shows a 40% speedup!

    invalid_status_response_trailing_space
          v0.4.4-16-g0be6b65b:     43956.1 i/s
                        0.4.4:     31788.6 i/s - 1.38x  slower

     rfc3501_7.2.4_STATUS_response_example
          v0.4.4-16-g0be6b65b:     45436.2 i/s
                        0.4.4:     32458.5 i/s - 1.40x  slower

       status_response_uidnext_uidvalidity
          v0.4.4-16-g0be6b65b:     45334.2 i/s
                        0.4.4:     32709.1 i/s - 1.39x  slower

Various changes:
* Add alias for `mailbox` to `astring`.
* Use char token matchers (faster than `match(T_#{name})`).
* Extract `status-att-list` and `status-att-val` methods, to mimic ABNF.
* Add a case statement to `status-att-val` and explicitly match all
  RFC3501 and RFC9051 status attributes.
@nevans nevans force-pushed the parser/better-faster-cleaner-status branch from 6129112 to 8070925 Compare November 13, 2023 01:11
@nevans nevans merged commit dcbdb21 into ruby:master Nov 13, 2023
11 checks passed
@nevans nevans deleted the parser/better-faster-cleaner-status branch November 13, 2023 01:22
nevans added a commit to nevans/net-imap that referenced this pull request Dec 11, 2023
The version of SequenceSet in net-imap prior to this commit was merely a
placeholder, needed in order to complete `tagged-ext` for ruby#225.

This updates it with a full API, inspired by Set, Range, and Array.
This allows it to be more broadly useful, e.g. for storing and working
with mailbox state.

In addition to Integer, Range, and enumerables, any object with
`#to_sequence_set` can now be used to create a sequence set.  For
compatibility with MessageSet, `ThreadMember#to_sequence_set` collects
all child seqno into a SequenceSet.

Because mailbox state can be _very_ large, inputs are stored in an
internal sorted array of ranges.  These are stored as `[start, stop]`
tuples, not Range objects, for simpler manipulation.  A future
optimization could convert all tuples to a flat one-dimensional Array
(to reduce object allocations).  Storing the data in sorted range tuples
allows many of the important operations to be `O(lg n)`.

Although updates do use `Array#insert` and `Array#slice!`—which are
technically `O(n)`—they tend to be fast until the number of elements is
very large.  Count and index-based methods are also `O(n)`.  A future
optimization could cache the count and compose larger sets from a sorted
tree of smaller sets, to preserve `O(lg n)` for most operations.

SequenceSet can be used to replace MessageSet (which is used internally
to validate, format, and send certain command args).  Some notable
differences between the two:
* Most validation is done up-front, when initializing or adding values.
* A ThreadMember to `sequence-set` bug has been fixed.
* The generated string is sorted and adjacent ranges are combined.

TODO in future PRs:
* #index_lte => get the index of a number in the set, or if the number
  isn't in the set, the number before it.
* Replace or supplement the UID set implementation in UIDPlusData.
* fully replace MessageSet (probably not before v0.5.0)
nevans added a commit to nevans/net-imap that referenced this pull request Dec 11, 2023
The version of SequenceSet in net-imap prior to this commit was merely a
placeholder, needed in order to complete `tagged-ext` for ruby#225.

This updates it with a full API, inspired by Set, Range, and Array.
This allows it to be more broadly useful, e.g. for storing and working
with mailbox state.

In addition to Integer, Range, and enumerables, any object with
`#to_sequence_set` can now be used to create a sequence set.  For
compatibility with MessageSet, `ThreadMember#to_sequence_set` collects
all child seqno into a SequenceSet.

Because mailbox state can be _very_ large, inputs are stored in an
internal sorted array of ranges.  These are stored as `[start, stop]`
tuples, not Range objects, for simpler manipulation.  A future
optimization could convert all tuples to a flat one-dimensional Array
(to reduce object allocations).  Storing the data in sorted range tuples
allows many of the important operations to be `O(lg n)`.

Although updates do use `Array#insert` and `Array#slice!`—which are
technically `O(n)`—they tend to be fast until the number of elements is
very large.  Count and index-based methods are also `O(n)`.  A future
optimization could cache the count and compose larger sets from a sorted
tree of smaller sets, to preserve `O(lg n)` for most operations.

SequenceSet can be used to replace MessageSet (which is used internally
to validate, format, and send certain command args).  Some notable
differences between the two:
* Most validation is done up-front, when initializing or adding values.
* A ThreadMember to `sequence-set` bug has been fixed.
* The generated string is sorted and adjacent ranges are combined.

TODO in future PRs:
* #index_lte => get the index of a number in the set, or if the number
  isn't in the set, the number before it.
* Replace or supplement the UID set implementation in UIDPlusData.
* fully replace MessageSet (probably not before v0.5.0)
nevans added a commit to nevans/net-imap that referenced this pull request Dec 11, 2023
The version of SequenceSet in net-imap prior to this commit was merely a
placeholder, needed in order to complete `tagged-ext` for ruby#225.

This updates it with a full API, inspired by Set, Range, and Array.
This allows it to be more broadly useful, e.g. for storing and working
with mailbox state.

In addition to Integer, Range, and enumerables, any object with
`#to_sequence_set` can now be used to create a sequence set.  For
compatibility with MessageSet, `ThreadMember#to_sequence_set` collects
all child seqno into a SequenceSet.

Because mailbox state can be _very_ large, inputs are stored in an
internal sorted array of ranges.  These are stored as `[start, stop]`
tuples, not Range objects, for simpler manipulation.  A future
optimization could convert all tuples to a flat one-dimensional Array
(to reduce object allocations).  Storing the data in sorted range tuples
allows many of the important operations to be `O(lg n)`.

Although updates do use `Array#insert` and `Array#slice!`—which are
technically `O(n)`—they tend to be fast until the number of elements is
very large.  Count and index-based methods are also `O(n)`.  A future
optimization could cache the count and compose larger sets from a sorted
tree of smaller sets, to preserve `O(lg n)` for most operations.

SequenceSet can be used to replace MessageSet (which is used internally
to validate, format, and send certain command args).  Some notable
differences between the two:
* Most validation is done up-front, when initializing or adding values.
* A ThreadMember to `sequence-set` bug has been fixed.
* The generated string is sorted and adjacent ranges are combined.

TODO in future PRs:
* #index_lte => get the index of a number in the set, or if the number
  isn't in the set, the number before it.
* Replace or supplement the UID set implementation in UIDPlusData.
* fully replace MessageSet (probably not before v0.5.0)
nevans added a commit to nevans/net-imap that referenced this pull request Dec 11, 2023
The version of SequenceSet in net-imap prior to this commit was merely a
placeholder, needed in order to complete `tagged-ext` for ruby#225.

This updates it with a full API, inspired by Set, Range, and Array.
This allows it to be more broadly useful, e.g. for storing and working
with mailbox state.

In addition to Integer, Range, and enumerables, any object with
`#to_sequence_set` can now be used to create a sequence set.  For
compatibility with MessageSet, `ThreadMember#to_sequence_set` collects
all child seqno into a SequenceSet.

Because mailbox state can be _very_ large, inputs are stored in an
internal sorted array of ranges.  These are stored as `[start, stop]`
tuples, not Range objects, for simpler manipulation.  A future
optimization could convert all tuples to a flat one-dimensional Array
(to reduce object allocations).  Storing the data in sorted range tuples
allows many of the important operations to be `O(lg n)`.

Although updates do use `Array#insert` and `Array#slice!`—which are
technically `O(n)`—they tend to be fast until the number of elements is
very large.  Count and index-based methods are also `O(n)`.  A future
optimization could cache the count and compose larger sets from a sorted
tree of smaller sets, to preserve `O(lg n)` for most operations.

SequenceSet can be used to replace MessageSet (which is used internally
to validate, format, and send certain command args).  Some notable
differences between the two:
* Most validation is done up-front, when initializing or adding values.
* A ThreadMember to `sequence-set` bug has been fixed.
* The generated string is sorted and adjacent ranges are combined.

TODO in future PRs:
* #index_lte => get the index of a number in the set, or if the number
  isn't in the set, the number before it.
* Replace or supplement the UID set implementation in UIDPlusData.
* fully replace MessageSet (probably not before v0.5.0)
nevans added a commit that referenced this pull request Dec 11, 2023
The version of SequenceSet in net-imap prior to this commit was merely a
placeholder, needed in order to complete `tagged-ext` for #225.

This updates it with a full API, inspired by Set, Range, and Array.
This allows it to be more broadly useful, e.g. for storing and working
with mailbox state.

In addition to Integer, Range, and enumerables, any object with
`#to_sequence_set` can now be used to create a sequence set.  For
compatibility with MessageSet, `ThreadMember#to_sequence_set` collects
all child seqno into a SequenceSet.

Because mailbox state can be _very_ large, inputs are stored in an
internal sorted array of ranges.  These are stored as `[start, stop]`
tuples, not Range objects, for simpler manipulation.  A future
optimization could convert all tuples to a flat one-dimensional Array
(to reduce object allocations).  Storing the data in sorted range tuples
allows many of the important operations to be `O(lg n)`.

Although updates do use `Array#insert` and `Array#slice!`—which are
technically `O(n)`—they tend to be fast until the number of elements is
very large.  Count and index-based methods are also `O(n)`.  A future
optimization could cache the count and compose larger sets from a sorted
tree of smaller sets, to preserve `O(lg n)` for most operations.

SequenceSet can be used to replace MessageSet (which is used internally
to validate, format, and send certain command args).  Some notable
differences between the two:
* Most validation is done up-front, when initializing or adding values.
* A ThreadMember to `sequence-set` bug has been fixed.
* The generated string is sorted and adjacent ranges are combined.

TODO in future PRs:
* #index_lte => get the index of a number in the set, or if the number
  isn't in the set, the number before it.
* Replace or supplement the UID set implementation in UIDPlusData.
* fully replace MessageSet (probably not before v0.5.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant