Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💥⚡ Simplify header-fld-name parser (backward incompatible) #216

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

nevans
Copy link
Collaborator

@nevans nevans commented Nov 3, 2023

This speeds up my (newly added) benchmarks by ~15-20% over v0.4.3, and by 22-40% over earlier versions.

NOTE: In every version up to v0.4.3, Net::IMAP recreated the raw original source string. After #217, it slices the raw original source string. After this PR, it returns the decoded astring value. Although this is technically incompatible, it should almost never make a difference. See the IANA Message Header Field Names list. All standard header field names are valid IMAP atoms. So I think this incompatibility should almost never occur.

Valid RFC-5322 field names will never require string literals. But they technically may include atom-special characters and thus need to be quoted. Note that RFC-6532 (I18N headers) explicitly does not change the RFC-5322 field name syntax.

RFC-5322 syntax:

field-name      =   1*ftext
ftext           =   %d33-57 /          ; Printable US-ASCII
                    %d59-126           ;  characters not including
                                       ;  ":".

Which is matched by the following Regexp:

33.chr  => "!"
57.chr  => "9"
59.chr  => ";"
126.chr => "~"
[*33..57, *59..126].map{_1.chr}.join =>
  "!\"\#$%&'()*+,-./0123456789;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
VALID_RFC5322_FIELD_NAME = /\A[!-9;-~]+\z/

Although it shouldn't, if a server unnecessarily uses a quoted string (or a literal) for any standard message headers, this PR simplifies accessing the result by normalizing the field name back to its atom form.

The real incompatibility occurs when fetching non-standard but syntactically valid RFC-5322 field names, containing atom specials, which need to be quoted. But the workaround is simple.

For example, with the following non-standard (but syntactically valid) field names:

field_names = %w[
  \\Foo%
  ("BAR")
]
field_names.all?(VALID_RFC5322_FIELD_NAME) => true

The current version of Net::IMAP#fetch doesn't quote any attrs, so in order to fetch these we'd need to manually quote them:

quoted_names = field_names
  .map { _1.gsub(/["\\]/) { "\\#$&" } }
  .join(" ")
joined_names = field_names.join(" ")

joined_names => "\\Foo% (\"BAR\")"
quoted_names => "\"\\\\Foo%\" \"(\\\"BAR\\\")\""

imap.fetch(1, "BODY[HEADER.FIELDS (#{quoted_names})]") => [fetch_data]

All of the above is unchanged by this PR. The incompatibility is when retrieving the results from the FetchData:

# In the current version (v0.4.3), access using quoted names:
fetch_data.attr['BODY[HEADER.FIELDS (#{quoted_names})'] => String

# After this PR, access using unquoted names:
fetch_data.attr['BODY[HEADER.FIELDS (#{joined_names})'] => String

However, I also prepared a version that is backward-compatible, with a smaller performance boost:

@nevans nevans changed the title ⚡ Simplify header-fld-name parser ⚡ Simplify header-fld-name parser (backward incompatible) Nov 3, 2023
@nevans nevans force-pushed the parser/header-fld-name branch 3 times, most recently from 6077ead to ca76d75 Compare November 7, 2023 04:43
@nevans nevans changed the title ⚡ Simplify header-fld-name parser (backward incompatible) ⚡💥 Simplify header-fld-name parser (backward incompatible) Nov 7, 2023
@nevans nevans changed the title ⚡💥 Simplify header-fld-name parser (backward incompatible) 💥⚡ Simplify header-fld-name parser (backward incompatible) Nov 7, 2023
@nevans nevans marked this pull request as draft November 10, 2023 14:09
@nevans nevans force-pushed the parser/header-fld-name branch from ca76d75 to a953f7f Compare April 30, 2024 13:12
@nevans nevans added the v0.5 label May 19, 2024
@nevans nevans added this to the v0.5 milestone May 19, 2024
@nevans nevans removed the v0.5 label May 19, 2024
@nevans nevans force-pushed the parser/header-fld-name branch from a953f7f to 755af6c Compare May 19, 2024 14:54
@nevans nevans force-pushed the parser/header-fld-name branch 3 times, most recently from e434500 to c000439 Compare June 22, 2024 16:16
@nevans nevans force-pushed the parser/header-fld-name branch from c000439 to f077440 Compare June 26, 2024 14:57
@nevans nevans modified the milestones: v0.5, v0.6 Jun 29, 2024
@nevans nevans modified the milestones: v0.6, v0.5 Sep 13, 2024
@nevans
Copy link
Collaborator Author

nevans commented Sep 13, 2024

Note that servers may opt to use a quoted or literal string for atoms that do not require it. I'm not sure if any do that when the client uses the atom form, or if they are simply repeating back what the user sent to them.

At any rate, if we always use the parsed version (which is what this PR does) it normalizes the result and can simplify the client code. Then the client can simply use FetchData#header and not worry about different ways to encode the same header.

I wasn't sure if I wanted to merge this for v0.5, but #315 convinced me that we should include it. I'm not planning to include a config option, because this particular code can make an outsized impact on parser performance.

@nevans nevans force-pushed the parser/header-fld-name branch from f077440 to 7825c6e Compare September 13, 2024 22:25
@nevans nevans marked this pull request as ready for review September 13, 2024 22:25
This speeds up my (earlier) benchmarks by ~15-20% over 0.4.3, and by
22-40% over earlier versions.  (FYI: I have not re-run the benchmarks
against v0.4.16 or v0.5.0-dev.)

NOTE: Previously, Net::IMAP recreated the raw original source string.
Now, it returns the decoded astring value.  Although this is technically
incompatible, it should almost never make a difference: all standard
header field names are valid atoms.

Where it _does_ make a difference, it should simplify client code, by
parsing (normalizing) the result and allowing clients to ignore server
inconsistency.

This will also allow clients to simply rely on FetchData#header for
extracting the result.  Without this commit, FetchData#header won't work
when servers return quoted or literal strings and the user expected
an atom.

(https://www.iana.org/assignments/message-headers/message-headers.xhtml)
@nevans nevans force-pushed the parser/header-fld-name branch from 7825c6e to 3fe175f Compare September 13, 2024 22:27
@nevans nevans merged commit 14f848f into ruby:master Sep 13, 2024
9 checks passed
@nevans nevans deleted the parser/header-fld-name branch September 17, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant