Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept-Encoding header parsing and interpretation #8104

Open
2 of 4 tasks
steverep opened this issue Jan 29, 2024 · 15 comments
Open
2 of 4 tasks

Accept-Encoding header parsing and interpretation #8104

steverep opened this issue Jan 29, 2024 · 15 comments
Assignees

Comments

@steverep
Copy link
Member

steverep commented Jan 29, 2024

Is your feature request related to a problem?

Forgive me - parts of this might be considered bug fix and parts feature request, but they are highly related and intertwined in the same code. These are things I noticed while working on #8063.

  • Dynamic compression via enable_compression() will compare Accept-Encoding case-insensitively, but retrieving static compressed files does not it should be case-insensitive per RFC 9110).
  • When the force parameter is used in enable_compression(), it does not even consider Accept-Encoding. I fail to see any real use case for this and feel it should be treated like a preference instead (i.e. override the default preference to use deflate to something else). If Brotli were implemented, it may very well be more performant than deflate, so preferring it might be desirable, but then the server no longer supports legacy browsers without Brotli support.
  • AFAICT, there is no proper parsing of the Accept-Encoding field per RFC 9110. There is simply an in operator test of the string. For example:
    • ❌ If the header is not set, it indicates any encoding is okay, but the server currently will only send uncompressed. Technically that's fine, but if a static compressed version exists but the uncompressed version does not (e.g. to save disk space), then the compressed version should be sent.
    • ❌ "*" in the field selects any encoding not explicitly listed.
    • ❌ Each coding is permitted to have a quality factor (e.g. "br;q=1, deflate;q=0, *;q=0). Higher quality factors should be preferred over lower ones if supported by the server, and ";q=0" says not to send that encoding.
    • ✔️ If the header is empty ("") then no compression should be sent.

Describe the solution you'd like

  • Accept-Encoding should be treated case-insensitively when returning static compressed files.
  • The force parameter to enable_compression() should really act like a preference, not an override (i.e. it should not be forced if the encoding is not supported per the Accept-Encoding request header).
  • Parse and interpret the header per RFC 9110 with quality factors, "*", etc. Send a 415 response if merited.

Describe alternatives you've considered

None for the first 2, but admittedly the parsing issue is more or less a moot point since most clients always send the header with an explicit list and no quality factors (i.e. "gzip, deflate, br").

Related component

Server

Additional context

n/a

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@steverep
Copy link
Member Author

steverep commented Jan 30, 2024

@bdraco I think handling the 1st task here for case-sensitivity is quite easy to do before #8063 so I'll submit that.

However, the 3rd task to completely parsing it correctly per the RFC is less a priority to backport IMO. And implementing it raises a larger question for me...

Although it's relatively easy in python, shouldn't that parsing be done by llhttp, and probably the conversion to lowercase too? Meaning, by the time it hits the code to use it for picking or compressing the file, it should be lowered and in a form like {"<encoding>": <quality>, ...} instead of just the raw string. There are other headers that use this format -- Accept, Accept-Language, TE -- but I don't see any of those parsed anywhere in the code.

@Dreamsorcerer is this a question for you?

@steverep
Copy link
Member Author

The 3rd task would be partially accomplished by #7679, which is drafted for the 4.0 milestone.

@steverep
Copy link
Member Author

steverep commented Feb 2, 2024

For the 2nd task regarding the behavior of the force parameter in enable_compression(), the change would be pretty easy; however, a decision is needed on intent and backporting. Either:

  1. Bug fix back to 3.9: In other words, the intent of force was never to ignore Accept-Encoding.
  2. Feature for 3.10: Perhaps add a new prefer parameter to make the intent clear, and potentially deprecate force in version 4.
  3. Breaking for v4: I don't recommend this as it would be problematic for users if Brotli were implemented for 3.10 unless it was the default over deflate.

Thoughts?

@bdraco
Copy link
Member

bdraco commented Feb 2, 2024

I think we need to do some git archeology to figure out why force was added in the first place before a path forward can be determined.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Feb 2, 2024

Although it's relatively easy in python, shouldn't that parsing be done by llhttp, and probably the conversion to lowercase too? Meaning, by the time it hits the code to use it for picking or compressing the file, it should be lowered and in a form like {"<encoding>": <quality>, ...} instead of just the raw string. There are other headers that use this format -- Accept, Accept-Language, TE -- but I don't see any of those parsed anywhere in the code.

My impression is that llhttp is just passing us header values. Meaning that it has parsed it as HTTP, but not converted anything based on the particular header names, which is probably out-of-scope. So, such changes would likely go into _http_parser.pyx instead, if you want the compiled performance.

However, is there actually any merit in implementing this quality check? I've seen the same syntax for locales, and I think the general consensus is that this was over-engineered and basically pointless. The quality numbers fail to achieve anything more than defining an order of preference (i.e. the actual numbers are never used for anything, they're just sorted on). Therefore, browsers don't provide any way for users to customise these quality values, they just pick evenly spaced values according to order of user preference. Because of all this, browsers will also always order them with the highest quality number at the start, so in reality, you can just read the values left-to-right while ignoring the quality numbers and you'll get the same result. Parsing the quality numbers would therefore just be a waste of CPU-time.

@steverep
Copy link
Member Author

steverep commented Feb 2, 2024

I think we need to do some git archeology to figure out why force was added in the first place before a path forward can be determined.

It was a boolean before #403 added gzip, and the code before that made it clear that it was actually a "force" to ignore Accept-Encoding. The boolean was originally added in #239 but there's no comments to explain any use case for it.

@Dreamsorcerer
Copy link
Member

I think the force parameter is pretty clear in that it forces a given compression. I'm not clear why you'd want to do this, so I'm not clear that turning it into a preference makes sense without understanding the use case (can we find any projects using the parameter?).

Clearly it's not correct behaviour to use compression that the client claims not to support, but it also seems weird to try and use a fixed compression method in the first place...

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Feb 3, 2024

I've seen the same syntax for locales, and I think the general consensus is that this was over-engineered and basically pointless.

OK, there seems to be an important difference between Accept-Encoding and Accept-Language. The former can disallow encodings with q=0, so maybe we need to atleast check for that.

Edit: OK, not a difference, but it's explicitly mentioned under Accept-Encoding, whereas it'd be very weird to try and exclude languages.

@steverep
Copy link
Member Author

steverep commented Feb 3, 2024

My impression is that llhttp is just passing us header values. Meaning that it has parsed it as HTTP, but not converted anything based on the particular header names, which is probably out-of-scope. So, such changes would likely go into _http_parser.pyx instead, if you want the compiled performance.

Okay, makes sense 👍🏻

However, is there actually any merit in implementing this quality check? I've seen the same syntax for locales, and I think the general consensus is that this was over-engineered and basically pointless. The quality numbers fail to achieve anything more than defining an order of preference (i.e. the actual numbers are never used for anything, they're just sorted on).

That's certainly correct - the numbers mean nothing but a sort order. I think the reason it's not just a list in preference order is that then it becomes impossible to communicate equal preference for some or all items. That's why the default quality if not specified is 1.

To a lesser extent it's also a way to have negations with a wildcard, e.g. "compress;q=0, *" says I'll take any encoding except compress. Although even the spec admits this pattern has little to no practical value.

FWIW, Apache does use the qualities in its content negotiation.

Therefore, browsers don't provide any way for users to customise these quality values, they just pick evenly spaced values according to order of user preference.

True they are not user-facing for sure (nor would they have any reason to be?), and the values are just picked to set the browser's preference.

Because of all this, browsers will also always order them with the highest quality number at the start, so in reality, you can just read the values left-to-right while ignoring the quality numbers and you'll get the same result. Parsing the quality numbers would therefore just be a waste of CPU-time.

This is definitely not true. The spec is clear that values without a quality default to 1 and browsers take advantage of that. For example, both Chrome and Firefox send Accept-Encoding: gzip, deflate, br, but that certainly does not mean they prefer gzip over brotli. Decoding time is inconsequential so the best interest for a client is smallest transfer size. Google even makes this clear in their Lighthouse audit for compression.

What they send for the Accept header also makes this clear:

  • Firefox = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8"
  • Chrome = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7"

@Dreamsorcerer
Copy link
Member

This is definitely not true.

I was referring to the real world behaviour of Accept-Language, and wondering if the same applies. Clearly that's not the case, so you are probably right that we should use the quality value here.

@steverep
Copy link
Member Author

steverep commented Feb 3, 2024

Clearly it's not correct behaviour to use compression that the client claims not to support, but it also seems weird to try and use a fixed compression method in the first place...

I completely agree - neither makes sense to me. I think there is a use case for changing the server's default encoding order though. For example, a user might benchmark and find that sending large files over long distances is likely faster with gzip than deflate (i.e. the larger encoding time is offset by the faster transfer time). So they might want the order to be ["br", "gzip", ..others].

If I were to redesign the current API, it might look something like this to allow a set of encodings to be preferred first but then fallback to others supported:

    def __init__(self, ...) -> None:
        ...
        self._ordered_codings = deque(ContentCoding, len(ContentCoding))

    def enable_compression(self, *prefer: ContentCoding) -> None:
        """Enables response compression with preferred encodings."""
        self._compression = True
        for encoding in prefer:
            self._ordered_codings.remove(encoding)
        self._ordered_codings.extend(prefer)
        self._ordered_codings.rotate(len(prefer))

    async def _start_compression(self, request: "BaseRequest") -> None:
        # Encoding comparisons should be case-insensitive
        # https://www.rfc-editor.org/rfc/rfc9110#section-8.4.1
        accept_encoding = request.headers.get(hdrs.ACCEPT_ENCODING, "").lower()
        for coding in self._ordered_codings:
            if coding.value in accept_encoding:
                await self._do_start_compression(coding)
                return

@Dreamsorcerer
Copy link
Member

Yeah, I'd be open to something like that. I wonder if it's worth adding this as a new feature (and potentially removing force if we can't find any uses for it).

@steverep
Copy link
Member Author

steverep commented Feb 5, 2024

Yeah, I'd be open to something like that. I wonder if it's worth adding this as a new feature (and potentially removing force if we can't find any uses for it).

Here's my attempt at a search for enable_compression() uses with an argument, after a lot of filtering to get rid of just copies of aiohttp itself.

There are some results using the old boolean, some oddly specifying "identity" 😕, some just specifying "gzip" that clearly would have no harm by turning to a preference (e.g. ignore the Home Assistant results), and even one result that is doing the job aiohttp should be doing and checking the Accept-Encoding header 😉.

Passing "identity" is concerning because currently that would just do nothing, but under my proposal, it would be skipped because it's not going to be in the Accept-Encoding header and then the algorithm would go on to deflate. I could either put an assertion in that disallows it, or just adjust the header check because it's implied as included (unless explicitly excluded).

@Dreamsorcerer
Copy link
Member

Here's my attempt at a search for enable_compression() uses with an argument, after a lot of filtering to get rid of just copies of aiohttp itself.

That gives me no results for some reason. I did create a script previously to search dependents, though it can be slow to run (it searches 1 repo every ~7 seconds due to rate limit): https://gist.github.com/Dreamsorcerer/70285fac0a11c3d9c26b577f7dd989a7

There are some results using the old boolean

Might be interesting to see if any of them mention any rationale in the commit they were added...

Passing "identity" is concerning because currently that would just do nothing, but under my proposal, it would be skipped because it's not going to be in the Accept-Encoding header and then the algorithm would go on to deflate. I could either put an assertion in that disallows it, or just adjust the header check because it's implied as included (unless explicitly excluded).

My counter-proposal is to just add a new prefer parameter which takes a sequence of strings. Then add a deprecation warning to the force parameter and remove it from v4. If we get complaints from the deprecation notice, then we can reconsider removing it.

@steverep
Copy link
Member Author

steverep commented Feb 5, 2024

That gives me no results for some reason.

The markdown sanitizer keeps removing all the escapes from the regex parts. The search term should look like this with 20 results for me:

aiohttp AND /\.enable_compression\(.+\)/ language:python NOT is:archived NOT is:fork NOT repo:aio-libs/aiohttp NOT is:vendored NOT is:generated NOT path:/(^|\/)aiohttp\/(multipart|web_response|client_reqrep)\.py$/ NOT path:/tests\/test_(web_response|client_functional|http_writer)\.py$/ NOT repo:cndn/intelligent-code-completion

Might be interesting to see if any of them mention any rationale in the commit they were added...

I didn't check, but one of the false arguments did have a comment mentioning it was a workaround for the double compression issue recently fixed by @bdraco.

My counter-proposal is to just add a new prefer parameter which takes a sequence of strings. Then add a deprecation warning to the force parameter and remove it from v4. If we get complaints from the deprecation notice, then we can reconsider removing it.

Not as elegant, but safer of course. Also, consider that either way this becomes a "breaking" change. For example, the Home Assistant use is definitely not expecting a force, so a change would be required to specify prefer=.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants