Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace calls to werkzeug.urls with urllib.parse #2608

Merged
merged 14 commits into from
Mar 3, 2023
Merged

replace calls to werkzeug.urls with urllib.parse #2608

merged 14 commits into from
Mar 3, 2023

Conversation

davidism
Copy link
Member

@davidism davidism commented Mar 2, 2023

Use urllib.parse functions instead of our own implementation. Deprecate all of werkzeug.urls except for uri_to_iri and iri_to_uri. My benchmark shows a 35% speedup in routing and responses, 8% from replacing most calls, the rest from refactoring the implementations of the iri functions. fixes #2600, fixes #2406

The only thing that still needed an (internal) wrapper was urlencode, since the router and test client might pass MultiDict or dict to it, and also expect None values to be discarded.

Since I was replacing all uses of quote, I also took the opportunity to review what characters are being treated as safe from percent encoding. We were not being particularly consistent or correct about it. Now all uses of quote for URLs use safe characters for the specific part of the URL being quoted, based on the WHATWG URL Standard, which fixes #2601. For quoting the filename* option in send_file, use the RFC 5987 attr-char set, which fixes #2598. iri_to_uri avoids quoting any ASCII printables, since it's assumed they're intentional at that stage. uri_to_iri unquotes as much as possible without changing how urllib.parse.urlsplit will split the URL.

As a start to #2602, deprecated passing a tuple or bytes to the iri functions.

Another side effect of inlining some helper functions is that parsing application/x-www-form-urlencoded form data now uses max_form_parts like multipart/form-data. Not as important in this case, but may as well be consistent.

davidism added 13 commits March 2, 2023 10:54
Python doesn't treat characters from RFC 3986 as safe,
so a small wrapper is used
need a wrapper to handle MultiDict and drop None
using urllib.parse results in a ~35% speedup
uri_to_iri unquotes as much as possible without changing urlsplit meaning
iri_to_uri quotes as little as possible without chaning urlsplit meaning
WhatWG URL Standard, and RFC 5987 for send_file
use keyword arg safe= to make searching easy
apply different safe sets to different parts of URL
@davidism davidism added this to the 2.3.0 milestone Mar 2, 2023
@ThiefMaster
Copy link
Member

Another side effect of inlining some helper functions is that parsing application/x-www-form-urlencoded form data now uses max_form_parts like multipart/form-data. Not as important in this case, but may as well be consistent.

As mentioned on Discord, I feel like change this is actually a bug that should possibly be reverted:

  • Unlike parsing multipart data, the overhead of parsing url-encoded form data is tiny (less than 2s for a million fields vs 25s for the same amount of multipart fields)
  • It's not really that uncommon to have POST data with many values, especially compared to equally huge multipart requests
  • Parsing urlencoded form data is more similar to parsing JSON data than to parsing multipart data - and AFAIK the default JSON parser has no such limit either.
  • "parts" is a term that's usually just used when it comes to MIME/multipart data, not random form fields
  • the docstring of max_form_parts also only mentions "multipart parts"

@pallets pallets unlocked this conversation May 5, 2023
@davidism
Copy link
Member Author

davidism commented May 5, 2023

Yep, I thought it would make sense to apply it consistently, but it sounds like only multipart is affected by the issue so I'm fine with reverting it for formdata.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants