replace calls to `werkzeug.urls` with `urllib.parse` #2608

davidism · 2023-03-02T21:54:39Z

Use urllib.parse functions instead of our own implementation. Deprecate all of werkzeug.urls except for uri_to_iri and iri_to_uri. My benchmark shows a 35% speedup in routing and responses, 8% from replacing most calls, the rest from refactoring the implementations of the iri functions. fixes #2600, fixes #2406

The only thing that still needed an (internal) wrapper was urlencode, since the router and test client might pass MultiDict or dict to it, and also expect None values to be discarded.

Since I was replacing all uses of quote, I also took the opportunity to review what characters are being treated as safe from percent encoding. We were not being particularly consistent or correct about it. Now all uses of quote for URLs use safe characters for the specific part of the URL being quoted, based on the WHATWG URL Standard, which fixes #2601. For quoting the filename* option in send_file, use the RFC 5987 attr-char set, which fixes #2598. iri_to_uri avoids quoting any ASCII printables, since it's assumed they're intentional at that stage. uri_to_iri unquotes as much as possible without changing how urllib.parse.urlsplit will split the URL.

As a start to #2602, deprecated passing a tuple or bytes to the iri functions.

Another side effect of inlining some helper functions is that parsing application/x-www-form-urlencoded form data now uses max_form_parts like multipart/form-data. Not as important in this case, but may as well be consistent.

Python doesn't treat characters from RFC 3986 as safe, so a small wrapper is used

need a wrapper to handle MultiDict and drop None

using urllib.parse results in a ~35% speedup uri_to_iri unquotes as much as possible without changing urlsplit meaning iri_to_uri quotes as little as possible without chaning urlsplit meaning

WhatWG URL Standard, and RFC 5987 for send_file use keyword arg safe= to make searching easy apply different safe sets to different parts of URL

ThiefMaster · 2023-05-05T10:00:16Z

Another side effect of inlining some helper functions is that parsing application/x-www-form-urlencoded form data now uses max_form_parts like multipart/form-data. Not as important in this case, but may as well be consistent.

As mentioned on Discord, I feel like change this is actually a bug that should possibly be reverted:

Unlike parsing multipart data, the overhead of parsing url-encoded form data is tiny (less than 2s for a million fields vs 25s for the same amount of multipart fields)
It's not really that uncommon to have POST data with many values, especially compared to equally huge multipart requests
Parsing urlencoded form data is more similar to parsing JSON data than to parsing multipart data - and AFAIK the default JSON parser has no such limit either.
"parts" is a term that's usually just used when it comes to MIME/multipart data, not random form fields
the docstring of max_form_parts also only mentions "multipart parts"

davidism · 2023-05-05T13:04:23Z

Yep, I thought it would make sense to apply it consistently, but it sounds like only multipart is affected by the issue so I'm fine with reverting it for formdata.

davidism added 13 commits March 2, 2023 10:54

replace uses of url_parse with urllib.parse.urlsplit

a6bae8f

replace uses of url_quote with urllib.parse.quote

53782a0

Python doesn't treat characters from RFC 3986 as safe, so a small wrapper is used

replace uses of url_unquote with urllib.parse.unquote

163d0f4

replace uses of url_decode with urllib.parse.parse_qsl

47cf613

replace uses of url_encode with urllib.parse.urlencode

6613fb9

need a wrapper to handle MultiDict and drop None

replace uses of url_join with urllib.parse.urljoin

2554f47

refactor uri_to_iri and iri_to_uri

22a83b7

using urllib.parse results in a ~35% speedup uri_to_iri unquotes as much as possible without changing urlsplit meaning iri_to_uri quotes as little as possible without chaning urlsplit meaning

link to reference for each use of safe chars during quote

aa2a0c5

WhatWG URL Standard, and RFC 5987 for send_file use keyword arg safe= to make searching easy apply different safe sets to different parts of URL

deprecate werkzeug.urls

19d55c1

remove unused private function

1f35318

deprecate tuple/bytes to iri functions

5e30073

test max_form_parts for x-www-form-urlencoded

7210a55

drop _urls module

c5f4004

davidism added this to the 2.3.0 milestone Mar 2, 2023

fix mypy findings

1c6ebef

davidism merged commit d4ddff6 into main Mar 3, 2023

davidism deleted the urllib branch March 3, 2023 15:04

davidism mentioned this pull request Mar 10, 2023

send_file wont accept filenames with comma pallets/flask#5023

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2023

mdavis-xyz mentioned this pull request Apr 26, 2023

plus (+) in query args converted to space #2657

Closed

The-Compiler mentioned this pull request Apr 26, 2023

werkzeug.urls deprecation warnings with Werkzeug 2.3.0 csernazs/pytest-httpserver#242

Closed

dairiki mentioned this pull request May 3, 2023

Use of warnings.catch_warnings resets the global warnings registry #2690

Closed

pallets unlocked this conversation May 5, 2023

ThiefMaster mentioned this pull request May 5, 2023

Do not apply max_form_parts to non-multipart data #2694

Merged

5 tasks

github-actions bot locked as resolved and limited conversation to collaborators May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace calls to `werkzeug.urls` with `urllib.parse` #2608

replace calls to `werkzeug.urls` with `urllib.parse` #2608

davidism commented Mar 2, 2023 •

edited

Loading

ThiefMaster commented May 5, 2023

davidism commented May 5, 2023

replace calls to werkzeug.urls with urllib.parse #2608

replace calls to werkzeug.urls with urllib.parse #2608

Conversation

davidism commented Mar 2, 2023 • edited Loading

ThiefMaster commented May 5, 2023

davidism commented May 5, 2023

replace calls to `werkzeug.urls` with `urllib.parse` #2608

replace calls to `werkzeug.urls` with `urllib.parse` #2608

davidism commented Mar 2, 2023 •

edited

Loading