Refactor Upload + Better Metadata Handling #14716

dstufft · 2023-10-08T22:57:14Z

This is still a work in progress, none of the tests are working/updated and there's still pending work to be done (see everything with a FIXME comment), opening this now as a draft though to give people a chance to start looking at it and giving their thoughts.

The upload endpoint in Warehouse is a gnarly endpoint, It's currently a single ~1600 line file and it is not well factored. Following the logic of what is happening and where it's happening requires careful reading of a large view function, where different parts have various inter-dependencies that are not particularly clear.

To make matters worse, the metadata handling in the upload endpoint is using a bespoke metadata implementation, which sometime differs from how other systems validate, and due to the historical "shape" this API took, the metadata that we're validating and storing isn't actually the metadata that clients will be using-- that metadata lives in the uploaded artifacts, but instead is sent by the upload client alongside the upload ¹.

So what does this Pull Request do? It refactors/revamps the uploading handling to try and fix all of these issues to move us to a better overall standing.

Concretely, it does the following:

Breaks the upload() view up into smaller functions, making it easier to follow the overall flow of the upload function without getting bogged down into details and making the requirements/dependencies that these "sub sections" have clearer and more obvious.
- We attempt to minimize the back and forth between validation, protocol handling, and constructing database objects.
Extracts the metadata out of the artifact, and uses that to populate the database.
- Currently this only works for wheel files, but we are setup to add support for sdist too with Metadata 2.2.
- Falls back to using the metadata from the form data as we do today, but only for artifacts that we know we can't extract metadata from (i.e. we don't treat failing to extract from a wheel as a non-error that falls back to form data).
Uses packaging.metadata as the canonical representation of our metadata and to handle validation of the metadata.
- We still layer our own validation on top of packaging.metadata, because we have extra constraints that are special to Warehouse.
Fixes the metadata handling for some fields that we were either accidentally ignoring (we support up to metadata 2.2, but we missed implementing some fields) or where we were incorrectly handling the values (multi use fields being treated like single use fields, etc).
Stops supporting md5 digest as a valid digest to verify an uploaded file.
- We still compute/store it, but we no longer accept uploads that only have a md5. They need to have a sha256 or blake2_256.
Stops supporting uploads for ancient releases where Release.canonical_version returns multiple objects ².

This should mean that the metadata that we record in Warehouse is much more likely to match what installers like pip will see when introspecting the artifact itself.

It also means that Warehouse is going to be more strict in what it accepts, because the metadata parsing in packaging.metadata has been carefully written to avoid silently allowing possibly ambiguous data (and as far as I know, it's the only parser that currently does that). That means that cases like:

Multiple uses of single use keys will be errors (currently all other parsers just pick either the first or last value).
Unknown fields will be errors (currently all other parsers just skip them).

Because we're extracting metadata out of the artifact rather than using the form data (where possible) we had to change the order of operations, which previously looked something like:

Process / Validate the metadata (from the multipart form data) and the "upload data" (file hashes, package type, filename, etc).
Get or create the Project declared in the metadata.
Check if the request is authorized to upload to Project.
Check if the description can be rendered.
Get or create the Release for the version declared in the metadata.
Do more validations around the filename (project-version.ext? invalid characters, etc).
Buffer the uploaded file to a temporary location on disk.
Do more validations (valid dist? duplicate file? etc).
Check if the file is a wheel, and do more validations that the filename of the wheel is valid.
If the file is a Wheel, extract the METADATA file.
Create the File object in the database.
Upload the artifact + metadata to S3.

However, the new order of operations looks more like this:

Process / Validate the "upload data" (file hashes, package type, filename, etc) and only the project name and version (from form data).
- This includes all filename validation that we can do without access to the database or metadata besides name + version.
Get or create the Project from the name we got from the form data.
Check if the request is authorized to upload to Project.
Do any validations of the filename that requires access to the database (duplicate file checks, etc).
Buffer the uploaded file to a temporary location on disk.
Validate that the file itself is a valid distribution file.
Extract the METADATA file (if we can, currently wheel only).
Construct a validated packaging.metadata.Metadata object, using the extracted METADATA or the form data if the dist isn't the kind we can extract METADATA from.
- This includes fully validating the metadata with any additional rules we add onto it that don't require access to the database.
- This also includes checking if the metadata.description is able to be rendered.
Get or create the Release for the version declared in the metadata.
Create the File object in the database.
Upload the artifact + metadata to S3.

We do end up shifting more of the filename validation to happen prior to ever buffering the uploaded file, which should allow those particular checks to bail out faster and do less work. However, we do shift metadata validation to happen after we've buffered the uploaded file, which will delay those particular checks to later in the request/response cycle ³.

Another subtle change is that by moving the duplicate file check prior to buffering the uploaded file to disk, we have to implicitly trust that the sha256 and/or blake2_256 digest that the client provides us is accurate when deciding to no-op the upload. This should be perfectly safe as we treat the entire upload as a no-op (including dooming the transaction) so the most a malicious client can do is trick Warehouse into either either turning an error into a no-op, or a no-op into an error.

Things still left to do:

I have more improvements to this I plan to do as well, but I'm going to keep them out of this PR since it's already big enough.

The most popular tool for uploading is twine, which just reads the METADATA or PKG-INFO files and then sends that along. In theory this should be equivalent to us extracting the METADATA files ourselves. This isn't actually always true in practice though, any time the upload client reads the metadata they risk transforming it in incompatible ways, missing fields, etc that is hidden from us unless we look at the METADATA file itself. ↩
The structure of this made things more difficult to refactor out, and it's been like 10+ years since we allowed uploading multiple distinct versions that all canonicalize to the same thing, so nobody should be hitting this today unless they're trying to release something to an ancient version. ↩
While this PR currently ignores all of the core metadata that is being sent in the form data besides name/version, we could still consume that data and validate it prior to buffering the uploaded file to disk, then extract the METADATA file, parse it, and ensure that it matches the metadata that was sent in the form data.

This PR chooses not to do that because the metadata in the file is the authoritative copy, and if that differs from the form data it likely means that the upload client had a bug... but we can choose whether or not we turn that bug into an error or just silently doing the right thing... so we just silently do the right thing. ↩

dstufft · 2023-10-09T00:18:31Z

Note: I opened #14717 to discuss a change I'm planning to make that affects this PR.

dstufft · 2023-10-09T00:27:20Z

I guess I should also mention, once I have this PR in a state that I'm happy with as an outcome, I'll probably try and break this up into smaller PRs to make them easier to review, it was just easier to figure out how this should look lumping it all together.

dstufft · 2023-10-09T02:39:06Z

Blocked on pypa/packaging#733

dstufft · 2023-10-09T06:59:09Z

Also blocked on pypa/packaging#736.

dstufft added 26 commits October 7, 2023 16:40

refactor the "upload allowed" checks into a helper

bd7f2df

expand comment a bit

01a6ac2

Start refactoring metadata handling

f8420d1

Get/Create the project prior to validating all metadata

ceebb0b

move most filename validation to the UploadForm

07ad047

refactor out validating filename for metadata

40c2e1e

refactor copying uploaded files to a local temporary file

4d6320f

more refactoring

51cb60a

move protocol version check to the front

223ee7a

refactor request sanitization

2a00cc3

Refactor metadata extraction

b8f8962

remove an old validation that should no longer be needed

95db7da

reshuffle

6c86844

Check for filename reuse (duplicate or not)

1533621

small refactor

11be239

treat it as an error to get multiple releases

7adac64

more factoring

a7b06d4

more refactoring

e328d0b

remove no longer used function

c24a360

turn immediate TODOs into FIXME

963f803

correct

53efa21

refactor validation to happen against Metadata objects

2e51dcb

Parse Metadata from a form object

3a8c79b

no longer used

893ae67

Drop support for md5 as a "secure" hash

cbe6cf3

Move filename name/version validation into the upload form

f33c29a

refactor the duplicate / conflicting file check

b07db69

dstufft mentioned this pull request Oct 9, 2023

Use packaging.metadata to parse and validate upload metadata #14718

Merged

miketheman added blocked Issues we can't or shouldn't get to yet and removed blocked Issues we can't or shouldn't get to yet labels Nov 17, 2023

dstufft mentioned this pull request May 31, 2024

Move request sanitization and upload disallowed out of file_upload #16032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Upload + Better Metadata Handling #14716

Refactor Upload + Better Metadata Handling #14716

dstufft commented Oct 8, 2023 •

edited

Loading

dstufft commented Oct 9, 2023

dstufft commented Oct 9, 2023

dstufft commented Oct 9, 2023

dstufft commented Oct 9, 2023

Refactor Upload + Better Metadata Handling #14716

Are you sure you want to change the base?

Refactor Upload + Better Metadata Handling #14716

Conversation

dstufft commented Oct 8, 2023 • edited Loading

Footnotes

dstufft commented Oct 9, 2023

dstufft commented Oct 9, 2023

dstufft commented Oct 9, 2023

dstufft commented Oct 9, 2023

dstufft commented Oct 8, 2023 •

edited

Loading