-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Upload + Better Metadata Handling #14716
base: main
Are you sure you want to change the base?
Conversation
Note: I opened #14717 to discuss a change I'm planning to make that affects this PR. |
I guess I should also mention, once I have this PR in a state that I'm happy with as an outcome, I'll probably try and break this up into smaller PRs to make them easier to review, it was just easier to figure out how this should look lumping it all together. |
Blocked on pypa/packaging#733 |
Also blocked on pypa/packaging#736. |
This is still a work in progress, none of the tests are working/updated and there's still pending work to be done (see everything with a
FIXME
comment), opening this now as a draft though to give people a chance to start looking at it and giving their thoughts.The upload endpoint in Warehouse is a gnarly endpoint, It's currently a single ~1600 line file and it is not well factored. Following the logic of what is happening and where it's happening requires careful reading of a large view function, where different parts have various inter-dependencies that are not particularly clear.
To make matters worse, the metadata handling in the upload endpoint is using a bespoke metadata implementation, which sometime differs from how other systems validate, and due to the historical "shape" this API took, the metadata that we're validating and storing isn't actually the metadata that clients will be using-- that metadata lives in the uploaded artifacts, but instead is sent by the upload client alongside the upload 1.
So what does this Pull Request do? It refactors/revamps the uploading handling to try and fix all of these issues to move us to a better overall standing.
Concretely, it does the following:
upload()
view up into smaller functions, making it easier to follow the overall flow of the upload function without getting bogged down into details and making the requirements/dependencies that these "sub sections" have clearer and more obvious.packaging.metadata
as the canonical representation of our metadata and to handle validation of the metadata.packaging.metadata
, because we have extra constraints that are special to Warehouse.Release.canonical_version
returns multiple objects 2.This should mean that the metadata that we record in Warehouse is much more likely to match what installers like pip will see when introspecting the artifact itself.
It also means that Warehouse is going to be more strict in what it accepts, because the metadata parsing in
packaging.metadata
has been carefully written to avoid silently allowing possibly ambiguous data (and as far as I know, it's the only parser that currently does that). That means that cases like:Because we're extracting metadata out of the artifact rather than using the form data (where possible) we had to change the order of operations, which previously looked something like:
Project
declared in the metadata.Project
.Release
for the version declared in the metadata.METADATA
file.File
object in the database.However, the new order of operations looks more like this:
Project
from thename
we got from the form data.Project
.METADATA
file (if we can, currently wheel only).packaging.metadata.Metadata
object, using the extractedMETADATA
or the form data if the dist isn't the kind we can extractMETADATA
from.metadata.description
is able to be rendered.Release
for the version declared in the metadata.File
object in the database.We do end up shifting more of the filename validation to happen prior to ever buffering the uploaded file, which should allow those particular checks to bail out faster and do less work. However, we do shift metadata validation to happen after we've buffered the uploaded file, which will delay those particular checks to later in the request/response cycle 3.
Another subtle change is that by moving the duplicate file check prior to buffering the uploaded file to disk, we have to implicitly trust that the sha256 and/or blake2_256 digest that the client provides us is accurate when deciding to no-op the upload. This should be perfectly safe as we treat the entire upload as a no-op (including dooming the transaction) so the most a malicious client can do is trick Warehouse into either either turning an error into a no-op, or a no-op into an error.
Things still left to do:
Metadata
ofUNKNOWN
values.version
out of the form data and moving all of the name/version checks to happen prior to buffering the file.Metadata
, but theMetadata
class doesn't have support for the rendered description).I have more improvements to this I plan to do as well, but I'm going to keep them out of this PR since it's already big enough.
Footnotes
The most popular tool for uploading is twine, which just reads the
METADATA
orPKG-INFO
files and then sends that along. In theory this should be equivalent to us extracting theMETADATA
files ourselves. This isn't actually always true in practice though, any time the upload client reads the metadata they risk transforming it in incompatible ways, missing fields, etc that is hidden from us unless we look at theMETADATA
file itself. ↩The structure of this made things more difficult to refactor out, and it's been like 10+ years since we allowed uploading multiple distinct versions that all canonicalize to the same thing, so nobody should be hitting this today unless they're trying to release something to an ancient version. ↩
While this PR currently ignores all of the core metadata that is being sent in the form data besides name/version, we could still consume that data and validate it prior to buffering the uploaded file to disk, then extract the
METADATA
file, parse it, and ensure that it matches the metadata that was sent in the form data.This PR chooses not to do that because the metadata in the file is the authoritative copy, and if that differs from the form data it likely means that the upload client had a bug... but we can choose whether or not we turn that bug into an error or just silently doing the right thing... so we just silently do the right thing. ↩