-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix how Warehouse stores metadata (per-file, not per-release) #8090
Comments
@cooperlees is now working on pypa/packaging-problems#367 which is partially blocked by this bug in Warehouse. |
I've been thinking about this again, so I looked at exactly what we're storing per Currently our metadata looks like:
Since all of this metadata comes from uploading a file, technically every single one of these besides Name and Version can vary file to file, but practically speaking much of it is exactly the same for each file within a release. Additionally, from an implementation point of view, it's a lot easier in the Web UI, and also just general data model and conceptual overhead when most of these values are the same for each release. Looking though this, the main thing I see that we should change is:
You could make an argument that Requires-Python and Yanked / Yanked Reason should move from I think that for Requires-Python, we display that in the Web UI, so it makes sense to want that to be consistent, but more importantly, the intent of that key is that you can make a new release that doesn't support some version(s) of Python without a new release breaking existing users. For that, I think it makes sense to keep it consistent, since every file should be the same version of the code. For yanking, I think that we've made a conscious choice to implement yanking in terms of a yanking a whole release, not an individual file, which I think is perfectly fine. So I believe the original issue was roughly accurate, we primarily just need to store dependency information on |
WARNING: I've done a fair amount of thinking on this, and in an effort to get that information out of my head I'm just going to brain dump on this comment. There are a few inter playing concerns here:
The actual mechanics of storing metadata per file is straight forward, either add columns to the existing Most of these problems come down to how do we source this metadata (for existing files, for new files in the interim, and long term), but there's also a question on how we handle sdists which may have dynamic metadata. Personally, I think the gold standard for what we want to arrive at here is that we have a source of per file metadata, that we guarantee matches the content of the files. If we can't get that information for a file, then we don't attempt to "best guess it", it either comes from the file or we don't provide it. An important aspect here is we'll need to determine the difference between unset and set to nothing. In the interim to getting to that point, we can have some "best guess" information, if we know that we can later fill it in, but we shouldn't provide "best guess" information, and then delete it completely. I think the fastest path forward then is we end up with a Once we evolve the upload API so that we can introspect artifacts and get metadata from them, then we can switch to using the from-the-artifact data, including in cases where we have that information inside of an sdist. We can also backfill this "best guess" data at anytime by iterating over the wheels (and actually sdists) that are already uploaded and fill in t hat data with "validated" data. We could do this after we have the upload introspection, or even before if we record that we've done it already. Then there's the question of handling the metadata that we want to treat as "release" metadata in the Warehouse UI and existing APIs, but which obviously come from one of the files (like summary, long description, URLs, whatever). For this we can do a number of things, either just duplicate the data at the So tl;dr I suspect our best path forward here:
Then at some point in the future, when we can introspect files on upload, we get validated data on upload. |
It's been a while and a few things have changed since #8090 (comment) (namely that we've started extracting metadata from wheels, and have backfilled metadata files for all wheels). I've updated the first comment (#8090 (comment)) with a task list that I think accurately describes the work we need to do here. |
Task list:
FileMetadata
model and an optional 1-1 relationship to theFile
model. The fields of this model should correspond to the available Core Metadata fields.metadata_source
field toFileMetadata
that lets us determine if the metadata is "provided" (i.e. from thePOST
) or "extracted" (i.e. from the artifact itself).FileMetadata
objects on new uploads, discerning how we got the metadata (wheels should always be "extracted", source distributions will be "provided")FileMetadata
files for wheels that have already been uploaded with data from our metadata backfill files (Remove metadata backfill task #15526 will be helpful here)Release
model and update the UI to get them from the "metadata source" instead.Original issue
Describe the bug
Warehouse's API gives the user sometimes inaccurate dependency metadata for a project's release, because:
Expected behavior
As I understand it, we should change how we store and provide dependency information, recording and storing it per file instead of per release. I presume this means that the
requires_dist
field within the release endpoint would move from the "info" value to the individual "releases" values.To Reproduce
Sorry, I don't have one to hand.
Additional context
Quoting a conversation in IRC today between @dstufft and @techalchemy (quotes are condensed and edited for grammar):
@dstufft said of the current Warehouse JSON API, "I don't think it's usable in it's current form for resolving dependencies". Regarding the metadata available, which clients would otherwise need to download packages to acquire,
"the data is wrong is the main thing ... for dep info .... because warehouse (and originally PyPI's data model) is wrong. We only store the dependency information per Release, but it can vary per file."
@techalchemy asked: "so which file do you pick for parsing dependencies? the first wheel that gets uploaded? Or the last one?"
@dstufft: "first file uploaded creates the Release object. which is also problematic, if you upload a sdist first no dependency information is encoded.... At one point only twine worked to upload dependency information. If you uploaded with setuptools it didn't get sent no matter what."
Donald also noted, on parseability of that info, "We [Warehouse] do not currently parse anything inside of a wheel, in part because we never did, in part because upload already takes forever and the more stuff we do the longer it takes. I think our timeout on upload is multiple minutes, because that's how long it takes sometimes." (That's a reason for #7730 but we should not block on that.)
"We might want to tweak the JSON API a bit just to make it suitable for the primary use case I think people want it for, and when I say tweak, I basically mean add a field or two to a dict inside of alist"
The text was updated successfully, but these errors were encountered: