Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of where the release metadata (notably requires_dist) in the JSON API comes from #9274

Closed
pfmoore opened this issue Mar 22, 2021 · 4 comments · Fixed by #9322
Closed

Comments

@pfmoore
Copy link
Contributor

pfmoore commented Mar 22, 2021

(I chose the "feature request" template, as there isn't a "request for information" option, so treating this as a feature request for the documentation to be improved seemed best. But just getting an answer here would be sufficient for me).

What's the problem this feature will solve?
The JSON API documented here includes requires_dist metadata. But it's not clear where that data comes from or when it's filled in, so it's essentially useless - without knowing how it's derived, applications can't reliably use it for anything.

Describe the solution you'd like
A clarification on how Warehouse determines that data for projects. Specifically:

  1. Which release file is it extracted from (in theory, a project release could contain different metadata in its Linux and Windows wheels, for example). My assumption is "one of the wheels, but which one is essentially arbitrary", which is probably good enough for my purposes (see below).
  2. Was the data backfilled to older releases when this was added, or should I assume that for entries older than some date, missing data just means "we weren't collecting the data at this time"? (I'm pretty sure that it wouldn't have been backfilled).
  3. For releases recent enough that the data should be there, how should I interpret a value of null? I can imagine multiple possibilities:
    3a. The project explicitly declares that it has no dependencies.
    3b. The project didn't upload wheels, and you don't extract metadata from sdists, so the project might have dependencies.
    3c. The project initially uploaded a sdist but later added a wheel, and you don't update the data in this case.

Additional context
I am looking for this information for the purposes of research into projects and their dependencies on PyPI. As a consequence, I don't need 100% accurate data, but understanding the limitations of what is available would be extremely useful for me, as it would save me from having to download potentially thousands of wheels from PyPI and process them myself. I'm also mostly interested in the latest releases of projects, so historical data isn't critical to me, but being able to look at whether metadata changes over time might be of interest if history is available.

The main thing I'd like to have is some heuristics on how to interpret a value of null. One of the key questions I want to answer is "what proportion of projects have dependencies at all" and it's hard to know that without being able to distinguish between "not known" and "definitely not there".

I know there is work going on to standardise and formalise the JSON API, but I don't know how far along that is, and I would still find it useful to know the current situation.

If someone can give me a pointer to the relevant parts of the Warehouse code, and a rough summary, I'm happy to go and read the code and work out the details for myself, but at the moment I'm unfamiliar with the Warehouse codebase, so I don't know where to start.

@ewdurbin
Copy link
Member

ewdurbin commented Mar 22, 2021

  1. The contents are whatever is provided in the Requires-Dist field of the metadata at upload time. It would also be the value provided by the first uploaded file for a given Release.

  2. No, no backfill would have been performed. I assume PyPI would have been one of the first tools to support the field, so it likely would have been moot to backfill. (Note that this field was added in Metadata 1.2 way back in 2005 via PEP 345)

  3. I believe the answer is none of the above, Requires-Dist is an optional metadata field that isn't enforced to match anything declared in setup.py or any wheel as far as I know.

@pfmoore
Copy link
Contributor Author

pfmoore commented Mar 22, 2021

Thanks @ewdurbin.

I just realised, I have completely misunderstood what's going on here 🙁 Looking at the upload API docs I see that Warehouse doesn't introspect anything, it simply records what the uploader sends as the relevant metadata. The reason I'm now seeing dependency data more often appears to be just because twine extracts it and includes it in the upload call, and more people are using twine these days.

Which means that the metadata in the JSON API is only as reliable as the tool used to upload the data, and missing data can't be assumed to mean anything specific.

Looks like I'm going to have to download a bunch of wheels, no reliable way to avoid it...

Sorry for the confusion, and thanks for helping.

@di
Copy link
Member

di commented Mar 31, 2021

@pfmoore Shall we consider this issue resolved in that case? Or is there some documentation updates we need to make here?

@pfmoore
Copy link
Contributor Author

pfmoore commented Apr 1, 2021

It would be nice if "where the data came from" could be recorded somewhere. Maybe something like #9322 would be suitable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants