Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Draft support in the current ("legacy") upload API #17230

Open
5 tasks
woodruffw opened this issue Dec 5, 2024 · 17 comments
Open
5 tasks

[META] Draft support in the current ("legacy") upload API #17230

woodruffw opened this issue Dec 5, 2024 · 17 comments
Assignees
Labels
APIs/feeds feature request meta Meta issues (rollouts, etc)

Comments

@woodruffw
Copy link
Member

woodruffw commented Dec 5, 2024

Support for drafted uploads is a long-standing feature request, dating back (at least) to 2015 with #726.

Since then there's been been a proposed PEP, PEP 694, which defines a new upload API. This API includes support for drafted uploads, among other larger features (like upload session management, resumption, and batched uploading).

In the interest of expedience, @DarkaMau, @facutuesca, and I (@woodruffw) are looking at adding drafting support to just the legacy API, in such a way that a future implementation of PEP 694 (such as one built on #8941 or #16277) can reuse the building blocks within a new endpoint.

To keep things tractable (and keep PRs small + reviewable), here's the units of work we're proposing:

  • Add published: datetime | None = datetime.now to the Release model, with a schema migration. This can be picked/adapted from PEP 694 (resolved merge conflicts from PR #8941) #16277

  • Add a PyPI-specific metadata field (or HTTP request header) to the upload endpoint, with the following semantics:

    • If the project or release is being created for the first time, sets published = None
    • If the release already exists and was not already marked as a draft, fails the upload (since this suggests a mixed/confused state on the user's side)

    The current work in Draft Release support in PyPI #8941/PEP 694 (resolved merge conflicts from PR #8941) #16277 uses the Is-Draft: true | false header, but I propose we prefix this with X-PyPI- to avoid standard header conflicts, so we'd check X-PyPI-Is-Draft or similar.

  • When Release.published is None, we should not present it in the standard indices/APIs. Instead, we'll serve it at a temporary draft index. Draft Release support in PyPI #8941 has this as /simple/draft/<draft_hash>, but I propose we drop the hash and do this with a URL that mirrors the existing hierarchy: /simple/draft/<name>/<version>/ or even just /simple/draft/<name> with all drafts for the project being aggregated.

  • When Release.published is None, the given release should not appear as the latest release on the web view. Instead, it should have presentation behavior similar to yanked releases: we can show it on the /project/<name>/#history timeline, but with a badge indicating that it's a draft. When navigating to /project/<name>/<version>, draft versions should be rendered like normal versions but with two differences:

    • A warning flash/banner/element on the top of the page, similar to what appears when a release has been yanked
    • The pip install ... instructions should be specialized to show the draft index, per the step above

All told, I think this gives us a set of independent tasks to accomplish drafting. Our proposal is that we try and use as much of #8941 and #16277 here as possible, but do so via independent PRs to keep things as small as possible and reduce reviewer burden.

Separately, as client-side/tooling changes that'll need to occur:

  • twine and similar uploading tools will ideally learn a --draft or similar flag that sets the appropriate header

CC @di for thoughts, and thanks @warsaw and @alanbato for the work you've done on this already!

@woodruffw woodruffw added APIs/feeds feature request meta Meta issues (rollouts, etc) labels Dec 5, 2024
@woodruffw woodruffw self-assigned this Dec 5, 2024
@DarkaMaul
Copy link
Contributor

When Release.published is None, the given release should not appear as the latest release on the web view. Instead, it should have presentation behavior similar to yanked releases: we can show it on the /project//#history timeline, but with a badge indicating that it's a draft. When navigating to /project//, draft versions should be rendered like normal versions but with two differences:

I worry that displaying draft releases in the project release timeline would confuse the end-users (i.e., not the project developers).
Having them accessible with the correct (and guessable?) URLs while keeping them unlisted would, I think, create a clearer separation between drafts and pre-releases.

@warsaw
Copy link
Contributor

warsaw commented Dec 5, 2024

Thanks, I'll review this in more detail when I get a chance. But just to let you know, I'm almost done with my PEP 694 rewrite. Keep in mind that PR will still go through one more pass on my side, but then I plan on rebooting the DPO thread.

@woodruffw
Copy link
Member Author

I worry that displaying draft releases in the project release timeline would confuse the end-users (i.e., not the project developers).
Having them accessible with the correct (and guessable?) URLs while keeping them unlisted would, I think, create a clearer separation between drafts and pre-releases.

That's a good point -- putting it in the timeline might be unnecessarily confusing. At the same time, I think we should render the draft release itself as a page, for a few reasons:

  • It'll be hard to teach users how to use the temporary "draft indices" without having the information presented somewhere (besides the docs, which people tend to only read as a last resort). Having some kind of HTML view of drafted releases with a specialized pip install command/similar will help get users unstuck when they do need to use drafts.
  • Without some kind of render, the project creation flow with a drafted first release would result in a mixed state: the project would exist, but no /project/<name>/ view would exist for it. I think we want to avoid that, since that would cause additional user confusion + require tools like twine to specialize their printing behavior to avoid showing a friendly URL in that case.

As a middle ground here, we could leave drafted projects out of the timeline but add a flash/element to the top of /project/<name>/ when a newer draft is present.

@warsaw
Copy link
Contributor

warsaw commented Dec 5, 2024

that PR will still go through one more pass on my side

This pass includes updating the PR to align with the current state of the resumable upload RFC draft which formed the original basis for PEP 694.

@konstin
Copy link
Contributor

konstin commented Dec 5, 2024

Asking as a uv publish implementer, how is re-uploading files is handled, both when identical file is re-uploaded and when a different file is re-uploaded?

@woodruffw
Copy link
Member Author

Asking as a uv publish implementer, how is re-uploading files is handled, both when identical file is re-uploaded and when a different file is re-uploaded?

I think the draft feature here would not affect the current behavior: files are unique at the index level, so draft releases would have immutable file -> content mappings like all other releases.

(In other words, drafts wouldn't be for iterating on a release's files; they're only for staging a release so that all files become public in one atomic step + allowing projects to share a new release before it becomes resolvable by default.)

@facutuesca
Copy link
Contributor

facutuesca commented Dec 5, 2024

they're only for staging a release so that all files become public in one atomic step (a) + allowing projects to share a new release before it becomes resolvable by default (b).

How important is (b) ?. In my mind, draft releases are a tool for maintainers to:

  • atomically publish all files in a single release (same as (a))
  • test that the upload worked as expected before making the release public (same as using TestPyPI)

Those could be achieved without making draft releases public to users, and without the need to expose a temporary draft index. Then draft releases would be only visible to maintainers in the web UI, as a tool to double-check and more neatly publish a release.

So my question is, how common is the problem that (b) is trying to solve, and is it worth the added complexity? With private draft releases, only publishing tools need to implement support for uploading draft releases, whereas if we make them public, they're suddenly a concern for any tool that downloads packages from the index.

@woodruffw
Copy link
Member Author

So my question is, how common is the problem that (b) is trying to solve, and is it worth the added complexity? With private pre-releases, only publishing tools need to implement support for uploading draft releases

I don't have good numbers here, but my understanding is that (b) is a recurring request from projects with long or complicated release procedures (e.g. due to complex native builds or their own build farms): having to do a pre-release means having to run their release processes twice, and it means more downstream test churn (since users looking to evaluate the pre-release need to upgrade twice, rather than opting into the draft and then rolling forwards).

whereas if we make them public, they're suddenly a concern for any tool that downloads packages from the index.

I think this is ameliorated by the split-index design: drafts won't appear in /simple/ or /simple/<name>/, so tools like pip and uv won't need to consider or filter them at all by default. They'll only be considered for resolution when a user explicitly adds their indices, e.g. via --extra-index-url.

Given that IMO the complexity here is pretty constrained: nothing will change in ordinary installation flows, and anything that uses the draft indices should interoperate fine, given that the indices will be the standard PEP 503/691 ones.

@warsaw
Copy link
Contributor

warsaw commented Dec 5, 2024

Some thoughts:

  • I prefer the term "stage" instead of "draft" and my PEP 694 rewrite uses that consistently.
  • I strongly believe stages need to be "obfuscatable". The use case is to be able to stage a release pre-announcement and then publish in one go all the wheels for that release at the moment of the announcement. I do not believe we need to lock stages down behind authenticated access. My PEP 694 rewrite accomplishes this by including an optional "nonce" in the stage creation request, which defaults to the empty string. If not given, then only the package name and version contribute to the stage token, but this nonce gives the client the option to further contribute to the stage token hash, with only information it knows. Thus, it's up to the client as to whether stage access is easily guessable or well-obscured. I think this addresses the (b) question.
  • I don't think staged files should be visible on a project's release timeline, or otherwise visible to the general public. It's confusing and potentially leaking of information (plus, remember another use case for stages is so that clients don't see incomplete releases). I'm inclined to omit them even for logged in owners of the stage, just for simplicity, but I'm not totally against it. However, through the use of the stage URL and/or stage token, programmatic access to the files in the stage (e.g. via pip's --extra-index-url) would be accessible, for testing purposes.
  • @konstin In my rewrite, stages are mutable and files in those stages can be overwritten or deleted at whim. The protocol easily supports this too. Only at the point that a stage is published do they acquire the normal constraints on mutability. I think this is a core use case for stages (i.e. since you're testing your staged release, you must be able to re-upload faulty wheels).
  • Thus, @woodruffw "In other words, drafts wouldn't be for iterating on a release's files; they're only for staging a release so that all files become public in one atomic step + allowing projects to share a new release before it becomes resolvable by default." Disagree with the first part!

@woodruffw
Copy link
Member Author

  • I prefer the term "stage" instead of "draft" and my PEP 694 rewrite uses that consistently.

No objection 🙂 -- I agree that's a better term.

I strongly believe stages need to be "obfuscatable". The use case is to be able to stage a release pre-announcement and then publish in one go all the wheels for that release at the moment of the announcement. I do not believe we need to lock stages down behind authenticated access. [...] Thus, it's up to the client as to whether stage access is easily guessable or well-obscured. I think this addresses the (b) question.

I'm a little bit murky on the value of this: is the idea to allow uploaders to hide their staged release until they want it made fully public? Given that features like attestations (and longer term, things like index-wide transparency logging) will publicly disclosure releases as they're made, I think the obfuscatory effect here is pretty limited.

(I think this is also true especially for the CI/CD context: projects that do public staged releases via GitHub or GitLab are going to have their staged release URLs in their public run logs.)

I'm also a little murky on optional client-controlled randomness: I think, if the staged release needs to be obfuscated, then it makes sense to always do it, and have the server handle the thing we're calling the staging token. My reasoning there is that clients will struggle to make informed decisions around how to turn this knob, so if we want to give them obfuscated staging URLs then we should do it for them. But as part of that, I don't want to accidentally produce an unenforceable security boundary here, per above 🙂

I don't think staged files should be visible on a project's release timeline, or otherwise visible to the general public. It's confusing and potentially leaking of information (plus, remember another use case for stages is so that clients don't see incomplete releases).

No objection, although I think we should state explicitly whether this is for user confusion reasons (a sufficient reason IMO!) or obfuscation reasons (which IMO are harder to formalize, per the transparency and CI/CD leakiness points above).

More generally, I think PyPI should retain the (currently implicit, but maybe it should be explicit) assumption that once you put something on the index, it should be considered disclosed. Obfuscation can be an OK feature for misuse-resistance purposes in that context, but I think it can't be a security boundary here.

(I'll update the top-level comment to adjust these parts, since I think you're 100% right about drafts being confusing in the web UI.)

I think this is a core use case for stages (i.e. since you're testing your staged release, you must be able to re-upload faulty wheels).

I'm not inherently opposed to mutable staging but I think it encourages a pretty different model than the one that Python packaging (IMO) is moving towards as a whole, where versions are "cheap" and the mapping between a package's identity (= distribution filename) and its contents (= strong digest) is immutable. That model is the one that PEP 740 (and other cryptographic packaging PEPs, like 458) assumes. It's also (separately) the one that PyPI is architecturally rigged for, since PyPI already has permanent, unique (filename, digest) pairs baked into the codebase.

Or perhaps as an alternate framing: I think the distinction between pre-releases and staged releases collapses too much if pre-releases become a mutable feature 😅 -- I think right now a lot of projects use pre-releases as a low-impact place to experiment with their uploads since they don't get resolved by default, and I'm not sure it makes sense for PyPI to have multiple ways to accomplish similar packaging tasks, versus having pre-releases be the "sandbox" feature and staging be the "make a release atomic" feature.

@woodruffw
Copy link
Member Author

That model is the one that PEP 740 (and other cryptographic packaging PEPs, like 458) assumes. It's also (separately) the one that PyPI is architecturally rigged for, since PyPI already has permanent, unique (filename, digest) pairs baked into the codebase.

To tack onto this: one weird (not necessarily insecure, but not ideal) consequence to mutable staged releases is that a given foo-1.2.3.tar.gz would potentially have multiple attestations for different content digests. This can happen currently anyways (since users and their CI/CD setups are in full control of signing), but it's less likely in current setups because of the larger shift towards release immutability.

The consequence of this would be some potentially useful (although admittedly narrow!) flexibility for an attacker: if foo-1.2.3.tar.gz @ sha123 is uploaded with a valid attestation, then replaced with foo-1.2.3.tar.gz @ sha456 with another valid attestation, an attacker could take advantage of a vulnerability in the first "revision" and serve it as if it was the second revision.

@warsaw
Copy link
Contributor

warsaw commented Dec 6, 2024

I'm a little bit murky on the value of this: is the idea to allow uploaders to hide their staged release until they want it made fully public? Given that features like attestations (and longer term, things like index-wide transparency logging) will publicly disclosure releases as they're made, I think the obfuscatory effect here is pretty limited.

Maybe the key difference in how we're thinking about stages is whether you would call the stage "a release" (and/or "pre-release" - a term we should probably avoid so as not to confuse with pre-release version numbers). The way I'm thinking about them, they really aren't a release, hence the term "stage"! This is not a release until the stage is actually published, at which point it becomes a release, along with all the constraints and guarantees you mention.

It seems clear that test.pypi.org is insufficient for testing purposes, so a staged release with overwrite capability and optional obfuscation provides everything needed for complex testing purposes. This is especially true for coordinated releases across many cross-dependent packages. E.g. I intend to release foo but also foo-plugin-A, foo-plugin-B, etc. I can create a stage for each of these, add all the --extra-index-url options I want, do all the final mile testing I need to do, then on Flag Day, I push the button and publish all the stages for all the packages at the same time.

I think PyPI should retain the (currently implicit, but maybe it should be explicit) assumption that once you put something on the index, it should be considered disclosed. Obfuscation can be an OK feature for misuse-resistance purposes in that context, but I think it can't be a security boundary here.

Agreed that if you really care about keeping artifacts completely secret, you can't use the staging feature. That said, I still think the optional obfuscation is useful as a middle ground between completely public and completely secret. And because the nonce in my proposal is optional, it's totally up to the client to decide. For example:

  • I want to stage a new release and let my downstreams test it out. I decide not to use a nonce and therefore the stage token is easily calculated by any interested consumer. They test things out with --extra-index-url and confirm that the packages to be published in the stage are working correctly.
  • My own CI system is building wheels and uploading them to the stage, but I don't want anybody to be able to guess (based just on project name and version) that there might be a release. I certainly don't want random users to be able to probe PyPI for upcoming releases. I share the nonce between my pipeline jobs so they can upload the wheels when they've been built, and then once all those jobs are complete, I run a final CI job to ensure that everything looks good before I publish it.

I think the distinction between pre-releases and staged releases collapses too much if pre-releases become a mutable feature 😅 -- I think right now a lot of projects use pre-releases as a low-impact place to experiment with their uploads since they don't get resolved by default, and I'm not sure it makes sense for PyPI to have multiple ways to accomplish similar packaging tasks, versus having pre-releases be the "sandbox" feature and staging be the "make a release atomic" feature.

By "pre-releases" here, you mean actual published releases with a pre-release version number. And all of what you say is fine, but it's not a substitute for a testable, pre-published, final release. Failures can and do happen between say the last rc and the final release, and it's that use case that I want to support.

The consequence of this would be some potentially useful (although admittedly narrow!) flexibility for an attacker: if foo-1.2.3.tar.gz @ sha123 is uploaded with a valid attestation, then replaced with foo-1.2.3.tar.gz @ sha456 with another valid attestation, an attacker could take advantage of a vulnerability in the first "revision" and serve it as if it was the second revision.

If we're treating stages as not-yet-published, then I don't think this is a realistic scenario. Yes, the @ sha123 attestation gets uploaded, but if we DELETE or overwrite the foo-1.2.3.tar.gz, regardless of the sha, then that should also delete any attestations connected to the original tarball. It's as if that original tarball was never uploaded.

The original PEP 694 was based on https://tus.io which then got turned into an Internet Draft. The draft has been updated since then, and I am in the process of aligning my PEP 694 update to the latest draft version. Although stages aren't part of that protocol, the ability to delete or overwrite artifacts are, and I think that has valid use cases that should be supported, at least up until the actual publish step.

@di
Copy link
Member

di commented Feb 7, 2025

I've been thinking through this issue and the related PRs (#17331) and how they relate to #726, #918 as well as #17520.

Assuming that we a) want to stop running TestPyPI as a separate service (due to many reasons why it is less than ideal) but b) still provide the benefits it provides (namely, uploading a release, testing if it is publishable, and testing installing it), what if we combined these efforts:

  • TestPyPI as a separate Warehouse instance is shut down completely.
  • Instead, test.pypi.org becomes a "view" over everything on PyPI that is both "live" and "staged'
    • This means that we don't have to worry about giving people special URLs which let them access their staged files, it's just a different subdomain
    • It means that installers don't have to learn the difference between a "staged" or "live" release, just different indexes, which they already understand
    • It also means that issues like duplicate accounts on PyPI/TestPyPI go away, as well as disparities between which packages are available on which which hinder TestPyPI actually being useful for installing from.
  • All uploads through test.pypi.org are automatically marked as "staged"
    • This means that tools like twine don't need to gain support for X-PyPI-Is-Staged or similar flags, and they will just work out of the box
    • But it's still an intentional step for the user to publish a "test" release
  • Uploading additional files to test.pypi.org can overwrite other staged files
    • Avoids a current shortcoming of TestPyPI in that it acts exactly like PyPI and prohibits re-uploads.
  • Publishing an identical file to test.pypi.org and then through upload.pypi.org does nothing but convert the file from "staged" to "live"
    • this continues supporting the "publish to TestPyPI then publish to PyPI" workflow that is fairly popular

Additionally: a lot of this functionality seems similar to our project lifecycle features, but at the file/release level, not the project level. We should probably think about how we can create file/release lifecycles and how they may affect the project lifecycle. E.g., publishing several staged files for a project that didn't previously exist should probably create a "staged" project until one or more of those files becomes "live".

@warsaw
Copy link
Contributor

warsaw commented Feb 8, 2025

It's a really interesting idea and I want to think about it more, but let me ask some questions!

  • When I want to create a stage (before I upload any files), would I do that on test.pypi.org, pypi.org, or either?
  • Would I still be able to stage multiple files together?
  • Would I still be able to publish all staged files atomically?
  • Would I still be able to obfuscate a release's files while they're being staged and tested? (i.e. to handle the "make this release live at 9am tomorrow" use case.)

I like the idea of using multiple index URLs to access stages. I'm seeing a lot of momentum behind making multiple indexes work better, and if we can accomplish that, then this is a really nice extension of that idea that unifies a lot of things.

I like that test.pypi.org as a separate service goes away and becomes a view on pypi.org. No more having to handle multiple accounts, project registrations, etc. (We may end up deciding that test.pypi.org has too much baggage and a different hostname is better, but we can defer that.)

I'm not sure I like the idea of one file in a stage becoming "live" without an explicit publishing step. That seems like it may be a recipe for accidental exposure that could pollute the whole release.

Additionally: a lot of this functionality seems similar to our project lifecycle features, but at the file/release level, not the project level. We should probably think about how we can create file/release lifecycles and how they may affect the project lifecycle. E.g., publishing several staged files for a project that didn't previously exist should probably create a "staged" project until one or more of those files becomes "live".

Could you provide links or details on these?

@konstin
Copy link
Contributor

konstin commented Feb 10, 2025

From my (potentially very unrepresentative) view as a tool author, test.pypi.org is a great resource, but i can understand if it's too much diversion from pypi.org.

Similarly as tool author, i'm looking for way to make the CLI interface and CI configuration as similar as possible between pypi and other indexes, so that we don't get a split between features, documentation and workflows for pypi and for elsewhere, e.g. i expect that staging will become widely supported, but generally under a single domain.

@di
Copy link
Member

di commented Feb 11, 2025

When I want to create a stage (before I upload any files), would I do that on test.pypi.org, pypi.org, or either?

I'm not sure the question makes sense, as PyPI currently has no real notion of something that exists before a file is uploaded. What are you imagining this would be? Are you thinking about similar functionality to project name reservations?

Would I still be able to stage multiple files together?

Yes, all files uploaded through test.pypi.org would remain staged until they were uploaded to pypi.org.

Would I still be able to publish all staged files atomically?

To be honest I'm a little confused by the way "atomically" is being used in this issue. I would assume that "atomic" publishing would mean "each file is published individually irrespective of the status of any other file" but it seems like here it's being used to describe "publish some set of files in a single step". I think the latter (which I would call "release publishing", perhaps) is an interesting idea but not strictly necessary to support staged publishing and probably would require quite a bit of updates to how the upload API works.

Would I still be able to obfuscate a release's files while they're being staged and tested? (i.e. to handle the "make this release live at 9am tomorrow" use case.)

Only in the sense that the files would not be installable from pypi.org until they are published, but they would be installable from test.pypi.org. I think this satisfies the use case of "make staged files installable for testing but not widely available until they are published". I don't think this satisfies the use case of "make staged files secret/private" but I'm not convinced that should be a goal here.

@warsaw
Copy link
Contributor

warsaw commented Feb 12, 2025

I'm not sure the question makes sense, as PyPI currently has no real notion of something that exists before a file is uploaded. What are you imagining this would be? Are you thinking about similar functionality to project name reservations?

I was thinking about this in the context of PEP 694. In that PEP, something would exist before the first file was uploaded. But now I see that this ticket is really proposing draft support in the existing legacy API, not a PEP 694 world.

Yes, all files uploaded through test.pypi.org would remain staged until they were uploaded to pypi.org.

So, uploads would have to happen twice to be published, once to test.pypi.org then to pypi.org. Under 694, you'd only upload once.

it seems like here it's being used to describe "publish some set of files in a single step". I think the latter (which I would call "release publishing", perhaps) is an interesting idea but not strictly necessary to support staged publishing and probably would require quite a bit of updates to how the upload API works.

Indeed 😄 - again I'm thinking about a PEP 694 world.

I don't think this satisfies the use case of "make staged files secret/private" but I'm not convinced that should be a goal here.

Perhaps not. It's a goal of 694 though.

Thanks. I think I was just confused about the intent of the discussion here. Totally fine to consider these as separate initiatives of course. I'm very interested in pushing 694 forward because I think it provides a lot of value to the package producers and consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APIs/feeds feature request meta Meta issues (rollouts, etc)
Projects
None yet
Development

No branches or pull requests

6 participants