Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging area for crate uploads #1503

Closed
d-e-s-o opened this issue Sep 28, 2018 · 2 comments
Closed

Staging area for crate uploads #1503

d-e-s-o opened this issue Sep 28, 2018 · 2 comments
Labels
C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Comments

@d-e-s-o
Copy link

d-e-s-o commented Sep 28, 2018

I believe it would be great to have sort of a staging area for crates.io. Currently it appears that once a crate is uploaded there is no going back/changing of it in any way, modulo publishing a new version of course.
While that is fine conceptually, for initial releases or changes impacting the README or other other things in major ways, it would be great if there was a way to try things out first, have an initial look with the possibility of adjusting things.
PyPi has a staging area for this purpose, which seemingly is just a different instance (different URL) that has the once-uploaded-no-more-changes restriction removed.

I am thinking that such a feature would benefit a great deal of people. Have I just not found it?

@d-e-s-o
Copy link
Author

d-e-s-o commented Sep 28, 2018

Case in point, I just uploaded a new version of my crate to https://crates.io/crates/gui. Turns out that the relative link that I embedded into README.md for some reason gets expanded to a wrong and non-existent file on github.com. That is absolutely not what I wanted and a possibility to inspect the result beforehand (I could see the result on github.com, but the link works correctly there!) would have given me the chance to fix the problem. Now I have to live with it until a new version is ready.

@dwijnand
Copy link
Member

dwijnand commented Jan 2, 2019

Another use case this solves is making multi-crate releases atomic. See rust-lang/cargo#6510 (comment) where CI observed a world where failure_derive v0.1.5 and failure v0.1.4 were the latest versions available, which broke CI because failure_derive v0.1.5 required a change released in failure v0.1.5.

sgrif added a commit to sgrif/crates.io that referenced this issue Jan 3, 2019
This fundamentally changes the workflow for all operations we perform
involving git, so that they are not performed on the web server and do
not block the response. This will improve the response times of `cargo
publish`, and make the publish process more resilient, reducing the
liklihood of an inconsistency occurring such as the index getting
updated, but not our database.

Previously, our workflow looked something like this:

- When the server boots, do a full clone of the index into a known
  location
- Some request comes in that needs to update the index
- Database transaction is opened
- Local checkout is modified, we attempt to commit & push (note: This
  involves a mutex to avoid contention with another request to update
  the index on the same server)
- If push fails, we fetch, `reset --hard`, and try again up to 20 times
- Database transaction is committed
- We send a successful response

The reason for the retry logic is that we have more than one web server,
meaning no server can be sure that its local checkout is actually up to
date. There's also a major opportunity for an inconsistent state to be
reached here. If the power goes out, the server is restarted, something
crashes, etc, in between the index being updated and the database
transaction being committed, we will never retry it.

The new workflow looks like this:

- Some request comes in that needs to update the index
- A job is queued in the database to update the index at some point in
  the future.
- We send a successful response
- A separate process pulls the job out of the database
- A full clone of the index is performed into a temporary directory
- The new checkout is modified, committed, and pushed
- If push succeeds, job is removed from database
- If push fails, job is marked as failed and will be retried at some
  point in the future

While a background worker can be spread across multiple machines and/or
threads, we will be able to avoid the race conditions that were
previously possible by ensuring that we only have one worker with one
thread that handles index updates. Right now that's easy since index
updates are the only background job we have, but as we add more we will
need to add support for multiple queues to account for this.

I've opted to do a fresh checkout in every job, rather than relying on
some state that was setup when the machine booted. This is mostly for
simplicity's sake. It also means that if we need to scale to multiple
threads/processes for other jobs, we can punt the multi-queue
enhancement for a while if we wish. However, it does mean the job will
take a bit longer to run. If this turns out to be a problem, it's easy
to address.

This should eliminate the opportunity for the index to enter an
inconsistent state from our database -- or at least they should become
eventually consistent. If the power goes out before the job is committed
as done, it is assumed the job failed and it will be retried. The job
itself is idempotent, so even if the power goes out after the index is
updated, the retry should succeed.

One other side effect of this change is that when `cargo publish`
returns with an exit status of 0, that does not mean that your crate/new
version is immediately available for use -- if you try to point to it in
Cargo.toml seconds after publishing, you may get an error that it could
not find that version. This was technically already true, since neither
S3 nor GitHub guarantee that uploads/pushes are immediately visible.
However, this does increase the timescale beyond the delay we would have
seen there. In most cases it should be under 10 seconds, and at most a
minute.

One enhancement that will come as a followup, but is not included in
this PR is a UI to see the status of your upload. This is definitely
nice to have, but is not something I think is necessary for this feature
to land. The time it would take to navigate to that UI is going to be
longer than the time it takes the background job to run in most cases.
That enhancement is something I think can go hand in hand with rust-lang#1503
(which incidentally becomes much easier to implement with this PR, since
a "staging" publish just skips queuing the background job, and the only
thing the button to full publish needs to do is queue the job).

This setup does assume that all background jobs *must* eventually
succeed. If any job fails, the index is in an inconsistent state with
our database, and we are having an outage of some kind. Due to the
nature of our background jobs, this likely means that GitHub is down, or
there is a bug in our code. Either way, we page whoever is on-call,
since it means publishing is broken. Since publishing crates is such an
infrequent event, I've set the thresholds to be extremely low.
sgrif added a commit to sgrif/crates.io that referenced this issue Feb 13, 2019
This fundamentally changes the workflow for all operations we perform
involving git, so that they are not performed on the web server and do
not block the response. This will improve the response times of `cargo
publish`, and make the publish process more resilient, reducing the
liklihood of an inconsistency occurring such as the index getting
updated, but not our database.

Previously, our workflow looked something like this:

- When the server boots, do a full clone of the index into a known
  location
- Some request comes in that needs to update the index
- Database transaction is opened
- Local checkout is modified, we attempt to commit & push (note: This
  involves a mutex to avoid contention with another request to update
  the index on the same server)
- If push fails, we fetch, `reset --hard`, and try again up to 20 times
- Database transaction is committed
- We send a successful response

The reason for the retry logic is that we have more than one web server,
meaning no server can be sure that its local checkout is actually up to
date. There's also a major opportunity for an inconsistent state to be
reached here. If the power goes out, the server is restarted, something
crashes, etc, in between the index being updated and the database
transaction being committed, we will never retry it.

The new workflow looks like this:

- Some request comes in that needs to update the index
- A job is queued in the database to update the index at some point in
  the future.
- We send a successful response
- A separate process pulls the job out of the database
- A full clone of the index is performed into a temporary directory
- The new checkout is modified, committed, and pushed
- If push succeeds, job is removed from database
- If push fails, job is marked as failed and will be retried at some
  point in the future

While a background worker can be spread across multiple machines and/or
threads, we will be able to avoid the race conditions that were
previously possible by ensuring that we only have one worker with one
thread that handles index updates. Right now that's easy since index
updates are the only background job we have, but as we add more we will
need to add support for multiple queues to account for this.

I've opted to do a fresh checkout in every job, rather than relying on
some state that was setup when the machine booted. This is mostly for
simplicity's sake. It also means that if we need to scale to multiple
threads/processes for other jobs, we can punt the multi-queue
enhancement for a while if we wish. However, it does mean the job will
take a bit longer to run. If this turns out to be a problem, it's easy
to address.

This should eliminate the opportunity for the index to enter an
inconsistent state from our database -- or at least they should become
eventually consistent. If the power goes out before the job is committed
as done, it is assumed the job failed and it will be retried. The job
itself is idempotent, so even if the power goes out after the index is
updated, the retry should succeed.

One other side effect of this change is that when `cargo publish`
returns with an exit status of 0, that does not mean that your crate/new
version is immediately available for use -- if you try to point to it in
Cargo.toml seconds after publishing, you may get an error that it could
not find that version. This was technically already true, since neither
S3 nor GitHub guarantee that uploads/pushes are immediately visible.
However, this does increase the timescale beyond the delay we would have
seen there. In most cases it should be under 10 seconds, and at most a
minute.

One enhancement that will come as a followup, but is not included in
this PR is a UI to see the status of your upload. This is definitely
nice to have, but is not something I think is necessary for this feature
to land. The time it would take to navigate to that UI is going to be
longer than the time it takes the background job to run in most cases.
That enhancement is something I think can go hand in hand with rust-lang#1503
(which incidentally becomes much easier to implement with this PR, since
a "staging" publish just skips queuing the background job, and the only
thing the button to full publish needs to do is queue the job).

This setup does assume that all background jobs *must* eventually
succeed. If any job fails, the index is in an inconsistent state with
our database, and we are having an outage of some kind. Due to the
nature of our background jobs, this likely means that GitHub is down, or
there is a bug in our code. Either way, we page whoever is on-call,
since it means publishing is broken. Since publishing crates is such an
infrequent event, I've set the thresholds to be extremely low.
sgrif added a commit to sgrif/crates.io that referenced this issue Feb 22, 2019
This fundamentally changes the workflow for all operations we perform
involving git, so that they are not performed on the web server and do
not block the response. This will improve the response times of `cargo
publish`, and make the publish process more resilient, reducing the
liklihood of an inconsistency occurring such as the index getting
updated, but not our database.

Previously, our workflow looked something like this:

- When the server boots, do a full clone of the index into a known
  location
- Some request comes in that needs to update the index
- Database transaction is opened
- Local checkout is modified, we attempt to commit & push (note: This
  involves a mutex to avoid contention with another request to update
  the index on the same server)
- If push fails, we fetch, `reset --hard`, and try again up to 20 times
- Database transaction is committed
- We send a successful response

The reason for the retry logic is that we have more than one web server,
meaning no server can be sure that its local checkout is actually up to
date. There's also a major opportunity for an inconsistent state to be
reached here. If the power goes out, the server is restarted, something
crashes, etc, in between the index being updated and the database
transaction being committed, we will never retry it.

The new workflow looks like this:

- Some request comes in that needs to update the index
- A job is queued in the database to update the index at some point in
  the future.
- We send a successful response
- A separate process pulls the job out of the database
- A full clone of the index is performed into a temporary directory
- The new checkout is modified, committed, and pushed
- If push succeeds, job is removed from database
- If push fails, job is marked as failed and will be retried at some
  point in the future

While a background worker can be spread across multiple machines and/or
threads, we will be able to avoid the race conditions that were
previously possible by ensuring that we only have one worker with one
thread that handles index updates. Right now that's easy since index
updates are the only background job we have, but as we add more we will
need to add support for multiple queues to account for this.

I've opted to do a fresh checkout in every job, rather than relying on
some state that was setup when the machine booted. This is mostly for
simplicity's sake. It also means that if we need to scale to multiple
threads/processes for other jobs, we can punt the multi-queue
enhancement for a while if we wish. However, it does mean the job will
take a bit longer to run. If this turns out to be a problem, it's easy
to address.

This should eliminate the opportunity for the index to enter an
inconsistent state from our database -- or at least they should become
eventually consistent. If the power goes out before the job is committed
as done, it is assumed the job failed and it will be retried. The job
itself is idempotent, so even if the power goes out after the index is
updated, the retry should succeed.

One other side effect of this change is that when `cargo publish`
returns with an exit status of 0, that does not mean that your crate/new
version is immediately available for use -- if you try to point to it in
Cargo.toml seconds after publishing, you may get an error that it could
not find that version. This was technically already true, since neither
S3 nor GitHub guarantee that uploads/pushes are immediately visible.
However, this does increase the timescale beyond the delay we would have
seen there. In most cases it should be under 10 seconds, and at most a
minute.

One enhancement that will come as a followup, but is not included in
this PR is a UI to see the status of your upload. This is definitely
nice to have, but is not something I think is necessary for this feature
to land. The time it would take to navigate to that UI is going to be
longer than the time it takes the background job to run in most cases.
That enhancement is something I think can go hand in hand with rust-lang#1503
(which incidentally becomes much easier to implement with this PR, since
a "staging" publish just skips queuing the background job, and the only
thing the button to full publish needs to do is queue the job).

This setup does assume that all background jobs *must* eventually
succeed. If any job fails, the index is in an inconsistent state with
our database, and we are having an outage of some kind. Due to the
nature of our background jobs, this likely means that GitHub is down, or
there is a bug in our code. Either way, we page whoever is on-call,
since it means publishing is broken. Since publishing crates is such an
infrequent event, I've set the thresholds to be extremely low.
sgrif added a commit to sgrif/crates.io that referenced this issue Mar 8, 2019
This fundamentally changes the workflow for all operations we perform
involving git, so that they are not performed on the web server and do
not block the response. This will improve the response times of `cargo
publish`, and make the publish process more resilient, reducing the
liklihood of an inconsistency occurring such as the index getting
updated, but not our database.

Previously, our workflow looked something like this:

- When the server boots, do a full clone of the index into a known
  location
- Some request comes in that needs to update the index
- Database transaction is opened
- Local checkout is modified, we attempt to commit & push (note: This
  involves a mutex to avoid contention with another request to update
  the index on the same server)
- If push fails, we fetch, `reset --hard`, and try again up to 20 times
- Database transaction is committed
- We send a successful response

The reason for the retry logic is that we have more than one web server,
meaning no server can be sure that its local checkout is actually up to
date. There's also a major opportunity for an inconsistent state to be
reached here. If the power goes out, the server is restarted, something
crashes, etc, in between the index being updated and the database
transaction being committed, we will never retry it.

The new workflow looks like this:

- Some request comes in that needs to update the index
- A job is queued in the database to update the index at some point in
  the future.
- We send a successful response
- A separate process pulls the job out of the database
- A full clone of the index is performed into a temporary directory
- The new checkout is modified, committed, and pushed
- If push succeeds, job is removed from database
- If push fails, job is marked as failed and will be retried at some
  point in the future

While a background worker can be spread across multiple machines and/or
threads, we will be able to avoid the race conditions that were
previously possible by ensuring that we only have one worker with one
thread that handles index updates. Right now that's easy since index
updates are the only background job we have, but as we add more we will
need to add support for multiple queues to account for this.

I've opted to do a fresh checkout in every job, rather than relying on
some state that was setup when the machine booted. This is mostly for
simplicity's sake. It also means that if we need to scale to multiple
threads/processes for other jobs, we can punt the multi-queue
enhancement for a while if we wish. However, it does mean the job will
take a bit longer to run. If this turns out to be a problem, it's easy
to address.

This should eliminate the opportunity for the index to enter an
inconsistent state from our database -- or at least they should become
eventually consistent. If the power goes out before the job is committed
as done, it is assumed the job failed and it will be retried. The job
itself is idempotent, so even if the power goes out after the index is
updated, the retry should succeed.

One other side effect of this change is that when `cargo publish`
returns with an exit status of 0, that does not mean that your crate/new
version is immediately available for use -- if you try to point to it in
Cargo.toml seconds after publishing, you may get an error that it could
not find that version. This was technically already true, since neither
S3 nor GitHub guarantee that uploads/pushes are immediately visible.
However, this does increase the timescale beyond the delay we would have
seen there. In most cases it should be under 10 seconds, and at most a
minute.

One enhancement that will come as a followup, but is not included in
this PR is a UI to see the status of your upload. This is definitely
nice to have, but is not something I think is necessary for this feature
to land. The time it would take to navigate to that UI is going to be
longer than the time it takes the background job to run in most cases.
That enhancement is something I think can go hand in hand with rust-lang#1503
(which incidentally becomes much easier to implement with this PR, since
a "staging" publish just skips queuing the background job, and the only
thing the button to full publish needs to do is queue the job).

This setup does assume that all background jobs *must* eventually
succeed. If any job fails, the index is in an inconsistent state with
our database, and we are having an outage of some kind. Due to the
nature of our background jobs, this likely means that GitHub is down, or
there is a bug in our code. Either way, we page whoever is on-call,
since it means publishing is broken. Since publishing crates is such an
infrequent event, I've set the thresholds to be extremely low.
bors added a commit that referenced this issue Mar 8, 2019
Move index updates off the web server

This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.

This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.

This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)

One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.

As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.

The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
bors added a commit that referenced this issue Mar 9, 2019
Move index updates off the web server

This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.

This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.

This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)

One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.

As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.

The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works and removed C-feature-request labels Feb 11, 2021
@rust-lang rust-lang locked and limited conversation to collaborators Dec 23, 2021
@Turbo87 Turbo87 converted this issue into discussion #4330 Dec 23, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

No branches or pull requests

4 participants