-
Notifications
You must be signed in to change notification settings - Fork 609
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Staging area for crate uploads #1503
Comments
Case in point, I just uploaded a new version of my crate to https://crates.io/crates/gui. Turns out that the relative link that I embedded into README.md for some reason gets expanded to a wrong and non-existent file on github.com. That is absolutely not what I wanted and a possibility to inspect the result beforehand (I could see the result on github.com, but the link works correctly there!) would have given me the chance to fix the problem. Now I have to live with it until a new version is ready. |
Another use case this solves is making multi-crate releases atomic. See rust-lang/cargo#6510 (comment) where CI observed a world where failure_derive v0.1.5 and failure v0.1.4 were the latest versions available, which broke CI because failure_derive v0.1.5 required a change released in failure v0.1.5. |
This fundamentally changes the workflow for all operations we perform involving git, so that they are not performed on the web server and do not block the response. This will improve the response times of `cargo publish`, and make the publish process more resilient, reducing the liklihood of an inconsistency occurring such as the index getting updated, but not our database. Previously, our workflow looked something like this: - When the server boots, do a full clone of the index into a known location - Some request comes in that needs to update the index - Database transaction is opened - Local checkout is modified, we attempt to commit & push (note: This involves a mutex to avoid contention with another request to update the index on the same server) - If push fails, we fetch, `reset --hard`, and try again up to 20 times - Database transaction is committed - We send a successful response The reason for the retry logic is that we have more than one web server, meaning no server can be sure that its local checkout is actually up to date. There's also a major opportunity for an inconsistent state to be reached here. If the power goes out, the server is restarted, something crashes, etc, in between the index being updated and the database transaction being committed, we will never retry it. The new workflow looks like this: - Some request comes in that needs to update the index - A job is queued in the database to update the index at some point in the future. - We send a successful response - A separate process pulls the job out of the database - A full clone of the index is performed into a temporary directory - The new checkout is modified, committed, and pushed - If push succeeds, job is removed from database - If push fails, job is marked as failed and will be retried at some point in the future While a background worker can be spread across multiple machines and/or threads, we will be able to avoid the race conditions that were previously possible by ensuring that we only have one worker with one thread that handles index updates. Right now that's easy since index updates are the only background job we have, but as we add more we will need to add support for multiple queues to account for this. I've opted to do a fresh checkout in every job, rather than relying on some state that was setup when the machine booted. This is mostly for simplicity's sake. It also means that if we need to scale to multiple threads/processes for other jobs, we can punt the multi-queue enhancement for a while if we wish. However, it does mean the job will take a bit longer to run. If this turns out to be a problem, it's easy to address. This should eliminate the opportunity for the index to enter an inconsistent state from our database -- or at least they should become eventually consistent. If the power goes out before the job is committed as done, it is assumed the job failed and it will be retried. The job itself is idempotent, so even if the power goes out after the index is updated, the retry should succeed. One other side effect of this change is that when `cargo publish` returns with an exit status of 0, that does not mean that your crate/new version is immediately available for use -- if you try to point to it in Cargo.toml seconds after publishing, you may get an error that it could not find that version. This was technically already true, since neither S3 nor GitHub guarantee that uploads/pushes are immediately visible. However, this does increase the timescale beyond the delay we would have seen there. In most cases it should be under 10 seconds, and at most a minute. One enhancement that will come as a followup, but is not included in this PR is a UI to see the status of your upload. This is definitely nice to have, but is not something I think is necessary for this feature to land. The time it would take to navigate to that UI is going to be longer than the time it takes the background job to run in most cases. That enhancement is something I think can go hand in hand with rust-lang#1503 (which incidentally becomes much easier to implement with this PR, since a "staging" publish just skips queuing the background job, and the only thing the button to full publish needs to do is queue the job). This setup does assume that all background jobs *must* eventually succeed. If any job fails, the index is in an inconsistent state with our database, and we are having an outage of some kind. Due to the nature of our background jobs, this likely means that GitHub is down, or there is a bug in our code. Either way, we page whoever is on-call, since it means publishing is broken. Since publishing crates is such an infrequent event, I've set the thresholds to be extremely low.
This fundamentally changes the workflow for all operations we perform involving git, so that they are not performed on the web server and do not block the response. This will improve the response times of `cargo publish`, and make the publish process more resilient, reducing the liklihood of an inconsistency occurring such as the index getting updated, but not our database. Previously, our workflow looked something like this: - When the server boots, do a full clone of the index into a known location - Some request comes in that needs to update the index - Database transaction is opened - Local checkout is modified, we attempt to commit & push (note: This involves a mutex to avoid contention with another request to update the index on the same server) - If push fails, we fetch, `reset --hard`, and try again up to 20 times - Database transaction is committed - We send a successful response The reason for the retry logic is that we have more than one web server, meaning no server can be sure that its local checkout is actually up to date. There's also a major opportunity for an inconsistent state to be reached here. If the power goes out, the server is restarted, something crashes, etc, in between the index being updated and the database transaction being committed, we will never retry it. The new workflow looks like this: - Some request comes in that needs to update the index - A job is queued in the database to update the index at some point in the future. - We send a successful response - A separate process pulls the job out of the database - A full clone of the index is performed into a temporary directory - The new checkout is modified, committed, and pushed - If push succeeds, job is removed from database - If push fails, job is marked as failed and will be retried at some point in the future While a background worker can be spread across multiple machines and/or threads, we will be able to avoid the race conditions that were previously possible by ensuring that we only have one worker with one thread that handles index updates. Right now that's easy since index updates are the only background job we have, but as we add more we will need to add support for multiple queues to account for this. I've opted to do a fresh checkout in every job, rather than relying on some state that was setup when the machine booted. This is mostly for simplicity's sake. It also means that if we need to scale to multiple threads/processes for other jobs, we can punt the multi-queue enhancement for a while if we wish. However, it does mean the job will take a bit longer to run. If this turns out to be a problem, it's easy to address. This should eliminate the opportunity for the index to enter an inconsistent state from our database -- or at least they should become eventually consistent. If the power goes out before the job is committed as done, it is assumed the job failed and it will be retried. The job itself is idempotent, so even if the power goes out after the index is updated, the retry should succeed. One other side effect of this change is that when `cargo publish` returns with an exit status of 0, that does not mean that your crate/new version is immediately available for use -- if you try to point to it in Cargo.toml seconds after publishing, you may get an error that it could not find that version. This was technically already true, since neither S3 nor GitHub guarantee that uploads/pushes are immediately visible. However, this does increase the timescale beyond the delay we would have seen there. In most cases it should be under 10 seconds, and at most a minute. One enhancement that will come as a followup, but is not included in this PR is a UI to see the status of your upload. This is definitely nice to have, but is not something I think is necessary for this feature to land. The time it would take to navigate to that UI is going to be longer than the time it takes the background job to run in most cases. That enhancement is something I think can go hand in hand with rust-lang#1503 (which incidentally becomes much easier to implement with this PR, since a "staging" publish just skips queuing the background job, and the only thing the button to full publish needs to do is queue the job). This setup does assume that all background jobs *must* eventually succeed. If any job fails, the index is in an inconsistent state with our database, and we are having an outage of some kind. Due to the nature of our background jobs, this likely means that GitHub is down, or there is a bug in our code. Either way, we page whoever is on-call, since it means publishing is broken. Since publishing crates is such an infrequent event, I've set the thresholds to be extremely low.
This fundamentally changes the workflow for all operations we perform involving git, so that they are not performed on the web server and do not block the response. This will improve the response times of `cargo publish`, and make the publish process more resilient, reducing the liklihood of an inconsistency occurring such as the index getting updated, but not our database. Previously, our workflow looked something like this: - When the server boots, do a full clone of the index into a known location - Some request comes in that needs to update the index - Database transaction is opened - Local checkout is modified, we attempt to commit & push (note: This involves a mutex to avoid contention with another request to update the index on the same server) - If push fails, we fetch, `reset --hard`, and try again up to 20 times - Database transaction is committed - We send a successful response The reason for the retry logic is that we have more than one web server, meaning no server can be sure that its local checkout is actually up to date. There's also a major opportunity for an inconsistent state to be reached here. If the power goes out, the server is restarted, something crashes, etc, in between the index being updated and the database transaction being committed, we will never retry it. The new workflow looks like this: - Some request comes in that needs to update the index - A job is queued in the database to update the index at some point in the future. - We send a successful response - A separate process pulls the job out of the database - A full clone of the index is performed into a temporary directory - The new checkout is modified, committed, and pushed - If push succeeds, job is removed from database - If push fails, job is marked as failed and will be retried at some point in the future While a background worker can be spread across multiple machines and/or threads, we will be able to avoid the race conditions that were previously possible by ensuring that we only have one worker with one thread that handles index updates. Right now that's easy since index updates are the only background job we have, but as we add more we will need to add support for multiple queues to account for this. I've opted to do a fresh checkout in every job, rather than relying on some state that was setup when the machine booted. This is mostly for simplicity's sake. It also means that if we need to scale to multiple threads/processes for other jobs, we can punt the multi-queue enhancement for a while if we wish. However, it does mean the job will take a bit longer to run. If this turns out to be a problem, it's easy to address. This should eliminate the opportunity for the index to enter an inconsistent state from our database -- or at least they should become eventually consistent. If the power goes out before the job is committed as done, it is assumed the job failed and it will be retried. The job itself is idempotent, so even if the power goes out after the index is updated, the retry should succeed. One other side effect of this change is that when `cargo publish` returns with an exit status of 0, that does not mean that your crate/new version is immediately available for use -- if you try to point to it in Cargo.toml seconds after publishing, you may get an error that it could not find that version. This was technically already true, since neither S3 nor GitHub guarantee that uploads/pushes are immediately visible. However, this does increase the timescale beyond the delay we would have seen there. In most cases it should be under 10 seconds, and at most a minute. One enhancement that will come as a followup, but is not included in this PR is a UI to see the status of your upload. This is definitely nice to have, but is not something I think is necessary for this feature to land. The time it would take to navigate to that UI is going to be longer than the time it takes the background job to run in most cases. That enhancement is something I think can go hand in hand with rust-lang#1503 (which incidentally becomes much easier to implement with this PR, since a "staging" publish just skips queuing the background job, and the only thing the button to full publish needs to do is queue the job). This setup does assume that all background jobs *must* eventually succeed. If any job fails, the index is in an inconsistent state with our database, and we are having an outage of some kind. Due to the nature of our background jobs, this likely means that GitHub is down, or there is a bug in our code. Either way, we page whoever is on-call, since it means publishing is broken. Since publishing crates is such an infrequent event, I've set the thresholds to be extremely low.
This fundamentally changes the workflow for all operations we perform involving git, so that they are not performed on the web server and do not block the response. This will improve the response times of `cargo publish`, and make the publish process more resilient, reducing the liklihood of an inconsistency occurring such as the index getting updated, but not our database. Previously, our workflow looked something like this: - When the server boots, do a full clone of the index into a known location - Some request comes in that needs to update the index - Database transaction is opened - Local checkout is modified, we attempt to commit & push (note: This involves a mutex to avoid contention with another request to update the index on the same server) - If push fails, we fetch, `reset --hard`, and try again up to 20 times - Database transaction is committed - We send a successful response The reason for the retry logic is that we have more than one web server, meaning no server can be sure that its local checkout is actually up to date. There's also a major opportunity for an inconsistent state to be reached here. If the power goes out, the server is restarted, something crashes, etc, in between the index being updated and the database transaction being committed, we will never retry it. The new workflow looks like this: - Some request comes in that needs to update the index - A job is queued in the database to update the index at some point in the future. - We send a successful response - A separate process pulls the job out of the database - A full clone of the index is performed into a temporary directory - The new checkout is modified, committed, and pushed - If push succeeds, job is removed from database - If push fails, job is marked as failed and will be retried at some point in the future While a background worker can be spread across multiple machines and/or threads, we will be able to avoid the race conditions that were previously possible by ensuring that we only have one worker with one thread that handles index updates. Right now that's easy since index updates are the only background job we have, but as we add more we will need to add support for multiple queues to account for this. I've opted to do a fresh checkout in every job, rather than relying on some state that was setup when the machine booted. This is mostly for simplicity's sake. It also means that if we need to scale to multiple threads/processes for other jobs, we can punt the multi-queue enhancement for a while if we wish. However, it does mean the job will take a bit longer to run. If this turns out to be a problem, it's easy to address. This should eliminate the opportunity for the index to enter an inconsistent state from our database -- or at least they should become eventually consistent. If the power goes out before the job is committed as done, it is assumed the job failed and it will be retried. The job itself is idempotent, so even if the power goes out after the index is updated, the retry should succeed. One other side effect of this change is that when `cargo publish` returns with an exit status of 0, that does not mean that your crate/new version is immediately available for use -- if you try to point to it in Cargo.toml seconds after publishing, you may get an error that it could not find that version. This was technically already true, since neither S3 nor GitHub guarantee that uploads/pushes are immediately visible. However, this does increase the timescale beyond the delay we would have seen there. In most cases it should be under 10 seconds, and at most a minute. One enhancement that will come as a followup, but is not included in this PR is a UI to see the status of your upload. This is definitely nice to have, but is not something I think is necessary for this feature to land. The time it would take to navigate to that UI is going to be longer than the time it takes the background job to run in most cases. That enhancement is something I think can go hand in hand with rust-lang#1503 (which incidentally becomes much easier to implement with this PR, since a "staging" publish just skips queuing the background job, and the only thing the button to full publish needs to do is queue the job). This setup does assume that all background jobs *must* eventually succeed. If any job fails, the index is in an inconsistent state with our database, and we are having an outage of some kind. Due to the nature of our background jobs, this likely means that GitHub is down, or there is a bug in our code. Either way, we page whoever is on-call, since it means publishing is broken. Since publishing crates is such an infrequent event, I've set the thresholds to be extremely low.
Move index updates off the web server This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future. This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed. This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us) One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary. As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed. The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
Move index updates off the web server This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future. This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed. This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us) One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary. As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed. The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I believe it would be great to have sort of a staging area for crates.io. Currently it appears that once a crate is uploaded there is no going back/changing of it in any way, modulo publishing a new version of course.
While that is fine conceptually, for initial releases or changes impacting the README or other other things in major ways, it would be great if there was a way to try things out first, have an initial look with the possibility of adjusting things.
PyPi has a staging area for this purpose, which seemingly is just a different instance (different URL) that has the once-uploaded-no-more-changes restriction removed.
I am thinking that such a feature would benefit a great deal of people. Have I just not found it?
The text was updated successfully, but these errors were encountered: