Skip to content

Commit

Permalink
Auto merge of #1588 - sgrif:sg-background-jobs, r=sgrif
Browse files Browse the repository at this point in the history
Move index updates off the web server

This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.

This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.

This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)

One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.

As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.

The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
  • Loading branch information
bors committed Mar 9, 2019
2 parents 3283448 + 1ab2c08 commit 7ca518d
Show file tree
Hide file tree
Showing 28 changed files with 978 additions and 213 deletions.
10 changes: 10 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ tempdir = "0.3.7"
parking_lot = "0.7.1"
jemallocator = { version = "0.1.8", features = ['unprefixed_malloc_on_supported_platforms', 'profiling'] }
jemalloc-ctl = "0.2.0"
threadpool = "1.7"

lettre = {git = "https://github.com/lettre/lettre", version = "0.9"}
lettre_email = {git = "https://github.com/lettre/lettre", version = "0.9"}
Expand Down
1 change: 1 addition & 0 deletions Procfile
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
web: bin/diesel migration run && bin/start-nginx ./target/release/server
worker: ./target/release/update-downloads daemon 300
background_worker: ./target/release/background-worker
1 change: 1 addition & 0 deletions migrations/2018-05-03-150523_create_jobs/down.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE background_jobs;
8 changes: 8 additions & 0 deletions migrations/2018-05-03-150523_create_jobs/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
CREATE TABLE background_jobs (
id BIGSERIAL PRIMARY KEY,
job_type TEXT NOT NULL,
data JSONB NOT NULL,
retries INTEGER NOT NULL DEFAULT 0,
last_retry TIMESTAMP NOT NULL DEFAULT '1970-01-01',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
2 changes: 1 addition & 1 deletion script/ci/prune-cache.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ du -hs target/debug

crate_name="cargo-registry"
test_name="all"
bin_names="delete-crate delete-version populate render-readmes server test-pagerduty transfer-crates update-downloads"
bin_names="delete-crate delete-version populate render-readmes server test-pagerduty transfer-crates update-downloads background-worker monitor"

normalized_crate_name=${crate_name//-/_}
rm -v target/debug/$normalized_crate_name-*
Expand Down
14 changes: 2 additions & 12 deletions src/app.rs
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
//! Application-wide components in a struct accessible from each request

use crate::{db, util::CargoResult, Config, Env};
use std::{
env,
path::PathBuf,
sync::{Arc, Mutex},
time::Duration,
};
use std::{env, path::PathBuf, sync::Arc, time::Duration};

use diesel::r2d2;
use scheduled_thread_pool::ScheduledThreadPool;
Expand All @@ -25,10 +20,8 @@ pub struct App {
/// A unique key used with conduit_cookie to generate cookies
pub session_key: String,

/// The crate index git repository
pub git_repo: Mutex<git2::Repository>,

/// The location on disk of the checkout of the crate index git repository
/// Only used in the development environment.
pub git_repo_checkout: PathBuf,

/// The server configuration
Expand Down Expand Up @@ -86,13 +79,10 @@ impl App {
.connection_customizer(Box::new(db::SetStatementTimeout(db_connection_timeout)))
.thread_pool(thread_pool);

let repo = git2::Repository::open(&config.git_repo_checkout).unwrap();

App {
diesel_database: db::diesel_pool(&config.db_url, config.env, diesel_db_config),
github,
session_key: config.session_key.clone(),
git_repo: Mutex::new(repo),
git_repo_checkout: config.git_repo_checkout.clone(),
config: config.clone(),
}
Expand Down
26 changes: 26 additions & 0 deletions src/background/job.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
use diesel::PgConnection;
use serde::{de::DeserializeOwned, Serialize};

use super::storage;
use crate::util::CargoResult;

/// A background job, meant to be run asynchronously.
pub trait Job: Serialize + DeserializeOwned {
/// The environment this job is run with. This is a struct you define,
/// which should encapsulate things like database connection pools, any
/// configuration, and any other static data or shared resources.
type Environment;

/// The key to use for storing this job, and looking it up later.
///
/// Typically this is the name of your struct in `snake_case`
const JOB_TYPE: &'static str;

/// Enqueue this job to be run at some point in the future.
fn enqueue(self, conn: &PgConnection) -> CargoResult<()> {
storage::enqueue_job(conn, self)
}

/// The logic involved in actually performing this job.
fn perform(self, env: &Self::Environment) -> CargoResult<()>;
}
8 changes: 8 additions & 0 deletions src/background/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
mod job;
mod registry;
mod runner;
mod storage;

pub use self::job::*;
pub use self::registry::Registry;
pub use self::runner::*;
46 changes: 46 additions & 0 deletions src/background/registry.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#![allow(clippy::new_without_default)] // https://github.com/rust-lang/rust-clippy/issues/3632

use serde_json;
use std::collections::HashMap;
use std::panic::RefUnwindSafe;

use super::Job;
use crate::util::CargoResult;

#[doc(hidden)]
pub type PerformFn<Env> =
Box<dyn Fn(serde_json::Value, &Env) -> CargoResult<()> + RefUnwindSafe + Send + Sync>;

#[derive(Default)]
#[allow(missing_debug_implementations)] // Can't derive debug
/// A registry of background jobs, used to map job types to concrege perform
/// functions at runtime.
pub struct Registry<Env> {
job_types: HashMap<&'static str, PerformFn<Env>>,
}

impl<Env> Registry<Env> {
/// Create a new, empty registry
pub fn new() -> Self {
Registry {
job_types: Default::default(),
}
}

/// Get the perform function for a given job type
pub fn get(&self, job_type: &str) -> Option<&PerformFn<Env>> {
self.job_types.get(job_type)
}

/// Register a new background job. This will override any existing
/// registries with the same `JOB_TYPE`, if one exists.
pub fn register<T: Job<Environment = Env>>(&mut self) {
self.job_types.insert(
T::JOB_TYPE,
Box::new(|data, env| {
let data = serde_json::from_value(data)?;
T::perform(data, env)
}),
);
}
}
Loading

0 comments on commit 7ca518d

Please sign in to comment.