Auto merge of #1588 - sgrif:sg-background-jobs, r=sgrif

Move index updates off the web server This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future. This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed. This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us) One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary. As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed. The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
rust-lang · Mar 9, 2019 · 7ca518d · 7ca518d
2 parents 3283448 + 1ab2c08
commit 7ca518d
Show file tree

Hide file tree

Showing 28 changed files with 978 additions and 213 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -63,6 +63,7 @@ tempdir = "0.3.7"
 parking_lot = "0.7.1"
 jemallocator = { version = "0.1.8", features = ['unprefixed_malloc_on_supported_platforms', 'profiling'] }
 jemalloc-ctl = "0.2.0"
+threadpool = "1.7"
 
 lettre = {git = "https://github.com/lettre/lettre", version = "0.9"}
 lettre_email = {git = "https://github.com/lettre/lettre", version = "0.9"}

diff --git a/Procfile b/Procfile
@@ -1,2 +1,3 @@
 web: bin/diesel migration run && bin/start-nginx ./target/release/server
 worker: ./target/release/update-downloads daemon 300
+background_worker: ./target/release/background-worker
diff --git a/migrations/2018-05-03-150523_create_jobs/down.sql b/migrations/2018-05-03-150523_create_jobs/down.sql
@@ -0,0 +1 @@
+DROP TABLE background_jobs;
diff --git a/migrations/2018-05-03-150523_create_jobs/up.sql b/migrations/2018-05-03-150523_create_jobs/up.sql
@@ -0,0 +1,8 @@
+CREATE TABLE background_jobs (
+  id BIGSERIAL PRIMARY KEY,
+  job_type TEXT NOT NULL,
+  data JSONB NOT NULL,
+  retries INTEGER NOT NULL DEFAULT 0,
+  last_retry TIMESTAMP NOT NULL DEFAULT '1970-01-01',
+  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
+);
diff --git a/script/ci/prune-cache.sh b/script/ci/prune-cache.sh
@@ -7,7 +7,7 @@ du -hs target/debug
 
 crate_name="cargo-registry"
 test_name="all"
-bin_names="delete-crate delete-version populate render-readmes server test-pagerduty transfer-crates update-downloads"
+bin_names="delete-crate delete-version populate render-readmes server test-pagerduty transfer-crates update-downloads background-worker monitor"
 
 normalized_crate_name=${crate_name//-/_}
 rm -v target/debug/$normalized_crate_name-*

diff --git a/src/app.rs b/src/app.rs
@@ -1,12 +1,7 @@
 //! Application-wide components in a struct accessible from each request
 
 use crate::{db, util::CargoResult, Config, Env};
-use std::{
-    env,
-    path::PathBuf,
-    sync::{Arc, Mutex},
-    time::Duration,
-};
+use std::{env, path::PathBuf, sync::Arc, time::Duration};
 
 use diesel::r2d2;
 use scheduled_thread_pool::ScheduledThreadPool;
@@ -25,10 +20,8 @@ pub struct App {
     /// A unique key used with conduit_cookie to generate cookies
     pub session_key: String,
 
-    /// The crate index git repository
-    pub git_repo: Mutex<git2::Repository>,
-
     /// The location on disk of the checkout of the crate index git repository
+    /// Only used in the development environment.
     pub git_repo_checkout: PathBuf,
 
     /// The server configuration
@@ -86,13 +79,10 @@ impl App {
             .connection_customizer(Box::new(db::SetStatementTimeout(db_connection_timeout)))
             .thread_pool(thread_pool);
 
-        let repo = git2::Repository::open(&config.git_repo_checkout).unwrap();
-
         App {
             diesel_database: db::diesel_pool(&config.db_url, config.env, diesel_db_config),
             github,
             session_key: config.session_key.clone(),
-            git_repo: Mutex::new(repo),
             git_repo_checkout: config.git_repo_checkout.clone(),
             config: config.clone(),
         }

diff --git a/src/background/job.rs b/src/background/job.rs
@@ -0,0 +1,26 @@
+use diesel::PgConnection;
+use serde::{de::DeserializeOwned, Serialize};
+
+use super::storage;
+use crate::util::CargoResult;
+
+/// A background job, meant to be run asynchronously.
+pub trait Job: Serialize + DeserializeOwned {
+    /// The environment this job is run with. This is a struct you define,
+    /// which should encapsulate things like database connection pools, any
+    /// configuration, and any other static data or shared resources.
+    type Environment;
+
+    /// The key to use for storing this job, and looking it up later.
+    ///
+    /// Typically this is the name of your struct in `snake_case`
+    const JOB_TYPE: &'static str;
+
+    /// Enqueue this job to be run at some point in the future.
+    fn enqueue(self, conn: &PgConnection) -> CargoResult<()> {
+        storage::enqueue_job(conn, self)
+    }
+
+    /// The logic involved in actually performing this job.
+    fn perform(self, env: &Self::Environment) -> CargoResult<()>;
+}
diff --git a/src/background/mod.rs b/src/background/mod.rs
@@ -0,0 +1,8 @@
+mod job;
+mod registry;
+mod runner;
+mod storage;
+
+pub use self::job::*;
+pub use self::registry::Registry;
+pub use self::runner::*;
diff --git a/src/background/registry.rs b/src/background/registry.rs
@@ -0,0 +1,46 @@
+#![allow(clippy::new_without_default)] // https://github.com/rust-lang/rust-clippy/issues/3632
+
+use serde_json;
+use std::collections::HashMap;
+use std::panic::RefUnwindSafe;
+
+use super::Job;
+use crate::util::CargoResult;
+
+#[doc(hidden)]
+pub type PerformFn<Env> =
+    Box<dyn Fn(serde_json::Value, &Env) -> CargoResult<()> + RefUnwindSafe + Send + Sync>;
+
+#[derive(Default)]
+#[allow(missing_debug_implementations)] // Can't derive debug
+/// A registry of background jobs, used to map job types to concrege perform
+/// functions at runtime.
+pub struct Registry<Env> {
+    job_types: HashMap<&'static str, PerformFn<Env>>,
+}
+
+impl<Env> Registry<Env> {
+    /// Create a new, empty registry
+    pub fn new() -> Self {
+        Registry {
+            job_types: Default::default(),
+        }
+    }
+
+    /// Get the perform function for a given job type
+    pub fn get(&self, job_type: &str) -> Option<&PerformFn<Env>> {
+        self.job_types.get(job_type)
+    }
+
+    /// Register a new background job. This will override any existing
+    /// registries with the same `JOB_TYPE`, if one exists.
+    pub fn register<T: Job<Environment = Env>>(&mut self) {
+        self.job_types.insert(
+            T::JOB_TYPE,
+            Box::new(|data, env| {
+                let data = serde_json::from_value(data)?;
+                T::perform(data, env)
+            }),
+        );
+    }
+}