Run import job as Kubernetes CronJob (#44)

Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs. Why do this now? See: - edgi-govdata-archiving/web-monitoring#168 - edgi-govdata-archiving/web-monitoring-processing#757 Work not visible here: - Created a new IAM account for jobs that can write to relevant S3 buckets. - Added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes.
edgi-govdata-archiving · Feb 17, 2023 · cd7051d · cd7051d
1 parent b09db60
commit cd7051d
Show file tree

Hide file tree

Showing 5 changed files with 71 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -6,8 +6,7 @@ This repository contains instructions and configuration files for EDGI’s deplo
 
 We currently run all our services in AWS:
 
-- *Services* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
-- *Scheduled jobs* are currently run on manually configured EC2 instances. See the [`manually-managed`](./manually-managed) directory for details.
+- *Services* and *Scheduled Jobs* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
 - We use a handful of AWS services like S3 and RDS. See the [`manually-managed`](./manually-managed) directory for details.
 
 **Incident Reports:** When major problems happen in production, we try and write up incident reports that describe what happened and how the problem was addressed. You can find these in the [`incidents` directory](./incidents).

diff --git a/kubernetes/production/ia-healthcheck-job → ...rnetes/production/ia-healthcheck-job.yaml b/kubernetes/production/ia-healthcheck-job → ...rnetes/production/ia-healthcheck-job.yaml
@@ -27,4 +27,4 @@ spec:
               valueFrom:
                 secretKeyRef:
                   name: job-secrets
-                  key: healthcheck_sentry_dsn
+                  key: healthcheck_sentry_dsn
diff --git a/kubernetes/production/ia-import-job.yaml b/kubernetes/production/ia-import-job.yaml
@@ -0,0 +1,60 @@
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: ia-import-job
+  namespace: production
+spec:
+  schedule: "30 3 * * *"
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          restartPolicy: Never
+          # Kill after 12 hours. Sometimes it gets stuck!
+          activeDeadlineSeconds: 43200
+          # Graceful shutdown can take a while here.
+          terminationGracePeriodSeconds: 600
+          containers:
+          - name: ia-import-job
+            image: envirodgi/processing:latest
+            command: [
+              "scripts/wm",
+              "import",
+              "ia-known-pages",
+              "--parallel", "10",
+              "--unplaybackable", "s3://edgi-wm-db-internal/importer-unplaybackable-cache.json",
+              "--precheck",
+              # The last 7 days
+              "--from", "168"
+            ]
+            imagePullPolicy: Always
+            resources:
+              requests:
+                memory: "256Mi"
+                cpu: "100m"
+              limits:
+                memory: "1024Mi"
+                cpu: "1500m"
+            env:
+            - name: WEB_MONITORING_DB_EMAIL
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: import_db_email
+            - name: WEB_MONITORING_DB_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: import_db_password
+            - name: WAYBACK_RATE_LIMIT
+              value: "10"
+            - name: AWS_ACCESS_KEY_ID
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: aws_access_key_id
+            - name: AWS_SECRET_ACCESS_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: aws_secret_access_key
diff --git a/manually-managed/README.md b/manually-managed/README.md
@@ -50,16 +50,17 @@ To protect the production instance of the API from from abuse, we point the DNS
     - It uses a `per-ip-rate-limit` rule to block IP addresses requesting over a certain rate.
 
 
-## ETL
+## 📦 Deprecated Services
 
-We currently run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported. These are managed via `cron` on a single EC2 VM.
+⚠️ These services used to be managed manually, but have either been shut down or moved to a different, automated approach. The documentation here is for historical reference.
 
-For details, see [`etl-server/README.md`](./etl-server/README.md).
 
+### ETL
 
-## 📦 Deprecated Services
+**These are now all Kubernetes `CronJob` resources.** We used to run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported via `cron` an a single EC2 VM. For details, see [`etl-server/README.md`](./etl-server/README.md).
+
+For details, see [`etl-server/README.md`](./etl-server/README.md).
 
-⚠️ These services used to be managed manually, but have either been shut down or moved to a different, automated approach. The documentation here is for historical reference.
 
 ### IA Archiver
 

diff --git a/manually-managed/etl-server/README.md b/manually-managed/etl-server/README.md
@@ -1,3 +1,6 @@
+**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.
+
+
 # ETL Server
 
 This machine runs ETL (Extract, Transform, and Load) scripts to pull page & version data out of other services (like the Internet Archive or Versionista) and import it into a [web-monitoring-db][] instance. The core code for most of that lives in other web-monitoring-* repositories; this server just uses cron and some very simple bash scripts to execute them and save logs.