Run import job in as Kubernetes CronJob

Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs. Work not visible here: created a new IAM account for jobs that can write to relevant S3 buckets, added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes. Why do this now? See: - edgi-govdata-archiving/web-monitoring#168 - edgi-govdata-archiving/web-monitoring-processing#757
edgi-govdata-archiving · Feb 16, 2023 · 6aef1d5 · 6aef1d5
1 parent 624716a
commit 6aef1d5
Show file tree

Hide file tree

Showing 7 changed files with 83 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -6,8 +6,7 @@ This repository contains instructions and configuration files for EDGI’s deplo
 
 We currently run all our services in AWS:
 
-- *Services* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
-- *Scheduled jobs* are currently run on manually configured EC2 instances. See the [`manually-managed`](./manually-managed) directory for details.
+- *Services* and *Scheduled Jobs* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
 - We use a handful of AWS services like S3 and RDS. See the [`manually-managed`](./manually-managed) directory for details.
 
 **Incident Reports:** When major problems happen in production, we try and write up incident reports that describe what happened and how the problem was addressed. You can find these in the [`incidents` directory](./incidents).
@@ -62,7 +61,7 @@ This is an open-source project, and works because of contributors like you! See
 
 ## License & Copyright
 
-Copyright (C) 2017-2019 Environmental Data and Governance Initiative (EDGI)
+Copyright (C) 2017-2023 Environmental Data and Governance Initiative (EDGI)
 
 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.
 

diff --git a/kubernetes/production/ia-healthcheck-job → ...rnetes/production/ia-healthcheck-job.yaml b/kubernetes/production/ia-healthcheck-job → ...rnetes/production/ia-healthcheck-job.yaml
@@ -27,4 +27,4 @@ spec:
               valueFrom:
                 secretKeyRef:
                   name: job-secrets
-                  key: healthcheck_sentry_dsn
+                  key: healthcheck_sentry_dsn
diff --git a/kubernetes/production/ia-import-job.yaml b/kubernetes/production/ia-import-job.yaml
@@ -0,0 +1,60 @@
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: ia-import-job
+  namespace: production
+spec:
+  # FIXME: schedule *should* be "30 3 * * *". Current schedule is a test.
+  # schedule: "30 3 * * *"
+  schedule: "20 18 * * *"
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          restartPolicy: Never
+          # Kill after 12 hours. Sometimes it gets stuck!
+          activeDeadlineSeconds: 43200
+          containers:
+          - name: ia-import-job
+            image: envirodgi/processing:latest
+            command: [
+              "scripts/wm",
+              "import",
+              "ia-known-pages",
+              "--parallel", "10",
+              "--unplaybackable", "s3://edgi-wm-db-internal/importer-unplaybackable-cache.json",
+              "--precheck",
+              # The last 7 days
+              "--from", "168"
+            ]
+            imagePullPolicy: Always
+            resources:
+              requests:
+                memory: "256Mi"
+                cpu: "100m"
+              limits:
+                memory: "1024Mi"
+                cpu: "1500m"
+            env:
+            - name: WEB_MONITORING_DB_EMAIL
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: import_db_email
+            - name: WEB_MONITORING_DB_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: import_db_password
+            - name: WAYBACK_RATE_LIMIT
+              value: "10"
+            - name: AWS_ACCESS_KEY_ID
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: aws_access_key_id
+            - name: AWS_SECRET_ACCESS_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: job-secrets
+                  key: aws_secret_access_key
diff --git a/manually-managed/README.md b/manually-managed/README.md
@@ -26,23 +26,25 @@ TBD
 For details, see [`rds/README.md`](./rds/README.md).
 
 
-## ETL
+## 📦 Deprecated Services
 
-We currently run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported. These are managed via `cron` on a single EC2 VM.
+⚠️ These services used to be managed manually, but have either been shut down or moved to a different, automated approach. The documentation here is for historical reference.
 
-For details, see [`etl-server/README.md`](./etl-server/README.md).
 
+### ETL
 
-## IA Archiver
+**These are now all Kubernetes `CronJob` resources.** We used to run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported via `cron` an a single EC2 VM. For details, see [`etl-server/README.md`](./etl-server/README.md).
 
-We have an EC2 VM named `ia-archiver` that pushes lists of URLs to the Internet Archive’s “Save Page Now” feature on a regular basis. More information about this is in [`ia-archiver`](./ia-archiver). It’s mainly just an implementation of [wayback-spn-client].
+
+### IA Archiver
+
+**We no longer do this.** We have an EC2 VM named `ia-archiver` that pushes lists of URLs to the Internet Archive’s “Save Page Now” feature on a regular basis. More information about this is in [`ia-archiver`](./ia-archiver). It’s mainly just an implementation of [wayback-spn-client].
 
 
 [-db]: https://github.com/edgi-govdata-archiving/web-monitoring-db
 [wayback-spn-client]: https://github.com/Mr0grog/wayback-spn-client
 
-## Metrics Server
 
-We use Elasticserach and its Kibana front-end for metrics collection and
-visualization. See the metrics-server directory in this repository for
-provisioning and configuration details.
+### Metrics Server
+
+**We no longer maintain a metrics service.** We use Elasticserach and its Kibana front-end for metrics collection and visualization. See the metrics-server directory in this repository for provisioning and configuration details.
diff --git a/manually-managed/etl-server/README.md b/manually-managed/etl-server/README.md
@@ -1,3 +1,6 @@
+**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.
+
+
 # ETL Server
 
 This machine runs ETL (Extract, Transform, and Load) scripts to pull page & version data out of other services (like the Internet Archive or Versionista) and import it into a [web-monitoring-db][] instance. The core code for most of that lives in other web-monitoring-* repositories; this server just uses cron and some very simple bash scripts to execute them and save logs.

diff --git a/manually-managed/ia-archiver/README.md b/manually-managed/ia-archiver/README.md
@@ -1,3 +1,6 @@
+**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.
+
+
 # IA Archiver
 
 This machine is dedicated to pushing URLs we wish to monitor into the Internet Archive on a regular basis. It works by automating the archive’s “Save Page Now” (SPN) feature via [wayback-spn-client](https://github.com/Mr0grog/wayback-spn-client).

diff --git a/manually-managed/metrics-server/README.md b/manually-managed/metrics-server/README.md
@@ -1,3 +1,6 @@
+**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.
+
+
 # Provisioning and Configuring the Metrics Server
 
 ## Overview