Skip to content

Commit

Permalink
Run import job as Kubernetes CronJob (#44)
Browse files Browse the repository at this point in the history
Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs.

Why do this now? See:
- edgi-govdata-archiving/web-monitoring#168
- edgi-govdata-archiving/web-monitoring-processing#757

Work not visible here:
- Created a new IAM account for jobs that can write to relevant S3 buckets.
- Added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes.
  • Loading branch information
Mr0grog authored Feb 17, 2023
1 parent b09db60 commit cd7051d
Show file tree
Hide file tree
Showing 5 changed files with 71 additions and 8 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ This repository contains instructions and configuration files for EDGI’s deplo

We currently run all our services in AWS:

- *Services* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
- *Scheduled jobs* are currently run on manually configured EC2 instances. See the [`manually-managed`](./manually-managed) directory for details.
- *Services* and *Scheduled Jobs* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
- We use a handful of AWS services like S3 and RDS. See the [`manually-managed`](./manually-managed) directory for details.

**Incident Reports:** When major problems happen in production, we try and write up incident reports that describe what happened and how the problem was addressed. You can find these in the [`incidents` directory](./incidents).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ spec:
valueFrom:
secretKeyRef:
name: job-secrets
key: healthcheck_sentry_dsn
key: healthcheck_sentry_dsn
60 changes: 60 additions & 0 deletions kubernetes/production/ia-import-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: ia-import-job
namespace: production
spec:
schedule: "30 3 * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
# Kill after 12 hours. Sometimes it gets stuck!
activeDeadlineSeconds: 43200
# Graceful shutdown can take a while here.
terminationGracePeriodSeconds: 600
containers:
- name: ia-import-job
image: envirodgi/processing:latest
command: [
"scripts/wm",
"import",
"ia-known-pages",
"--parallel", "10",
"--unplaybackable", "s3://edgi-wm-db-internal/importer-unplaybackable-cache.json",
"--precheck",
# The last 7 days
"--from", "168"
]
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1024Mi"
cpu: "1500m"
env:
- name: WEB_MONITORING_DB_EMAIL
valueFrom:
secretKeyRef:
name: job-secrets
key: import_db_email
- name: WEB_MONITORING_DB_PASSWORD
valueFrom:
secretKeyRef:
name: job-secrets
key: import_db_password
- name: WAYBACK_RATE_LIMIT
value: "10"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: job-secrets
key: aws_access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: job-secrets
key: aws_secret_access_key
11 changes: 6 additions & 5 deletions manually-managed/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,17 @@ To protect the production instance of the API from from abuse, we point the DNS
- It uses a `per-ip-rate-limit` rule to block IP addresses requesting over a certain rate.


## ETL
## 📦 Deprecated Services

We currently run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported. These are managed via `cron` on a single EC2 VM.
⚠️ These services used to be managed manually, but have either been shut down or moved to a different, automated approach. The documentation here is for historical reference.

For details, see [`etl-server/README.md`](./etl-server/README.md).

### ETL

## 📦 Deprecated Services
**These are now all Kubernetes `CronJob` resources.** We used to run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported via `cron` an a single EC2 VM. For details, see [`etl-server/README.md`](./etl-server/README.md).

For details, see [`etl-server/README.md`](./etl-server/README.md).

⚠️ These services used to be managed manually, but have either been shut down or moved to a different, automated approach. The documentation here is for historical reference.

### IA Archiver

Expand Down
3 changes: 3 additions & 0 deletions manually-managed/etl-server/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.


# ETL Server

This machine runs ETL (Extract, Transform, and Load) scripts to pull page & version data out of other services (like the Internet Archive or Versionista) and import it into a [web-monitoring-db][] instance. The core code for most of that lives in other web-monitoring-* repositories; this server just uses cron and some very simple bash scripts to execute them and save logs.
Expand Down

0 comments on commit cd7051d

Please sign in to comment.