Skip to content

Commit

Permalink
Run import job in as Kubernetes CronJob
Browse files Browse the repository at this point in the history
Instead of running the import job as a cron script on a random EC2 VM, run it as an actual CronJob in Kubernetes with everything else. This also cleans up the docs around jobs.

Work not visible here: created a new IAM account for jobs that can write to relevant S3 buckets, added ability to store cache files in S3 (edgi-govdata-archiving/web-monitoring-processing#849) since we have no persistent storage in Kubernetes.

Why do this now? See:
- edgi-govdata-archiving/web-monitoring#168
- edgi-govdata-archiving/web-monitoring-processing#757
  • Loading branch information
Mr0grog committed Feb 16, 2023
1 parent 624716a commit 6aef1d5
Show file tree
Hide file tree
Showing 7 changed files with 83 additions and 13 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ This repository contains instructions and configuration files for EDGI’s deplo

We currently run all our services in AWS:

- *Services* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
- *Scheduled jobs* are currently run on manually configured EC2 instances. See the [`manually-managed`](./manually-managed) directory for details.
- *Services* and *Scheduled Jobs* are managed by [Kubernetes](https://kubernetes.io/). See the [`kubernetes`](./kubernetes) directory for details.
- We use a handful of AWS services like S3 and RDS. See the [`manually-managed`](./manually-managed) directory for details.

**Incident Reports:** When major problems happen in production, we try and write up incident reports that describe what happened and how the problem was addressed. You can find these in the [`incidents` directory](./incidents).
Expand Down Expand Up @@ -62,7 +61,7 @@ This is an open-source project, and works because of contributors like you! See

## License & Copyright

Copyright (C) 2017-2019 Environmental Data and Governance Initiative (EDGI)
Copyright (C) 2017-2023 Environmental Data and Governance Initiative (EDGI)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ spec:
valueFrom:
secretKeyRef:
name: job-secrets
key: healthcheck_sentry_dsn
key: healthcheck_sentry_dsn
60 changes: 60 additions & 0 deletions kubernetes/production/ia-import-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: ia-import-job
namespace: production
spec:
# FIXME: schedule *should* be "30 3 * * *". Current schedule is a test.
# schedule: "30 3 * * *"
schedule: "20 18 * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
# Kill after 12 hours. Sometimes it gets stuck!
activeDeadlineSeconds: 43200
containers:
- name: ia-import-job
image: envirodgi/processing:latest
command: [
"scripts/wm",
"import",
"ia-known-pages",
"--parallel", "10",
"--unplaybackable", "s3://edgi-wm-db-internal/importer-unplaybackable-cache.json",
"--precheck",
# The last 7 days
"--from", "168"
]
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1024Mi"
cpu: "1500m"
env:
- name: WEB_MONITORING_DB_EMAIL
valueFrom:
secretKeyRef:
name: job-secrets
key: import_db_email
- name: WEB_MONITORING_DB_PASSWORD
valueFrom:
secretKeyRef:
name: job-secrets
key: import_db_password
- name: WAYBACK_RATE_LIMIT
value: "10"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: job-secrets
key: aws_access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: job-secrets
key: aws_secret_access_key
20 changes: 11 additions & 9 deletions manually-managed/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,23 +26,25 @@ TBD
For details, see [`rds/README.md`](./rds/README.md).


## ETL
## 📦 Deprecated Services

We currently run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported. These are managed via `cron` on a single EC2 VM.
⚠️ These services used to be managed manually, but have either been shut down or moved to a different, automated approach. The documentation here is for historical reference.

For details, see [`etl-server/README.md`](./etl-server/README.md).

### ETL

## IA Archiver
**These are now all Kubernetes `CronJob` resources.** We used to run scheduled scripts for extracting data from external services (Versionista, the Wayback Machine) and sending it to [web-monitoring-db][-db] to be imported via `cron` an a single EC2 VM. For details, see [`etl-server/README.md`](./etl-server/README.md).

We have an EC2 VM named `ia-archiver` that pushes lists of URLs to the Internet Archive’s “Save Page Now” feature on a regular basis. More information about this is in [`ia-archiver`](./ia-archiver). It’s mainly just an implementation of [wayback-spn-client].

### IA Archiver

**We no longer do this.** We have an EC2 VM named `ia-archiver` that pushes lists of URLs to the Internet Archive’s “Save Page Now” feature on a regular basis. More information about this is in [`ia-archiver`](./ia-archiver). It’s mainly just an implementation of [wayback-spn-client].


[-db]: https://github.com/edgi-govdata-archiving/web-monitoring-db
[wayback-spn-client]: https://github.com/Mr0grog/wayback-spn-client

## Metrics Server

We use Elasticserach and its Kibana front-end for metrics collection and
visualization. See the metrics-server directory in this repository for
provisioning and configuration details.
### Metrics Server

**We no longer maintain a metrics service.** We use Elasticserach and its Kibana front-end for metrics collection and visualization. See the metrics-server directory in this repository for provisioning and configuration details.
3 changes: 3 additions & 0 deletions manually-managed/etl-server/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.


# ETL Server

This machine runs ETL (Extract, Transform, and Load) scripts to pull page & version data out of other services (like the Internet Archive or Versionista) and import it into a [web-monitoring-db][] instance. The core code for most of that lives in other web-monitoring-* repositories; this server just uses cron and some very simple bash scripts to execute them and save logs.
Expand Down
3 changes: 3 additions & 0 deletions manually-managed/ia-archiver/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.


# IA Archiver

This machine is dedicated to pushing URLs we wish to monitor into the Internet Archive on a regular basis. It works by automating the archive’s “Save Page Now” (SPN) feature via [wayback-spn-client](https://github.com/Mr0grog/wayback-spn-client).
Expand Down
3 changes: 3 additions & 0 deletions manually-managed/metrics-server/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
**⚠️ This server is no longer used! ⚠️** This documentation is for historical reference.


# Provisioning and Configuring the Metrics Server

## Overview
Expand Down

0 comments on commit 6aef1d5

Please sign in to comment.