Skip to content

Commit

Permalink
Migrate all mongodb-awesome-backup repo code/actions to this repo (#…
Browse files Browse the repository at this point in the history
…3030)

* Initial DB backup GH actions setup

* Change Dockerfile path

* Move Github actions to the parent folder

* Change Docker bin path

* Change Docker base path

* Add / to the Docker path

* Refactor and reuse AIID env variables

* Test empty MongoDB URI

* Remove extra whitespaces

* Rename `CLOUDFLARE_R2_BUCKET` to `CLOUDFLARE_R2_BUCKET_NAME`

* Remove MONGODB_HOST environment variable from backup scripts

* Remove MONGODB_USERNAME environment variable from backup scripts

* Remove MONGODB_PASSWORD environment variable from backup scripts

* Remove unnecessary MongoDB environment variables from backup scripts

* Test removing "Configure AWS credentials" step

* Remove AWS env variables

* Remove all AWS related code

* Apply changes to private backup

* Rename `docker-db-backup` folder to `db-backup`

* Remove mongodb-clients package dependency

* Add `mongodb-database-tools` dependency

* Change backup.sh path

* Install `boto3`

* Remove Docker completely and unnecessary env variables

* Improve logs

* Remove unused files

* Delete all S3 related functions

* Remove GitHub action "on push"

* Change the script file name and remove S3 references

* Test removing boto3 dependency

* rollback boto3 deletion

* Export all classifications (without any query filter)

* Remove Private backups

* Update README file

* Move all code to `tools` folder

* Remove classification query filter from mongodump command

* Remove boto.sh script

* Export all CSET classifications to CSV

* Change CSET CSV file name

* Update classifications_cset.csv file name

* Remove unnecessary `workflow_dispatch` inputs

* Move `db-backup` folder to `site` folder

* Update db-backup workflow to include environment input
  • Loading branch information
pdcp1 authored Sep 5, 2024
1 parent 64e1275 commit 4046eb3
Show file tree
Hide file tree
Showing 7 changed files with 442 additions and 0 deletions.
57 changes: 57 additions & 0 deletions .github/workflows/db-backup.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Public backup to the cloud

on:
schedule:
- cron: "0 10 * * 1" # At 10:00 on Monday.
workflow_dispatch:
inputs:
environment:
description: The Github environment to load secrets from
type: string
required: true

defaults:
run:
shell: bash

jobs:
build-and-run-backups:
# If the execution is triggered by a schedule, the environment is production
environment: ${{ inputs.environment || 'production' }}
name: Backup
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y \
coreutils \
bash \
tzdata \
python3-pip \
curl \
npm
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | gpg --dearmor | sudo tee /usr/share/keyrings/mongodb.gpg > /dev/null
echo "deb [ arch=amd64 signed-by=/usr/share/keyrings/mongodb.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list
sudo apt update
sudo apt install -y mongodb-database-tools
- name: Install boto3
run: pip install boto3

- name: Generate public backup
run: |
./bin/backup.sh
./bin/prune.sh
./bin/list.sh
working-directory: site/db-backup
env:
CLOUDFLARE_R2_ACCOUNT_ID: ${{ vars.CLOUDFLARE_R2_ACCOUNT_ID }}
CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID }}
CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY }}
CLOUDFLARE_R2_BUCKET_NAME: ${{ vars.CLOUDFLARE_R2_BUCKET_NAME }}
MONGODB_URI: ${{ secrets.MONGODB_CONNECTION_STRING }}
33 changes: 33 additions & 0 deletions site/db-backup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
This is a quick port of the forked project to support JSON and CSV backups of the [AIID](https://incidentdatabase.ai/).

The complete state of the database will be backed up on a weekly basis in both JSON and CSV form. The backups can be downloaded from [here](https://incidentdatabase.ai/research/snapshots/).

Requirements
------------

- Cloudflare R2 Access Key ID/Secret Access Key, which must have the access rights of the target Cloudflare R2 bucket.
- MongoDB credentials with read access to the target database.

Usage
-----

The GitHub Action "Public backup to the cloud" [/.github/workflows/db-backup.yml](/.github/workflows/db-backup.yml) will run the backup script at 10:00 AM every Monday.

After running this, `backup-YYYYMMdd.tar.bz2` will be placed on the Cloudflare R2 Bucket.


Required environment variables
---------

| Variable | Description |
| --------------------- | ------------------------------------------------------------------------------ |
| CLOUDFLARE_R2_ACCOUNT_ID | Cloudflare R2 account ID |
| CLOUDFLARE_R2_BUCKET_NAME | Cloudflare R2 public bucket name (ie: "aiid-public") |

Required environment secrets

| Secret | Description |
| --------------------- | ------------------------------------------------------------------------------ |
| CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID | Cloudflare R2 Access Key ID with write permission |
| CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY | Cloudflare R2 Access Secret ID with write permission|
| MONGODB_CONNECTION_STRING | mongodb+srv://[username]:[password]@aiiddev.[CLUSTER].mongodb.net |
100 changes: 100 additions & 0 deletions site/db-backup/bin/backup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/bin/bash -e

echo "--------------------------------------"
echo "Starting backup.sh script execution..."
echo "--------------------------------------"

# settings
BACKUPFILE_PREFIX="backup"
CLOUDFLARE_R2_ACCOUNT_ID=${CLOUDFLARE_R2_ACCOUNT_ID}
MONGODB_DBNAME="aiidprod"
MONGODB_DBNAME_TRANSLATIONS="translations"

# start script
CWD=$(/usr/bin/dirname $0)
cd $CWD

. ./functions.sh
NOW=$(create_current_yyyymmddhhmmss)

echo "=== $0 started at $(/bin/date "+%Y/%m/%d %H:%M:%S") ==="

TMPDIR="/tmp"
TARGET_DIRNAME="mongodump_full_snapshot"
TARGET="${TMPDIR}/${TARGET_DIRNAME}"
TAR_CMD="/bin/tar"
TAR_OPTS="jcvf"

DIRNAME=$(/usr/bin/dirname ${TARGET})
BASENAME=$(/usr/bin/basename ${TARGET})
TARBALL="${BACKUPFILE_PREFIX}-${NOW}.tar.bz2"
TARBALL_FULLPATH="${TMPDIR}/${TARBALL}"

# check parameters
# deprecate the old option
if [ "x${CLOUDFLARE_R2_ACCOUNT_ID}" == "x" ]; then
echo "ERROR: CLOUDFLARE_R2_ACCOUNT_ID must be specified." 1>&2
exit 1
fi
if [ -z "${CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID}" ]; then
echo "ERROR: If CLOUDFLARE_R2_ACCOUNT_ID environment variable is defined, you have to define the CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID as well" 1>&2
exit 1
fi
if [ -z "${CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY}" ]; then
echo "ERROR: If CLOUDFLARE_R2_ACCOUNT_ID environment variable is defined, you have to define the CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY as well" 1>&2
exit 1
fi
if [ -z "${CLOUDFLARE_R2_BUCKET_NAME}" ]; then
echo "ERROR: If CLOUDFLARE_R2_ACCOUNT_ID environment variable is defined, you have to define the CLOUDFLARE_R2_BUCKET_NAME as well" 1>&2
exit 1
fi

echo "Dump MongoDB 'aiidprod' database..."
mongodump -o ${TARGET} --uri=${MONGODB_URI}/${MONGODB_DBNAME}

echo "Dump MongoDB 'translations' database..."
mongodump -o ${TARGET} --uri=${MONGODB_URI}/${MONGODB_DBNAME_TRANSLATIONS}

echo "Export collections as CSV files..."
mongoexport -o ${TARGET}/incidents.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=incidents --fields=_id,incident_id,date,reports,Alleged\ deployer\ of\ AI\ system,Alleged\ developer\ of\ AI\ system,Alleged\ harmed\ or\ nearly\ harmed\ parties,description,title
mongoexport -o ${TARGET}/duplicates.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=duplicates --fields=duplicate_incident_number,true_incident_number
mongoexport -o ${TARGET}/quickadd.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=quickadd --fields=incident_id,url,date_submitted,source_domain
mongoexport -o ${TARGET}/submissions.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=submissions --fields=authors,date_downloaded,date_modified,date_published,date_submitted,image_url,incident_date,incident_id,language,mongodb_id,source_domain,submitters,text,title,url
mongoexport -o ${TARGET}/reports.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=reports --fields=_id,incident_id,authors,date_downloaded,date_modified,date_published,date_submitted,description,epoch_date_downloaded,epoch_date_modified,epoch_date_published,epoch_date_submitted,image_url,language,ref_number,report_number,source_domain,submitters,text,title,url,tags

# Taxa CSV Export

# Get the field names
mongoexport -o classifications_cset_headers.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --query='{ "namespace": {"$regex": "^CSET" }}' --collection=classifications --noHeaderLine --fields='attributes.0.short_name,attributes.1.short_name,attributes.2.short_name,attributes.3.short_name,attributes.4.short_name,attributes.5.short_name,attributes.6.short_name,attributes.7.short_name,attributes.8.short_name,attributes.9.short_name,attributes.10.short_name,attributes.11.short_name,attributes.12.short_name,attributes.13.short_name,attributes.14.short_name,attributes.15.short_name,attributes.16.short_name,attributes.17.short_name,attributes.18.short_name,attributes.19.short_name,attributes.20.short_name,attributes.21.short_name,attributes.22.short_name,attributes.23.short_name,attributes.24.short_name,attributes.25.short_name,attributes.26.short_name,attributes.27.short_name,attributes.28.short_name,attributes.29.short_name,attributes.30.short_name,attributes.31.short_name'

# Get the values
mongoexport -o classifications_cset_values.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --query='{ "namespace": {"$regex": "^CSET" }}' --collection=classifications --noHeaderLine --fields='_id,incident_id,namespace,publish,attributes.0.value_json,attributes.1.value_json,attributes.2.value_json,attributes.3.value_json,attributes.4.value_json,attributes.5.value_json,attributes.6.value_json,attributes.7.value_json,attributes.8.value_json,attributes.9.value_json,attributes.10.value_json,attributes.11.value_json,attributes.12.value_json,attributes.13.value_json,attributes.14.value_json,attributes.15.value_json,attributes.16.value_json,attributes.17.value_json,attributes.18.value_json,attributes.19.value_json,attributes.20.value_json,attributes.21.value_json,attributes.22.value_json,attributes.23.value_json,attributes.24.value_json,attributes.25.value_json,attributes.26.value_json,attributes.27.value_json,attributes.28.value_json,attributes.29.value_json,attributes.30.value_json,attributes.31.value_json'

# Construct the header
echo -n "_id,incident_id,namespace,publish," >tmp.csv
head -n 1 classifications_cset_headers.csv >tmp_header.csv
cat tmp.csv tmp_header.csv >header.csv

# Concat the header and the values to the output
cat header.csv classifications_cset_values.csv >${TARGET}/classifications_cset.csv

# Cleanup
rm tmp.csv
rm tmp_header.csv
rm header.csv
rm classifications_cset_headers.csv
rm classifications_cset_values.csv

echo "Report contents are subject to their own intellectual property rights. Unless otherwise noted, the database is shared under (CC BY-SA 4.0). See: https://creativecommons.org/licenses/by-sa/4.0/" >${TARGET}/license.txt

# run tar command
echo "Start backup ${TARGET} into ${CLOUDFLARE_R2_BUCKET_NAME} ..."
time ${TAR_CMD} ${TAR_OPTS} ${TARBALL_FULLPATH} -C ${DIRNAME} ${BASENAME}

# upload tarball to Cloudflare R2
r2_copy_file ${CLOUDFLARE_R2_ACCOUNT_ID} ${CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID} ${CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY} ${CLOUDFLARE_R2_BUCKET_NAME} ${TARBALL_FULLPATH} ${TARBALL}

# call healthchecks url for successful backup
if [ "x${HEALTHCHECKS_URL}" != "x" ]; then
curl -fsS --retry 3 ${HEALTHCHECKS_URL} >/dev/null
fi
117 changes: 117 additions & 0 deletions site/db-backup/bin/cloudflare_operations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
#!/usr/bin/env python3

import sys

import argparse
import boto3


def parse_arguments():
parser = argparse.ArgumentParser(
description="Simple client for uploading, deleting, listing, and checking objects in Cloudlfare R2 buckets."
)

parser.add_argument(
"--operation",
choices=["list", "upload", "delete", "check_exists"],
required=True,
help="",
)

# Arguments that are always required.
parser.add_argument("--account_id", required=True, help="Cloudflare account ID")
parser.add_argument(
"--access_key", required=True, help="Cloudflare R2 bucket access key"
)
parser.add_argument(
"--secret_key", required=True, help="Cloudflare R2 bucket secret key"
)
parser.add_argument(
"--bucket_name", required=True, help="Cloudflare R2 bucket name"
)

parser.add_argument(
"--file_path",
required=False,
help="Path to the file to be uploaded or deleted.",
)
parser.add_argument(
"--object_key",
required=False,
help="Key under which the object should be stored in the bucket.",
)

args = parser.parse_args()

# Arguments required for only some operations.
if args.operation == "upload":
if args.file_path is None:
parser.error("--operation={upload} requires --file_path.")

if args.operation in ["upload", "delete", "check_exists"]:
if args.object_key is None:
parser.error(
"--operation={delete,upload,check_exists} requires --object_key."
)

return args


def create_cloudflare_client(account_id, access_key, secret_key, region="auto"):
endpoint_url = f"https://{account_id}.r2.cloudflarestorage.com"
cloudflare_client = boto3.client(
service_name="s3",
endpoint_url=endpoint_url,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name=region,
)
return cloudflare_client


def main(args):
cloudflare_client = create_cloudflare_client(
args.account_id, args.access_key, args.secret_key
)

if args.operation == "list":
response = cloudflare_client.list_objects_v2(Bucket=args.bucket_name)

if "Contents" in response:
for obj in response["Contents"]:
print(obj["Key"], "size:", obj["Size"])

elif args.operation == "upload":
cloudflare_client.upload_file(
args.file_path,
args.bucket_name,
args.object_key,
ExtraArgs={"ContentType": "application/x-bzip2"},
)
print("-----------------------------")
print(
f"Successfully uploaded file {args.file_path} (key: {args.object_key}) to bucket {args.bucket_name}"
)
print("-----------------------------")

elif args.operation == "delete":
cloudflare_client.delete_object(Bucket=args.bucket_name, Key=args.object_key)
print("-----------------------------")
print(
f"Successfully deleted file {args.object_key} from bucket {args.bucket_name}"
)
print("-----------------------------")

elif args.operation == "check_exists":
# Raises error/non-zero exit if object doesn't exist. Otherwise success, raises nothing.
cloudflare_client.get_object(Bucket=args.bucket_name, Key=args.object_key)

else:
raise NotImplementedError

sys.exit()


if __name__ == "__main__":
args = parse_arguments()
main(args)
Loading

0 comments on commit 4046eb3

Please sign in to comment.