-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Migrate all
mongodb-awesome-backup
repo code/actions to this repo (#…
…3030) * Initial DB backup GH actions setup * Change Dockerfile path * Move Github actions to the parent folder * Change Docker bin path * Change Docker base path * Add / to the Docker path * Refactor and reuse AIID env variables * Test empty MongoDB URI * Remove extra whitespaces * Rename `CLOUDFLARE_R2_BUCKET` to `CLOUDFLARE_R2_BUCKET_NAME` * Remove MONGODB_HOST environment variable from backup scripts * Remove MONGODB_USERNAME environment variable from backup scripts * Remove MONGODB_PASSWORD environment variable from backup scripts * Remove unnecessary MongoDB environment variables from backup scripts * Test removing "Configure AWS credentials" step * Remove AWS env variables * Remove all AWS related code * Apply changes to private backup * Rename `docker-db-backup` folder to `db-backup` * Remove mongodb-clients package dependency * Add `mongodb-database-tools` dependency * Change backup.sh path * Install `boto3` * Remove Docker completely and unnecessary env variables * Improve logs * Remove unused files * Delete all S3 related functions * Remove GitHub action "on push" * Change the script file name and remove S3 references * Test removing boto3 dependency * rollback boto3 deletion * Export all classifications (without any query filter) * Remove Private backups * Update README file * Move all code to `tools` folder * Remove classification query filter from mongodump command * Remove boto.sh script * Export all CSET classifications to CSV * Change CSET CSV file name * Update classifications_cset.csv file name * Remove unnecessary `workflow_dispatch` inputs * Move `db-backup` folder to `site` folder * Update db-backup workflow to include environment input
- Loading branch information
Showing
7 changed files
with
442 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
name: Public backup to the cloud | ||
|
||
on: | ||
schedule: | ||
- cron: "0 10 * * 1" # At 10:00 on Monday. | ||
workflow_dispatch: | ||
inputs: | ||
environment: | ||
description: The Github environment to load secrets from | ||
type: string | ||
required: true | ||
|
||
defaults: | ||
run: | ||
shell: bash | ||
|
||
jobs: | ||
build-and-run-backups: | ||
# If the execution is triggered by a schedule, the environment is production | ||
environment: ${{ inputs.environment || 'production' }} | ||
name: Backup | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
|
||
- name: Install dependencies | ||
run: | | ||
sudo apt-get update | ||
sudo apt-get install -y \ | ||
coreutils \ | ||
bash \ | ||
tzdata \ | ||
python3-pip \ | ||
curl \ | ||
npm | ||
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | gpg --dearmor | sudo tee /usr/share/keyrings/mongodb.gpg > /dev/null | ||
echo "deb [ arch=amd64 signed-by=/usr/share/keyrings/mongodb.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list | ||
sudo apt update | ||
sudo apt install -y mongodb-database-tools | ||
- name: Install boto3 | ||
run: pip install boto3 | ||
|
||
- name: Generate public backup | ||
run: | | ||
./bin/backup.sh | ||
./bin/prune.sh | ||
./bin/list.sh | ||
working-directory: site/db-backup | ||
env: | ||
CLOUDFLARE_R2_ACCOUNT_ID: ${{ vars.CLOUDFLARE_R2_ACCOUNT_ID }} | ||
CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID: ${{ secrets.CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID }} | ||
CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY: ${{ secrets.CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY }} | ||
CLOUDFLARE_R2_BUCKET_NAME: ${{ vars.CLOUDFLARE_R2_BUCKET_NAME }} | ||
MONGODB_URI: ${{ secrets.MONGODB_CONNECTION_STRING }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
This is a quick port of the forked project to support JSON and CSV backups of the [AIID](https://incidentdatabase.ai/). | ||
|
||
The complete state of the database will be backed up on a weekly basis in both JSON and CSV form. The backups can be downloaded from [here](https://incidentdatabase.ai/research/snapshots/). | ||
|
||
Requirements | ||
------------ | ||
|
||
- Cloudflare R2 Access Key ID/Secret Access Key, which must have the access rights of the target Cloudflare R2 bucket. | ||
- MongoDB credentials with read access to the target database. | ||
|
||
Usage | ||
----- | ||
|
||
The GitHub Action "Public backup to the cloud" [/.github/workflows/db-backup.yml](/.github/workflows/db-backup.yml) will run the backup script at 10:00 AM every Monday. | ||
|
||
After running this, `backup-YYYYMMdd.tar.bz2` will be placed on the Cloudflare R2 Bucket. | ||
|
||
|
||
Required environment variables | ||
--------- | ||
|
||
| Variable | Description | | ||
| --------------------- | ------------------------------------------------------------------------------ | | ||
| CLOUDFLARE_R2_ACCOUNT_ID | Cloudflare R2 account ID | | ||
| CLOUDFLARE_R2_BUCKET_NAME | Cloudflare R2 public bucket name (ie: "aiid-public") | | ||
|
||
Required environment secrets | ||
|
||
| Secret | Description | | ||
| --------------------- | ------------------------------------------------------------------------------ | | ||
| CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID | Cloudflare R2 Access Key ID with write permission | | ||
| CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY | Cloudflare R2 Access Secret ID with write permission| | ||
| MONGODB_CONNECTION_STRING | mongodb+srv://[username]:[password]@aiiddev.[CLUSTER].mongodb.net | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
#!/bin/bash -e | ||
|
||
echo "--------------------------------------" | ||
echo "Starting backup.sh script execution..." | ||
echo "--------------------------------------" | ||
|
||
# settings | ||
BACKUPFILE_PREFIX="backup" | ||
CLOUDFLARE_R2_ACCOUNT_ID=${CLOUDFLARE_R2_ACCOUNT_ID} | ||
MONGODB_DBNAME="aiidprod" | ||
MONGODB_DBNAME_TRANSLATIONS="translations" | ||
|
||
# start script | ||
CWD=$(/usr/bin/dirname $0) | ||
cd $CWD | ||
|
||
. ./functions.sh | ||
NOW=$(create_current_yyyymmddhhmmss) | ||
|
||
echo "=== $0 started at $(/bin/date "+%Y/%m/%d %H:%M:%S") ===" | ||
|
||
TMPDIR="/tmp" | ||
TARGET_DIRNAME="mongodump_full_snapshot" | ||
TARGET="${TMPDIR}/${TARGET_DIRNAME}" | ||
TAR_CMD="/bin/tar" | ||
TAR_OPTS="jcvf" | ||
|
||
DIRNAME=$(/usr/bin/dirname ${TARGET}) | ||
BASENAME=$(/usr/bin/basename ${TARGET}) | ||
TARBALL="${BACKUPFILE_PREFIX}-${NOW}.tar.bz2" | ||
TARBALL_FULLPATH="${TMPDIR}/${TARBALL}" | ||
|
||
# check parameters | ||
# deprecate the old option | ||
if [ "x${CLOUDFLARE_R2_ACCOUNT_ID}" == "x" ]; then | ||
echo "ERROR: CLOUDFLARE_R2_ACCOUNT_ID must be specified." 1>&2 | ||
exit 1 | ||
fi | ||
if [ -z "${CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID}" ]; then | ||
echo "ERROR: If CLOUDFLARE_R2_ACCOUNT_ID environment variable is defined, you have to define the CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID as well" 1>&2 | ||
exit 1 | ||
fi | ||
if [ -z "${CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY}" ]; then | ||
echo "ERROR: If CLOUDFLARE_R2_ACCOUNT_ID environment variable is defined, you have to define the CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY as well" 1>&2 | ||
exit 1 | ||
fi | ||
if [ -z "${CLOUDFLARE_R2_BUCKET_NAME}" ]; then | ||
echo "ERROR: If CLOUDFLARE_R2_ACCOUNT_ID environment variable is defined, you have to define the CLOUDFLARE_R2_BUCKET_NAME as well" 1>&2 | ||
exit 1 | ||
fi | ||
|
||
echo "Dump MongoDB 'aiidprod' database..." | ||
mongodump -o ${TARGET} --uri=${MONGODB_URI}/${MONGODB_DBNAME} | ||
|
||
echo "Dump MongoDB 'translations' database..." | ||
mongodump -o ${TARGET} --uri=${MONGODB_URI}/${MONGODB_DBNAME_TRANSLATIONS} | ||
|
||
echo "Export collections as CSV files..." | ||
mongoexport -o ${TARGET}/incidents.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=incidents --fields=_id,incident_id,date,reports,Alleged\ deployer\ of\ AI\ system,Alleged\ developer\ of\ AI\ system,Alleged\ harmed\ or\ nearly\ harmed\ parties,description,title | ||
mongoexport -o ${TARGET}/duplicates.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=duplicates --fields=duplicate_incident_number,true_incident_number | ||
mongoexport -o ${TARGET}/quickadd.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=quickadd --fields=incident_id,url,date_submitted,source_domain | ||
mongoexport -o ${TARGET}/submissions.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=submissions --fields=authors,date_downloaded,date_modified,date_published,date_submitted,image_url,incident_date,incident_id,language,mongodb_id,source_domain,submitters,text,title,url | ||
mongoexport -o ${TARGET}/reports.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --collection=reports --fields=_id,incident_id,authors,date_downloaded,date_modified,date_published,date_submitted,description,epoch_date_downloaded,epoch_date_modified,epoch_date_published,epoch_date_submitted,image_url,language,ref_number,report_number,source_domain,submitters,text,title,url,tags | ||
|
||
# Taxa CSV Export | ||
|
||
# Get the field names | ||
mongoexport -o classifications_cset_headers.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --query='{ "namespace": {"$regex": "^CSET" }}' --collection=classifications --noHeaderLine --fields='attributes.0.short_name,attributes.1.short_name,attributes.2.short_name,attributes.3.short_name,attributes.4.short_name,attributes.5.short_name,attributes.6.short_name,attributes.7.short_name,attributes.8.short_name,attributes.9.short_name,attributes.10.short_name,attributes.11.short_name,attributes.12.short_name,attributes.13.short_name,attributes.14.short_name,attributes.15.short_name,attributes.16.short_name,attributes.17.short_name,attributes.18.short_name,attributes.19.short_name,attributes.20.short_name,attributes.21.short_name,attributes.22.short_name,attributes.23.short_name,attributes.24.short_name,attributes.25.short_name,attributes.26.short_name,attributes.27.short_name,attributes.28.short_name,attributes.29.short_name,attributes.30.short_name,attributes.31.short_name' | ||
|
||
# Get the values | ||
mongoexport -o classifications_cset_values.csv --uri=${MONGODB_URI}/${MONGODB_DBNAME} -v --type=csv --query='{ "namespace": {"$regex": "^CSET" }}' --collection=classifications --noHeaderLine --fields='_id,incident_id,namespace,publish,attributes.0.value_json,attributes.1.value_json,attributes.2.value_json,attributes.3.value_json,attributes.4.value_json,attributes.5.value_json,attributes.6.value_json,attributes.7.value_json,attributes.8.value_json,attributes.9.value_json,attributes.10.value_json,attributes.11.value_json,attributes.12.value_json,attributes.13.value_json,attributes.14.value_json,attributes.15.value_json,attributes.16.value_json,attributes.17.value_json,attributes.18.value_json,attributes.19.value_json,attributes.20.value_json,attributes.21.value_json,attributes.22.value_json,attributes.23.value_json,attributes.24.value_json,attributes.25.value_json,attributes.26.value_json,attributes.27.value_json,attributes.28.value_json,attributes.29.value_json,attributes.30.value_json,attributes.31.value_json' | ||
|
||
# Construct the header | ||
echo -n "_id,incident_id,namespace,publish," >tmp.csv | ||
head -n 1 classifications_cset_headers.csv >tmp_header.csv | ||
cat tmp.csv tmp_header.csv >header.csv | ||
|
||
# Concat the header and the values to the output | ||
cat header.csv classifications_cset_values.csv >${TARGET}/classifications_cset.csv | ||
|
||
# Cleanup | ||
rm tmp.csv | ||
rm tmp_header.csv | ||
rm header.csv | ||
rm classifications_cset_headers.csv | ||
rm classifications_cset_values.csv | ||
|
||
echo "Report contents are subject to their own intellectual property rights. Unless otherwise noted, the database is shared under (CC BY-SA 4.0). See: https://creativecommons.org/licenses/by-sa/4.0/" >${TARGET}/license.txt | ||
|
||
# run tar command | ||
echo "Start backup ${TARGET} into ${CLOUDFLARE_R2_BUCKET_NAME} ..." | ||
time ${TAR_CMD} ${TAR_OPTS} ${TARBALL_FULLPATH} -C ${DIRNAME} ${BASENAME} | ||
|
||
# upload tarball to Cloudflare R2 | ||
r2_copy_file ${CLOUDFLARE_R2_ACCOUNT_ID} ${CLOUDFLARE_R2_WRITE_ACCESS_KEY_ID} ${CLOUDFLARE_R2_WRITE_SECRET_ACCESS_KEY} ${CLOUDFLARE_R2_BUCKET_NAME} ${TARBALL_FULLPATH} ${TARBALL} | ||
|
||
# call healthchecks url for successful backup | ||
if [ "x${HEALTHCHECKS_URL}" != "x" ]; then | ||
curl -fsS --retry 3 ${HEALTHCHECKS_URL} >/dev/null | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import sys | ||
|
||
import argparse | ||
import boto3 | ||
|
||
|
||
def parse_arguments(): | ||
parser = argparse.ArgumentParser( | ||
description="Simple client for uploading, deleting, listing, and checking objects in Cloudlfare R2 buckets." | ||
) | ||
|
||
parser.add_argument( | ||
"--operation", | ||
choices=["list", "upload", "delete", "check_exists"], | ||
required=True, | ||
help="", | ||
) | ||
|
||
# Arguments that are always required. | ||
parser.add_argument("--account_id", required=True, help="Cloudflare account ID") | ||
parser.add_argument( | ||
"--access_key", required=True, help="Cloudflare R2 bucket access key" | ||
) | ||
parser.add_argument( | ||
"--secret_key", required=True, help="Cloudflare R2 bucket secret key" | ||
) | ||
parser.add_argument( | ||
"--bucket_name", required=True, help="Cloudflare R2 bucket name" | ||
) | ||
|
||
parser.add_argument( | ||
"--file_path", | ||
required=False, | ||
help="Path to the file to be uploaded or deleted.", | ||
) | ||
parser.add_argument( | ||
"--object_key", | ||
required=False, | ||
help="Key under which the object should be stored in the bucket.", | ||
) | ||
|
||
args = parser.parse_args() | ||
|
||
# Arguments required for only some operations. | ||
if args.operation == "upload": | ||
if args.file_path is None: | ||
parser.error("--operation={upload} requires --file_path.") | ||
|
||
if args.operation in ["upload", "delete", "check_exists"]: | ||
if args.object_key is None: | ||
parser.error( | ||
"--operation={delete,upload,check_exists} requires --object_key." | ||
) | ||
|
||
return args | ||
|
||
|
||
def create_cloudflare_client(account_id, access_key, secret_key, region="auto"): | ||
endpoint_url = f"https://{account_id}.r2.cloudflarestorage.com" | ||
cloudflare_client = boto3.client( | ||
service_name="s3", | ||
endpoint_url=endpoint_url, | ||
aws_access_key_id=access_key, | ||
aws_secret_access_key=secret_key, | ||
region_name=region, | ||
) | ||
return cloudflare_client | ||
|
||
|
||
def main(args): | ||
cloudflare_client = create_cloudflare_client( | ||
args.account_id, args.access_key, args.secret_key | ||
) | ||
|
||
if args.operation == "list": | ||
response = cloudflare_client.list_objects_v2(Bucket=args.bucket_name) | ||
|
||
if "Contents" in response: | ||
for obj in response["Contents"]: | ||
print(obj["Key"], "size:", obj["Size"]) | ||
|
||
elif args.operation == "upload": | ||
cloudflare_client.upload_file( | ||
args.file_path, | ||
args.bucket_name, | ||
args.object_key, | ||
ExtraArgs={"ContentType": "application/x-bzip2"}, | ||
) | ||
print("-----------------------------") | ||
print( | ||
f"Successfully uploaded file {args.file_path} (key: {args.object_key}) to bucket {args.bucket_name}" | ||
) | ||
print("-----------------------------") | ||
|
||
elif args.operation == "delete": | ||
cloudflare_client.delete_object(Bucket=args.bucket_name, Key=args.object_key) | ||
print("-----------------------------") | ||
print( | ||
f"Successfully deleted file {args.object_key} from bucket {args.bucket_name}" | ||
) | ||
print("-----------------------------") | ||
|
||
elif args.operation == "check_exists": | ||
# Raises error/non-zero exit if object doesn't exist. Otherwise success, raises nothing. | ||
cloudflare_client.get_object(Bucket=args.bucket_name, Key=args.object_key) | ||
|
||
else: | ||
raise NotImplementedError | ||
|
||
sys.exit() | ||
|
||
|
||
if __name__ == "__main__": | ||
args = parse_arguments() | ||
main(args) |
Oops, something went wrong.