Skip to content

Commit

Permalink
Merge pull request #72 from bird-house/sync-raven-testdata-to-thredds
Browse files Browse the repository at this point in the history
Sync Raven testdata to Thredds for Raven tutorial notebooks.

Leveraging the cron daemon of the scheduler component, sync Raven testdata to Thredds for Raven tutorial notebooks.

Activation of the pre-configured cronjob is via `env.local` as usual for infra-as-code.

New generic `deploy-data` script can clone any number of git repos, sync any number of folders in the git repo to any number of local folders, with ability to cherry-pick just the few files needed (Raven testdata has many types of files, we only need to sync `.nc` files to Thredds, to avoid polluting Thredds storage `/data/datasets/testdata/raven`).

Limitation of the first version of this `deploy-data` script:
* Do not handle re-organizing file layout, this is a pure sync only with very limited rsync filtering for now (tutorial notebooks deploy from multiple repos, need re-organizing the file layout)

So the script has room to grow.  I see it as a generic solution to the repeated problem "take files from various git repos and deploy them somewhere automatically".  If we need to deploy another repo, juste write a new config file, stop writing boilerplate code again.

Minor unrelated change in this PR:
* README update to reference the new birdhouse-deploy-ouranos.
* Make sourcing the various pre-configured cronjob backward-compat with older version of the repo where those cronjob did not exist yet.
  • Loading branch information
tlvu authored Oct 15, 2020
2 parents 0789a13 + 60b817f commit 5ba68a0
Show file tree
Hide file tree
Showing 7 changed files with 332 additions and 2 deletions.
5 changes: 4 additions & 1 deletion birdhouse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,10 @@ for your organization. For an example of possible override, see how the [emu
service](optional-components/emu/docker-compose-extra.yml)
([README](optional-components/README.md)) can be optionally added to the
deployment via the [override
mechanism](https://docs.docker.com/compose/extends/).
mechanism](https://docs.docker.com/compose/extends/). Ouranos specific
override can be found in this
[birdhouse-deploy-ouranos](https://github.com/bird-house/birdhouse-deploy-ouranos)
repo.

The automatic deployment is able to handle multiple repos, so will trigger if
this repo or your private-personalized-config repo changes, giving you
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
##############################################################################
# Configuration vars, set in env.local before sourcing this file.
# This job assume the "scheduler" component is enabled.
##############################################################################

# Cronjob schedule to trigger deployment attempt.
if [ -z "$DEPLOY_RAVEN_TESTDATA_SCHEDULE" ]; then
DEPLOY_RAVEN_TESTDATA_SCHEDULE="*/30 * * * *" # UTC
fi

# Location for local cache of git clone to save bandwidth and time from always
# re-cloning from scratch.
if [ -z "$DEPLOY_RAVEN_TESTDATA_CHECKOUT_CACHE" ]; then
DEPLOY_RAVEN_TESTDATA_CHECKOUT_CACHE="/data/deploy_data_cache/deploy_raven_testdata_to_thredds"
fi

# Location of deploy-data config file.
# Provide a different config file to sync to a different location or include
# more files in the sync.
if [ -z "$DEPLOY_RAVEN_TESTDATA_CONFIG" ]; then
DEPLOY_RAVEN_TESTDATA_CONFIG="${COMPOSE_DIR}/deployment/deploy-data-raven-testdata-to-thredds.yml"
fi

# Log file location. Default location under /var/log/PAVICS/ has built-in logrotate.
if [ -z "$DEPLOY_RAVEN_TESTDATA_LOGFILE" ]; then
DEPLOY_RAVEN_TESTDATA_LOGFILE="/var/log/PAVICS/deploy_raven_testdata_to_thredds.log"
fi

# Location of ssh private key for git clone over ssh, useful for private repos.
# Raven do not need this since Raven repo is public so cloning over https.
# This is here in case a custom config file is supplied with additional repos.
#DEPLOY_RAVEN_TESTDATA_GIT_SSH_IDENTITY_FILE="/path/to/id_rsa"
#DEPLOY_RAVEN_TESTDATA_GIT_SSH_IDENTITY_FILE=/home/vagrant/.ssh/id_rsa_git_ssh_read_only

##############################################################################
# End configuration vars
##############################################################################


if [ -z "`echo "$AUTODEPLOY_EXTRA_SCHEDULER_JOBS" | grep deploy_raven_testdata_to_thredds`" ]; then

# Add job only if not already added (config is read twice during
# autodeploy process.

LOGFILE_DIRNAME="`dirname "$DEPLOY_RAVEN_TESTDATA_LOGFILE"`"

EXTRA_DOCKER_ARGS=""
if [ -n "$DEPLOY_RAVEN_TESTDATA_GIT_SSH_IDENTITY_FILE" ]; then
EXTRA_DOCKER_ARGS="
--volume ${DEPLOY_RAVEN_TESTDATA_GIT_SSH_IDENTITY_FILE}:${DEPLOY_RAVEN_TESTDATA_GIT_SSH_IDENTITY_FILE}:ro
--env DEPLOY_DATA_GIT_SSH_IDENTITY_FILE=${DEPLOY_RAVEN_TESTDATA_GIT_SSH_IDENTITY_FILE}"
fi

export AUTODEPLOY_EXTRA_SCHEDULER_JOBS="
$AUTODEPLOY_EXTRA_SCHEDULER_JOBS

- name: deploy_raven_testdata_to_thredds
comment: Auto-deploy Raven testdata to Thredds for Raven tutorial notebooks.
schedule: '$DEPLOY_RAVEN_TESTDATA_SCHEDULE'
command: '/deploy-data ${DEPLOY_RAVEN_TESTDATA_CONFIG}'
dockerargs: >-
--rm --name deploy_raven_testdata_to_thredds
--volume /var/run/docker.sock:/var/run/docker.sock:ro
--volume ${COMPOSE_DIR}/deployment/deploy-data:/deploy-data:ro
--volume ${DEPLOY_RAVEN_TESTDATA_CONFIG}:${DEPLOY_RAVEN_TESTDATA_CONFIG}:ro
--volume ${DEPLOY_RAVEN_TESTDATA_CHECKOUT_CACHE}:${DEPLOY_RAVEN_TESTDATA_CHECKOUT_CACHE}:rw
--volume ${LOGFILE_DIRNAME}:${LOGFILE_DIRNAME}:rw
--env DEPLOY_DATA_CHECKOUT_CACHE=${DEPLOY_RAVEN_TESTDATA_CHECKOUT_CACHE}
--env DEPLOY_DATA_LOGFILE=${DEPLOY_RAVEN_TESTDATA_LOGFILE} ${EXTRA_DOCKER_ARGS}
image: 'docker:19.03.6-git'
"

fi
164 changes: 164 additions & 0 deletions birdhouse/deployment/deploy-data
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
#!/bin/sh
# Deploy data from git repo(s) to local folder(s).
#
# See sample input config in deploy-data.config.sample.yml for how to specify
# which git repo(s), which git branch for each repo, which sub-folder(s) to
# sync to which local folder(s) and rsync extra options for each sub-folder.
#
# The git repo clones are cached for faster subsequent runs and rsync is used
# to only modify files that actually changed, to keep the file tree in sync and
# to have include/exclude filter rules. All these options are not available if
# using regular 'cp'.
#
# Docker image is used for yq (yaml file parser) and rsync so this script have
# very few install dependencies (only need docker and git installed locally)
# so it can runs inside very minimalistic image (the 'docker' Docker image).
#
# Setting environment variable DEPLOY_DATA_LOGFILE='/path/to/logfile.log'
# will redirect all STDOUT and STDERR to that logfile so this script will be
# completely silent.
#
# Setting environment variable DEPLOY_DATA_GIT_SSH_IDENTITY_FILE='/path/to/id_rsa'
# will allow git clone over ssh, useful for private repos.
#
# Other self explanatory environment variables DEPLOY_DATA_CHECKOUT_CACHE,
# DEPLOY_DATA_YQ_IMAGE, DEPLOY_DATA_RSYNC_IMAGE.
#

if [ ! -z "$DEPLOY_DATA_LOGFILE" ]; then
exec >>$DEPLOY_DATA_LOGFILE 2>&1
fi


cleanup_on_exit() {
set +x
echo "
datadeploy finished START_TIME=$START_TIME
datadeploy finished END_TIME=`date -Isecond`"
}

trap cleanup_on_exit EXIT


if [ -z "$DEPLOY_DATA_CHECKOUT_CACHE" ]; then
DEPLOY_DATA_CHECKOUT_CACHE="/tmp/deploy-data-clone-cache"
fi

if [ -z "$DEPLOY_DATA_YQ_IMAGE" ]; then
DEPLOY_DATA_YQ_IMAGE="mikefarah/yq:3.3.4"
fi

if [ -z "$DEPLOY_DATA_RSYNC_IMAGE" ]; then
DEPLOY_DATA_RSYNC_IMAGE="eeacms/rsync:2.3"
fi

CONFIG_YML="$1"
if [ -z "$CONFIG_YML" ]; then
echo "ERROR: missing config.yml file" 1>&2
exit 2
else
shift
# Docker volume mount requires absolute path.
CONFIG_YML="`realpath "$CONFIG_YML"`"
fi


yq() {
docker run --rm --name deploy_data_yq -v $CONFIG_YML:$CONFIG_YML:ro $DEPLOY_DATA_YQ_IMAGE yq "$@"
}

# Empty value could mean typo in the keys in the config file.
ensure_not_empty() {
if [ -z "$*" ]; then
echo "ERROR: value empty" 1>&2
exit 1
fi
}


START_TIME="`date -Isecond`"
echo "==========
datadeploy START_TIME=$START_TIME"

set -x

CHECKOUT_CACHE="`yq r -p v $CONFIG_YML config.checkout_cache`"
if [ -z "$CHECKOUT_CACHE" ]; then
CHECKOUT_CACHE="$DEPLOY_DATA_CHECKOUT_CACHE"
fi

GIT_SSH_IDENTITY_FILE="`yq r -p v $CONFIG_YML config.git_ssh_identity_file`"
if [ -z "$GIT_SSH_IDENTITY_FILE" ]; then
GIT_SSH_IDENTITY_FILE="$DEPLOY_DATA_GIT_SSH_IDENTITY_FILE"
fi

if [ ! -z "$GIT_SSH_IDENTITY_FILE" ]; then
export GIT_SSH_COMMAND="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o IdentityFile=$GIT_SSH_IDENTITY_FILE"
fi

GIT_REPO_URLS="`yq r -p v $CONFIG_YML deploy\[*\].repo_url`"
ensure_not_empty "$GIT_REPO_URLS"
REPO_NUM=0

for GIT_REPO_URL in $GIT_REPO_URLS; do

GIT_BRANCH="`yq r -p v $CONFIG_YML --defaultValue origin/master deploy\[$REPO_NUM\].branch`"
ensure_not_empty "$GIT_BRANCH"
GIT_CHECKOUT_NAME="`yq r -p v $CONFIG_YML deploy\[$REPO_NUM\].checkout_name`"
ensure_not_empty "$GIT_CHECKOUT_NAME"

CLONE_DEST="$CHECKOUT_CACHE/$GIT_CHECKOUT_NAME"
if [ ! -d "$CLONE_DEST" ]; then
echo "checkout repo '$GIT_REPO_URL' on branch '$GIT_BRANCH' to '$CLONE_DEST'"
git clone $GIT_REPO_URL $CLONE_DEST || exit 1
cd $CLONE_DEST
git checkout $GIT_BRANCH
else
echo "refresh repo '$CLONE_DEST' on branch '$GIT_BRANCH'"
cd $CLONE_DEST
git remote -v # log remote, should match GIT_REPO_URL
git clean -fdx # force, recur dir, also clean .gitignore files and untracked files
git fetch --prune --all || exit 1
git checkout --force $GIT_BRANCH # force checkout to throwaway local changes
fi

SRC_DIRS="`yq r -p v $CONFIG_YML deploy\[$REPO_NUM\].dir_maps\[*\].source_dir`"
ensure_not_empty "$SRC_DIRS"
DIR_NUM=0

for SRC_DIR in $SRC_DIRS; do
DEST_DIR="`yq r -p v $CONFIG_YML deploy\[$REPO_NUM\].dir_maps\[$DIR_NUM\].dest_dir`"
ensure_not_empty "$DEST_DIR"
RSYNC_EXTRA_OPTS="`yq r -p v $CONFIG_YML deploy\[$REPO_NUM\].dir_maps\[$DIR_NUM\].rsync_extra_opts`"

echo "sync '$SRC_DIR' to '$DEST_DIR'"
DEST_DIR_PARENT="`dirname "$DEST_DIR"`"
SRC_DIR_ABS_PATH="`pwd`/$SRC_DIR"
USER_ID="`id -u`"
GROUP_ID="`id -g`"

# Ensure DEST_DIR_PARENT is created using current USER_ID/GROUP_ID for
# next rsync to have proper write access.
mkdir -p "$DEST_DIR_PARENT"

# Rsync with --checksum to only update file that changed.
docker run --rm --name deploy_data_rsync \
--volume $SRC_DIR_ABS_PATH:$SRC_DIR_ABS_PATH:ro \
--volume $DEST_DIR_PARENT:$DEST_DIR_PARENT:rw \
--user $USER_ID:$GROUP_ID \
--entrypoint /usr/bin/rsync \
$DEPLOY_DATA_RSYNC_IMAGE \
--recursive --links --checksum --delete \
--itemize-changes --human-readable --verbose \
--prune-empty-dirs $RSYNC_EXTRA_OPTS \
$SRC_DIR_ABS_PATH/ $DEST_DIR

DIR_NUM=`expr $DIR_NUM + 1`
done

REPO_NUM=`expr $REPO_NUM + 1`

done


# vi: tabstop=8 expandtab shiftwidth=4 softtabstop=4
12 changes: 12 additions & 0 deletions birdhouse/deployment/deploy-data-raven-testdata-to-thredds.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
deploy:
#- repo_url: git@github.com:Ouranosinc/raven.git
- repo_url: https://github.com/Ouranosinc/raven
# optional, default "origin/master"
# branch:
checkout_name: raven
dir_maps:
# rsync content below source_dir into dest_dir
- source_dir: tests/testdata
dest_dir: /data/datasets/testdata/raven
# only sync .nc files
rsync_extra_opts: --include=*/ --include=*.nc --exclude=*
62 changes: 62 additions & 0 deletions birdhouse/deployment/deploy-data.config.sample.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Sample config file for deploy-data script.
#
# Many git repos are supported. For each repo, many mapping between source dir
# and destination dir are supported. For each mapping, extra rsync option can
# be provided to include/exclude a subset of files to keep in sync.

config:
# optional, default "/tmp/deploy-data-clone-cache"
# can also be set by env var DEPLOY_DATA_CHECKOUT_CACHE
# setting in this config file have precedence over env var
#checkout_cache:

# optional, default unset
# for git clone over ssh, useful for private repos
# can also be set by env var DEPLOY_DATA_GIT_SSH_IDENTITY_FILE
# setting in this config file have precedence over env var
#git_ssh_identity_file: /path/to/id_rsa

deploy:
# this form if clone over ssh: git@github.com:Ouranosinc/jenkins-master.git
- repo_url: https://github.com/Ouranosinc/jenkins-master
# optional, default "origin/master"
# branch:
checkout_name: jenkins-master
dir_maps:
# rsync content below source_dir into dest_dir
- source_dir: initial-jenkins-plugins-suggestion
dest_dir: /tmp/deploy-data-test-deploy/jenkins-plugins
# optional, useful for include/exclude filter rules
# rsync_extra_opts:

- repo_url: https://github.com/Ouranosinc/jenkins-config
branch: origin/master
checkout_name: jenkins-config
dir_maps:
- source_dir: canarie-presentation/
dest_dir: /tmp/deploy-data-test-deploy/canarie
# sync only .txt, .html and .gif files, if other already existing files,
# ignore them, unless they have same extensions.
rsync_extra_opts: --include=*/ --include=*.txt --include=*.html --include=*.gif --exclude=*
- source_dir: jcasc
# remap dir jcasc inside previous dir canarie, without conflicting with
# previous canarie sync. This works because no .txt, .html, .gif in jcasc.
dest_dir: /tmp/deploy-data-test-deploy/canarie/jcasc
rsync_extra_opts:

- repo_url: https://github.com/Ouranosinc/pavics-sdi
# branch:
checkout_name: pavics-sdi
dir_maps:
# sync only 2 sub-dirs and .rst files under source/
- source_dir: docs/
dest_dir: /tmp/deploy-data-test-deploy/pavics-sdi
rsync_extra_opts: --include=*/ --include=source/tutorials/** --include=source/processes/** --include=source/*.rst --exclude=*
# sync only .yml files at the root of checkout
- source_dir: .
dest_dir: /tmp/deploy-data-test-deploy/pavics-sdi
rsync_extra_opts: --include=/ --include=*.yml --exclude=*
# move dir 'notebooks' one level higher in hierarchy
- source_dir: docs/source
dest_dir: /tmp/deploy-data-test-deploy/pavics-sdi
rsync_extra_opts: --include=*/ --include=notebooks/** --exclude=*
12 changes: 12 additions & 0 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,19 @@ export POSTGRES_MAGPIE_PASSWORD=postgres-qwerty
# See the job for additional possible configurations. The "scheduler"
# component needs to be enabled for this pre-configured job to work.
#
#if [ -f "/<absolute path>/components/scheduler/renew_letsencrypt_ssl_cert_extra_job.env" ]; then
#. /<absolute path>/components/scheduler/renew_letsencrypt_ssl_cert_extra_job.env
#fi
#
# Load pre-configured cronjob to automatically deploy Raven testdata to Thredds
# for Raven tutorial notebooks.
#
# See the job for additional possible configurations. The "scheduler"
# component needs to be enabled for this pre-configured job to work.
#
#if [ -f "/<absolute path>/components/scheduler/deploy_raven_testdata_to_thredds.env" ]; then
#. /<absolute path>/components/scheduler/deploy_raven_testdata_to_thredds.env
#fi

# Public (on the internet) fully qualified domain name of this Pavics
# installation. This is optional so default to the same internal PAVICS_FQDN if
Expand Down
6 changes: 5 additions & 1 deletion birdhouse/vagrant-utils/configure-pavics.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,11 @@ RENEW_LETSENCRYPT_SSL_SCHEDULE="22 9 * * *" # UTC
# This repo will be volume-mount at /vagrant so can not go higher.
RENEW_LETSENCRYPT_SSL_NUM_PARENTS_MOUNT="/"
. $PWD/components/scheduler/renew_letsencrypt_ssl_cert_extra_job.env
# Only source if file exist. Allow for config file to be backward-compat with
# older version of the repo where the .env file do not exist yet.
if [ -f "$PWD/components/scheduler/renew_letsencrypt_ssl_cert_extra_job.env" ]; then
. $PWD/components/scheduler/renew_letsencrypt_ssl_cert_extra_job.env
fi
EOF
elif [ -n "$KITENAME" -a -n "$KITESUBDOMAIN" ]; then
cat <<EOF >> env.local
Expand Down

0 comments on commit 5ba68a0

Please sign in to comment.