This Project is the samplesheet check for the UMCCR backend and infrastructure.
The project contains the app.py
which is the main function of the app and directories:
- src - the lambdas source code lives
- stacks - the stack for which the code is structured at AWS
It is recommended to create a virtual environment for the app.
To do so please follow the instructions below.
Change your directory to the root of this readme file.
Create a virtual environment for the app.
virtualenv .venv --python=python3.11
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
source .venv/bin/activate
Install all dependencies
make install
Prerequisite
- A valid SSL Certificate in
us-east-1
region at ACM for all the domain name needed. See here (alias_domain_name
on the props variable) on what domain needs to be included, determined based on which account is deployed. - SSM Parameter for the certificate ARN created above with the name of
/sscheck/api/ssl_certificate_arn
Deploying the stack without the prerequisite above may result in a stack rollback
There are 2 stacks in this application:
- SSCheckBackEndCdkPipeline/SampleSheetCheckBackEndStage/SampleSheetCheckBackEnd - Contains the applications stack
- SSCheckBackEndCdkPipeline - Contains the pipeline for the stack to run and self-update
To deploy the application stack, you will need to deploy the pipeline
stack. The pipeline stack will take care of the application stack.
Deploy pipeline stack
cdk deploy SSCheckBackEndCdkPipeline --profile=${AWS_PROFILE}
You can deploy the cdk with the following command
cdk deploy SSCheckBackEndCdkPipeline/SampleSheetCheckBackEndStage/SampleSheetCheckBackEnd --profile=${AWS_PROFILE}
This is done every 24 hours (overnight), however, if one needs to update the lab metadata on demand, the following code may be of assistance.
Ensure you're logged in to the right AWS account and then run the following code:
aws lambda invoke \
--function-name data-portal-api-prod-labmetadata_scheduled_update_processor \
--output json \
output.json
Prerequisite:
sam --version
SAM CLI, version 1.100.
cdk --version
2.114.1 (build 02bbb1d)
The local start could configure the domain name for the metadata lookup. Currently, it is pointing to localhost:8000
where the data-portal-api operates locally, but alternatively, you could change and point to the remote domain name
(e.g. api.portal.dev.umccr.org
or api.portal.prod.umccr.org
). This can be done on the local-start-env-var.json
file located at the root of the directory. The appropriate bearer token when calling this local endpoint to make use of
the remote metadata endpoint.
To start simply use the makefile to start a local api running in localhost:8001
. Run:
make start
You could call this endpoint with the following command.
curl --location 'http://127.0.0.1:8001/' \
--header 'Authorization: Bearer ${TOKEN}' \
--form 'file=@"/the/samplesheet/path"' \
--form 'logLevel="ERROR"'
You could import this to Postman and take advantage of the UI to select the appropriate SampleSheet file.
This goes through running the samplesheet check against the API.
This tests the deployed version of the samplesheet check, rather than the local file content.
You can use the curl binary to make a POST request to the API.
This script expects the user to have the following environment variables:
PORTAL_TOKEN
(can be obtained from the data.umccr.org home page)
API_URL="https://api.sscheck.prod.umccr.org" # Dev URL: https://api.sscheck.dev.umccr.org
SAMPLESHEET_FILE="SampleSheet.csv"
curl \
--location \
--request POST \
--header "Authorization: Bearer ${PORTAL_TOKEN}" \
--form "logLevel=ERROR" \
--form "file=@${SAMPLESHEET_FILE}" \
"${API_URL}"
Some unit tests are placed in the respective test folder. Alternatively, this section will give a tutorial to make your own testing script.
This tutorial goes through running the samplesheet check functions locally.
This allows a user to debug the code on a failing or passing samplesheet.
Create a conda env / virtual env that you can deploy your requirements to
conda create --yes \
--name samplesheet-check-backend \
--channel conda-forge \
pip \
python==3.8
Install requirements into conda env / virtual env.
conda activate samplesheet-check-backend
pip install -r src/layers/requirements.txt
Set the PYTHONPATH env var to the layers directory so that the umccr_utils
are
found.
mkdir -p "${CONDA_PREFIX}/etc/conda/activate.d"
echo '#!/usr/bin/env bash' > "${CONDA_PREFIX}/etc/conda/activate.d/umccr_utils.sh"
echo "export PYTHONPATH=\"${PWD}/lambdas/layers/:\$PYTHONPATH\"" >> "${CONDA_PREFIX}/etc/conda/activate.d/umccr_utils.sh"
Re-activate the conda env.
conda deactivate
conda activate samplesheet-check-backend
In order to test our samplesheet, we need to run two separate functions in the samplesheet_check.py script:
run_sample_sheet_content_check
which ensures that:- the samplesheet header has the right settings entered
- none of the indexes clash within each lane set in the samplesheet.
run_sample_sheet_check_with_metadata
which ensures that:- if the library id has a topup suffix, ensure the original sample already exists.
- the assay and type in the labmetadata are set as expected. 🚧 # Not yet implemented
- the override cycles in the metadata are consistent with the number of non-N bases in the indexes.
- for each sample, the override cycles all suggest the same number of cycles for each read.
An example shell script testing the samplesheet samplesheet.csv
is shown below:
This script expects the user to have set the following environment variables:
PORTAL_TOKEN
(can be obtained from the data.umccr.org home page)data_portal_domain_name
set toapi.data.prod.umccr.org
orapi.data.dev.umccr.org
#!/usr/bin/env bash
: '
This script has three sections
1. Setup
- check the portal token env var
- check the samplesheet exists
- check samplesheet_check.py exists
2. Call the run_sample_sheet_content_check function
3. Call the run_sample_sheet_check_with_metadata function
'
# Set to fail
set -euo pipefail
### USER ####
SAMPLESHEET_FILE="SampleSheet.csv"
#############
## GLOBALS ##
SAMPLESHEET_CHECK_SCRIPT="lambdas/functions/samplesheet_check.py"
CONDA_ENV_NAME="samplesheet-check-backend"
#############
### SETUP ###
if [[ -z "${PORTAL_TOKEN-}" ]]; then
echo "Error: Could not get the env var 'PORTAL_TOKEN'. Exiting" 1>&2
exit 1
fi
if [[ -z "${data_portal_domain_name-}" ]]; then
echo "Error: Could not get the env var 'data_portal_domain_name'. Exiting" 1>&2
exit 1
fi
if [[ ! -f "${SAMPLESHEET_FILE}" ]]; then
echo "Error: Could not find the file '${SAMPLESHEET_FILE}'" 1>&2
exit 1
fi
if [[ ! -f "${SAMPLESHEET_CHECK_SCRIPT}" ]]; then
echo "Error: Could not find the file '${SAMPLESHEET_CHECK_SCRIPT}'" 1>&2
exit 1
fi
#############
### TESTS ###
python_file="$(mktemp)"
cat << EOF > "${python_file}"
#!/usr/bin/env python3
# Imports
from samplesheet.samplesheet_check import run_sample_sheet_content_check
from samplesheet.samplesheet_check import run_sample_sheet_check_with_metadata
from utils.samplesheet import SampleSheet
# Get auth header for portal
auth_header = "Bearer ${PORTAL_TOKEN}"
# Get samplesheet
sample_sheet_path = "${SAMPLESHEET_FILE}"
sample_sheet = SampleSheet(sample_sheet_path)
# Check 1
run_sample_sheet_content_check(sample_sheet)
# Check 2
async def set_and_check_metadata(sample_sheet, auth_header):
# Set metadata
loop = asyncio.get_running_loop()
error = await asyncio.gather(
sample_sheet.set_metadata_df_from_api(auth_header, loop),
)
# run metadta check
run_sample_sheet_check_with_metadata(sample_sheet)
loop = asyncio.new_event_loop()
set_and_check_metadata(sample_sheet, auth_header):
loop.close()
EOF
echo "Running samplesheet '${SAMPLESHEET_FILE}' through check script '${SAMPLESHEET_CHECK_SCRIPT}'" 1>&2
conda run \
--name "${CONDA_ENV_NAME}" \
python3 "${python_file}"
echo "Test complete!" 1>&2
rm "${python_file}"
#############