This repo is forked from the Broad Institute and contains code for monitoring resource utilization in Cromwell tasks running on Google Cloud Life Sciences API v2beta.
After cloning this repo, from within the root directory:
$ mamba env create --file environment.yml
$ conda activate cromwell_monitor
$ pre-commit install # install git hook scripts
Run unit tests with:
$ make test
The monitoring script is intended to be used through a Docker image (as part of an associated "monitoring action").
It uses psutil to continuously measure CPU, memory and disk utilization and disk IOPS, and periodically report them as custom metrics to Cloud Monitoring API.
The labels for each time point contain the following metadata:
- Cromwell-specific values, such as workflow ID, task call name, index and attempt.
- GCP instance values such as instance name, zone, number of CPU cores, total memory and disk size.
This approach enables:
-
Users to easily plot real-time resource usage statistics across all tasks in a workflow, or for a single task call across many workflow runs, etc.
This monitoring tool can be very powerful to quickly determine the outlier tasks that could use optimization, without the need for any configuration or code.
-
Scripts to easily get aggregate statistics on resource utilization and to produce suggestions based on those.
TestMonitoring.wdl can be used to verify that the monitoring action/container is working as intended.