This repo contains code for monitoring resource utilization in Cromwell tasks running on Google Genomics Pipelines API v2alpha1.
The monitoring script is indended to be used through a Docker image (as part of an associated "monitoring action"), currently built as quay.io/broadinstitute/cromwell-monitor.
It uses psutil to continuously measure CPU, memory and disk utilization and disk IOPS, and periodically report them as distinct metrics to Stackdriver Monitoring API.
The labels for each time point contain
- Cromwell-specific values, such as workflow ID, task call name, index and attempt.
- GCP instance values such as instance name, zone, number of CPU cores, total memory and disk size.
This approach enables:
-
Users to easily plot real-time resource usage statistics across all tasks in a workflow, or for a single task call across many workflow runs, etc.
This can be very powerful to quickly determine the outlier tasks that could use optimization, without the need for any configuration or code.
-
Scripts to easily get aggregate statistics on resource utilization and to produce suggestions based on those.
TestMonitoring.wdl can be used to verify that the monitoring action/container is working as intended.