This repository demonstrates using Lithops to run the METASPACE metabolite annotation pipeline on cloud resources.
METASPACE is a cloud engine for spatial metabolomics that performs molecular annotation of imaging mass spectrometry data. It takes an imaging mass spectrometry dataset and outputs molecules (e.g. metabolites and lipids) which are represented in the dataset, with assigned scores and false discovery rates. METASPACE is free and open source and is developed by the Alexandrov team at EMBL Heidelberg with the generous European and US funding. It is used by a growing community over 500 users across the world. For more information, visit METASPACE website.
Annotating high-resolution imaging mass spectrometry data often requires multiple CPU-days and >100GB of temporary storage, often making it impractical to run on typical desktop computers. Lithops allows processing to be almost seamlessly offloaded to cloud compute resources, rapidly scaling up to use as much compute power is available in your cloud of choice (e.g. 1000 parallel invocations in IBM Cloud) during intensive stages of the pipeline, and scaling down during less parallelizable stages to minimize cost.
This repository includes two variant implementations of the annotation pipeline, selectable through runtime configuration:
- A purely Serverless Functions implementation, which runs on any cloud Lithops supports (including IBM Cloud, Google Cloud, AWS, Azure and on-premise Knative/OpenWhisk installations).
- A hybrid Serverless + VM implementation, which enables several pipeline stages to use more efficient but more memory-intensive algorithms on large cloud VMs. This configuration is currently only supported with IBM Cloud and on-premise VMs.
- Python 3.8.5
- An account with a supported cloud provider (if running on a cloud platform)
- Jupyter Notebook or Jupyter Lab (if running the benchmark notebooks)
Clone and install this repository with the following commands:
git clone https://github.com/metaspace2020/Lithops-METASPACE.git
cd Lithops-METASPACE
pip install -e .
The purely Serverless and Hybrid implementations have different platform requirements when running on cloud platforms. In "localhost" mode (i.e. not using cloud resources), both implementations are supported.
This is the default mode. If you don't have any existing Lithops configuration, no configuration is needed. If you have an existing Lithops config file, change the following values:
lithops:
mode: "localhost"
storage: "localhost"
workers: # Leave this blank to auto-detect CPU count
Follow the Lithops instructions to configure a Serverless compute backend and a storage backend. Additionally, set the following values in the Lithops config:
lithops:
mode: "serverless"
include_modules: ["annotation_pipeline"]
data_limit: false
serverless:
runtime: "macarronesc0lithops/metaspace2020:01"
Hybrid mode requires both a Standalone and a Serverless executor to be configured, sharing the same storage backend. Currently this combination is only possible with IBM Virtual Private Cloud, IBM Cloud Functions and IBM Cloud Object Storage.
Follow the Lithops instructions to configure the 3 backends. Additionally, set the following values in the Lithops config:
lithops:
mode: "serverless"
include_modules: ["annotation_pipeline"]
data_limit: false
serverless:
runtime: "macarronesc0lithops/metaspace2020:01"
standalone:
runtime: "macarronesc0lithops/metaspace2020:01"
Launch Jupyter Notebook and open this directory. The main notebook is
annotation-pipeline-demo.ipynb
, which allows you to run
through the whole pipeline, and see the results at each step.
There are also 3 notebooks prepared for benchmarking:
experiment-1-typical.ipynb
- Demonstrates running through the whole Serverless metabolite annotation pipeline with a typical dataset,
downloading the results and comparing them against the Serverful implementation of METASPACE.experiment-2-interactive.ipynb
- An example of running the pipeline against a smaller set of molecules, to demonstrate the potential of Serverless to provide low-latency access to computating resources.experiment-3-large.ipynb
- A stress test that runs the Serverless metabolite annotation pipeline with a large dataset and many molecular databases.
usage: python3 -m annotation_pipeline annotate [ds_config.json] [db_config.json] [output path]
positional arguments:
ds_config.json ds_config.json path
db_config.json db_config.json path
output directory to write output files
optional arguments:
-h, --help show this help message and exit
--no-output prevents outputs from being written to file
--no-cache prevents loading cached data from previous runs
--impl {serverless,hybrid,auto}
Selects whether to use the Serverless or Hybrid
implementation. "auto" will select the Hybrid
implementation if the selected platform is supported
and correctly configured (running in localhost mode,
or in serverless mode with ibm_vpc configured)
The main inputs to the pipeline are specified in two JSON files: the dataset and database configs.
There are example config files in the [metabolomics
][metabolomics] directory.
Dataset configs should follow this format:
{
"name": "****", // A unique name for this dataset (used for caching)
"imzml_path": "https://****.imzML or C:\****.imzML", // URL or filesystem path to the .imzML file
"ibd_path": "https://****.ibd or C:\****.ibd", // URL or filesystem path to the .ibd file
"num_decoys": 20, // Number of decoys to use for FDR calculation (can be any integer between 1 and 80)
"polarity": "+", // Ionization mode of the dataset ("+" or "-")
"isocalc_sigma": 0.001238, // The "sigma" parameter representing the expected peak width at 200 m/z based on the instrument's resolving power
// Common values are:
// RP 70,000 @ 200 m/z: 0.002476
// RP 140,000 @ 200 m/z: 0.001238
// RP 200,000 @ 200 m/z: 0.000867
// RP 280,000 @ 200 m/z: 0.000619
"metaspace_id": "**** (Optional)" // Optional ID of a dataset at https://metaspace2020.eu to validate the results against
}
The imzML and ibd files may also be specified as URL-like paths to cloud storage,
e.g. cos://datasets/ds.imzML
for IBM COS or s3://datasets/ds.imzML
for AWS S3.
Database configs should follow this format:
{
"name": "db_configN", // A unique name for this database (used for caching)
"databases": ["metabolomics/db/mol_db1.csv"], // Filesystem path to CSV file containing formulas
"adducts": ["","+H","+Na","+K"], // Adducts to search for
"modifiers": ["", "-H2O", "-CO2", "-NH3"] // Neutral losses or chemical modifications to search for
}
Dataset | Author | Config file |
---|---|---|
Brain02_Bregma1-42_02 | Régis Lavigne, University of Rennes 1 |
ds_config1.json |
AZ_Rat_Brains | Nicole Strittmatter, AstraZeneca |
ds_config2.json |
CT26_xenograft | Nicole Strittmatter, AstraZeneca |
ds_config3.json |
Mouse brain test434x902 Captured with AP-SMALDI5 and Q Exactive HF Orbitrap |
Dhaka Bhandari, Justus-Liebig-University Giessen |
ds_config4.json |
X089-Mousebrain_842x603 Captured with AP-SMALDI5 and Q Exactive HF Orbitrap |
Dhaka Bhandari, Justus-Liebig-University Giessen |
ds_config5.json |
Microbial interaction slide | Don Nguyen, European Molecular Biology Laboratory |
ds_config6.json |
These molecular databases can be selected in the ds_config.json
files. They are automatically converted to
pickle format and uploaded to IBM cloud in the notebooks.
Database | Filename | Description |
---|---|---|
HMDB | mol_db1.csv |
Human Metabolome Database |
ChEBI | mol_db2.csv |
Chemical Entities of Biological Interest |
LIPID MAPS | mol_db3.csv |
|
SwissLipids | mol_db4.csv |
|
Small database | mol_db5.csv |
This database is used in Experiment 2 as an example of a small set of user-supplied molecules for running small, interactive annotation jobs. |
Peptide databases | mol_db7.csv ... mol_db12.csv |
A collection of databases of predicted peptides. These databases were contributed by Benjamin Baluff (M4I, Maastricht University) exclusively for use with METASPACE. |
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825184.