Serverless METASPACE with Lithops

This repository demonstrates using Lithops to run the METASPACE metabolite annotation pipeline on cloud resources.

METASPACE is a cloud engine for spatial metabolomics that performs molecular annotation of imaging mass spectrometry data. It takes an imaging mass spectrometry dataset and outputs molecules (e.g. metabolites and lipids) which are represented in the dataset, with assigned scores and false discovery rates. METASPACE is free and open source and is developed by the Alexandrov team at EMBL Heidelberg with the generous European and US funding. It is used by a growing community over 500 users across the world. For more information, visit METASPACE website.

Annotating high-resolution imaging mass spectrometry data often requires multiple CPU-days and >100GB of temporary storage, often making it impractical to run on typical desktop computers. Lithops allows processing to be almost seamlessly offloaded to cloud compute resources, rapidly scaling up to use as much compute power is available in your cloud of choice (e.g. 1000 parallel invocations in IBM Cloud) during intensive stages of the pipeline, and scaling down during less parallelizable stages to minimize cost.

This repository includes two variant implementations of the annotation pipeline, selectable through runtime configuration:

A purely Serverless Functions implementation, which runs on any cloud Lithops supports (including IBM Cloud, Google Cloud, AWS, Azure and on-premise Knative/OpenWhisk installations).
A hybrid Serverless + VM implementation, which enables several pipeline stages to use more efficient but more memory-intensive algorithms on large cloud VMs. This configuration is currently only supported with IBM Cloud and on-premise VMs.

Instructions for use

Prerequisites:

Python 3.8.5
An account with a supported cloud provider (if running on a cloud platform)
Jupyter Notebook or Jupyter Lab (if running the benchmark notebooks)

1. Installation

Clone and install this repository with the following commands:

git clone https://github.com/metaspace2020/Lithops-METASPACE.git
cd Lithops-METASPACE
pip install -e .

2. Lithops Configuration

The purely Serverless and Hybrid implementations have different platform requirements when running on cloud platforms. In "localhost" mode (i.e. not using cloud resources), both implementations are supported.

Localhost mode

This is the default mode. If you don't have any existing Lithops configuration, no configuration is needed. If you have an existing Lithops config file, change the following values:

lithops:
  mode: "localhost"
  storage: "localhost"
  workers: # Leave this blank to auto-detect CPU count

Pure Serverless mode

Follow the Lithops instructions to configure a Serverless compute backend and a storage backend. Additionally, set the following values in the Lithops config:

lithops:
  mode: "serverless"
  include_modules: ["annotation_pipeline"]
  data_limit: false
  
serverless:
  runtime: "macarronesc0lithops/metaspace2020:01"

Hybrid mode

Hybrid mode requires both a Standalone and a Serverless executor to be configured, sharing the same storage backend. Currently this combination is only possible with IBM Virtual Private Cloud, IBM Cloud Functions and IBM Cloud Object Storage.

Follow the Lithops instructions to configure the 3 backends. Additionally, set the following values in the Lithops config:

lithops:
  mode: "serverless"
  include_modules: ["annotation_pipeline"]
  data_limit: false
  
serverless:
  runtime: "macarronesc0lithops/metaspace2020:01"
  
standalone:
  runtime: "macarronesc0lithops/metaspace2020:01"

3. Running the pipeline

Running the example notebooks

Launch Jupyter Notebook and open this directory. The main notebook is annotation-pipeline-demo.ipynb, which allows you to run through the whole pipeline, and see the results at each step.

There are also 3 notebooks prepared for benchmarking:

experiment-1-typical.ipynb - Demonstrates running through the whole Serverless metabolite annotation pipeline with a typical dataset,
downloading the results and comparing them against the Serverful implementation of METASPACE.
experiment-2-interactive.ipynb - An example of running the pipeline against a smaller set of molecules, to demonstrate the potential of Serverless to provide low-latency access to computating resources.
experiment-3-large.ipynb - A stress test that runs the Serverless metabolite annotation pipeline with a large dataset and many molecular databases.

Running from the command line

usage: python3 -m annotation_pipeline annotate [ds_config.json] [db_config.json] [output path]

positional arguments:
  ds_config.json        ds_config.json path
  db_config.json        db_config.json path
  output                directory to write output files
optional arguments:
  -h, --help            show this help message and exit
  --no-output           prevents outputs from being written to file
  --no-cache            prevents loading cached data from previous runs
  --impl {serverless,hybrid,auto}
                        Selects whether to use the Serverless or Hybrid
                        implementation. "auto" will select the Hybrid
                        implementation if the selected platform is supported
                        and correctly configured (running in localhost mode,
                        or in serverless mode with ibm_vpc configured)

Input data

The main inputs to the pipeline are specified in two JSON files: the dataset and database configs. There are example config files in the [metabolomics][metabolomics] directory.

Dataset configs

Dataset configs should follow this format:

{
  "name": "****",                                      // A unique name for this dataset (used for caching)
  "imzml_path": "https://****.imzML or C:\****.imzML", // URL or filesystem path to the .imzML file 
  "ibd_path": "https://****.ibd or C:\****.ibd",       // URL or filesystem path to the .ibd file
  "num_decoys": 20,                                    // Number of decoys to use for FDR calculation (can be any integer between 1 and 80) 
  "polarity": "+",                                     // Ionization mode of the dataset ("+" or "-")
  "isocalc_sigma": 0.001238,                           // The "sigma" parameter representing the expected peak width at 200 m/z based on the instrument's resolving power
                                                       // Common values are:
                                                       // RP 70,000 @ 200 m/z: 0.002476
                                                       // RP 140,000 @ 200 m/z: 0.001238
                                                       // RP 200,000 @ 200 m/z: 0.000867
                                                       // RP 280,000 @ 200 m/z: 0.000619
  "metaspace_id": "**** (Optional)"                    // Optional ID of a dataset at https://metaspace2020.eu to validate the results against
}

The imzML and ibd files may also be specified as URL-like paths to cloud storage, e.g. cos://datasets/ds.imzML for IBM COS or s3://datasets/ds.imzML for AWS S3.

Database configs

Database configs should follow this format:

{
  "name": "db_configN",                                // A unique name for this database (used for caching)
  "databases": ["metabolomics/db/mol_db1.csv"],        // Filesystem path to CSV file containing formulas
  "adducts": ["","+H","+Na","+K"],                     // Adducts to search for
  "modifiers": ["", "-H2O", "-CO2", "-NH3"]            // Neutral losses or chemical modifications to search for
}

Example datasets

Dataset	Author	Config file
Brain02_Bregma1-42_02	Régis Lavigne, University of Rennes 1	`ds_config1.json`
AZ_Rat_Brains	Nicole Strittmatter, AstraZeneca	`ds_config2.json`
CT26_xenograft	Nicole Strittmatter, AstraZeneca	`ds_config3.json`
Mouse brain test434x902 Captured with AP-SMALDI5 and Q Exactive HF Orbitrap	Dhaka Bhandari, Justus-Liebig-University Giessen	`ds_config4.json`
X089-Mousebrain_842x603 Captured with AP-SMALDI5 and Q Exactive HF Orbitrap	Dhaka Bhandari, Justus-Liebig-University Giessen	`ds_config5.json`
Microbial interaction slide	Don Nguyen, European Molecular Biology Laboratory	`ds_config6.json`

Example databases

These molecular databases can be selected in the ds_config.json files. They are automatically converted to pickle format and uploaded to IBM cloud in the notebooks.

Database	Filename	Description
HMDB	`mol_db1.csv`	Human Metabolome Database
ChEBI	`mol_db2.csv`	Chemical Entities of Biological Interest
LIPID MAPS	`mol_db3.csv`
SwissLipids	`mol_db4.csv`
Small database	`mol_db5.csv`	This database is used in Experiment 2 as an example of a small set of user-supplied molecules for running small, interactive annotation jobs.
Peptide databases	`mol_db7.csv` ... `mol_db12.csv`	A collection of databases of predicted peptides. These databases were contributed by Benjamin Baluff (M4I, Maastricht University) exclusively for use with METASPACE.

Acknowledgements

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825184.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless METASPACE with Lithops

Instructions for use

Prerequisites:

1. Installation

2. Lithops Configuration

Localhost mode

Pure Serverless mode

Hybrid mode

3. Running the pipeline

Running the example notebooks

Running from the command line

Input data

Dataset configs

Database configs

Example datasets

Example databases

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
annotation_pipeline		annotation_pipeline
metabolomics		metabolomics
runtime		runtime
.gitignore		.gitignore
README.md		README.md
annotation-pipeline-demo.ipynb		annotation-pipeline-demo.ipynb
experiment-1-typical.ipynb		experiment-1-typical.ipynb
experiment-2-interactive.ipynb		experiment-2-interactive.ipynb
experiment-3-large.ipynb		experiment-3-large.ipynb
setup.py		setup.py

macarronesc/Lithops-METASPACE

Folders and files

Latest commit

History

Repository files navigation

Serverless METASPACE with Lithops

Instructions for use

Prerequisites:

1. Installation

2. Lithops Configuration

Localhost mode

Pure Serverless mode

Hybrid mode

3. Running the pipeline

Running the example notebooks

Running from the command line

Input data

Dataset configs

Database configs

Example datasets

Example databases

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages