Skip to content

PominovaMS/denovo_benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking de novo peptide sequencing algorithms

Adding a new algorithm

Make a pull request to add your algorithm to the benchmarking system.

Add your algorithm in the denovo_benchmarks/algorithms/algorithm_name folder by providing
container.def, make_predictions.sh, input_mapper.py, output_mapper.py files.
Detailed files descriptions are given below.

Templates for each file implementation can be found in the algorithms/base/ folder.
It also includes the InputMapperBase and OutputMapperBase base classes for implementing input and output mappers.
For examples, you can check Casanovo and DeepNovo implementations.

  • container.def — definition file of the Apptainer container image that creates environment and installs dependencies required for running the algorithm.

  • make_predictions.sh — bash script to run the de novo algorithm on the input dataset (folder with MS spectra in .mgf files) and generate an output file with per-spectrum peptide predictions.
    Input: path to a dataset folder containing .mgf files with spectra data
    Output: output file (in a common output format) containing predictions for all spectra in the dataset

    To configure the model for specific data properties (e.g. non-tryptic data, data from a particular instrument, etc.), please use dataset tags. Current set of tags can be found in the DatasetTag in dataset_config.py and includes nontryptic, timstof, waters, sciex. Example usage can be found in algorithms/base/make_predictions_template.sh.

  • input_mapper.py — python script to convert input data from its original representation (input format) to the format expected by the algorithm.

    Input format

    • Input: a dataset folder with separate .mgf files containing MS spectra.
    • Keys order for a spectrum in .mgf file:
      [TITLE, RTINSECONDS, PEPMASS, CHARGE]
  • output_mapper.py — python script to convert the algorithm output to the common output format.

    Output format

    • .csv file (with sep=",")

    • must contain columns:

      • "sequence" — predicted peptide sequence, written in the predefined output sequence format
      • "score"de novo algorithm "confidence" score for a predicted sequence
      • "aa_scores" — per-amino acid scores, if available. If not available, the whole peptide score will be used as a score for each amino acid.
      • "spectrum_id" — information to match each prediction with its ground truth sequence.
        {filename}:{index} string, where
        filename — name of the .mgf file in a dataset,
        index — index (0-based) of each spectrum in an .mgf file.
    • Output sequence format

      • 20 amino acid tokens:
        G, A, S, P, V, T, C, L, I, N, D, Q, K, E, M, H, F, R, Y, W
      • Amino acids with post-translational modifications (PTMs) are written in ProForma format with Unimod accession codes for PTMs:
        C[UNIMOD:4] for Cysteine Carbamidomethylation, M[UNIMOD:35] for Methionine Oxidation, etc.
      • N-terminus and C-terminus modifications, if supported by the algorithm, are also written in ProForma notation with Unimod accession codes:
        [UNIMOD:xx]-PEPTIDE-[UNIMOD:yy]

System requirements

Building containers and running the benchmark locally requires the following:

  • Operating System: Linux (required for Apptainer).

  • Dependencies:

    Make sure the Apptainer dependencies are installed.

    You may also need to install the following packages:

    sudo apt install squashfuse gocryptfs fuse-overlayfs  

    The benchmark was tested with Python 3.11 and Streamlit 1.33.

We run the tools on a high-performance computing (HPC) system using the Suse Linux Enterprise Server operating system, equipped with two Intel Xeon Gold 6526Y processors, 512 GB of RAM, and four NVIDIA L40S GPUs.

The current source code does not include containerized implementations for PEAKS and GraphNovo due to their integration within the PEAKS Studio 12 software, which is only available as a graphical user interface (GUI) tool compatible with the Windows operating system. These tools will be executed manually on a desktop computer running Windows 11, equipped with an Intel Core i9 processor and 128 GB of RAM.

Input data structure

The benchmark expects input data to follow a specific folder structure.

  • Each dataset is stored in a separate folder with unique name.
  • Spectra are stored as .mgf files inside the mgf/ subfolder.
  • Ground truth labels (PSMs found via database search) are contained in labels.csv file within each dataset folder.

Below is an example layout for our evaluation datasets stored on the HPC:

datasets/
    9_species_human/
        labels.csv
        mgf/
            151009_exo3_1.mgf
            151009_exo3_2.mgf
            151009_exo3_3.mgf
            ...
    9_species_solanum_lycopersicum/
        labels.csv
        mgf/...
    9_species_mus_musculus/
        labels.csv
        mgf/...
    9_species_methanosarcina_mazei/
        labels.csv
        mgf/...
    ...

Note that algorithm containers only get as input the /mgf subfolder with spectra files and do not have access to the labels.csv file. Only the evaluation container accesses the labels.csv file to evaluate algorithm predictions.

We provide a simplified demo dataset in the sample_data/ directory for testing the benchmarking pipeline locally.

However, running the full benchmark, especially on larger spectra files, is not recommended on a local computer, as de novo prediction can be computationally intensive and time-consuming. Additionally, while some containerized tool versions support flexible switching between CPU and GPU devices, others strictly require GPU access and will fail to run if a compatible GPU is unavailable.

Running the benchmark

To run the benchmark locally:

  1. Clone the repository:

    git clone https://github.com/PominovaMS/denovo_benchmarks.git
    cd denovo_benchmarks
  2. Build containers for algorithms and evaluation: To build all apptainer images, make sure you have apptainer installed. Then run:

    chmod +x build_apptainer_images.sh
    ./build_apptainer_images.sh

    This will build the apptainer images for all algorithms and the evaluation apptainer image.

    If an apptainer image already exists, the script will ask if you want to rebuild it.

    A .sif image for casanovo already exists. Force rebuild? (y/N) 

    If a container is missing, that algorithm will be skipped during benchmarking. We don't share or store containers publicly yet due to ongoing development and their large size.

  3. Configure paths: In order to configure the project environment to run the benchmark locally, you need to make a copy of the .env.template file and rename it to .env. This file contains the necessary environment variables for the project to run properly.

    After renaming the file, update the file paths within the .env file to reflect the correct locations on your system.

  4. Run benchmark on a dataset:

    Run the benchmark:

    ./run.sh /path/to/dataset/dir

    Example:

    ./run.sh sample_data/9_species_human

Running Streamlit dashboard locally:

To view the Streamlit dashboard for the benchmark locally, run:

# If Streamlit is not installed
pip install streamlit

streamlit run dashboard.py

The dashboard reads the benchmark results stored in the results/ folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published