Skip to content

Scripts for clustering protein sequences, generating consensus sequences that represent a common representative of the cluster, and identifying representative medoid sequences for each cluster as the sequence most representing the cluster.

Notifications You must be signed in to change notification settings

Aaryesh-AD/Sequence-cluster-consensus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence Clustering and Consensus Sequence Generation Tool

Overview

This repository provides a comprehensive pipeline scripts for clustering protein sequences, generating consensus sequences that represent a common representative of the cluster, and identifying representative medoid sequences for each cluster as the representative sequence most representing the cluster. The goal is to analyze sequences, group them effectively, and retrieve representative sequences for further simulations.

Features

  1. FASTA Sequence Processing: Reads and merges sequences from input FASTA files.

  2. Feature Extraction: Feature vectors are created using a combined approach:

    1. Sequence K-Mers: Extract subsequences of length k from protein or DNA sequences to represent sequence patterns.
    2. Pseudo Amino Acid Composition (PAAC): Encodes sequences based on physicochemical properties:
      • Hydrophobicity
      • Hydrophilicity
      • *Mass
  3. Clustering: Supports multiple clustering algorithms, including:

    • K-means
    • DBSCAN
    • Hierarchical
    • Birch
    • Agglomerative
    • OPTICS
    • HDBSCAN
  4. Visualization: Generates PCA-based plots for clustering results.

    Representative Sequence Retrieval Most Representative Sequence: The sequence closest to the cluster medoid is identified as the most representative sequence.

    Consensus Sequences: Consensus sequences are generated for each cluster by aligning all sequences within the group and selecting the most frequent residues at each position using multiple sequence alignment generated by CLUSTAL and consensus generated by biopython.

  5. Consensus Sequence Generation: Creates consensus sequences from multiple sequence alignments (MSA) for each cluster.

  6. Medoid Identification: Identifies and saves medoid sequences as representatives for each cluster.

Why Use This Tool?

  • To identify patterns and groupings in biological sequences.
  • To generate representative sequences for downstream analyses.
  • To visualize and understand sequence diversity within clusters.
  • To create high-quality consensus sequences for further studies.

Installation

Prerequisites

  • Python 3.8+
  • Required Python libraries:
    • Biopython
    • numpy
    • pandas
    • matplotlib
    • scikit-learn
    • hdbscan

Install dependencies using conda:

  • To install all of the dependencies of this environment to run the script, use the bash environment_cluster.yml file to setup and install dependencies to a conda environment
conda env create -f environment_cluster.yml
conda activate cluster_env

Install dependencies using pip:

pip install biopython numpy pandas matplotlib scikit-learn hdbscan

Building the Docker Image (Recommended)

To run the tool in an isolated environment, build and run the Docker image:

  1. Build the Docker image:

    docker build -t cluster_container .
  2. Run the Docker container:

    docker run --rm \
    -v $(pwd)/data_file:/data \
    -v $(pwd)/outdir:/test \
    cluster_container \
    -i /data/input_sequences.fa \
    -o /test

Usage

1. Jupyter Notebook (Interactive Visualization)

  • Navigate to jupyter_notebook/.
  • Open a notebook to visualize and analyze smaller datasets interactively.

2. Python script Command-Line Interface

The main script, main_cluster_consensus.py, orchestrates the entire workflow. Below is an example of its usage: -inputs and -outputs are required, rest all the parameters are optional.

python main_cluster_consensus.py \
    -i /path/to/input.fasta \
    -o /path/to/output_directory \
    --n_kmers 3 \
    --n_clusters 10 \
    --eps 0.05 \
    --consensus_threshold 0.7 \
    --ambiguous_char N

Input Parameters

Argument Description Default Value
-i, --input_fasta Path to the input FASTA file. None (Required)
-o, --output_dir Directory to save results. None (Required)
--n_kmers Length of k-mers for feature extraction. 3
--n_clusters Number of clusters for clustering algorithms. 10
--eps Epsilon value for DBSCAN clustering. 0.05
--consensus_threshold Threshold for consensus sequence generation. 0.7
--ambiguous_char Ambiguous character for consensus sequence generation. N

Workflow Steps

  1. Merge FASTA Sequences: Merges input sequences into a single FASTA file.
  2. Feature Extraction:
    • K-mer counting
    • Pseudo-amino acid composition
  3. Sequence Clustering:
    • Clusters sequences using the selected algorithms.
    • Evaluates clustering performance using silhouette scores.
  4. Visualization:
    • Generates PCA-based scatter plots for each clustering method.
  5. Consensus Sequence Generation:
    • Performs multiple sequence alignments for each cluster.
    • Generates consensus sequences based on a specified threshold.
  6. Medoid Identification:
    • Identifies the most representative sequence (medoid) for each cluster.

Outputs

  1. Plots: PCA scatter plots for each clustering method.
  2. Clustered Sequences: FASTA files for each cluster.
  3. Consensus Sequences: Consensus sequences for each cluster saved in FASTA format.
  4. Medoid Sequences: Representative medoid sequences for each cluster saved in FASTA format.

Example Output Structure

output_directory/
├── clusters/
│   ├── K-means/
│   │   ├── cluster_0.fasta
│   │   ├── cluster_1.fasta
│   │   └── ...
│   └── DBSCAN/
├── plots/
│   ├── K-means_Clustering.png
│   ├── DBSCAN_Clustering.png
│   └── ...
├── consensus_sequences/
│   ├── K-means_consensus.fasta
│   ├── DBSCAN_consensus.fasta
│   └── ...
└── representative_sequences/
    ├── K-means_medoid_sequences.fasta
    ├── DBSCAN_medoid_sequences.fasta
    └── ...

Contributing

Contributions are welcome! If you encounter any issues or have feature requests, feel free to open an issue or submit a pull request.


Acknowledgments

The section Pseudo Amino Acid Composition (PAAC) Feature Extraction extracts feature vectors based on Pseudo Amino Acid Composition (PAAC). The implementation is adapted from the work of Rakesh Busi. The original repository for PAAC implementation in clustering can be found here.

Reference If you use this implementation, please cite the following article:

Busi, Rakesh, Machingal, Pranav, Hemachandra, Nandyala, & Balaji, Petety V.
How suitable are clustering methods for functional annotation of proteins?
bioRxiv, 2024.
Publisher: Cold Spring Harbor Laboratory.
DOI: 10.1101/2024.12.26.630370

This script was brought to life with invaluable support from Bruno Di Geronimo (@BruDiGe).


Contact

For questions or further information, please contact:

Aaryesh Deshpande
Bioinformatics and Computational Chemistry Researcher
Email: adeshpande334@gatech.edu Georgia Institute of Technology

About

Scripts for clustering protein sequences, generating consensus sequences that represent a common representative of the cluster, and identifying representative medoid sequences for each cluster as the sequence most representing the cluster.

Topics

Resources

Stars

Watchers

Forks