This repository provides a comprehensive pipeline scripts for clustering protein sequences, generating consensus sequences that represent a common representative of the cluster, and identifying representative medoid sequences for each cluster as the representative sequence most representing the cluster. The goal is to analyze sequences, group them effectively, and retrieve representative sequences for further simulations.
-
FASTA Sequence Processing: Reads and merges sequences from input FASTA files.
-
Feature Extraction: Feature vectors are created using a combined approach:
- Sequence K-Mers: Extract subsequences of length
k
from protein or DNA sequences to represent sequence patterns. - Pseudo Amino Acid Composition (PAAC): Encodes sequences based on physicochemical properties:
- Hydrophobicity
- Hydrophilicity
- *Mass
- Sequence K-Mers: Extract subsequences of length
-
Clustering: Supports multiple clustering algorithms, including:
- K-means
- DBSCAN
- Hierarchical
- Birch
- Agglomerative
- OPTICS
- HDBSCAN
-
Visualization: Generates PCA-based plots for clustering results.
Representative Sequence Retrieval Most Representative Sequence: The sequence closest to the cluster medoid is identified as the most representative sequence.
Consensus Sequences: Consensus sequences are generated for each cluster by aligning all sequences within the group and selecting the most frequent residues at each position using multiple sequence alignment generated by CLUSTAL and consensus generated by biopython.
-
Consensus Sequence Generation: Creates consensus sequences from multiple sequence alignments (MSA) for each cluster.
-
Medoid Identification: Identifies and saves medoid sequences as representatives for each cluster.
- To identify patterns and groupings in biological sequences.
- To generate representative sequences for downstream analyses.
- To visualize and understand sequence diversity within clusters.
- To create high-quality consensus sequences for further studies.
- Python 3.8+
- Required Python libraries:
- Biopython
- numpy
- pandas
- matplotlib
- scikit-learn
- hdbscan
Install dependencies using conda:
- To install all of the dependencies of this environment to run the script, use the
bash environment_cluster.yml
file to setup and install dependencies to a conda environment
conda env create -f environment_cluster.yml
conda activate cluster_env
Install dependencies using pip:
pip install biopython numpy pandas matplotlib scikit-learn hdbscan
To run the tool in an isolated environment, build and run the Docker image:
-
Build the Docker image:
docker build -t cluster_container .
-
Run the Docker container:
docker run --rm \ -v $(pwd)/data_file:/data \ -v $(pwd)/outdir:/test \ cluster_container \ -i /data/input_sequences.fa \ -o /test
- Navigate to
jupyter_notebook/
. - Open a notebook to visualize and analyze smaller datasets interactively.
The main script, main_cluster_consensus.py
, orchestrates the entire workflow. Below is an example of its usage:
-inputs and -outputs are required, rest all the parameters are optional.
python main_cluster_consensus.py \
-i /path/to/input.fasta \
-o /path/to/output_directory \
--n_kmers 3 \
--n_clusters 10 \
--eps 0.05 \
--consensus_threshold 0.7 \
--ambiguous_char N
Argument | Description | Default Value |
---|---|---|
-i, --input_fasta |
Path to the input FASTA file. | None (Required) |
-o, --output_dir |
Directory to save results. | None (Required) |
--n_kmers |
Length of k-mers for feature extraction. | 3 |
--n_clusters |
Number of clusters for clustering algorithms. | 10 |
--eps |
Epsilon value for DBSCAN clustering. | 0.05 |
--consensus_threshold |
Threshold for consensus sequence generation. | 0.7 |
--ambiguous_char |
Ambiguous character for consensus sequence generation. | N |
- Merge FASTA Sequences: Merges input sequences into a single FASTA file.
- Feature Extraction:
- K-mer counting
- Pseudo-amino acid composition
- Sequence Clustering:
- Clusters sequences using the selected algorithms.
- Evaluates clustering performance using silhouette scores.
- Visualization:
- Generates PCA-based scatter plots for each clustering method.
- Consensus Sequence Generation:
- Performs multiple sequence alignments for each cluster.
- Generates consensus sequences based on a specified threshold.
- Medoid Identification:
- Identifies the most representative sequence (medoid) for each cluster.
- Plots: PCA scatter plots for each clustering method.
- Clustered Sequences: FASTA files for each cluster.
- Consensus Sequences: Consensus sequences for each cluster saved in FASTA format.
- Medoid Sequences: Representative medoid sequences for each cluster saved in FASTA format.
output_directory/
├── clusters/
│ ├── K-means/
│ │ ├── cluster_0.fasta
│ │ ├── cluster_1.fasta
│ │ └── ...
│ └── DBSCAN/
├── plots/
│ ├── K-means_Clustering.png
│ ├── DBSCAN_Clustering.png
│ └── ...
├── consensus_sequences/
│ ├── K-means_consensus.fasta
│ ├── DBSCAN_consensus.fasta
│ └── ...
└── representative_sequences/
├── K-means_medoid_sequences.fasta
├── DBSCAN_medoid_sequences.fasta
└── ...
Contributions are welcome! If you encounter any issues or have feature requests, feel free to open an issue or submit a pull request.
The section Pseudo Amino Acid Composition (PAAC) Feature Extraction extracts feature vectors based on Pseudo Amino Acid Composition (PAAC). The implementation is adapted from the work of Rakesh Busi. The original repository for PAAC implementation in clustering can be found here.
Reference If you use this implementation, please cite the following article:
Busi, Rakesh, Machingal, Pranav, Hemachandra, Nandyala, & Balaji, Petety V.
How suitable are clustering methods for functional annotation of proteins?
bioRxiv, 2024.
Publisher: Cold Spring Harbor Laboratory.
DOI: 10.1101/2024.12.26.630370
This script was brought to life with invaluable support from Bruno Di Geronimo (@BruDiGe).
For questions or further information, please contact:
Aaryesh Deshpande
Bioinformatics and Computational Chemistry Researcher
Email: adeshpande334@gatech.edu
Georgia Institute of Technology