Sequence Clustering and Consensus Sequence Generation Tool

Overview

This repository provides a comprehensive pipeline scripts for clustering protein sequences, generating consensus sequences that represent a common representative of the cluster, and identifying representative medoid sequences for each cluster as the representative sequence most representing the cluster. The goal is to analyze sequences, group them effectively, and retrieve representative sequences for further simulations.

Features

FASTA Sequence Processing: Reads and merges sequences from input FASTA files.
Feature Extraction: Feature vectors are created using a combined approach:
1. Sequence K-Mers: Extract subsequences of length k from protein or DNA sequences to represent sequence patterns.
2. Pseudo Amino Acid Composition (PAAC): Encodes sequences based on physicochemical properties:
  - Hydrophobicity
  - Hydrophilicity
  - *Mass
Clustering: Supports multiple clustering algorithms, including:
- K-means
- DBSCAN
- Hierarchical
- Birch
- Agglomerative
- OPTICS
- HDBSCAN
Visualization: Generates PCA-based plots for clustering results.

Representative Sequence Retrieval Most Representative Sequence: The sequence closest to the cluster medoid is identified as the most representative sequence.

Consensus Sequences: Consensus sequences are generated for each cluster by aligning all sequences within the group and selecting the most frequent residues at each position using multiple sequence alignment generated by CLUSTAL and consensus generated by biopython.
Consensus Sequence Generation: Creates consensus sequences from multiple sequence alignments (MSA) for each cluster.
Medoid Identification: Identifies and saves medoid sequences as representatives for each cluster.

Why Use This Tool?

To identify patterns and groupings in biological sequences.
To generate representative sequences for downstream analyses.
To visualize and understand sequence diversity within clusters.
To create high-quality consensus sequences for further studies.

Installation

Prerequisites

Python 3.8+
Required Python libraries:
- Biopython
- numpy
- pandas
- matplotlib
- scikit-learn
- hdbscan

Install dependencies using conda:

To install all of the dependencies of this environment to run the script, use the bash environment_cluster.yml file to setup and install dependencies to a conda environment

conda env create -f environment_cluster.yml
conda activate cluster_env

Install dependencies using pip:

pip install biopython numpy pandas matplotlib scikit-learn hdbscan

Building the Docker Image (Recommended)

To run the tool in an isolated environment, build and run the Docker image:

Build the Docker image:
```
docker build -t cluster_container .
```

Run the Docker container:

docker run --rm \
-v $(pwd)/data_file:/data \
-v $(pwd)/outdir:/test \
cluster_container \
-i /data/input_sequences.fa \
-o /test

Usage

1. Jupyter Notebook (Interactive Visualization)

Navigate to jupyter_notebook/.
Open a notebook to visualize and analyze smaller datasets interactively.

2. Python script Command-Line Interface

The main script, main_cluster_consensus.py, orchestrates the entire workflow. Below is an example of its usage: -inputs and -outputs are required, rest all the parameters are optional.

python main_cluster_consensus.py \
    -i /path/to/input.fasta \
    -o /path/to/output_directory \
    --n_kmers 3 \
    --n_clusters 10 \
    --eps 0.05 \
    --consensus_threshold 0.7 \
    --ambiguous_char N

Input Parameters

Argument	Description	Default Value
`-i, --input_fasta`	Path to the input FASTA file.	None (Required)
`-o, --output_dir`	Directory to save results.	None (Required)
`--n_kmers`	Length of k-mers for feature extraction.	3
`--n_clusters`	Number of clusters for clustering algorithms.	10
`--eps`	Epsilon value for DBSCAN clustering.	0.05
`--consensus_threshold`	Threshold for consensus sequence generation.	0.7
`--ambiguous_char`	Ambiguous character for consensus sequence generation.	N

Workflow Steps

Merge FASTA Sequences: Merges input sequences into a single FASTA file.
Feature Extraction:
- K-mer counting
- Pseudo-amino acid composition
Sequence Clustering:
- Clusters sequences using the selected algorithms.
- Evaluates clustering performance using silhouette scores.
Visualization:
- Generates PCA-based scatter plots for each clustering method.
Consensus Sequence Generation:
- Performs multiple sequence alignments for each cluster.
- Generates consensus sequences based on a specified threshold.
Medoid Identification:
- Identifies the most representative sequence (medoid) for each cluster.

Outputs

Plots: PCA scatter plots for each clustering method.
Clustered Sequences: FASTA files for each cluster.
Consensus Sequences: Consensus sequences for each cluster saved in FASTA format.
Medoid Sequences: Representative medoid sequences for each cluster saved in FASTA format.

Example Output Structure

output_directory/
├── clusters/
│   ├── K-means/
│   │   ├── cluster_0.fasta
│   │   ├── cluster_1.fasta
│   │   └── ...
│   └── DBSCAN/
├── plots/
│   ├── K-means_Clustering.png
│   ├── DBSCAN_Clustering.png
│   └── ...
├── consensus_sequences/
│   ├── K-means_consensus.fasta
│   ├── DBSCAN_consensus.fasta
│   └── ...
└── representative_sequences/
    ├── K-means_medoid_sequences.fasta
    ├── DBSCAN_medoid_sequences.fasta
    └── ...

Contributing

Contributions are welcome! If you encounter any issues or have feature requests, feel free to open an issue or submit a pull request.

Acknowledgments

The section Pseudo Amino Acid Composition (PAAC) Feature Extraction extracts feature vectors based on Pseudo Amino Acid Composition (PAAC). The implementation is adapted from the work of Rakesh Busi. The original repository for PAAC implementation in clustering can be found here.

Reference If you use this implementation, please cite the following article:

Busi, Rakesh, Machingal, Pranav, Hemachandra, Nandyala, & Balaji, Petety V.
How suitable are clustering methods for functional annotation of proteins?
bioRxiv, 2024.
Publisher: Cold Spring Harbor Laboratory.
DOI: 10.1101/2024.12.26.630370

This script was brought to life with invaluable support from Bruno Di Geronimo (@BruDiGe).

Contact

For questions or further information, please contact:

Aaryesh Deshpande
Bioinformatics and Computational Chemistry Researcher
Email: adeshpande334@gatech.edu Georgia Institute of Technology

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
jupyter_notebook		jupyter_notebook
py_scripts		py_scripts
.gitignore		.gitignore
README.md		README.md
dockerfile		dockerfile
environment_cluster.yml		environment_cluster.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequence Clustering and Consensus Sequence Generation Tool

Overview

Features

Why Use This Tool?

Installation

Prerequisites

Building the Docker Image (Recommended)

Usage

1. Jupyter Notebook (Interactive Visualization)

2. Python script Command-Line Interface

Input Parameters

Workflow Steps

Outputs

Example Output Structure

Contributing

Acknowledgments

Contact

About

Languages

Aaryesh-AD/Sequence-cluster-consensus

Folders and files

Latest commit

History

Repository files navigation

Sequence Clustering and Consensus Sequence Generation Tool

Overview

Features

Why Use This Tool?

Installation

Prerequisites

Building the Docker Image (Recommended)

Usage

1. Jupyter Notebook (Interactive Visualization)

2. Python script Command-Line Interface

Input Parameters

Workflow Steps

Outputs

Example Output Structure

Contributing

Acknowledgments

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages