A Python tool for cleaning and analyzing FASTA sequence alignments (e.g. from https://github.com/cmayer/MitoGeneExtractor) using multiple filtering approaches. This tool is designed to help researchers identify and remove problematic sequences from their alignments while maintaining data quality and integrity.
The FASTA Sequence Cleaner implements a sophisticated filtering pipeline that includes:
-
Human COX1 Contamination Detection
- Identifies sequences with high similarity to human COX1
- Configurable similarity threshold
- Uses efficient local alignment for comparison
- Helps prevent contamination from human DNA
-
AT Content Analysis
- Compares AT content between sequences and consensus
- Identifies sequences with divergent nucleotide composition
- Supports multiple filtering modes (absolute, higher, lower)
- Customizable difference threshold
- Only considers overlapping regions for comparison
-
Statistical Outlier Detection
- Uses both weighted and unweighted deviation scores
- Position-specific residue frequency analysis
- Conservation-weighted sequence comparison
- Adjustable percentile threshold for outlier detection
- Robust handling of gaps and missing data
-
Reference Sequence Comparison
- Optional comparison against known reference sequences
- Supports multiple reference sequence files
- Additional metrics for reference-based filtering
- Weighted deviation scoring based on conservation
- Python 3.6 or higher
- BioPython
- NumPy
- typing (for type hints)
pip install biopython numpy typing
- Clone this repository:
git clone https://github.com/bge-barcoding/fasta-cleaner.git
cd fasta-cleaner
- Install dependencies:
pip install biopython numpy typing
python fasta_cleaner_combined.py -i input_dir -o output_dir
python fasta_cleaner_combined.py \
-i input_dir \
-o output_dir \
-r reference_dir \
--human_threshold 0.95 \
--at_difference 0.1 \
--at_mode absolute \
--percentile_threshold 90.0 \
--consensus_threshold 0.5
Argument | Description | Default |
---|---|---|
-i , --input_dir |
Directory containing input FASTA files | Required |
-o , --output_dir |
Output directory for processed files | Required |
-r , --reference_dir |
Directory containing reference sequences | Optional |
-u , --human_threshold |
Human COX1 similarity threshold (0-1) | 0.95 |
-d , --at_difference |
Maximum allowed AT content difference | 0.1 |
-m , --at_mode |
AT content filtering mode (absolute/higher/lower) | absolute |
-p , --percentile_threshold |
Percentile for outlier detection (0-100) | 90.0 |
-c , --consensus_threshold |
Consensus sequence generation threshold | 0.5 |
The tool supports three modes for AT content filtering:
absolute
: Removes sequences if AT content differs from consensus by more than threshold in either directionhigher
: Removes only sequences with AT content above consensus + threshold (i.e. AT is too high)lower
: Removes only sequences with AT content below consensus - threshold (i.e. AT is too low)
Flag | Description |
---|---|
--disable_human |
Disable human COX1 similarity filtering |
--disable_at |
Disable AT content difference filtering |
--disable_outliers |
Disable statistical outlier detection |
For each input FASTA file, the tool generates:
*_cleaned.fasta
: Sequences that passed all filters, ordered by start position*_removed_all.fasta
: All removed sequences combined into one file*_removed_human.fasta
: Sequences removed due to human similarity*_removed_at.fasta
: Sequences removed due to AT content*_removed_outlier.fasta
: Sequences removed as statistical outliers*_removed_reference.fasta
: Sequences removed as reference-based outliers*_consensus.fasta
: Final consensus sequence*_metrics.csv
: Detailed metrics for all sequences*_log.txt
: Processing log with parameters and statistics*_ordered_annotated.fasta
: All original sequences with fate annotations, ordered by start position
The tool calculates comprehensive metrics for each sequence:
- Sequence length and composition
- AT content and deviation from consensus
- Human COX1 similarity scores using local alignment
- Position-specific conservation scores
- Weighted and unweighted deviation measures
- Conservation-based statistical scores
- Reference-based metrics (if enabled)
- Gap handling and position-specific frequencies
All metrics are saved in the CSV report for further analysis.
The filtering pipeline processes sequences in this specific order:
- Remove sequences with high human COX1 similarity
- Filter sequences with divergent AT content
- Remove statistical outliers
- Compare against reference sequences (if provided)
After each filtering step:
- A new consensus sequence is generated from remaining sequences
- New position-specific frequencies are calculated
- New metrics are computed for all remaining sequences
# Process a directory of FASTA files with custom thresholds and AT mode lower
python fasta_cleaner_combined.py \
-i /path/to/fasta/files \
-o /path/to/output \
-r /path/to/references \
--human_threshold 0.90 \
--at_difference 0.15 \
--at_mode lower \
--percentile_threshold 95.0
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this tool in your research, please cite:
@software{fasta_cleaner,
author = {Ben Price AND Daniel Parsons AND Jordan Beasley AND Claude Sonnet},
title = {FASTA Sequence Cleaner},
version = {1.0.0},
year = {2024},
url = {https://github.com/bge-barcoding/fasta-cleaner},
note = {Implements multiple sequence filtering approaches with position-specific analysis}
}
- Uses BioPython for sequence analysis
- Implements methods inspired by various sequence quality control approaches
- Developed to address common contamination and quality issues in sequence data
For bugs, feature requests, or questions, please open an issue on GitHub.