Skip to content
/ altair Public

AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data

License

Notifications You must be signed in to change notification settings

cobilab/altair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conda License: GPL v3 Speed HF AFM

AltaiR

AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.

This method provides alignment-free and temporal analysis of multi-FASTA data through the implementation of a C toolkit highly flexible and with characteristics covering large-scale data, namely extensive collections of genomes/proteomes. This toolkit is ideal for scenarios entangling the presence of multiple sequences from epidemic and pandemic events. AlcoR is implemented in C language using multi-threading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence(s) in (multi-) FASTA format.

The AltaiR toolkit contains one main menu (command: AltaiR) with the six sub menus for computing the features that it provides, namely

  • average: moving average filter of a column float CSV file (the column to use is a parameter);
  • filter: filters FASTA reads by characteristics: alphabet, completeness, length, CG quantity, multiple string patterns and pattern absence;
  • frequency: computes the alphabet frequencies for each FASTA read (it enables alphabet filtering);
  • nc: computes the Normalized Compression (NC) for all FASTA reads according to a compression level;
  • ncd: computes the Normalized Compression Distance (NCD) for all FASTA reads according to a reference;
  • raw: computes Relative Absent Words (RAWs) with CG quantity estimation for all RAWs.

INSTALLATION

Conda

First, install Miniconda if you haven't already. Then, to create a new Conda environment named altair and install altair-mf using Conda Forge and Bioconda channels, run the following command:

mamba create -n altair -c conda-forge -c bioconda altair-mf

To simply install altair-mf in an existing environment:

conda install -y -c bioconda altair-mf

Otherwise, CMake is needed for manual installation. You can download CMake directly from http://www.cmake.org/cmake/resources/software.html or use an appropriate package manager. Below are the instructions to install, compile, and run AltaiR:

sudo apt-get install cmake git
git clone https://github.com/cobilab/altair.git
cd altair/src/
cmake .
make

Additional Tools

For certain scripts, the Gto toolkit is required, installable via Conda:

conda install -c cobilab gto --yes

Or manually:

git clone https://github.com/cobilab/gto.git
cd gto/src/
make
export PATH="$HOME/gto/bin:$PATH"

PARAMETERS

To see the possible options type

AltaiR

or

AltaiR -h

If you are not interested in viewing each sub-program option, type

AltaiR average -h
AltaiR filter -h
AltaiR frequency -h
AltaiR nc -h
AltaiR ncd -h
AltaiR raw -h

Reproducing Experiments

Assuming AltaiR is compiled under the src/ folder, and you are in the pipeline/ folder.

cp ../src/AltaiR .

Filtering Sequences

To filter sequences use the following command:

python3 Histogram.py
bash Filter.sh 29885 29921

Similarity Profiles (NCD)

To simulate and measure similarity profiles:

bash Simulation.sh
bash Similarity.sh ORIGINAL.fa
bash SimProfile.sh sim-data.csv 2 0 1.2
mv NCDProfilesim-data.csv.pdf NCD_P1.pdf

Phylogenetic Tree Construction

Use the tree.py script to construct a phylogenetic tree from NCD values:

python3 tree.py sim-data.csv -N 50

Complexity Profiles (NC)

Run the following script to generate complexity profiles:

bash ComplexitySars.sh
python3 CompProfileSars.py comp-data.csv sorted_output.fa 0.961 0.9617
mv NCProfilecomp-data.csv.pdf NC.pdf

Frequency Profiles

Generate frequency profiles using the following commands:

bash FrequencySars.sh
python3 combine_freq_and_date.py
mv base_frequencies_plot.pdf Freq.pdf

Relative Singularity (RAWs) Profiles

To calculate RAWs profiles:

bash RawSars.sh
python3 RawSarsProfile.py sorted_output.fa
mv relativeSingularityProfile.pdf RAWProfiles.pdf

Citation

If you use AltaiR in your research, please cite: Silva, Jorge M., Armando J. Pinho, and Diogo Pratas. "AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data." GigaScience 13 (2024): giae086.

Issues

For any issues, please report at AltaiR Issues.

License

AltaiR is licensed under GPL v3. For more information, visit GPL v3 License.