============================================================================================================
When using GNFish or part of its code please cite the following paper:
Lorente-MartĂnez, H., Agorreta, A., & San Mauro, D. (2022). Genomic Fishing and Data Processing for Molecular Evolution Research. Methods and Protocols, 5(2), 26. https://doi.org/10.3390/mps5020026
============================================================================================================
Suite of small Python programs following the pipeline detailed in (DOI of article yet to be published) for:
(1) “fishing” specific gene sequences of interest from genomic data available in NCBI databases
(2) processing and depuration of retrieved sequences
(3) production of a multiple sequence alignment
(4) selection of best-fit model of evolution
(5) solid reconstruction of a phylogenetic tree.
There are 12 scripts:
align_sequences.py -> Align sequences using MAFFT software
blast.py -> Perfoms BLAST searches against the download genomes
class_list_files.py -> Class for accessing directories. Not need to be run, but the other scripts need it.
decompress_genomes.py -> Decompress genome files.
get_combined_seqs.py -> Merge sequences for creating a whole alingment.
get_genomes.py -> Download genomes from NCBI databases
get_query_sequences.py -> Download a dataset of protein sequences that act as query for BLAST searches
get_RAW_sequences.py -> Extracts the sequences mapped after BLAST searches
get_unique_hits.py -> Gets unique hits from Blast output files based on genomes IDs
phylogenetic_inference.py -> Runs IQ-Tree program for phylogentic inference
translate_sequences.py -> Translate nucleotide sequences to protein.
alignment_trimming.py -> Runs trimAl program for trimming alignments.
This software is released under the license GNU GPLv3.
This software is provided as is without warranty of any kind.
People wishing to contribute to the software, report issues or seek support can contact Hector Lorente Martinez at hlorente@ucm.es
Download scripts from https://github.com/hectorloma/GNFish
In linux terminal
git clone https://github.com/hectorloma/GNFish
If git is not installed visit https://github.com/git-guides/install-git which is available for any OS
Make scripts executable if you want to run it from the very same terminal
cd GNFish
chmod +x *.py
This software uses Biopython module in python 3 (tested version 1.78-2). It can be installed in your system using pip:
pip install biopython
Program for dowloading genomes from NCBI Databases through Entrez. Requires internet conection.
It needs your e-mail and a file with your queries.
Queries are usallly taxa names (see query_genomes examples files at ./Example)
By default it will download genomic data, but you can also add protein or RNA, using --protein or --rna respectively.
Retmax set a maximum number of downloaded genomes for query.
It creates a directory named Data and three subdirectories named Genomic, Rna and Protein where it downloads the genomes.
Genomes file are compressed you must descompress for working with them (see decompress genome file).
Genome file name will be "Genus_species_assemblyID_datatype.f[n,a]a".
Type on terminal get_genomes.py -h for further information.
**get_genomes.py 'e-mail' 'query.txt' **
--email -> mandatory e-mail for NCBI searches
--query -> file with the queries. Usually simple taxa names (species, group). Field tags or filters can be added to each query. See examples below or look at Example directory for examples of query files
Optional parameters:
--genomic -> downloads whole genomic data
--rna -> downloads protein annotation data
--protein -> downloads protein annotation data
--exclusive -> downloads just protein or rna annotation data if available. No genomic backup
--retmax -> number of NCBI records reported for every query. Default value equal to 200
--refine -> adds filter or field information to all queries. Constant value AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])
Examples files are availabe at Example directory.
Query txt with simple searches, just taxa nor filters or field tags. No refine argument.
See ./Example/query_genome_1.txt for examples of queries.
Creates ./Data directory and ./Data/Genomic, ./Data/Rna and ./Data/Protein subdirectories.
Nor filters or field tags applied. Not curated and redundant genomes (one genom for more than one species is used).
Use this when you do not care very much about filtering.
Genomic
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt
Protein. Look for genomic data as backup (For transcrits use --rna)
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein
Exclusive protein. No backup. (For transcrits use --rna)
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --exlusive
Protein and Rna. Genomic search as backup
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --rna
Rna donwload changing number of records downloaded. Will donwload until 1000 genomes if available. By default 200 are donwloaded.
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --rna --retmax 1000
Default. Applies Representative (just one genome for species), Latest, Not Anomalous.
Just protein example. For rna, genomic, exclusive or retmax argument see 1 above and them to this command line.
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --refine
Own filters or field tags. Just genomes annotated at chromosome level (Example)
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_1.txt --protein --refine 'AND "chromosome level"[filter]'
More info about filters and field tags at https://www.ncbi.nlm.nih.gov/assembly/help/ and reading https://www.ncbi.nlm.nih.gov/books/NBK3837/ - Entre-zHelp.Entrez_Searching_Options.
You must add a the field tags or filters after your query.
See ./Example/query_genome_2.txt for examples of queries.
As each contains is own filters or field tags this will be applied exclusively.
For gobioidei it will download just genomes annotated at chromosome level.
For Homo sapiens all the available genomes will be downloaded.
Just protein example. For rna, genomic, exclusive or retmax argument see 1 above and them to this command line.
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_2.txt --protein
Default. Applies Representative (just one genome for species), Latest, Not Anomalous.
You must add a the field tags or filters after your query.
See ./Example/query_genome_2.txt for examples of queries.
As each contains is own filters or field tags this will be applied exclusively.
For gobioidei it will download just genomes annotated at chromosome level.
For Homo sapiens just the representative one.
Just protein example. For rna, genomic, exclusive or retmax argument see 1 above and them to this command line.
./Code/get_genomes.py hlorente@ucm.es Example/query_genome_2.txt --protein --refine
Program for dowloading protein or nucleotide from NCBI Databases through Entrez. Requires internet conection.
It needs your e-mail and a file with your queries.
Queries must follow "Gene name (filters). If no filters type () after the gene name.
By default it will download protein data, but you can change using --nucleotide parameter.
Retmax set a maximum number of downloaded genomes for query.
Curated argument refine the search for protein search, just those entries which include the name of the protein in Protein Feature.
Stores sequences at ./Data/Query_seqs.
Output file will be named "genename_query_seqs.fas.
Type on terminal get_query_seqs.py -h for further information.
**get_query_seqs.py 'e-mail' 'query.txt' **
--email -> mandatory e-mail for NCBI searches
--query -> file with the queries. Usually simple taxa names (species, group). Field tags or filters can be added to each query. See examples below or look at Examples directory for examples of query files
Optional parameters:
--nucleotide -> for nucleotide downloading
--curated -> downloads just protein or rna annotation data if available. No genomic backup
--retmax -> number of NCBI records reported for every query. Default value equal to 200
--refine -> adds filter or field tags to the query. Constant value refseq[filter]. Follow constant value structure for your custom filters, begin with "AND"
Examples files are availabe at Example directory.
Query txt with simple searches, just gene name and filters equal to ().
Creates ./Data/Query_seqs directory.
Peforms well for protein, for nucleotide it is better to see 2.
Protein
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt
Protein. Downloading restricting maximum number to 50
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50
Protein. Downloading restricting maximum number to 50 and curating
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --curated
Nucleotide. Downloading restricting maximum number to 50. Better see 2.
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --nucleotide
Default. Applies refseq[filter].
Protein download, restricting maximum number to 50, curating, refine default just sequences from RefSeq.
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --curated --refine
Nucleotide recommended use. Dowloading restricting maximum number to 50 and just transcripts (refine argument).
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_1.txt --retmax 50 --curated --refine 'biomol_mrna[PROP]'
You must add a the field tags or filters after your query.
As each query contains is own filters or field tags these will be applied exclusively to every query. See ./Example/query_seqs_2.txt and quer_seqs_3.txt for a scheme.
Protein. For aquaporin just vertebrates sequences. For hemoglobin just sequences between 100 and 200 positions.
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_2.txt
Nucleotide. For aquaporin sequences just download sequences from RefSeq. For hemoglobin just from Plasmids.
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_3.txt
Protein. Same filters as in 3 plus sequences just from 1999.
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_2.txt --refine 'AND ("1999/01/01"[PDAT] : "3000/12/31"[PDAT])'
Nucleotide. Same filters as in 3 plus sequences just from 1999.
./Code/get_query_seqs.py hlorente@ucm.es Example/query_seqs_3.txt --refine 'AND ("1999/01/01"[PDAT] : "3000/12/31"[PDAT])'
More info about filters and field tags at x and reading y.
Program for decompress downloaded genomes.
It searchs for '.gz' files. Can be applied to any other '.gz' file.
With data type (genomic, rna, protein) argument the program will look at ./Data/Datat_type.
With directory argument the program will iterativily within the detail path.
Type on terminal decompress_genomes.py -h for further information.
**decompress_genomes.py --data_type or --directory 'path' **
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
BLAST against genome .fna and .faa files.
It requires BLAST download at https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
Type on terminal blast.py -h for further information.
**blast.py --data_type or --directory 'path' **
blast_path -> path to bin ncbi-blast directory
query_file -> path to your query file with your reference sequences (Fasta file with prot or nucl seqs)
query_type -> data type of query sequences. prot or nucl
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--evalue -> E-value threshold for search. Default value equal to 1e-10
--outfmt -> Output alignment options. Default value 6
--out_exten -> Extension of the BLAST output file. Default "_out.tsv"
Examples files are availabe at Example directory.
Protein directory
./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --protein
Custom directory
./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --directory 'path_to_custom_directory' --protein
Rna directory
./Code/blast.py 'path_to_bin_directory' ./Example/nucleotide_query.fas 'nucl' --rna
Custom directory
./Code/blast.py 'path_to_bin_directory' ./Example/nucleotide_query.fas 'nucl' --directory 'path_to_custom_directory' --rna
Rna directory
./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --rna
Custom directory
./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'prot' --directory 'path_to_custom_directory' --rna
Protein directory
./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'nucl' --protein
Custom directory
./Code/blast.py 'path_to_bin_directory' ./Example/protein_query.fas 'nucl' --directory 'path_to_custom_directory' --protein
Gets unique hits from BLAST output files based on genomes IDs.
The program search for '.tsv' files. Change this with --pattern argument.
Generates output with 'unique.tsv' extension.
Outputs will be stored at Genomic, Rna, Protein or custom directory.
Type on terminal get_unique_hits.py -h for further information.
**get_unique_hits.py --data_type or --directory 'path' **
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--pattern -> custom pattern to find Blast output files. Default ".tsv"
Examples files are availabe at Example directory.
Protein directory. Output will be stored at ./Data/Protein/Species_directory
./Code/get_unique_hits.py --protein
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/get_unique_hits.py --directory 'path_to_custom_directory'
Using '.txt'. Program will serach for '.txt' files.
./Code/get_unique_hits.py --protein --pattern '.txt'
Gets RAW sequences from genomes based on 'unique.tsv' files.
The program search for 'unique.tsv' and '.f[a,n]a files. Change this with --blast_pattern an--genome_pattern arguments.
Genome and Blast unique output file must have a "Genus_species_assemblyID" structure or the name provided by NCBI when downloaded.
Type on terminal get_unique_hits.py -h for further information.
**get_RAW_sequences.py --data_type or --directory 'path' **
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--blast_pattern -> custom pattern to find BLAST output files. Default ".tsv"
--genome_pattern -> pattern to find genome files. Default ".f[a,n]a"
--in_len -> Number of sites extracted upstream and downstream from the blast hit. Default 10000. A whole sequence of at least 20000 sites if exists
--query_seqs -> Use it when you want to attach some query sequences according to BLAST results for future alignments
--query_seqs_num -> Maximum number of query sequences extracted. Use it when you want to attach some sequences to genomic sequences for future alignments. 5 by default
--email -> Mandatory when you want to download some sequences to complete the RAW files for future alignments. As in get_query_seqs
Examples files are availabe at Example directory.
Protein directory. Output will be stored at ./Data/Protein/Species_directory
./Code/get_RAW_sequences.py --protein
Genomic directory, sequences of 20,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory
./Code/get_RAW_sequences.py --genomic
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/get_RAW_sequences.py --directory 'path_to_custom_directory'
Genomic directory, sequences of 20,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory
20 sequences from the dataset of nucleotide_query_seqs.fas.
./Code/get_RAW_sequences.py --genomic --in_len --query_seqs ./Example/nucleotide_query_seqs.fas --query_seqs_num 20 --email hlorente@ucm.es
Genomic directory, sequences of 20,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory
20 sequences from the dataset of nucleotide_query_seqs.fas.
Program will look for 'unique.txt' and for '.fas'.
./Code/get_RAW_sequences.py --genomic --in_len --blast_pattern 'unique.txt' --genome_pattern '.fas' --query_seqs ./Example/nucleotide_query_seqs.fas --query_seqs_num 20 --email hlorente@ucm.es
Genomic directory, sequences of 30,000 positions if possible. Output will be stored at ./Data/Genomic/Species_directory
./Code/get_RAW_sequences.py --genomic --in_len 15000
Aligns files enclosed at Data folder or your custom file.
You must have used --query_seqs argument in get_RAW_sequences.py or have manually added some sequences to raw files.
See https://mafft.cbrc.jp/ for information about algorithms.
Type on terminal get_unique_hits.py -h for further information.
**align_sequences.py --data_type or --directory 'path' **
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--pattern -> sets custom pattern to find files for alignment. Default 'RAW.fas'. Backup search RAW.fas
--algorithm -> sets MAFFT algorithm. Default mafft (automatic)
Examples files are availabe at Example directory.
Genomic directory. Output will be stored at ./Data/Genomic/Species_directory
./Code/align_sequences.py --genomic
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/align_sequences.py --directory 'path_to_custom_directory'
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/align_sequences.py --directory 'path_to_custom_directory' --algorithm einsi --pattern 'RAW_mod.fas'
Translates files.
Open code file of the program for checking for different genetic codes.
**translate_sequences.py --data_type or --directory 'path' **
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--pattern -> custom pattern to find files for translation. Defaul '_final.fas' and 'RAW.fas' as backup
--genetic_code -> Default 1 -Standard. Look at code file for more genetics codes
--out_exten -> extension of the output file. Default '_transtaled.fas'.
Rna directory. Output will be stored at ./Data/Rna/Species_directory
./Code/translate_sequences.py --rna
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/translate_sequences.py --directory 'path_to_custom_directory'
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/get_RAW_sequences.py --directory 'path_to_custom_directory' --genetic_code 4
Combines FASTA files in just one matrix.
You can combine genomic, RNA and protein data. But be careful with duplications. Besides genomic and RNA should have been translated (see above) in the correct reading framework before translation (see above).
Note that the program concatenate sequences to same file if you do not change file_name argument.
After the pipeline (see article) you may have:
1.-Untouched sequences. Typically protein or RNA sequence (valid for future nucleotide matrix if you do not care about UTRs; not valid for future protein matrix, because of lost of reading framework). Extension 'RAW.fas'.
2.- Aligned and trimmed (mismatched positions) sequences. Mandatory for genomic sequences and in some cases for RNA sequences (if you are going to uses for protein matrix after translation, but in this case they will follow point 3). Extension '_final.fas'. Add the extension to your file after visualizing the alingment even when you do not trimmed any position of the alingment.
3.- Translated sequences. Mandatory for genomic sequences and RNA sequences
4.- Raw translated sequences. If the RNA dataset you are working with is just the coding sequence and you do not need to trimm it. Extension '_RAW_translated.fas
**get_combined_seqs.py --data_type or --directory 'path' **
output_name -> name for the file of the combined matrix
output_path -> path were to store the combined matrix
Optional parameters:
--directory -> sets path to custom folder
--genomic -> looks at ./Data/Genomic
--rna -> looks at ./Data/Rna
--protein -> looks at ./Data/Protein
--pattern -> custom pattern to find files for translation. Defaul '_translated.fas' and 'RAW.fas' as backup'
--out_exten -> extension of the output file. Default "_all_combined.fas"
Protein and Genomic directory. Output will be stored at ./Data/Rna/Species_directory
./Code/get_combined_seqs.py --protein --genomic
Custom directory. Output will be stored at custom directory (where the file is located)
./Code/get_combined_seqs.py --directory 'path_to_custom_directory'
Protein and Genomic directory. Output will be stored at ./Data/Rna/Species_directory
./Code/get_combined_seqs.py --protein --genomic --pattern 'modified.fas'
Trims alignment.
See http://trimal.cgenomics.org/use_of_the_command_line_trimal_v1.2 for more information about trimAL and running algorithms.
**alignment_trimming.py trimal_path 'path' input_file 'path' output_file 'path' **
trimal_path -> path to trimAl directory
input_file -> alignment (FASTA file with prot or nucl aligned seqs)
output_file -> output file name
Optional parameters:
--trimal_command -> command to run trimal options. Default -gt 0.1
Examples files are availabe at Example directory.
Remove all positions in the alignment with gaps in 90%
./Code/alignment_trimming.py trimal_path 'path_to_trimAl_source_directory' --input_file 'path_to_alignment' --output_file 'path_to_ouput_file'
Remove all positions in the alignment with gaps in 50%
./Code/alignment_trimming.py trimal_path 'path_to_source_directory' --input_file 'path_to_alignment' --output_file 'path_to_ouput_file' --trimal_command -gt 0.5
Phylogenetic analysis using IQ-TREE.
See http://www.iqtree.org/doc/ for more information about IQ-TREE and running algorithms.
**phylogenetic_inference.py --data_type or --directory 'path' **
iqtree_path -> path to IQ-TREE bin folder
input_file -> alignment (FASTA file with prot or nucl aligned seqs)
Optional parameters:
--trimal_command -> command to run IQ-Tree, without output declaration, see next parameter. Default TEST, ULTRAFAST BOOSTRAP 1000, ALRT 1000
--output_suffix -> suffixes added to input file. Recommended, model and type of support. Default _TEST_UFBS_alrt
Phylogenetic inference running TEST for calculating model of evolutoon, 1000 replicates of Ultrafast Boostrap and 1000 replicates of alrt as support.
./Code/alignment_trimming.py iqtree_path 'path_to_IQTREE_bin_directory' --input_file 'path_to_alignment'
Phylogenetic inference running LG as model of evolution, 1000 replicates of Boostrap and 1000 replicates of alrt as support.
./Code/alignment_trimming.py iqtree_path 'path_to_IQTREE_bin_directory' --input_file 'path_to_alignment' --trimal_command '-m LG -b 1000 -alrt 1000