Bioinformatic scripts

Here I will upload some python3 scripts for dealing with nucleotide sequences that I wrote initially for myself but then they appeared to be useful for other people.

1. annotate_blast_hits

Motivation

When you download the results of blast search from NCBI BLAST webpage, the table of blast hits does not contain any information about the matching sequences except their accession number. It is not convenient to check all the matches manually, especially if you queried thousands of sequences in a batch.

Description

This script is intended for taxonomic annotation of blast results (blastn, tblastn, blastp or blastx) saved in Hit Table CSV format. It uses efetch function from Bio.Entrez package to get information about accessions from GenBank Nucleotide (Nuccore) or Protein databases.

Input: table in CSV or TSV format (separators: comma, tabulation or semicolon), where GenBank accession numbers are by default in the 2nd column (1 if counted from zero). The column can be changed via optional argument --column, -c (column number counting from zero). There should be no headers or commented lines - they will cause errors.

The output is a file "*_annotated.csv" containing the original table with the following columns added:

Record name
Species name
Date of update
Reference
Full taxonomy

Running the script

This script "annotate_blast_hits.py" requires python3 and package Biopython and runs from terminal on Linux.

I did not try to run it on Windows, but probably it should also work.

You can also download a standalone executable for Linux that does not require any dependencies (Upd: Standalone version requires GNU libc 2.31 or newer). The script with all dependencies was bundled into one file with PyInstaller, so you can just save "annotate_blast_hits" and run it.

After downloading the script you should first make it executable:

chmod +x annotate_blast_hits.py

Usage: annotate_blast_hits.py [-h] [-c] csv db email

Example usage: ./annotate_blast_hits.py example_input.csv n yourname@mail.com -c 1

arguments	description
csv	Pathway to csv-formatted Hit Table file with blastn results. Positional argument
db	For the output of nucleotide blast or tblastn, use `n`. For the output of protein blast or blastx, use `p`. Positional argument
email	Your e-mail address. It is important when you make many requests to NCBI databases because it helps NCBI to contact you if something goes wrong. Otherwise they can just silently block you. Positional argument
-c, --column	Specify the position of the column with GenBank accession numbers, counting from zero. Default value is 1. Optional argument
-h, --help	Show help message and exit. Optional argument

Possible problems

If any of GenBank accessions correspond to extremely large sequences, such as whole-genome bacterial sequences or complete chromosome sequences with a length of millions bp, it will run EXTREMELY slowly.
It was tested with hit tables with up to several thousand accessions. It usually works, but if it's a daytime in USA and NCBI servers are overloaded, the script can stop with a runtime error. In this case, it will try to fetch the data from GenBank once more automatically.
The script writes some temporary files in the folder containing the hit table and they are normally removed at the end. If the script is terminated manually, some files can stay unremoved.

Contacts

For any questions and suggestions: ledum_laconicum(at)mail(dot)ru

If anybody will require additional annotation features (e.g., genes), don't hesitate to contact me.

2. Additional scripts for VSEARCH pipeline

Several scripts are stored here as a supplementary material for my publication This is not the best organizational decision from me, but it's too late to change it.

Scripts:

add_silva_taxonomy.py filter_fasta.py filter_otutab_and_fasta.py filter_tsv_by_other_tsv.py

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README.md		README.md
add_silva_taxonomy.py		add_silva_taxonomy.py
annotate_blast_hits		annotate_blast_hits
annotate_blast_hits.py		annotate_blast_hits.py
blastn_example_input.csv		blastn_example_input.csv
blastn_example_input_annotated.csv		blastn_example_input_annotated.csv
blastx_example_input.csv		blastx_example_input.csv
blastx_example_input_annotated.csv		blastx_example_input_annotated.csv
example_input.tsv		example_input.tsv
filter_fasta.py		filter_fasta.py
filter_otutab_and_fasta.py		filter_otutab_and_fasta.py
filter_tsv_by_other_tsv.py		filter_tsv_by_other_tsv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatic scripts

1. annotate_blast_hits

Motivation

Description

Running the script

Possible problems

Contacts

2. Additional scripts for VSEARCH pipeline

Scripts:

About

Releases

Packages

Languages

Gurdhhu/bioinf_scripts

Folders and files

Latest commit

History

Repository files navigation

Bioinformatic scripts

1. annotate_blast_hits

Motivation

Description

Running the script

Possible problems

Contacts

2. Additional scripts for VSEARCH pipeline

Scripts:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages