Robitaille, A., Brancaccio, R.N., Dutta, S. et al. PVAmpliconFinder: a workflow for the identification of human papillomaviruses from high-throughput amplicon sequencing. BMC Bioinformatics 21, 233 (2020). https://doi.org/10.1186/s12859-020-03573-8
PVampliconFinder is a data analysis workflow designed to rapidly identify and classify known and potentially new papilliomaviridae sequences from amplicon deep-sequencing with degenerated papillomavirus (PV) primers.
PVampliconFinder is based on alignment similarity metrics, but also consider molecular evolution time for an improved identification and taxonomic classification of novel PVs. The final output of the tool includes a list of fully characterized putatively new papillomaviriade sequences, as well as graphical representations of relative abundance of the virome sequence diversity in the tested samples.
The PVampliconFinder workflow is designed for the analysis of sequencing reads generated from paired-end sequencing of DNA amplified using degenerated primers targeting specifically the L1 sequence of papillomaviruses (Chouhy et al., 2010,Forslund et al., 1999,Forslund et al., 2003).
Python2.7 or higher and Perl v5.22.1 or higher are required.
The tool has been created under UNIX environment, but installing clang_osx-64, clangxx_osx-64 and gfortran_osx-64 with conda should provide a functional environment on Mac.
PVAmpliconFinder rely on Bioconda to install the software and associated dependencies
Please install the version of Miniconda corresponding to your python version
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -y fastqc multiqc trim-galore vsearch blast raxml cap3 krona libxml2 gcc_linux-64 gxx_linux-64 gfortran_linux-64 perl-padwalker perl-xml-libxml perl-libxml-perl perl-bioperl perl-getopt-long perl-math-round perl-statistics-basic perl-list-moreutils perl-module-build perl-bioperl-run perl-text-csv
export PATH="PATH_TO_PVAMPLICONFINDER/program:$PATH"
For 32bits system, PaPaRa available binary file is not functionnal, as specified on the webpage of the tool. You need to install manually PaPara following the instruction, and put the binary file in PVAmpliconFinder/program. Note that the binary file must be named "papara".
The list of tools used by PVAmpliconFinder can be manually downloaded and installed, and corresponding executable must be present in the PATH environment variable.
Please note that PaPaRa binary file must be named "papara".
PVAmpliconFinder need the nt and taxdb NCBI databases to work properly. You can find thoses databases at the following ftp adress : ftp://ftp.ncbi.nlm.nih.gov/blast/db/. Note that the taxonomy file must be correctly located.
It is advised to use the NCBI script update_blastdb.pl to facilitate the installation of the databases. More information here.
Once downloaded and installed, please check that the ~/.ncbirc file
is present and point to the correct NCBI nt database location. More information here.
Type | Description |
---|---|
-d | PATH to input fastq directory |
tests files can be found here
Name | Example value | Description |
---|---|---|
-s | pool | suffix of fastq filename |
-o | PV_Amplicon_output | PATH to output directory |
Name | Default value | Description |
---|---|---|
-f | NA | Tabular file containing information about the samples - The first line of this file must be "ID primer tissue" |
-b | nt | Name of the local "nt" blast database |
-i | 98 | Threshold of percentage of identity used for the de-novo centroid-based clustering |
-t | 2 | Number of threads |
Flags are special parameters without value.
Name | Description |
---|---|
-h | Display help |
sh PVAmpliconFinder.sh [-h] [-t threads] [-b "nt" database] [-f info_file] [-i identity thershold] -s fastq_files_suffix -d input_dir -o output_dir
Type | Description |
---|---|
QC report | Report on FastQ file quality, before and after trimming |
Diversity by tissu | Excel table of taxonomically classified PV species identified in the samples |
Table summary | Excel table of reads metics |
Table putative Known viruses | Excel table of putative known viruses identified in the samples |
Table putative New viruses | Excel table of putative new viruses identified in the samples |
Putative Known viruses | Fasta files of putative known viruses ssequences identified in the samples |
Putative New viruses | Fasta files of putative new viruses ssequences identified in the samples |
KRONA Megablast | Directory of KRONA graphical representations of the unormalized abundance of viruses identified by Megablast in the samples |
KRONA BlastN | Directory of KRONA graphical representations of the unormalized abundance of viruses identified by BlastN in the samples |
KRONA RaxML | Directory of KRONA graphical representations of the unormalized abundance of viruses identified by RaxML-EPA in the samples |
Log file | File of the logs |
Detailed description of the output
Name | Description | |
---|---|---|
Alexis Robitaille | alexis.robitaille@orange.fr | Developer to contact for support |
Magali Olivier | olivierm@iarc.fr | |
Massimo Tommasino | tommasinom@iarc.fr |
Version 1.0
- Alexis Robitaille - IARC bioinformatic platform
This project is licensed under GPL-3.0.