Skip to content

Adrian-Cantu/PRINSEQ-plus-plus

Repository files navigation

Prinseq++

PRINSEQ++ is a C++ implementation of the prinseq-lite.pl program. It can be used to filter, reformat or trim genomic and metagenomic sequence data. It is 5X faster than prinseq-lite.pl and uses less RAM thanks to the use of multi-threading and the cboost libraries. It can read and write compressed (gzip) files, drastically reducing the use of hard drive.

Install from Conda

You can install from Conda either in its own channel:

conda create -n prinseq-plus-plus -c conda-forge -c bioconda prinseq-plus-plus
conda activate prinseq-plus-plus
prinseq++ -h

or adding it to another channel:

conda install -c conda-forge -c bioconda prinseq-plus-plus

See below to install prinseq++ from source if you don't want to use conda.

Usage

-h | -help
    Print the help page; ignore other arguments.

-v | -version
    Print version; ignore other arguments.

-threads
    Nuber of threads to use. Note that if more than one thread is used, output
    sequences might not be in the same order as input sequences. (Default=1)

-VERBOSE <int>
    Format of the report of filtered reads, VERBOSE=1 prints information only
    on the filters that removed sequences. VERBOSE=2 prints numbers for filters
    in order (min_len, max_len, min_gc, max_gc, min_qual_score, min_qual_mean,
    ns_max_n, noiupac, derep, lc_entropy, lc_dust, trim_tail_left, trim_tail_right,
    trim_qual_left, trim_qual_right, trim_left, trim_right) to compare stats of diferent files.
    VERBOSE=0 prints nothing.
    (Default=1)

***** INPUT OPTIONS *****

-fastq <file>
    Input file in FASTQ format. Can also read a compressed (.gz) file.

-fastq2 <file>
    Input file in FASTQ format for pair-end reads. Can also read a
    compressed (.gz) file.

-FASTA
    Input is in fasta format (no quality). Note that the output format is
    still fastq by default. Quality will be treated as 31 (A) for all bases.

-phred64
    Input quality is in phred64 format. This is for older Illumina/Solexa reads.

***** OUTPUT OPTION *****

-out_format <int>
    Set output format. 0 FASTQ, 1 FASTA. (Default=0)

-out_name <string>
    For pair-end sequences, the output files are <string>_good_out_R1 and
    <string>_good_out_R2 for pairs where both reads pass quality control,
    <string>_single_out_R1 and <string>_single_out_R2 for read that passed
    quality control but mate did not. <string>_bad_out_R1 and <string>_bad_out_R2
    for reads that fail quality controls. [Default = random size 6 string]

-rm_header
    Remove the header in the 3rd line of the fastq (+header -> +). This does
    not change the header in the 1st line (@header).

-out_gz
    Write the output to a compressed file (WARNING this can be really SLOW,
    will be fixed in a future release)

-out_good  , -out_single , -out_bad,
-out_good2 , -out_single2, -out_bad2
    Rename the output files idividually, this overwrites the names given by
    -out_name only for the selected files. File extension won't be added
    automatically. (TIP: if you don't need a file, set its name to /dev/null)

***** FILTER OPTIONS ******

-min_len <int>
    Filter sequence shorter than min_len.

-max_len <int>
    Filter sequence longer than max_len.

-min_gc <float>
    Filter sequence with GC percent content below min_gc.

-max_gc <float>
    Filter sequence with GC percent content above min_gc.

-min_qual_score <int>
    Filter sequence with at least one base with quality score below
    min_qual_score.

-min_qual_mean <int>
    Filter sequence with quality score mean below min_qual_mean.

-ns_max_n <int>
    Filter sequence with more than ns_max_n Ns.

-noiupac
    Filter sequence with characters other than A, C, G, T or N.

-derep
    Filter duplicated sequences. This only remove exact duplicates.

-lc_entropy=[float]
    Filter sequences with entropy lower than [float]. [float] should be in
    the 0-1 interval. (Default=0.5)

-lc_dust=[float]
    Filter sequences with dust_score lower than [float]. [float] should be in
    the 0-1 interval. (Default=0.5)

***** TRIM OPTIONS *****

-trim_left <integer>
    Trim <integer> bases from the left (5'->3').

-trim_right <integer>
    Trim <integer> bases from the right (3'->5').

-trim_tail_left <integer>
    Trim poly-A/T tail with a minimum length of <integer> at the
    5'-end.

-trim_tail_right <integer>
    Trim poly-A/T tail with a minimum length of <integer> at the
    3'-end.

-trim_qual_rule <string>
    Rule to use to compare quality score to calculated value. Allowed
    options are lt (less than), gt (greater than) and et (equal to).
    [default: lt]

-trim_qual_left=[float]
    Trim recursively from the 3'-end chunks of length -trim_qual_step if the
    mean quality of the first -trim_qual_window bases is less than [float].
    (Default=20)

-trim_qual_right=[float]
    Trim recursively from the 5'-end chunks of length -trim_qual_step if the
    mean quality of the last -trim_qual_window bases is less than [float].
    (Default=20)

-trim_qual_window [int]
    Size of the window used by trim_qual_left and trim_qual_right (Default=5)

-trim_qual_step [int]
    Step size used by trim_qual_left and trim_qual_right (Default=2)

-trim_qual_type <string>
    Type of quality score calculation to use. Allowed options are min,
    mean, max and sum. [default= min]

Alternate installations

If you don't want to use conda, there are several other ways that you can install prinseq++. Try one of these and post an issue if you get stuck.

General requirements to install from source

  1. g++
  2. make
  3. boost-devel ( "sudo apt-get install libboost-all-dev" in Ubuntu)
  4. pthread

Download

If you are just interested in compiling and using PRINSEQ++, download the latest version. You can also download the binary. If you want to edit the source code, clone this repository.

To install

tar -xvf prinseq-plus-plus-1.2.4.tar.gz
cd prinseq-plus-plus-1.2.4
./configure
make
make test
sudo make install

To install prinseq++ from this Git repository

git clone https://github.com/Adrian-Cantu/PRINSEQ-plus-plus.git
cd PRINSEQ-plus-plus
./autogen.sh
./configure
make
make test
sudo make install

To check instalation

prinseq++ -h

Contributors

Prinseq++ was written by Adrian Cantu, with contributions from Jeff Sadural and Katelyn McNair. Rob Edwards helped here and there.