Program SeqyClean
Version: 1.10.09 (2017-10-16)
Main purpose of this software is to pre-process NGS data in order to prepare for downstream analysis.
SeqyClean offers:
- Adapter/key/primers filtering
- Vector and contaminants filtering.
- Quality trimming.
- Poly A/T trimming.
- Overlapping paired reads.
It handles SFF and FASTQ file formats.
Developer version of the zlib
:
$sudo apt-get install zlib1g-dev
Clone or download the repository. Then cd
to seqyclean home folder, and type make
.
Note: by default, it builds the binary for OS-X. It should build on Linux as well. If not, try to use this command:
make PLATFORM=-DLINUX
or simply contact me: ilya.zhbannikov@duke.edu
usage: ./seqyclean libflag input_file_name_1 [libflag input_file_name_2] -o output_prefix [options]
The parameter libflag
here is a library type: -454 for Roche 454 reads, -1, -2 for paired-end Illumina reads, -U for single-end reads. See examples below.
-h, --help - Show this help and exit.
-v <filename> - Turns on vector trimming, default=off. <filename> - is a path to a FASTA-file containing vector genomes.
-c <filename> - Turns on contaminants screening, default=off, <filename> - is a path to a FASTA-file containing contaminant genomes.
-k <value> - Common size of k-mer, default=15.
-d - Distance between consecutive k-mers, default=1.
-kc <value> - Size of k-mer used in sampling contaminat genome, default=15.
-qual ```max_average_error max_error_at_ends``` - Turns on quality trimming, default=off. Error boundaries: max_average_error (default=20 Phred), max_error_at_ends (default=20 Phred).
-braket ```window_size max_average_error``` - Parameter for quality trimming. By default window_size=10 and max_average_error=0.794.
-window ```window_size max_average_error``` [```window_size maximum_average_error``` [...]] - Parameters for quality trimming. By default there are two windows with size of 50 and 10 bp with the same max_average_error=0.794.
-ow - Overwrite existing results, default=off.
-minlen <value> - Minimum length of read to accept, default=100 bp.
-polyat [cdna] [cerr] [crng] - Turns on poly A/T trimming, default=off. Parameters: cdna (default=10) - maximum size of a poly tail, cerr (default=3) - maximum number of G/C nucleotides within a tail, cnrg (default=50) - range to look for a tail within a read.
-verbose - Verbose output, default=off.
-detrep - Generate detailed report for each read, default=off.
-dup [-startdw][-sizedw][-maxdup] - Turns on screening duplicated sequences, default=off. Here startdw (defalt=10) and sizedw (default=35) are starting position and size of the window within a read, -maxdup (default=3) - maximum number of duplicated sequences allowed.
-no_adapter_trim - Turns off adapter trimming, default=off.
-t <value> - Number of threads (not yet applicable to Illumina mode), default=4.
-fastq - Output in FASTQ format, default=off.
-fasta - Output in FASTA format, default=off.
-m <filename> - Using custom barcodes, default=off. <filename> - a path to a FASTA-file with custom barcodes.
-1 <filename1> -2 <filename2> - Paired-end mode (see examples below)
-U <filename> - Single-end mode
-shuffle - Store non-paired Illumina reads in shuffled file, default=off.
-i64 - Turns on 64-quality base, default = off.
-adp <filename> - Turns on using custom adapters, default=off. <filename> - FASTA file with adapters
-at <value> - This option sets the similarity threshold for adapter trimming by overlap (only in paired-end mode). By default its value is set to 0.75.
-overlap <value> - This option turns on merging overlapping paired-end reads and <value> is the minimum overlap length. By default the minimum overlap length is 16 base pairs.
-new2old - A switch to fix read IDs, default=off ( As is detailed in: http://contig.wordpress.com/2011/09/01/newbler-input-iii-a-quick-fix-for-the-new-illumina-fastq-header/#more-342 ).
-gz - A flag that indicates compressed (.gz) output, default=off.
-alen - Maximum adapter length, default=30 bp.(only for paired-end mode).
###Please note We call 'Adapter' for Illumina reads the thing, which contains: [Adapter P5/P7 + Index I5/I7 + Linker (primer hybridization)]. In other words 'Adapter' the total foreign sequence attached to 5' or 3' end of the piece of DNA.
Output in SFF, no quality trimming, vector trimming is performed:
./seqyclean -454 test_data/in.sff -o test/Test454 -v test_data/vectors.fasta
Output in SFF, quality trimming with default parameters, vector trimming and contaminants screening are performed:
./seqyclean -454 test_data/in.sff -o test/Test454 -qual -v test_data/vectors.fasta -c test_data/contaminants.fasta
Trimming of adapters is performed, quality trimming with default parameters:
./seqyclean -1 test_data/R1.fastq.gz -2 test_data/R2.fastq.gz -qual -o test/Test_Illumina
Trimmings of adapters and vectors are performed, quality trimming with default parameters:
./seqyclean -1 test_data/R1.fastq.gz -2 test_data/R2.fastq.gz -qual -v test_data/vectors.fasta -o test/Test_Illumina
Trimming of adapters, vectors and contaminant screening are performed, quality trimming with default parameters:
./seqyclean -U test_data/R1.fastq.gz -o test/Test_Illumina -v test_data/vectors.fasta -c test_data/contaminants.fasta
@inproceedings{Zhbannikov:2017:SPH:3107411.3107446,
author = {Zhbannikov, Ilya Y. and Hunter, Samuel S. and Foster, James A. and Settles, Matthew L.},
title = {SeqyClean: A Pipeline for High-throughput Sequence Data Preprocessing},
booktitle = {Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics},
series = {ACM-BCB '17},
year = {2017},
isbn = {978-1-4503-4722-8},
location = {Boston, Massachusetts, USA},
pages = {407--416},
numpages = {10},
url = {http://doi.acm.org/10.1145/3107411.3107446},
doi = {10.1145/3107411.3107446},
acmid = {3107446},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {data preprocessing, high-throughput dna sequencing, sequence analysis},
}
Ilya Y. Zhbannikov, Samuel S. Hunter, James A. Foster, and Matthew L. Settles. 2017. SeqyClean: A Pipeline for High-throughput Sequence Data Preprocessing. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics (ACM-BCB '17). ACM, New York, NY, USA, 407-416. DOI: https://doi.org/10.1145/3107411.3107446
%0 Conference Paper
%1 3107446
%A Ilya Y. Zhbannikov
%A Samuel S. Hunter
%A James A. Foster
%A Matthew L. Settles
%T SeqyClean: A Pipeline for High-throughput Sequence Data Preprocessing
%B Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
%@ 978-1-4503-4722-8
%C Boston, Massachusetts, USA
%P 407-416
%D 2017
%R 10.1145/3107411.3107446
%I ACM
Please ask Ilya (ilya.zhbannikov@duke.edu) in case of any questions.