Knead your sequencing reads before baking
ReadKnead clips, trims, demultiplexes, filters (e.g. by length), selects (e.g. randomly) and renames reads from FASTQ files.
- Choice of algorithm for trimming: fast and accurate bit-masked k-difference matching, Needleman–Wunsch, search or match.
- Demultiplexing using internal barcodes (user-defined positions in the reads)
For testing read preparation pipelines and quality control, ReadKnead:
- Plots read-length barplot
- Plots quality scores boxplot
- Provide detailed stats for each operation (clip, trim etc)
- Integrated pipeline for read preparation When individual tools are used for read preparation, demultiplexing can't depend on successful trimming for example. With ReadKnead, users define read preparation pipeline with multiple dependent steps.
- Speed ReadKnead is written in Go and parallelized.
See refs page for tarball and executable.
go test -v ./cmd/readknead/...
Input files for following examples can be found in the cmd/readknead/testdata
directory. Please note that FASTQ files found in that directory are uncompressed, while following examples demonstrate how to use compressed FASTQ files using multiple compressor (Gzip, Zstandard and LZ4). We suggest to run the tests (see above) to see the output of these examples.
Highly recommended to use the -verbose_level 20
argument to test pipelines.
First, define a pipeline in clip_trim.json
file for single-end reads that will:
- Clip 10 nucleotides from the 5' end of reads
- Trim the 3' end of reads using the bit-masked k-difference matching keeping all reads including untrimmed reads (
no_trim
). - Remove reads shorter than 20 nucleotides
[{"name": "clip",
"end": 5,
"length": 10},
{"name": "trim",
"end": 3,
"algo": "bktrim",
"epsilon": 0.15,
"epsilon_indel": 0.1,
"keep": ["no_trim",
"trim_exact",
"trim_align"],
"sequence":"AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"},
{"name": "length",
"min_length": 20}]
Then run the pipeline defined above in the JSON file on input file sample1_R1.fastq.gz
:
readknead -fq_fnames_r1 "sample1_R1.fastq.gz" \
-fq_command_in "zcat" \
-fq_path_out "output" \
-fq_fname_out_r1 "sample1_R1.fastq.lz4" \
-fq_command_out "lz4,-" \
-ops_r1_path "clip_trim.json" \
-report_path "output/preparing_report.json"
First, define a pipeline in paired_end_trim.json
file for paired-end reads that will:
- Rename reads to shorten their names using the pattern
sample2.##
where##
will be replaced by the read number. In case read names contain a barcode (prefixed with#
), it will be kept in the renamed reads. - Trim the reads on both ends using the bit-masked k-difference matching keeping all reads including untrimmed reads (
no_trim
). - Remove reads shorter than 20 nucleotides
[{"name": "rename",
"base36": true,
"keep_barcode": true,
"new_name": "sample2."},
{"name": "trim",
"algo": "bktrim_paired",
"epsilon": 0.15,
"epsilon_indel": 0.1,
"keep": ["no_trim",
"trim_exact",
"trim_align"],
"sequence":"AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
"sequence_paired":"AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA"},
{"name": "length",
"min_length": 20}]
To generate quality-control stats, quality score encoding (Illumina 1.8+ Phred+33) argument is added. Four workers are used. Run the pipeline defined above in the JSON file on input files sample2_R1.fastq.gz
and sample2_R2.fastq.gz
:
readknead -fq_fnames_r1 "sample2_R1.fastq.zst" \
-fq_fnames_r2 "sample2_R2.fastq.zst" \
-fq_command_in "zstdcat" \
-fq_path_out "output" \
-fq_fname_out_r1 "sample2_R1.fastq.zst" \
-fq_fname_out_r2 "sample2_R2.fastq.zst" \
-fq_command_out "zstd,-,-fo" \
-ops_r1_path "paired_end_trim.json" \
-report_path "output/preparing_report.json" \
-label "WT replicate 2" \
-ascii_min "33" \
-stats_in_path "output/stats_in" \
-stats_out_path "output/stats_out" \
-num_worker 4
The input FASTQ files are compressed using Zstandard.
First, define a pipeline in demultiplex.json
file for paired-end reads that will:
- Trim/Filter the reads on 5' end with a match algorithm with one of the sequences listed in
sequences
. Only perfect match and >70% match reads are kept. Trimmed sequences are not added to the read names ("add_trimmed": false
) and matched sequences are not actually trimmed from the reads ("apply_trim_seq": false
): this step is only a filter. - Clip 10 nucleotides from the 5' end of reads and add the clipped sequence to the read names
- Reads are separated (i.e. demultiplexed) matching one of the barcodes listed in
barcodes
(at position 10 from the 5' end with maximum of 1 mismatch). One FASTQ file will be created per barcode (replacing the sequence of the barcode in the output FASTQ file in[DPX]
in the-fq_fname_out_r1
and-fq_fname_out_r2
arguments). - Clip 19 nucleotides from the 5' end of reads and don't add the clipped sequence to the read names
- Remove reads shorter than 20 nucleotides
[{"name": "trim",
"end": 5,
"algo": "match",
"position": 6,
"min_score": 0.7,
"add_trimmed": false,
"keep": ["trim_exact",
"trim_align"],
"sequences": ["CATTGCTTATGG",
"GTACGGGACTTA"],
"apply_trim_seq": false},
{"name": "clip",
"end": 5,
"length": 10,
"add_clipped": true},
{"name": "demultiplex",
"end": 5,
"max_mismatch": 1,
"barcodes": ["GAGTA",
"CTGAG"]},
{"name": "clip",
"end": 5,
"length": 19,
"add_clipped": false},
{"name": "length",
"min_length": 20}]
Then run the pipeline defined above in the JSON file on input files sample2_R1.fastq.zst
and sample2_R2.fastq.zst
:
readknead -fq_fnames_r1 "sample2_R1.fastq.zst" \
-fq_fnames_r2 "sample2_R2.fastq.zst" \
-fq_command_in "zstdcat" \
-fq_path_out "output" \
-fq_fname_out_r1 "sample2_[DPX]_R1.fastq.zst" \
-fq_fname_out_r2 "sample2_[DPX]_R2.fastq.zst" \
-fq_command_out "zstdcat" \
-ops_r1_path "demultiplex.json" \
-report_path "output/preparing_report.json" \
-label "WT replicate 2" \
-ascii_min "33" \
-stats_in_path "output/stats_in" \
-stats_out_path "output/stats_out" \
-num_worker 4
- Input
-fq_fnames_r1
Path to read 1 FASTQ files (comma separated)-fq_fnames_r2
Path to read 2 FASTQ files (comma separated)-fq_command_in
Command line to execute for opening each input file (comma separated)-buf_size
Buffer IO size (default 41943040)
- Output
-fq_path_out
Path to output FASTQ files-fq_fname_out_r1
Output read 1 FASTQ file-fq_fname_out_r2
Output read 2 FASTQ file-fq_command_out
Command line to execute for opening each output file (comma separated)
- Pipeline
-ops_r1
Operation(s) for read1-ops_r1_path
Path to operation(s) for read1-ops_r2
Operation(s) for read2-ops_r2_path
Path to operation(s) for read2
- Pipeline report
-report_path
Write report to path (stdout with -)
- Plotting
- Activate plotting for input and/or output:
-stats_in_path
Path to statistics of input FASTQ files-stats_out_path
Path to statistics of output FASTQ files
- Configure plotting and read characteristics:
-label
Label-ascii_min
ASCII coding of the minimum quality score (Ex: 33 for Phred+33) (default: 33)-max_quality
Highest nucleotide quality (default: 43)-max_read_length
Maximum read length. No precision required (default: 1000)
- Activate plotting for input and/or output:
- Other options
-num_worker
Number of worker(s) (default 1)-verbose
Verbose-verbose_level
Verbose level. This option is useful for testing pipelines.-version
Print version and quit
Operation | Parameter | Type | Default | |
---|---|---|---|---|
clip | length | integer | Number of nucleotide to clip | |
end | integer | End of read to clip: 5 or 3 | ||
add_clipped | boolean | false | Copy clipped nucleotide to read name (#-prefixed) | |
add_separator | boolean | true | Add prefix (#) before clipped sequence in read name | |
demultiplex | barcodes | []strings | List of barcode sequences | |
end | integer | End of read to clip: 5 or 3 | ||
barcode_idx | integer | Index (first: 0) of #-prefixed sequence (barcode or UMI) in read name | ||
max_mismatch | integer | 0 | Maximum number of mismatch between read and barcode | |
length_ligand | integer | 0 | Clip if barcode found | |
length | min_length | integer | -1 | Minimum read length |
max_length | integer | -1 | Maximum read length | |
random | probability | float | 1. | Probability to keep read (between 0 and 1) |
rename | new_name | string | New read name | |
base36 | boolean | false | Convert read number to shorter base36 | |
keep_barcode | boolean | false | Keep #-prefixed sequences | |
merge_barcode | boolean | false | Merge #-prefixed sequences to one | |
all_reads | boolean | true | Rename all reads | |
trim | sequence | string | Sequence to trim (for pair-end reads: downstream sequence) | |
sequences | []strings | Use for multiple-sequence trimming | ||
sequence_paired | string | Upstream sequence to trim for paired-end reads | ||
sequences_paired | []strings | Use for multiple-sequence paired-end reads trimming | ||
add_trimmed | boolean | false | Copy trimmed nucleotide to read name (#-prefixed) | |
add_trimmed_ref | boolean | false | Copy reference trimming sequence to read name (#-prefixed) | |
add_separator | boolean | true | Add prefix (#) before trimmed sequence in read name | |
algo | string | bktrim or bktrim_paired | Algorithms: align, bktrim, bktrim_paired, search or match | |
end | integer | End of read to trim : 5 or 3 (only for align, bktrim and search algo ) |
||
min_sequence | integer | 0 | Length of perfect match (starting at trimming position) in trimming alignment | |
min_score | float | 0.8 | Minimum alignment score (only for align, search and match algo ) |
|
position | integer | 0 | Position in reads to match trimming sequence (only for match algo ) |
|
epsilon | float | 0.1 | Maximum mismatch ratio (only for bktrim and bktrim_paired algo ) |
|
epsilon_indel | float | 0.03 | Maximum indel ratio (only for bktrim and bktrim_paired algo ) |
|
min_overlap | integer | 3 | Minimum overlap length (only for bktrim and bktrim_paired algo ) |
ReadKnead is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).
Copyright © 2017-2023 Charles E. Vejnar