-
Notifications
You must be signed in to change notification settings - Fork 0
seqator
This script performs binning. It bins sequences (in fasta format) and sequence-containing files which have SPAdes-like headers and file names, respectively.
“SPAdes-like” means the following format:
NODE_1_length_61704_cov_114.517
Seqator can bin according to sequence length and coverage (length
and cov
in SPAdes-like line).
Seqator works in two modes: dir
and fasta_file
:
-
If seqator mode is
dir
, then the script will move files which pass the filter from the input directory to the output directory. -
If seqator mode is
fasta_file
, the script will copy sequences which pass the filter from the input file to the output file.
Seqator filter is customizable. You can choose:
-
Filter parameter: seqator can filter sequences by two parameters: by coverage or by length.
-
Filter mode defines how to compare sequence parameters to the threshold. E.g. if filter mode is
lt
(Less Than), then the script will match and copy sequences having their parameter (say, coverage) less then the threshold (say, 25.0). See the details in the Options section (-f
option). -
Threshold: the threshold to filter by.
The script is written in Python, so you need Python interpreter (version 3.X) to use it. Here you can download Python.
# Input
-i / --input
Input directory or fasta file, depending on seqator_mode (-m).
And input fasta file may be gzipped.
Mandatory.
-x / --target-file-extention
This option is applicable only if seqator_mode (-m) is 'dir'.
'-x' is the extention of files to be checked, without the preceding dot.
E.g. if you want to bin .fasta files, then specify '-x fasta'.
Optional. Default: 'dna'.
# Output
-o / --output
Output directory or fasta file, depending on '-m'.
If the file name ends with '.gz', the output file will be gzipped.
Optional. By default, the script will create an output directory in the wokring directory.
# Seqator mode
-m / --seqator-mode
There are two modes: 'dir' and 'fasta_file'.
If the mode is 'dir', the script will move files which pass the filter
from the input directory to the output directory.
If the mode is 'fasta_file', the script will copy sequences which pass the filter
from the input file to the output file.
Also, '-m' may be 'auto'. In 'auto' mode, the mode will be
'dir' if '-i' is a directory and 'fasta_file' if '-i' is a regular file.
Optional. Default: 'auto'.
# Filter
-p / --filter-parameter
There are two sequence parameters to filter by: 'len' and 'cov':
length and coverage, respectively.
Optional. Default: 'cov'.
-f / --filter-mode
There are basically six ways to compare numbers:
'lt' (Less Than), 'le' (Less or Equal),
'gt' (Greater Than), 'ge' (Greater or Equal),
'eq' (EQual to), 'ne' (Not Equal to).
E.g. if you specify '-p cov -f lt -t 12.5 -m fasta_file', then
the script will copy all sequences having coverage less then 12.5 to the output file.
Optional. Default: 'lt'.
-t / --threshold
Threshold to use for filtering by '-p' parameter.
See the example for '-f' option -- there you'll see how '-t' option works.
Mandatory.
# Help and version
-h / --help
Print help message and exit.
-v / --version
Print version and exit.
dir
seqator mode.
Move .dna
files having contig coverage greater than 10 from directory indir/
to outdir/
.
python3 seqator.py \
-i indir \
-x dna \
-p cov \
-f gt \
-t 10 \
-o outdir
dir
seqator mode.
Move .fna
files having sequence length equal to 1000 bp from directory indir/
to outdir/
.
python3 seqator.py \
-i indir \
-x fna \
-p len \
-f eq \
-t 1000 \
-o outdir
fasta_file
seqator mode
Copy sequences from file input.fasta
which have sequence length less than 1000 bp to file output.fasta.gz
.
python3 seqator.py \
-i input.fasta \
-p len \
-f lt \
-t 1000 \
-o output.fasta.gz
fasta_file
seqator mode
Copy sequences from file input.fasta.gz
which have contig coverage greater than 10 to file output.fasta
.
python3 seqator.py \
-i input.fasta.gz \
-p cov \
-f gt \
-t 10 \
-o output.fasta