A tool to estimate deltas for sequence sets and answer questions about relative contribution. DandD can be used to addresses how much new sequence is gained when a sequence collection grows and describe how much structural variation is discovered in each new human genome assembly allowing prediction of when discoveries will level off in the future. DandD uses a measure called
Jessica K. Bonnie, Omar Y. Ahmed, Ben Langmead. [DandD: efficient measurement of sequence growth and similarity] (https://www.sciencedirect.com/science/article/pii/S258900422400275X), iScience 10.1016/j.isci.2024.109054
git clone https://github.com/jessicabonnie/dandd
cd dandd
chmod +x lib/dandd # make dandd executable
# NOTE: The following will alias dandd for the session. For this to persist this should be added to .bashrc
alias dandd=$(pwd)/lib/dandd
Note: The version of Dashing currently available on bioconda is not recent enough. Please obtain the latest binary release for your system from https://github.com/dnbaker/dashing-binaries and move it to somewhere in your $PATH
. Alternatively, you can install the latest version from source by cloning it from github
: https://github.com/dnbaker/dashing:
git clone --recursive https://github.com/dnbaker/dashing
cd dashing && make dashing
If you wish to use the --exact
command to retrieve the actual k-mer counts rather than using Dashing to estimate them, you will need to install kmc
(with kmc_tools
) and make sure that it is in your $PATH
. You can install it through bioconda with the following command:
conda install -c bioconda kmc
This version of DandD uses parallel
to simultaneously create sketches across different values of k
. You will need to make sure that your system has this by loading the module on your server or installing it on your computer.
conda install -c bioconda parallel
The tool will write summary files to the current working directory or to the output directory provided. If a sketch directory is not provided it will be sought or created within the output directory. The sketch directory will contain (1) files mapping the sketch/database names to the component inputs (2) subdirectories of sketches/dbs arranged by number of genomes and k-mer length.
outputdir /
- |_ deltatree : pickle to be used internally during later calls to DandD
- |_ deltas and summary tables : user readable tables named using tag/labels provided
- |_ sketchdb.txt tracking file : user readable table tracking locations and inputs for sketches and database files relevant to particular experiment (prefixed accordingly)
- sketchdb /
- |_ dandd_fastahex, dandd_sketchinfo, dandd_cardinalities : internal pickles tracking name assignments and cardinalities.
- ngen[#] / : one directory for each number of genomes/assemblies/fastas unioned (starting at 1 for single fasta sketches/databases)
- k[#] / : one directory for each k explored
- |_ dashing-sketch
- |_ kmc-database files
- k[#] / : one directory for each k explored
There are a couple of commands options that pertain to the program in general. These flags must be provided before the tree
, kij
, etc. subcommands.
--debug
: this flag will cause DandD to print the commands for each call to Dashing or KMC so that you can run them directly to see what kind of error they are returning.
--fast
: Avoid writing out intermediate copies of the reference files. (Meaning that cardinalities and pathing information will not be saved if DandD is terminated prematurely.)
--safe
: Double check all sketch hashes to make sure they match the sums of the fasta hashes. This is time consuming, but is recommended if you have any suspicion that the fasta files have changed.
--verbose
: Un-supress messages from DandD about progress. Unfortunately, there is no way currently to supress output from Dashing and KMC regardless of this flag.
The tree
command is the first step in performing any analysis. It creates a pickle with the structure DandD will expect to receive for the other commands in order to answer questions downstream. It requires either a file containing full filepaths to fastas (--fastas <path>
, -f <path>
) or a data directory (--datadir <path>
) containing all fastas of interest.
--outdir <path>
or -o <path>
: output directory (default current working directory)
--sketchdir <sketchdb path>
or -c <sketchdb path>
: path to sketch directory. If the sketches for tree
already exist in a sketchdb
directory other than inside outdir
, the sketchdir
can be provided in order to take maximum advantage of previous work in terms of time/space. This is especially important for the sketches in sketchdb/ngen1
which take the most time and space when using Dashing.
Additionally:
-s <string>
or --tag <string>
: string used in file name prefixes to name the non-sketch output files for a particular experiment. If no tag is provided, "dandd"
will be used by default.
-k <int>
: set the starting value in search for optimal k
for delta. As demonstrated in the DandD paper, the optimal k-mer length for delta changes based on characteristics of the data. Therefore, it is beneficial to adjust the starting value of k based on prior information to avoid searching excessively high or low values.
--exact
: use KMC for k-mer counting (instead of estimation through Dashing sketching) to calculate delta.
--non-canon
: disable canonicalization. By default, DandD will canonicalize the k-mers (i.e. treat reverse complements as equivalent to the original k-mer string).
--nchildren <int>
or -n <int>
: maximum number of children for each union. By default, DandD will build trees with a single layer of child nodes and one single union of all inputs (nicknamed 'spiders'). With --nchildren
the tree can be built by calculating delta for a series of smaller unions by indicating the maximum number of children for each union.
--registers <int>
: adjust number of registers used during sketching. For information about the implementation and trade-offs of adjusting this value, see Dashing's documentation.
--lowmem
: delete union sketches/databases after storing their cardinalities and use stored values even when the sketch is missing. The path information etc. for the sketch will remain stored in the internal files such as dandd_fastahex and dandd_sketchinfo as well as the sketchdb.txt for the experiment. Note: It is not recommended to use this command the first time the databases are initialized.
Result files are named using a prefix composed of: the provided tag
string, the total number of fasta files used to produce them, and the tool
used (kmc or dashing).
_deltas.csv
The tree
command outputs the deltas for the component fastas and their full union as well as any deltas of intermediate unions if the --step <int>
argument is used. (In addition to producing the pickle file to be used by other subcommands.) The field order is subject to change but columns contain the following:
- ngen - the number of fasta files in the union
- title - the basenames of each fasta separated by underscores
- sketchloc - the path to the sketch whose cardinality was used to find delta
- fastas - the full paths to the component fastas separated by "|"
- card - the cardinality (count or estimated count) of unique k-mers of length k
- delta - the value of delta for the union
The progressive
command performs a series of cumulative (or progressive) unions over a number (--norder
) of orderings of the input fastas. It requires --dtree / -d <tree.pickle>
where <tree.pickle
is an output of the tree
command. It will automatically use the same values for --lowmem
, --step
, --sketchdir
, --non-canon
, --registers
embedded in that delta-tree output. If only a subset of the original fastas in the tree are of interest, a subset list can be provided using --fastas / -f <filepath>
. If orderings have previously been produced for this tree, those orderings will be repurposed, with additional orderings generated if --norder
is greater than the number of orderings already generated. If an ordering file is to be shared across experiments/tags, a preexisting ordering file can be used via --orderings / -r <ordering pickle>
. If only one deterministic ordering is desired provide it via the fasta file and use --norder 1
.
--ksweep
: populate all of the possible delta values across a range of k
values for each sketch/database combination in the progressive union. --mink <int>
and --maxk <int>
are used to bound that range with default [2,32]. Note For reasons internal to Dashing it must be the case that maxk<= 32 for estimation. If you want to sweep higher k
s, you must use KMC via the --exact
flag. If --ksweep
is set, the progressive output will not seek or produce the delta values for each union, but instead provide the possible delta values (output field: delta_pos) for each union at each specified k in the range.
Result files are named using a prefix composed of: the provided tag
string, the total number of fasta files used to produce them, progu<the number of orderings (from --norder
or zero if entirety of default ordering file is used)>, and the tool
used (kmc or dashing).
.csv
If --ksweep
is not included, this file will contain the progressive
outputs of the deltas for the component fastas for each union in each ordering permutation. If --ksweep
is provided but the argmax-k is not in [mink, maxk], it will be nonsense. The field order is subject to change but columns contain the following:
- ordering - the index of the ordering (one-based numbering)
- ngen - the number of fasta files in the union; also the index within the ordering
- title - the basenames of each fasta separated by underscores
- sketchloc - the path to the sketch whose cardinality was used to find delta
- fastas - the full paths to the component fastas separated by "|"
- card - the cardinality (count or estimated count) of unique k-mers of length k
- delta - the value of delta for the union
_summary.csv
This file contains all of the possible delta values across all ks scanned (whether during the search for delta or via --ksweep
) during every step of every ordering. The field order is subject to change but columns contain the following:
- ordering - the index of the ordering (one-based numbering)
- ngen - the number of fasta files in the union; also the index within the ordering
- kval - the k-value (k-mer length) that was used to find the cardinality and the delta-pos
- title - the basenames of each fasta in the union separated by underscores
- command - the bash command used to produce the sketch in case you wish to run the individual command (n.b the component ngen1 sketches will need to still be at the indicated location)
- card - the cardinality (count or estimated count) of unique k-mers of length kval
- delta_pos - the possible value of delta for this union at this kval (i.e. card/k)
The kij
command produces pairwise deltas across all of the input fastas in the tree.pickle
in order to compute K-Independent Jaccards. It also calculates values for pairwise Jaccard within the range provided using --mink
and --maxk
. To write out those pairwise values use --jaccard
. It requires --dtree / -d <tree.pickle>
where <tree.pickle
is an output of the tree
command.
Result files are named using a prefix composed of: the provided tag
string, the total number of fasta files used to produce them, progu<the number of orderings (from --norder
or zero if entirety of default ordering file is used)>, and the tool
used (kmc or dashing).
.kij.csv
ABk,Btitle,ABdelta,Atitle,Bdelta,A,Bk,Adelta,Ak,KIJ,B
This file contains the values for the k-Independent Jaccard similarity metric.
- <A,B> - file path of FASTA <A,B>
- <A,B>title - stripped sample id from FASTA <A,B>
- <A,B>delta - value of delta for <A,B> alone
- <A,B>k - argmax-k used to calculate <A,B>delta
- ABdelta - value of delta for union of A and B
- ABk - argmax-k used to calculate ABdelta
- KIJ - value of k-Independent Jaccard similarity metric
.j.csv
This file contains the values for the standard Jaccard similarity metric if the --jaccard
flag and mink
/maxk
values are provided.
_AFtuples.pickle
This file contains a pickle that can be passed to unsupported functions in helpers/afproject.py
.