Merge branch 'release/v2.2.10'

gmarcais · Apr 16, 2018 · da91420 · da91420
2 parents 8bd8e23 + d7fa74a
commit da91420
Show file tree

Hide file tree

Showing 10 changed files with 458 additions and 81 deletions.
diff --git a/Makefile.simple b/Makefile.simple
diff --git a/configure.ac b/configure.ac
@@ -1,4 +1,4 @@
-AC_INIT([jellyfish], [2.2.9], [gmarcais@umd.edu])
+AC_INIT([jellyfish], [2.2.10], [gmarcais@umd.edu])
 AC_CANONICAL_HOST
 AC_CONFIG_MACRO_DIR([m4])
 AM_INIT_AUTOMAKE([subdir-objects foreign parallel-tests color-tests])
@@ -156,6 +156,9 @@ AM_COND_IF([PERL_BINDING],
            [AS_IF([test x$enable_perl_binding != xyes], [PERL_EXT_LIB=$enable_perl_binding])]
            [AX_PERL_EXT([$prefix])])
 
+# Check for aligned memory allocation functions
+AC_CHECK_FUNCS([posix_memalign aligned_alloc])
+
 AC_OUTPUT
 
 
diff --git a/doc/Readme.md b/doc/Readme.md
@@ -0,0 +1,334 @@
+# Getting started
+
+
+## Counting all k-mers
+
+
+The basic command to count all k-mers is as follows:
+
+```
+jellyfish count -m 21 -s 100M -t 10 -C reads.fasta
+```
+
+This will count canonical (`-C`) 21-mers (`-m 21`), using a hash with
+100 million elements (`-s 100M`) and 10 threads (`-t 10`) in the
+sequences in the file `reads.fasta`. The output is written in the file
+`mer counts.jf` by default (change with `-o` switch).
+
+To compute the histogram of the k-mer occurrences, use the
+`histo subcommand (see section [histo](#histo)):
+
+```
+jellyfish histo mer_counts.jf
+```
+
+To query the counts of a particular k-mer, use the
+`query` subcommand (see section [query](#query)):
+
+```
+jellyfish query mer_counts.jf AACGTTG
+```
+
+To output all the counts for all the k-mers in the file, use the
+`dump` subcommand (see section [dump](#dump)):
+
+```
+jellyfish dump mer_counts.jf > mer_counts_dumps.fa
+```
+
+To get some information on how, when and where this jellyfish file was
+generated, use the `info` subcommand (see section [info](#info)):
+
+```
+jellyfish info mer_counts.jf
+```
+
+For more detail information, see the relevant sections in this
+document.  All commands understand `--help` and will produce some
+information about the switches available.
+
+### Counting k-mers in sequencing reads
+
+In sequencing reads, it is unknown which strands of the DNA is
+sequenced. As a consequence, a k-mer or its reverse complement are
+essentially equivalent. The canonical representative of a k-mer `m` is
+by definition `m` or the reverse complement of `m`, whichever comes
+first lexicographically. The `-C` switch instructs to save in the hash
+only canonical k-mers, while the count is the number of occurrences
+of both a k-mer and it reverse complement.
+
+The size parameter (given with `-s`) is an indication of the number
+k-mers that will be stored in the hash. For sequencing reads, this
+size should be the size of the genome plus the k-mers generated by
+sequencing errors. For example, if the error rate is e (e.g.Illumina
+reads, usually e~1%), with an estimated genome size of G and
+a coverage of c, the number of expected k-mers is G+Gcek.
+
+> NOTE: unlike in Jellyfish 1, this `-s` parameter is only
+> an estimation. If the size given is too small to fit all the k-mers,
+> the hash size will be increased automatically or partial results will
+> be written to disk and finally merged automatically. Running
+> `jellyfish merge` should never be necessary, as now
+> jellyfish now takes care of this task on its own.
+
+If the low frequency k-mers (k-mers occurring only once), which are
+mostly due to sequencing errors, are not of interest, one might
+consider counting only high-frequency k-mers (see
+section [Counting high frequency k-mers](#Counting-high-frequency-k-mers)),
+which uses less memory and is potentially faster.
+
+### Counting k-mers in a genome
+
+In an actual genome or finished sequence, a k-mer and its reverse
+complement are not equivalent, hence using the `-C` switch does not
+make sense. In addition, the size for the hash can be set directly to
+the size of the genome.
+
+## Counting high-frequency k-mers
+
+Jellyfish offers two way to count only high-frequency k-mers (meaning
+only k-mers with count >1), which reduces significantly the memory
+usage. Both methods are based on using Bloom filters. The first method
+is a one pass method, which provides approximate count for some
+percentage of the k-mers. The second method is a two pass method which
+provides exact count. In both methods, most of the low-frequency
+k-mers are not reported.
+
+### One pass method
+
+Adding the `--bf-size` switch make jellyfish first insert all k-mers
+first into a Bloom filter and only insert into the hash the k-mers
+which have already been seen at least once. The argument to
+`--bf-size` should the total number of k-mer expected in the data set
+while the `--size` argument should be the number of $k$-mers occurring
+more than once. For example:
+
+```
+jellyfish count -m 25 -s 3G --bf-size 100G -t 16 homo_sapiens.fa
+```
+
+would be appropriate for counting 25-mers in human reads at 30x
+coverage. The approximate memory usage is 9 bits per k-mer in the
+Bloom filter.
+
+The count reported for each k-mer (by `jellyfish dump` or `jellyfish
+query`) is one less than the actual count.  Meaning, the count 1 k-mer
+are not reported, count 2 k-mer are reported to have count 1, etc.
+
+The drawback of this method is some percentage of the $k$-mer that
+should not be reported (because they occur only once) are
+reported. This is due to the random nature of the Bloom filter data
+structure. The percentage is <1% by default and can be changed with
+the `--bf-fp switch.
+
+### Two passes method
+
+In the two pass method, first a Bloom counter is created from the
+reads with `jellyfish bc`. Then this Bloom counter is given to the
+`jelllyfish count` command and only the k-mers which have been seen
+twice in the first pass will be inserted in the hash. For example,
+with a human data set similar that in
+section [One pass method](#One-pass-method):
+
+```
+jellyfish bc -m 25 -s 100G -t 16 -o homo_sapiens.bc homo_sapiens.fa
+jellyfish count -m 25 -s 3G -t 16 --bc homo_sapiens.bc homo_sapiens.fa
+```
+
+The advantage of this method is that the counts reported for the
+k-mers are all correct. Most count 1 k-mer are not reported, except
+for a small percentage (set by the false positive rate, `-f` switch of
+the `bc` subcommand) of them which are reported (correctly with count
+1). All other k-mers are reported with the correct count.
+
+The drawback of this method is that it requires to parse the entire
+reads data set twice and the memory usage of the Bloom counter is
+greater than that of the Bloom filter (slightly less than twice as
+much).
+
+# FAQ
+
+## How to read compressed files (or other format)?
+
+Jellyfish only reads FASTA or FASTQ formatted input files. By reading
+from pipes, jellyfish can read compressed files, like this:
+
+```
+zcat *.fastq.gz | jellyfish count /dev/fd/0 ...
+```
+
+or by using the `<()` redirection provided by the shell (e.g. bash, zsh):
+
+```
+jellyfish count <(zcat file1.fastq.gz) <(zcat file2.fasta.gz) ...
+```
+
+## How to read multiple files at once?
+
+Often, jellyfish can parse an input sequence file faster than `gzip`
+or `fastq-dump` (to parse SRA files) can output the sequence. This
+leads to many threads in jellyfish going partially unused. Jellyfish
+can be instructed to open multiple file at once. For example, to read
+two short read archive files simultaneously:
+
+```
+jellyfish count -F 2 <(fastq-dump -Z file1.sra) <(fastq-dump -Z file2.sra) ...
+```
+
+Another way is to use "generators". First, create a file containing,
+one per line, commands to generate sequence. Then pass this file to
+jellyfish and the number of generators to run
+simultaneously. Jellyfish will spawn subprocesses running the commands
+passed and read their standard output for sequence. By default, the
+commands are run using the shell in the SHELL environment variable,
+and this can be changed by the `-S` switch. Multiple generators will
+be run simultaneously as specified by the `-G` switch. For example:
+
+```
+ls *.fasta.gz | xargs -n 1 echo gunzip -c > generators
+jellyfish count -g generators -G 4 ...
+```
+
+The first command created the command list into the 'generators' file,
+each command unzipping one FASTA file in the current directory. The
+second command runs jellyfish with 4 concurrent generators.
+
+## How to reduce the output size?
+
+The output file was design to be easy to read, but the file generated
+can be rather large. By default, a 4 bytes counter value is saved for
+every k-mer (i.e.a maximum count of over 4 billion).  Instead, a
+counter size of 2 bytes or 1 byte can be used with the switch
+`--out-counter-len`, which reduces significantly the output size.
+
+The count of k-mers which cannot be represented with the given number
+of bytes will have a value equal to the maximum value that can be
+represented. Meaning, if the counter field uses 1 byte, any
+k-mers with count greater or equal to 255 will be reported of having
+a count of 255.
+
+Also, low frequency and high frequency k-mers can be skipped using the
+`--L and `--U` switches respectively. Although it might be more
+appropriate to filter out the low frequency $k$-mers using Bloom
+filters, as shown in
+section [Counting-high-frequency-k-mers](#Counting-high-frequency-k-mers).
+
+## How much memory is needed?
+
+The memory needed to count k-mers, given the various parameters of the
+`count` subcommand is obtained by using the
+`mem` subcommand. It understand all the same switches, but
+only the `--mer-len` (`-m`), `--size` (`-s`),
+`--counter-len` (`-c`) and `--reprobes` (`-p`)
+switches are taken into account.
+
+For example, for 24-mers, with an initial hash size of 1 billion, and
+default parameters otherwise, the memory usage is:
+
+```
+jellyfish mem -m 24 -s 1G
+4521043056 (4G)
+```
+
+Conversely, if the `--size` switch is not given but the
+`--mem` switch is, then the maximum initial hash size that would
+fit in the given memory is returned. For example, this is the maximum
+hash size for 31-mers in 8 Gb of RAM:
+
+```
+jellyfish mem -m 31 --mem 8g
+1073741824 (1G)
+```
+
+# Subcommands
+
+
+## histo
+
+The `histo` subcommand outputs the histogram of k-mers
+frequencies. The last bin, with value one above the high setting set
+by the `h` switch (10000 by default), is a catch all: all k-mers with
+a count greater than the high setting are tallied in that one bin. If
+the low setting is set (`-l` switch), then the first bin, with value
+one below the low setting, is also similarly a catch all.
+
+By default, the bins with a zero count are skipped. This can be changed
+with the `f` switch.
+
+## dump
+
+The `dump` subcommand outputs a list of all the k-mers in
+the file associated with their count. By default, the output is in FASTA
+format, where the header line contains the count of the k-mer and the
+sequence part is the sequence of the k-mer. This format has the
+advantage that the output contains the sequence of $k$-mers and can be
+directly fed into another program expecting the very common FASTA
+format. A more convenient column format (for human beings) is selected
+with the `-c` switch.
+
+Low frequency and high frequency k-mers can be skipped with the `-L`
+and `-U` switches respectively.
+
+In the output of the `dump` subcommand, the k-mers are sorted
+according to the hash function used by Jellyfish. The output can be
+considered to be "fairly pseudo-random". By "fairly" we mean that NO
+guarantee is made about the actual randomness of this order, it is
+just good enough for the hash table to work properly. And by
+"pseudo-random" we mean that the order is actually deterministic:
+given the same hash function, the output will be always the same and
+two different files generated with the same hash function can be
+merged easily.
+
+## query
+
+The `query` subcommand outputs the k-mers and their counts for some
+subset of k-mers. It will outputs the counts of all the k-mers
+passed on the command line or of all the k-mers in the sequence read
+from the FASTA or FASTQ formatted file passed to the switch
+`-s` (this switch can be given multiple times).
+
+## info
+
+The `info` subcommand outputs some information about the jellyfish
+file and the command used to generated it, in which directory and at
+what time the command was run. Hopefully, the information given should
+be enough to rerun jellyfish under the same conditions and reproduce
+the output file. In particular, the `-c` switch outputs the command,
+properly escaped and ready to run in a shell.
+
+The header is saved in JSON format and contains more information than is
+written by the default. The full header in JSON format can be written
+out using the `-j` switch.
+
+## merge
+
+The `merge` subcommand is of little direct use with version version 2
+of Jellyfish. When intermediary files were written to disk, because
+not all k-mers would fit in memory, they can be merged into one file
+containing the final result with the `merge` subcommand. The `count`
+subcommand will merge intermediary files automatically as needed.
+
+## stats
+
+The `stats` subcommand computes some statistics about the
+mers. Although these statistics could be computed from the histogram, it
+provides quick summary information. The fields are:
+
+* _Unique_: The number of k-mer occuring exactly once
+* _Distinct_: The number of k-mers, ignoring their multiplicity
+    (i.e.the cardinality of the set of $k$-mers)
+* _Total_: The number of k-mers with multiplicity (i.e.the sum of the
+    number of occurence of all the mers)
+* _Max\_count_: The maximum of the number of occurrences
+
+## mem
+
+The `mem` subcommand shows how much memory a count subcommand will
+need or conversely how large of a hash size will fit in a given amount
+of memory.
+
+## cite
+
+The `cite` subcommand prints the citation for the jellyfish
+paper. With the `-b`, it is formatted in Bibtex format. How
+convenient!