Skip to content

Commit

Permalink
Merge branch 'release/v2.2.10'
Browse files Browse the repository at this point in the history
  • Loading branch information
gmarcais committed Apr 16, 2018
2 parents 8bd8e23 + d7fa74a commit da91420
Show file tree
Hide file tree
Showing 10 changed files with 458 additions and 81 deletions.
40 changes: 0 additions & 40 deletions Makefile.simple

This file was deleted.

5 changes: 4 additions & 1 deletion configure.ac
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
AC_INIT([jellyfish], [2.2.9], [gmarcais@umd.edu])
AC_INIT([jellyfish], [2.2.10], [gmarcais@umd.edu])
AC_CANONICAL_HOST
AC_CONFIG_MACRO_DIR([m4])
AM_INIT_AUTOMAKE([subdir-objects foreign parallel-tests color-tests])
Expand Down Expand Up @@ -156,6 +156,9 @@ AM_COND_IF([PERL_BINDING],
[AS_IF([test x$enable_perl_binding != xyes], [PERL_EXT_LIB=$enable_perl_binding])]
[AX_PERL_EXT([$prefix])])

# Check for aligned memory allocation functions
AC_CHECK_FUNCS([posix_memalign aligned_alloc])

AC_OUTPUT


334 changes: 334 additions & 0 deletions doc/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,334 @@
# Getting started


## Counting all k-mers


The basic command to count all k-mers is as follows:

```
jellyfish count -m 21 -s 100M -t 10 -C reads.fasta
```

This will count canonical (`-C`) 21-mers (`-m 21`), using a hash with
100 million elements (`-s 100M`) and 10 threads (`-t 10`) in the
sequences in the file `reads.fasta`. The output is written in the file
`mer counts.jf` by default (change with `-o` switch).

To compute the histogram of the k-mer occurrences, use the
`histo subcommand (see section [histo](#histo)):

```
jellyfish histo mer_counts.jf
```

To query the counts of a particular k-mer, use the
`query` subcommand (see section [query](#query)):

```
jellyfish query mer_counts.jf AACGTTG
```

To output all the counts for all the k-mers in the file, use the
`dump` subcommand (see section [dump](#dump)):

```
jellyfish dump mer_counts.jf > mer_counts_dumps.fa
```

To get some information on how, when and where this jellyfish file was
generated, use the `info` subcommand (see section [info](#info)):

```
jellyfish info mer_counts.jf
```

For more detail information, see the relevant sections in this
document. All commands understand `--help` and will produce some
information about the switches available.

### Counting k-mers in sequencing reads

In sequencing reads, it is unknown which strands of the DNA is
sequenced. As a consequence, a k-mer or its reverse complement are
essentially equivalent. The canonical representative of a k-mer `m` is
by definition `m` or the reverse complement of `m`, whichever comes
first lexicographically. The `-C` switch instructs to save in the hash
only canonical k-mers, while the count is the number of occurrences
of both a k-mer and it reverse complement.

The size parameter (given with `-s`) is an indication of the number
k-mers that will be stored in the hash. For sequencing reads, this
size should be the size of the genome plus the k-mers generated by
sequencing errors. For example, if the error rate is e (e.g.Illumina
reads, usually e~1%), with an estimated genome size of G and
a coverage of c, the number of expected k-mers is G+Gcek.

> NOTE: unlike in Jellyfish 1, this `-s` parameter is only
> an estimation. If the size given is too small to fit all the k-mers,
> the hash size will be increased automatically or partial results will
> be written to disk and finally merged automatically. Running
> `jellyfish merge` should never be necessary, as now
> jellyfish now takes care of this task on its own.
If the low frequency k-mers (k-mers occurring only once), which are
mostly due to sequencing errors, are not of interest, one might
consider counting only high-frequency k-mers (see
section [Counting high frequency k-mers](#Counting-high-frequency-k-mers)),
which uses less memory and is potentially faster.

### Counting k-mers in a genome

In an actual genome or finished sequence, a k-mer and its reverse
complement are not equivalent, hence using the `-C` switch does not
make sense. In addition, the size for the hash can be set directly to
the size of the genome.

## Counting high-frequency k-mers

Jellyfish offers two way to count only high-frequency k-mers (meaning
only k-mers with count >1), which reduces significantly the memory
usage. Both methods are based on using Bloom filters. The first method
is a one pass method, which provides approximate count for some
percentage of the k-mers. The second method is a two pass method which
provides exact count. In both methods, most of the low-frequency
k-mers are not reported.

### One pass method

Adding the `--bf-size` switch make jellyfish first insert all k-mers
first into a Bloom filter and only insert into the hash the k-mers
which have already been seen at least once. The argument to
`--bf-size` should the total number of k-mer expected in the data set
while the `--size` argument should be the number of $k$-mers occurring
more than once. For example:

```
jellyfish count -m 25 -s 3G --bf-size 100G -t 16 homo_sapiens.fa
```

would be appropriate for counting 25-mers in human reads at 30x
coverage. The approximate memory usage is 9 bits per k-mer in the
Bloom filter.

The count reported for each k-mer (by `jellyfish dump` or `jellyfish
query`) is one less than the actual count. Meaning, the count 1 k-mer
are not reported, count 2 k-mer are reported to have count 1, etc.

The drawback of this method is some percentage of the $k$-mer that
should not be reported (because they occur only once) are
reported. This is due to the random nature of the Bloom filter data
structure. The percentage is <1% by default and can be changed with
the `--bf-fp switch.

### Two passes method

In the two pass method, first a Bloom counter is created from the
reads with `jellyfish bc`. Then this Bloom counter is given to the
`jelllyfish count` command and only the k-mers which have been seen
twice in the first pass will be inserted in the hash. For example,
with a human data set similar that in
section [One pass method](#One-pass-method):

```
jellyfish bc -m 25 -s 100G -t 16 -o homo_sapiens.bc homo_sapiens.fa
jellyfish count -m 25 -s 3G -t 16 --bc homo_sapiens.bc homo_sapiens.fa
```

The advantage of this method is that the counts reported for the
k-mers are all correct. Most count 1 k-mer are not reported, except
for a small percentage (set by the false positive rate, `-f` switch of
the `bc` subcommand) of them which are reported (correctly with count
1). All other k-mers are reported with the correct count.

The drawback of this method is that it requires to parse the entire
reads data set twice and the memory usage of the Bloom counter is
greater than that of the Bloom filter (slightly less than twice as
much).

# FAQ

## How to read compressed files (or other format)?

Jellyfish only reads FASTA or FASTQ formatted input files. By reading
from pipes, jellyfish can read compressed files, like this:

```
zcat *.fastq.gz | jellyfish count /dev/fd/0 ...
```

or by using the `<()` redirection provided by the shell (e.g. bash, zsh):

```
jellyfish count <(zcat file1.fastq.gz) <(zcat file2.fasta.gz) ...
```

## How to read multiple files at once?

Often, jellyfish can parse an input sequence file faster than `gzip`
or `fastq-dump` (to parse SRA files) can output the sequence. This
leads to many threads in jellyfish going partially unused. Jellyfish
can be instructed to open multiple file at once. For example, to read
two short read archive files simultaneously:

```
jellyfish count -F 2 <(fastq-dump -Z file1.sra) <(fastq-dump -Z file2.sra) ...
```

Another way is to use "generators". First, create a file containing,
one per line, commands to generate sequence. Then pass this file to
jellyfish and the number of generators to run
simultaneously. Jellyfish will spawn subprocesses running the commands
passed and read their standard output for sequence. By default, the
commands are run using the shell in the SHELL environment variable,
and this can be changed by the `-S` switch. Multiple generators will
be run simultaneously as specified by the `-G` switch. For example:

```
ls *.fasta.gz | xargs -n 1 echo gunzip -c > generators
jellyfish count -g generators -G 4 ...
```

The first command created the command list into the 'generators' file,
each command unzipping one FASTA file in the current directory. The
second command runs jellyfish with 4 concurrent generators.

## How to reduce the output size?

The output file was design to be easy to read, but the file generated
can be rather large. By default, a 4 bytes counter value is saved for
every k-mer (i.e.a maximum count of over 4 billion). Instead, a
counter size of 2 bytes or 1 byte can be used with the switch
`--out-counter-len`, which reduces significantly the output size.

The count of k-mers which cannot be represented with the given number
of bytes will have a value equal to the maximum value that can be
represented. Meaning, if the counter field uses 1 byte, any
k-mers with count greater or equal to 255 will be reported of having
a count of 255.

Also, low frequency and high frequency k-mers can be skipped using the
`--L and `--U` switches respectively. Although it might be more
appropriate to filter out the low frequency $k$-mers using Bloom
filters, as shown in
section [Counting-high-frequency-k-mers](#Counting-high-frequency-k-mers).

## How much memory is needed?

The memory needed to count k-mers, given the various parameters of the
`count` subcommand is obtained by using the
`mem` subcommand. It understand all the same switches, but
only the `--mer-len` (`-m`), `--size` (`-s`),
`--counter-len` (`-c`) and `--reprobes` (`-p`)
switches are taken into account.

For example, for 24-mers, with an initial hash size of 1 billion, and
default parameters otherwise, the memory usage is:

```
jellyfish mem -m 24 -s 1G
4521043056 (4G)
```

Conversely, if the `--size` switch is not given but the
`--mem` switch is, then the maximum initial hash size that would
fit in the given memory is returned. For example, this is the maximum
hash size for 31-mers in 8 Gb of RAM:

```
jellyfish mem -m 31 --mem 8g
1073741824 (1G)
```

# Subcommands


## histo

The `histo` subcommand outputs the histogram of k-mers
frequencies. The last bin, with value one above the high setting set
by the `h` switch (10000 by default), is a catch all: all k-mers with
a count greater than the high setting are tallied in that one bin. If
the low setting is set (`-l` switch), then the first bin, with value
one below the low setting, is also similarly a catch all.

By default, the bins with a zero count are skipped. This can be changed
with the `f` switch.

## dump

The `dump` subcommand outputs a list of all the k-mers in
the file associated with their count. By default, the output is in FASTA
format, where the header line contains the count of the k-mer and the
sequence part is the sequence of the k-mer. This format has the
advantage that the output contains the sequence of $k$-mers and can be
directly fed into another program expecting the very common FASTA
format. A more convenient column format (for human beings) is selected
with the `-c` switch.

Low frequency and high frequency k-mers can be skipped with the `-L`
and `-U` switches respectively.

In the output of the `dump` subcommand, the k-mers are sorted
according to the hash function used by Jellyfish. The output can be
considered to be "fairly pseudo-random". By "fairly" we mean that NO
guarantee is made about the actual randomness of this order, it is
just good enough for the hash table to work properly. And by
"pseudo-random" we mean that the order is actually deterministic:
given the same hash function, the output will be always the same and
two different files generated with the same hash function can be
merged easily.

## query

The `query` subcommand outputs the k-mers and their counts for some
subset of k-mers. It will outputs the counts of all the k-mers
passed on the command line or of all the k-mers in the sequence read
from the FASTA or FASTQ formatted file passed to the switch
`-s` (this switch can be given multiple times).

## info

The `info` subcommand outputs some information about the jellyfish
file and the command used to generated it, in which directory and at
what time the command was run. Hopefully, the information given should
be enough to rerun jellyfish under the same conditions and reproduce
the output file. In particular, the `-c` switch outputs the command,
properly escaped and ready to run in a shell.

The header is saved in JSON format and contains more information than is
written by the default. The full header in JSON format can be written
out using the `-j` switch.

## merge

The `merge` subcommand is of little direct use with version version 2
of Jellyfish. When intermediary files were written to disk, because
not all k-mers would fit in memory, they can be merged into one file
containing the final result with the `merge` subcommand. The `count`
subcommand will merge intermediary files automatically as needed.

## stats

The `stats` subcommand computes some statistics about the
mers. Although these statistics could be computed from the histogram, it
provides quick summary information. The fields are:

* _Unique_: The number of k-mer occuring exactly once
* _Distinct_: The number of k-mers, ignoring their multiplicity
(i.e.the cardinality of the set of $k$-mers)
* _Total_: The number of k-mers with multiplicity (i.e.the sum of the
number of occurence of all the mers)
* _Max\_count_: The maximum of the number of occurrences

## mem

The `mem` subcommand shows how much memory a count subcommand will
need or conversely how large of a hash size will fit in a given amount
of memory.

## cite

The `cite` subcommand prints the citation for the jellyfish
paper. With the `-b`, it is formatted in Bibtex format. How
convenient!
Loading

0 comments on commit da91420

Please sign in to comment.