Skip to content
Sergey Kazakov edited this page Jun 21, 2016 · 26 revisions

Welcome to the MetaFast wiki!

Fast metagenome analysis toolkit, version 0.1.0.

Authors:

  • Software: Sergey Kazakov and Vladimir Ulyantsev, ITMO University, Saint-Petersburg.
  • Testing: Veronika Dubinkina and Alexandr Tyakht, SRI of Physical-Chemical Medicine, Moscow.
  • Idea, supervisor: Dmitry Alexeev, SRI of Physical-Chemical Medicine, Moscow.

Description

MetaFast — METAgenome FAST analysis toolkit — is a software for calculating a number of different statistics of metagenome sequences, building the distance matrix between them, constructing the heatmap and the dendrogram based on the distance matrix.

Despite the existence of a number of approaches to analyse and compare metagenomic data, most of them have inherent disadvantages that limit their scope of applicability. For instance, reference-based methods require a representative database of known genomes; assembly-based methods are computationally intense and can hardly be applied to metagenomes with highly complex community structure; composition-based methods does not have clear interpretable resulting features.

A novel approach Metafast is based on unfinished assembly for every metagenome library included to the study. Its main steps are:

  1. Assembling short genomic sequences from reads for every metagenome separately (basing on de Bruijn graph).
  2. Constructing one combined de Bruijn graph for all assembled sequences, then searching for connected components in it.
  3. Calculating a characteristic vector for every metagenome with a length equal to the number of connected components.
  4. Cross-comparing metagenomes by calculating the Bray-Curtis dissimilarity matrix based on characteristic vectors.

It does not require a priori knowledge about the taxa possibly included in the microbiota. Other advantages over the above-mentioned methods are rather small system requirements and interpretability of the extracted features.

The software was implemented in Java and can be run on any operating system (tested on Linux 2.6.32 x86_64).

Installation

To run metafast you need to have JRE 1.6 or higher installed. To run it only one script is required (metafast.sh, metafast.bat or metafast.jar).

You can download the metafast run script from the last stable release from https://github.com/ctlab/metafast/releases.

  • For Linux and Mac OS: download metafast.sh, run the command chmod a+x metafast.sh, then run ./metafast.sh from the command line.
  • For Windows: download metafast.bat and run it from the command line.
  • For other OS: download metafast.jar and run it via command java -jar metafast.jar.

Alternatively, you can build the newest version of the metafast from the repository:

git clone https://github.com/ctlab/metafast.git
cd metafast 
ant
./out/metafast.sh --version

System requirements

  • Software: Java Runtime Environment 1.6 or higher (you can download it here).

There are no other strict requirements to run the metafast, however we evaluate memory and disk requirements based on our runs:

  • RAM: metafast requires 2-2.5 times more memory than maximum size of uncompressed FASTQ file to be processed.
  • Hard disk space: metafast requires 25-30% of total size of processed uncompressed FASTQ files.

NB. These estimations are very rough, and is good for 10-100 libraries (fastq files) with 1-50 Gb each.

Running metafast

To run metafast use the following syntax:

  • metafast.sh [<Launch options>] [<Input parameters>]
  • metafast.bat [<Launch options>] [<Input parameters>]
  • java -jar metafast.jar [<Launch options>] [<Input parameters>]

Full description of launch options and input parameters can be found below in section Options and Parameters.
Also, you can run metafast.sh --help or metafast.sh --help-all to view help for them.

By running metafast a working directory is created (by default ./workDir/). All intermidiate files, log file and final results are saved in it.

File output_description.txt is created after every run in the current and working directories. It contains the description of every output file produced by the metafast.

The metafast run script also allows you to run subtools of whole process or different tools, that was included in the package. To see the list of available additional tools, run metafast.sh --tools.

Input files

MetaFast accepts input sequence files of FASTQ and FASTA formats. Input files can also be compressed with gzip of bzip2.

Output files

When metafast finishes, working directory will contain following results:

  • output_description.txt,
    <work-dir>/output_description.txt — Identical text files with output files' description. File output_description.txt is created in the current directory (where the run was started) but only if it is possible to do this.
  • <work-dir>/log,
    <work-dir>/logs/log_<date>_<time> — Identical text files with run log.
  • <work-dir>/kmer-counter-many/stats — Directory with kmer frequency statistics (statistics files is in text format for every input reads file).
  • <work-dir>/seq-builder-many/sequences — Directory with FASTA files - paths from reads for every library.
  • <work-dir>/component-cutter/components-stat-<b1>-<b2>.txt — File with components' statistics (in text format).
  • <work-dir>/component-cutter/components.bin — File with extracted components (in binary format).
  • <work-dir>/features-calculator/vectors — Directory with features values files for every library (in text format).
  • <work-dir>/matrices/dist_matrix_<date>_<time>_original_order.txt — File with resulted distance matrix between samples keeping original order. It is based on Bray–Curtis dissimilarity; element matrix[i][j] is a distance between sample i and sample j.
  • <work-dir>/matrices/dist_matrix_<date>_<time>.txt — File with resulted distance matrix between samples with new order based on adjacency of the samples.
  • <work-dir>/matrices/dist_matrix_<date>_<time>_heatmap.png — Image file with heatmap and dendrogram between samples.

WARNING! Only files in <work-dir>/matrices and logs are saved in working directory <work-dir> after another run in the same working directory.

Options and Parameters

Input parameters for metafast:

  • -i, --reads <files>
    List of reads files from single environment. FASTQ, BINQ, FASTA files are acceptable, gzip- and bzip2-compressed files are allowed too. Files can be set by bash regexp, for example -i dir/*.fastq or -i `cat filelist.txt` .
  • -k, --k <arg>
    K-mer size (in nucleotides, maximum 31 due to realization details). The default value is 31 nucleotide.
  • -b, --maximal-bad-frequency <arg>
    Maximal frequency for a k-mer to be assumed erroneous. The default value is 1 k-mer.
  • -l, --min-seq-len <arg>
    Minimum sequence length to be added to a component (in nucleotides). The default value is 100 nucleotides.
  • --matrix-file <arg>
    Resulting distance matrix file. The default value is <work-dir>/matrices/dist_matrix_<date>_<time>.txt.
  • --stats-dir <arg>
    Directory with statistics for kmers. The default value is <work-dir>/kmer-counter-many/stats.
  • -bp, --bottom-cut-percent <arg>
    K-mers percent to be assumed erroneous while building sequences in seq-builder. If specified, --maximal-bad-frequency wouldn't be used in sequence builder.
  • -b1, --min-component-size <arg>
    Minimum component size in component-cutter (in k-mers). The default value is 1000 k-mers.
  • -b2, --max-component-size <arg>
    Maximum component size in component-cutter (in k-mers). The default value is 10000 k-mers.
  • -wn, --without-names
    Do not print matrix row and column names as given file names.

Launch options:

  • -ts, --tools
    Print available tools.
  • -t, --tool <tool-name>
    Set certain tool to run. Default tool to run is the matrix-builder tool.
    NOTE. All input parameters defined above are parameters for the matrix-builder tool. Input parameters can be different for other tools. To see help for them run the command metafast.sh -t <tool-name>.
  • -m, --memory <MEM>
    Memory to use (values with suffix, for example: 1500M, 4G, etc.). By default metafast uses 90% of free memory.
    WARNING! The parameter works only with metafast.sh and metafast.bat scripts. Use java -Xss24M -Xmx<MEM> -Xms<MEM> -jar metafast.jar when you are using metafast.jar.
  • -p, --available-processors <arg>
    Available processors. By default metafast uses all available processors.
  • -w, --work-dir <arg>
    Working directory. The default working directory is workDir/ in the current directory.
  • -c, --continue
    Continue the previous run in working directory from last succeed stage.
    NOTE. There is no need to set input parameters — they are loaded from <work-dir>, i.e. the command metafast.sh -w <work-dir> -c makes the metafast to continue the previous run, saved in <work-dir>.
  • -s, --start <stage-name>
    Try to continue the previous run, and rerun program from <stage-name> stage (with rewriting old results).
  • -f, --finish <stage-name>
    Stop after running <stage-name> stage.
  • -ea, --enable-assertions
    Enable assertions. If metafast works strange you can use this flag for additional checking during working process. By default assertions are disabled.
    WARNING! The parameter works only with metafast.sh and metafast.bat scripts. Use java -ea -jar metafast.jar when you are using metafast.jar.
  • -v, --verbose
    Enable debug output.
  • -h, --help
    Print short help message.
  • -ha, --help-all
    Print full help message.

Examples

To run metafast with default parameters on Linux on two samples with reads in sample_1.fastq and sample_2.fastq:

metafast.sh -i sample_1.fastq sample_2.fastq

To run metafast with default parameters on Linux on all reads' files with extension .fastq in data directory:

metafast.sh -i data/*.fastq

To run metafast on hand-made test with reads tinytest_A.fastq and tinytest_B.fastq:

metafast.sh -m 1G -k 7 -b 0 -l 8 -b1 3 -i tinytest_A.fastq tinytest_B.fastq

Parameters' explanation: -m 1G — using 1 GB of memory;
-k 7 — use k-mer size of 7 nucleotides;
-b 0 — maximal frequency for a k-mer to be assumed erroneous is 0 (all k-mers are good);
-l 8 — only sequences with at least 8 nucleotides will be added to a component;
-b1 3 — minimum component size is 3 different k-mers.

The distance matrix for this example will be saved in workDir/matrices/dist_matrix_<date>_<time>.txt:

#       tinytest_A  tinytest_B
tinytest_A  0.0     0.09090909090909091
tinytest_B  0.09090909090909091     0.0

The matrix cell in row i and column j is a distance between sample i and sample j.