-
Notifications
You must be signed in to change notification settings - Fork 10
Home
Welcome to the MetaFast wiki!
Fast metagenome analysis toolkit, version 0.1.0.
Authors:
- Software: Sergey Kazakov and Vladimir Ulyantsev, ITMO University, Saint-Petersburg.
- Testing: Veronika Dubinkina and Alexandr Tyakht, SRI of Physical-Chemical Medicine, Moscow.
- Idea, supervisor: Dmitry Alexeev, SRI of Physical-Chemical Medicine, Moscow.
MetaFast — METAgenome FAST analysis toolkit — is a software for calculating a number of different statistics of metagenome sequences, building the distance matrix between them, constructing the heatmap and the dendrogram based on the distance matrix.
Despite the existence of a number of approaches to analyse and compare metagenomic data, most of them have inherent disadvantages that limit their scope of applicability. For instance, reference-based methods require a representative database of known genomes; assembly-based methods are computationally intense and can hardly be applied to metagenomes with highly complex community structure; composition-based methods does not have clear interpretable resulting features.
A novel approach Metafast is based on unfinished assembly for every metagenome library included to the study. Its main steps are:
- Assembling short genomic sequences from reads for every metagenome separately (basing on de Bruijn graph).
- Constructing one combined de Bruijn graph for all assembled sequences, then searching for connected components in it.
- Calculating a characteristic vector for every metagenome with a length equal to the number of connected components.
- Cross-comparing metagenomes by calculating the Bray-Curtis dissimilarity matrix based on characteristic vectors.
It does not require a priori knowledge about the taxa possibly included in the microbiota. Other advantages over the above-mentioned methods are rather small system requirements and interpretability of the extracted features.
The software was implemented in Java and can be run on any operating system (tested on Linux 2.6.32 x86_64).
To run metafast you need to have JRE 1.6 or higher installed.
To run it only one script is required (metafast.sh
, metafast.bat
or metafast.jar
).
You can download the metafast run script from the last stable release from https://github.com/ctlab/metafast/releases.
- For Linux and Mac OS: download
metafast.sh
, run the commandchmod a+x metafast.sh
, then run./metafast.sh
from the command line. - For Windows: download
metafast.bat
and run it from the command line. - For other OS: download
metafast.jar
and run it via commandjava -jar metafast.jar
.
Alternatively, you can build the newest version of the metafast from the repository:
git clone https://github.com/ctlab/metafast.git
cd metafast
ant
./out/metafast.sh --version
- Software: Java Runtime Environment 1.6 or higher (you can download it here).
There are no other strict requirements to run the metafast, however we evaluate memory and disk requirements based on our runs:
- RAM: metafast requires 2-2.5 times more memory than maximum size of uncompressed FASTQ file to be processed.
- Hard disk space: metafast requires 25-30% of total size of processed uncompressed FASTQ files.
NB. These estimations are very rough, and is good for 10-100 libraries (fastq files) with 1-50 Gb each.
To run metafast use the following syntax:
metafast.sh [<Launch options>] [<Input parameters>]
metafast.bat [<Launch options>] [<Input parameters>]
java -jar metafast.jar [<Launch options>] [<Input parameters>]
Full description of launch options and input parameters can be found below in section Options and Parameters.
Also, you can run metafast.sh --help
or metafast.sh --help-all
to view help for them.
By running metafast a working directory is created (by default ./workDir/
).
All intermidiate files, log file and final results are saved in it.
File output_description.txt
is created after every run in the current and working directories.
It contains the description of every output file produced by the metafast.
The metafast run script also allows you to run subtools of whole process or different tools, that was included in the package.
To see the list of available additional tools, run metafast.sh --tools
.
MetaFast accepts input sequence files of FASTQ and FASTA formats. Input files can also be compressed with gzip of bzip2.
When metafast finishes, working directory will contain following results:
-
output_description.txt
,
<work-dir>/output_description.txt
— Identical text files with output files' description. Fileoutput_description.txt
is created in the current directory (where the run was started) but only if it is possible to do this. -
<work-dir>/log
,
<work-dir>/logs/log_<date>_<time>
— Identical text files with run log. -
<work-dir>/kmer-counter-many/stats
— Directory with kmer frequency statistics (statistics files is in text format for every input reads file). -
<work-dir>/seq-builder-many/sequences
— Directory with FASTA files - paths from reads for every library. -
<work-dir>/component-cutter/components-stat-<b1>-<b2>.txt
— File with components' statistics (in text format). -
<work-dir>/component-cutter/components.bin
— File with extracted components (in binary format). -
<work-dir>/features-calculator/vectors
— Directory with features values files for every library (in text format). -
<work-dir>/matrices/dist_matrix_<date>_<time>_original_order.txt
— File with resulted distance matrix between samples keeping original order. It is based on Bray–Curtis dissimilarity; elementmatrix[i][j]
is a distance between sample i and sample j. -
<work-dir>/matrices/dist_matrix_<date>_<time>.txt
— File with resulted distance matrix between samples with new order based on adjacency of the samples. -
<work-dir>/matrices/dist_matrix_<date>_<time>_heatmap.png
— Image file with heatmap and dendrogram between samples.
WARNING! Only files in <work-dir>/matrices
and logs are saved in working directory <work-dir>
after another run in the same working directory.
Input parameters for metafast:
-
-i, --reads <files>
List of reads files from single environment. FASTQ, BINQ, FASTA files are acceptable, gzip- and bzip2-compressed files are allowed too. Files can be set by bash regexp, for example-i dir/*.fastq
or-i `cat filelist.txt`
. -
-k, --k <arg>
K-mer size (in nucleotides, maximum 31 due to realization details). The default value is 31 nucleotide. -
-b, --maximal-bad-frequency <arg>
Maximal frequency for a k-mer to be assumed erroneous. The default value is 1 k-mer. -
-l, --min-seq-len <arg>
Minimum sequence length to be added to a component (in nucleotides). The default value is 100 nucleotides. -
--matrix-file <arg>
Resulting distance matrix file. The default value is <work-dir>/matrices/dist_matrix_<date>_<time>.txt. -
--stats-dir <arg>
Directory with statistics for kmers. The default value is <work-dir>/kmer-counter-many/stats. -
-bp, --bottom-cut-percent <arg>
K-mers percent to be assumed erroneous while building sequences in seq-builder. If specified, --maximal-bad-frequency wouldn't be used in sequence builder. -
-b1, --min-component-size <arg>
Minimum component size in component-cutter (in k-mers). The default value is 1000 k-mers. -
-b2, --max-component-size <arg>
Maximum component size in component-cutter (in k-mers). The default value is 10000 k-mers. -
-wn, --without-names
Do not print matrix row and column names as given file names.
Launch options:
-
-ts, --tools
Print available tools. -
-t, --tool <tool-name>
Set certain tool to run. Default tool to run is the matrix-builder tool.
NOTE. All input parameters defined above are parameters for the matrix-builder tool. Input parameters can be different for other tools. To see help for them run the commandmetafast.sh -t <tool-name>
. -
-m, --memory <MEM>
Memory to use (values with suffix, for example: 1500M, 4G, etc.). By default metafast uses 90% of free memory.
WARNING! The parameter works only withmetafast.sh
andmetafast.bat
scripts. Usejava -Xss24M -Xmx<MEM> -Xms<MEM> -jar metafast.jar
when you are usingmetafast.jar
. -
-p, --available-processors <arg>
Available processors. By default metafast uses all available processors. -
-w, --work-dir <arg>
Working directory. The default working directory is workDir/ in the current directory. -
-c, --continue
Continue the previous run in working directory from last succeed stage.
NOTE. There is no need to set input parameters — they are loaded from <work-dir>, i.e. the commandmetafast.sh -w <work-dir> -c
makes the metafast to continue the previous run, saved in <work-dir>. -
-s, --start <stage-name>
Try to continue the previous run, and rerun program from <stage-name> stage (with rewriting old results). -
-f, --finish <stage-name>
Stop after running <stage-name> stage. -
-ea, --enable-assertions
Enable assertions. If metafast works strange you can use this flag for additional checking during working process. By default assertions are disabled.
WARNING! The parameter works only withmetafast.sh
andmetafast.bat
scripts. Usejava -ea -jar metafast.jar
when you are usingmetafast.jar
. -
-v, --verbose
Enable debug output. -
-h, --help
Print short help message. -
-ha, --help-all
Print full help message.
To run metafast with default parameters on Linux on two samples with reads in sample_1.fastq and sample_2.fastq:
metafast.sh -i sample_1.fastq sample_2.fastq
To run metafast with default parameters on Linux on all reads' files with extension .fastq in data directory:
metafast.sh -i data/*.fastq
To run metafast on hand-made test with reads tinytest_A.fastq and tinytest_B.fastq:
metafast.sh -m 1G -k 7 -b 0 -l 8 -b1 3 -i tinytest_A.fastq tinytest_B.fastq
Parameters' explanation:
-m 1G
— using 1 GB of memory;
-k 7
— use k-mer size of 7 nucleotides;
-b 0
— maximal frequency for a k-mer to be assumed erroneous is 0 (all k-mers are good);
-l 8
— only sequences with at least 8 nucleotides will be added to a component;
-b1 3
— minimum component size is 3 different k-mers.
The distance matrix for this example will be saved in workDir/matrices/dist_matrix_<date>_<time>.txt
:
# tinytest_A tinytest_B
tinytest_A 0.0 0.09090909090909091
tinytest_B 0.09090909090909091 0.0
The matrix cell in row i and column j is a distance between sample i and sample j.