FlashWeave predicts ecological interactions between microbes from large-scale compositional abundance data (i.e. OTU tables constructed from sequencing data) through statistical co-occurrence or co-abundance. It reports direct associations, with adjustment for bystander effects and other confounders, and can furthermore integrate environmental or technical factors into the analysis of microbial systems.
To install Julia, please follow instructions on https://github.com/JuliaLang/julia. The preferred way is to obtain a binary from https://julialang.org/downloads/. Make sure you install Julia 1.6 or above, the versions currently supported by FlashWeave.
In an interactive Julia session, you can then install FlashWeave after typing ]
via
(@v1.6) pkg> add FlashWeave
# to run tests: (@v1.6) pkg> test FlashWeave
Important note: from version 0.19, FlashWeave dropped support for Julia versions < 1.6 (the new LTS version). If you want to use FlashWeave with older Julia installations, make sure to install FlashWeave 0.18.1 (or lower) via ]
+ add FlashWeave@0.18
.
See NEWS.md for the latest features and bugfixes.
To learn an interaction network from an OTU table and (optionally) a meta data table, you can do
julia> using FlashWeave # this has some pre-compilation delay the first time it's called, subsequent imports are fast
julia> data_path = "/my/example/data.tsv" # or .csv, .biom
julia> meta_data_path = "/my/example/meta_data.tsv"
julia> netw_results = learn_network(data_path, meta_data_path, sensitive=true, heterogeneous=false)
<< summary statistics of the learned network >>
julia> G = graph(netw_results) # weighted graph object representing interactions + weights, to be used with the JuliaGraphs ecosystem (https://github.com/JuliaGraphs)
Results can currently be saved in JLD2 (soon discontinued, see below), fast for large networks, or as traditional Graph Modelling Language (".gml") or edgelist (".edgelist") formats:
julia> save_network("/my/example/network_output.edgelist", netw_results)
julia> ## or: save_network("/my/example/network_output.gml", netw_results)
For output of additional information (such as discarding sets, if available) in separate files you can specify the "detailed" flag:
julia> save_network("/my/example/network_output.edgelist", netw_results, detailed=true)
julia> # for .jld2, additional information is always saved if available
A convenient loading function is available:
julia> netw_results = load_network("/my/example/network_output.edgelist")
To get more information on a function, you may type ?
into the prompt, followed by a function name:
julia> ?
help> learn_network
learn_network(data::AbstractArray{<:Real}) -> FWResult{Int}
Learn an interaction network from a data table (including OTUs and optionally meta variables).
Algorithmic parameters:
• heterogeneous - enable heterogeneous mode for multi-habitat or -protocol data with at least thousands of samples (FlashWeaveHE)
• sensitive - enable fine-grained associations (FlashWeave-S, FlashWeaveHE-S), sensitive=false results in the fast modes FlashWeave-F or FlashWeaveHE-F
• max_k - maximum size of conditioning sets, high values can strongly increase runtime. max_k=0 results in no conditioning (univariate mode)
• alpha - threshold used to determine statistical significance
• conv - convergence threshold, i.e. if conv=0.01 assume convergence if the number of edges increased by only 1% after 100% more runtime (checked in
intervals)
• feed_forward - enable feed-forward heuristic
• max_tests - maximum number of conditional tests that should be performed on a variable pair before association is assumed
• hps - reliability criterion for statistical tests when sensitive=false
• FDR - perform False Discovery Rate correction (Benjamini-Hochberg method) on pairwise associations
• n_obs_min - don't compute associations between variables having less reliable samples (i.e. non-zero if heterogeneous=true) than this number. -1:
automatically choose a threshold.
• time_limit - if feed-forward heuristic is active, determines the interval (seconds) at which neighborhood information is updated
General parameters:
• normalize - automatically choose and perform data normalization (based on sensitive and heterogeneous)
• track_rejections - store for each discarded edge, which variable set lead to its exclusion (can be memory intense for large networks)
• verbose - print progress information
• transposed - if true, rows of data are variables and columns are samples
• prec - precision in bits to use for calculations (16, 32, 64 or 128)
• make_sparse - use a sparse data representation (should be left at true in almost all cases)
• update_interval - if verbose=true, determines the interval (seconds) at which network stat updates are printed
For further analysis of the computed network, one can take advantage of Julia's graph analysis ecosystem (https://github.com/JuliaGraphs), in particular the LightGraphs.jl package (using the object returned by FlashWeave's graph
function). Otherwise, the network can be used directly with external graph analysis packages (e.g. igraph or networkx) by exporting it via FlashWeave's save_network
into a supported format. For visualization, we recommend exporting & loading the network into specialized tools such as Cytoscape or Gephi.
OTU tables can be provided in several formats:
delimited formats: ".tsv" (example) or ".csv" (example)
- if the first column contains row ids, these must be unique string identifiers
BIOM: BIOM 1.0 (description, example) or the more performant BIOM 2.0 (description, example)
JLD2: a julia-specific, high-performance file format (description, example)
- soon discontinued due to stability issues, please use a delimited format or BIOM
Meta data should generally be provided as delimited format (see for instance example1 or example2), separately from the OTU table. Notably, this implies that FlashWeave does not yet support reading meta data directly from BIOM files, but requires a separate delimited meta data file (support will be added in an upcoming version). NOTE: OTU table and metadata table must be aligned such that each row corresponds to the same sample in both files (i.e. sample 1 data is found in row 1 in both files, etc.).
For JLD2, however, you can already provide HDF5 keys linked to meta data tables (and optionally headers):
julia> data_path = "/my/example/otu_and_meta_data.jld2"
julia> netw_results = learn_network(data_path, otu_data_key="otu_data", otu_header_key="otu_header", meta_data_key="meta_data", meta_header_key="meta_header", sensitive=true, heterogeneous=false)
See also the test/data/HMP_SRA_gut directory for further examples of OTU and meta data tables.
For delimited and JLD2 formats, FlashWeave treats rows of the table as observations (i.e. samples) and columns as variables (i.e. OTUs or meta variables), consistent with the majority of statistical and machine-learning applications, but in contrast to several other microbiome analysis frameworks. Behavior can be switched with the transposed=true
flag.
Meta variables containing string factors with more than two categories are automatically one-hot encoded by FlashWeave prior to network inference to increase the reliability and interpretability of statistical tests (the user will be notified if this happens). For instance, the meta variable
HABITAT |
---|
soil |
soil |
marine |
river |
marine |
will be split into three dummy variables in the following fashion
HABITAT_soil | HABITAT_marine | HABITAT_river |
---|---|---|
1 | 0 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
Each dummy variable will be a separate node in the result network.
FlashWeave currently does not support missing data, please remove all samples with missing entries (both in OTU and meta data tables) prior to running FlashWeave.
Depending on your data, make sure to chose the appropriate flags (heterogeneous=true
for multi-habitat or -protocol data sets with ideally at least thousands of samples; sensitive=false
for faster, but more coarse-grained associations) to achieve optimal runtime. If FlashWeave should get stuck on a small fraction of nodes with large neighborhoods, try increasing the convergence criterion (conv
). To learn a network in parallel, see the section below.
Note, that this package is optimized for large-scale data sets. On small data (hundreds of samples and OTUs) its speed advantages can be negated by JIT-compilation overhead.
FlashWeave leverages Julia's built-in parallel infrastructure. In the most simple case, you can start julia with several workers
$ julia -p 4 # for 4 workers
or manually add workers at the beginning of an interactive session
julia> using Distributed; addprocs(4) # can be skipped if julia was started with -p
julia> @everywhere using FlashWeave
julia> learn_network(...
and network learning will be parallelized in a shared-memory, multi-process fashion.
If you want to run FlashWeave remotely on a computing cluster, a ClusterManager
can be used (for example from the ClusterManagers.jl package, installable via ]
and then add ClusterManagers
). Details differ depending on the setup (queueing system, resource requirements etc.), but a simple example for a Sun Grid Engine (SGE) system would be:
julia> using ClusterManagers
julia> addprocs_qrsh(20) # 20 remote workers
julia> ## for more fine-grained control: addprocs(QRSHManager(20, "<your queue>"), qsub_env="<your environment>", params=Dict(:res_list=>"<requested resources>"))
julia> # or
julia> addprocs_sge(20)
julia> ## addprocs_sge(5, queue="<your queue>", qsub_env="<your environment>", res_list="<requested resources>")
Please refer to the ClusterManagers.jl documentation for further details.
To cite FlashWeave, please refer to our paper in Cell Systems:
Tackmann, Janko, Joao Frederico Matias Rodrigues, and Christian von Mering. "Rapid inference
of direct interactions in large-scaleecological networks from heterogeneous microbial
sequencing data." Cell Systems (2019).
Example BibTeX entry:
@article{tackmann2019rapid,
title={Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data},
author={Tackmann, Janko and Rodrigues, Joao Frederico Matias and von Mering, Christian},
journal={Cell Systems},
year={2019},
publisher={Elsevier},
doi={10.1016/j.cels.2019.08.002},
url={https://doi.org/10.1016/j.cels.2019.08.002}
}
FlashWeave follows semantic versioning. Stability guarantees are only provided for exported functions (official API), anything else should be considered untested and subject to change. Note, that FlashWeave is currently in its experimental phase (version < v1.0), which means that breaking interface changes may occur in every minor version.