Home | Installation Guide | User Guide | Command Line Reference | Developer Guide
GRNBoost is an Apache Spark Application library. A Spark application entails a program, bundled as a .jar file, that can be launched by submitting it to a Spark instance using the command line. Let's have a look at an example:
NOTE: the backslashes are used for multiline bash shell commands
$SPARK_HOME/bin/spark-submit \
--class org.aertslab.grnboost.GRNBoost \
--master local[*] \
--deploy-mode client \
--jars /home/xxx/.m2/repository/ml/dmlc/xgboost4j/0.7/xgboost4j-0.7.jar \
/path/to/GRNBoost.jar \
infer \
-i /path/to/dream5/training\ data/Network\ 1\ -\ in\ silico/net1_expression_data.transposed.tsv \
-tf /path/to/dream5/training\ data/Network\ 1\ -\ in\ silico/net1_transcription_factors.tsv \
-o /path/to/grnboost/output/net1_grnboost_depth3.tsv \
-p eta=0.01 \
-p max_depth=3 \
-p colsample_bytree=0.1 \
--truncate 100000
A GRNBoost job command consists of 4 different parts:
-
Call the
spark-submit
executable.$SPARK_HOME/bin/spark-submit \
-
Specify the Spark command line arguments. Consult to the Spark Submitting Applications page for detailed information.
--class org.aertslab.grnboost.GRNBoost \ --master local[*] \ --deploy-mode client \ --jars /path/to/xgboost4j-0.7.jar \
Notice that in the last line, we specify the location of the xgboost .jar file we already built or downloaded. We chose not to include xgboost by default in GRNBoost because it is built differently on different platforms (MaxOS, Unbuntu, ...). Instead we refer to it as an additional
.jar
file in the Spark job. -
Specify the path to the Spark Application
.jar
file, in this case the GRNBoost.jar artifact./path/to/GRNBoost.jar \
-
Specify the GRNBoost command line arguments. Consult the Command Line Reference for detailed information.
infer \ -i /media/tmo/data/work/datasets/dream5/training\ data/Network\ 1\ -\ in\ silico/net1_expression_data.transposed.tsv \ -tf /media/tmo/data/work/datasets/dream5/training\ data/Network\ 1\ -\ in\ silico/net1_transcription_factors.tsv \ -o /media/tmo/data/work/datasets/dream5/grnboost/net1/net1_grnboost_depth3.tsv \ -p eta=0.01 \ -p max_depth=3 \ -p colsample_bytree=0.1 \ --truncate 100000
GRNBoost parameters are typically:
- the input file containing the expression matrix
- the file containing the list of transcription factors
- the output file name
- some optional xgboost-specific parameters for controlling regression behaviour
- parameters for post-processing the collection of inferred regulatory links between candidate regulators and target genes
GRNBoost accepts text files with following layout. Each non-header line starts with a gene name and its expression profile across the observations. The CLI provides an option to skip header lines.
header etc. # <-- unused header line
GENE obs1 obs2 obs3 obs4 obs5 # <-- unused header line
Tspan12 0 0.666 0 0 0.089 # gene + expression profile
Gad1 1.800 0 0 0.061 0 # gene + expression profile
Neurod1 0 0 1.301 0.232 0 # gene + expression profile
...
...
# ^ ^ ^ ^ ^
# expression profile matrix from second to last column
#
# ^
# first column contains gene name
GRNBoost writes the inferred gene regulatory network to file as lines of regulator, target and importance. For example:
TF1 target1 0.234
TF2 target7 0.225
TF10 target2 0.201
...
...
Instructions on setting up a Spark cluster are out of scope for this manual, please consult the Apache Spark cluster overview documention.
We will treat two usages scenarios: running on a local machine and running on Amazon Elastic MapReduce.
Although Spark was designed to run on a multi-node compute cluster, it is also capable to make good use of the resources of a (preferably powerful) computer with one or more physical CPUs, like a single cluster node. In this case we can simply install Spark in a local folder and submit GRNBoost as a Spark job to that instance.
GRNBoost was developed against Spark 2.1.0
, so any version equal or higher than that will do nicely. We recommend downloading the default suggested release from the Spark downloads page.
-
Download a Spark release and unpack it somewhere.
-
Add the
SPARK_HOME
environment variable, specifying the location where you unpacked the Spark release, to your~/.bash_profile
or~/.bashrc
file. For example:export SPARK_HOME=~/<..folders../..here..>/spark-2.0.2-bin-hadoop2.7
-
Now we can submit a GRNBoost Spark job via the command line. We recommend putting the GRNBoost command in a file and making that file executable:
chmod +x <grnboost_command.sh>
.Make sure you have obtained or built the GRNBoost and xgboost artifacts, we refer to those in the job submit command. The job described above is a local mode job.
Observe the third line in the command:
$SPARK_HOME/bin/spark-submit \ --class org.aertslab.grnboost.GRNBoost \ --master local[*] \ ...
The master URL is in this case the local machine, with as many worker threads as logical cores on your machine.
Following steps walk through launching GRNBoost on Amazon Elastic MapReduce. Be aware that running on AWS can incur a monetary cost. We will focus on using the Amazon web interface.
TODO