Renv is used for this project.
To quickly initialize the R enviorment, run ./cli/configure_enviorment.R
.
There is also a docker container availible at schaunr/amlbiomarkers.
See more under Running the Pipeline
To install all the R packages just use renv::restore()
.
The renv autoloader should automatically boostrap renv and renv::restore()
will take care of installing all the correct package versions, although you will have to install system dependencies yourself.
Singlularity was not trivial to use, therfore I directly loaded modules. The following should be added to your .bashrc
file or prepend to any submission script (note, versions may differ across systems):
# Set up R enviorment
module load hdf5-serial/1.12.0
module load magick
module load fftw3/3.3.8
module load R/4.1.0
module load gdal/3.2.2
module load proj/7.2.1
module load geos/3.8.2
module load python/3.7-2019.10
# CIBERSORT options
export TOKEN={Token from website}
export EMAIL={email address used on website}
Then follow the directions under "Getting Setup"
The pipeline is stored in the cli directory. The functions each take an id which will automatically handle inputs and outputs. An invocation file is created if not present and will contain the arguments used at runtime. CLI functions cache some of their outputs to make repeated runs quicker. All outputs are saved, however, when cached, the expressions are only rerun if the data changes.
Some external datasets are required. TARGET AML, BeatAML, and TCGA LAML are all used during deconvolution.
Many different clinical information tables are also required.
These are all found in /clinical_info
A docker image can be found at schaunr/amlbiomarkers
. To quickly get setup and explore the container run:
docker run -it schaunr/amlbiomarkers bash
.
This will download the image, and open up to a terminal.
The image schaunr/amlbiomarkers:${TAG}
is multiarchitecture (x64 and ARM/Apple Silicon).
However, there will be no data inside the image.
To setup the entire pipeline, you'll need to bind mount several directories.
TAG="latest"
# Run using a script name as arguments
docker run \
-v "$(pwd)"/clinical_info:/clininfo \
-v "$(pwd)"/outs:/outs \
-v "$(pwd)"/plots:/plots \
schaunr/amlbiomarkers:{$TAG} \
survival_analysis.R \
-i run1 \
-I EFS \
-V 'c(matches("^Year|^Age|^Overall"), where(is_character))' \
-a 'Event Free Survival Time in Days' -E
# Run interactively in the terminal
docker run \
-it \
-v "$(pwd)"/clinical_info:/clinical_info \
-v "$(pwd)"/outs:/outs \
-v "$(pwd)"/plots:/plots \
schaunr/amlbiomarkers:{$TAG} \
bash
You can also use schaunr/amlbiomarkers:${TAG}
in a non-interactive mode by passing the name and options associated with a CLI script and the container will exit after completion. See first example above.
See Pipeline Flow Chart for details on connecting the various functions and a list of options for each script.
dimension_reduction.R: Subsets the data and runs dimension reductions (UMAP, PCA) and does batch correction using Harmony. Then finds clusters that are different between groups. The results are cached.
make_CIBERSORT_ref.R: Prepares data to upload to CIBERSORTx for reference creation. The results are cached.
deconvolute.R: Prepares references for deconvolution and runs several deconvolution methods. The results are cached.
CIBERSORTx.py: Runs the CIBERSORTx algorithm on the data.
prepare_deconvoluted_samples.R: Annotates deconvoluted samples with clinical information. Selects desired deconvolution method.
run_gsva.R: Runs a GSVA on the clusters. The results are cached.
run_cellphone.R: Runs the CellPhoneDB algorithm on the clusters. The results are cached.
survival_analysis.R: Does a survival analysis using the chosen parameters.
survival_analysis_multi_model.R: Does a survival analysis using LASSO, Random Forests, KNN Regression, Support Vector Regression, and Neural Network Regression
de.R: Does a differential expression per cluster
merge_targeted_with_whole_transcriptome.R: Merge unfiltered raw counts with existing Seurat object and normalize
--id
: An arbitrary ID to give the run.
Should be consistant across all functions.
Will be a subfolder of --dir
.
--dir
: What directory to use as the run directory.
Defualts to the current directory.
--verbose
: How should messages be printed.
--cores
: How many cores should be used for multicore processing.
This uses the future framework.
--help
: will print help for each CLI function.
--info
: information for column names and values for survival analysis.
Has a specific format of "Dataset:option=value:...".
Potential options are below. You can also provide the path to a json
file with the following format:
"Dataset": {
"option": "value",
"option2": "value",
...
}
Option | Decription |
---|---|
efs | Column containing the survivial information |
wbc | Column containing % White Blood Cells, not used currently |
status | Column containing the events |
event | Value in status to determine an event has occured |
not_event | Value in status to determine an event has not occured. All other values in status will be considered an event. |
Below is a flow chart that shows the prerequesites for each script. Color is provided to organize the scripts into several different categories.
- Simplify Clinical Annotation sheets
- Clear Downloading of External Datasets
- Master Pipeline Runner, i.e. Martian, NextFlow...