IARC course on analyzing TCGA data in the SevenBridges Genomics Cancer Genomics Cloud (SBG-CGC). Slides are in the slide folder.
Learning objectives
After completing this workshop, participants will be able to run their own computational tools on the cloud using TCGA data using:
- the SevenBridges web interface to select and retrieve TCGA data,
- Docker and DockerHub to build and store containers to deploy their own computational tools,
- the Common Workflow Language (CWL) to describe the pipelines to run,
- the SevenBridges R api to run automatically reproducible analyses.
Main topics
- Introduction to Cloud computing
- Introduction to Docker and DockerHub
- SevenBridges R API and web interface
- TCGA data analysis
Wednesday 28 February
09:00-10:00 Introduction to cloud computing and the SevenBridges architecture
10:00-10:30 Introduction to TCGA data
10:30-11:00 Break
11:00-11:30 Introduction to the SevenBridges web interface to run analyses
11:30-12:30 Practical application: run your first basic analysis in the cloud
Thursday 1 March
09:00-09:30 Introduction to Docker and DockerHub
09:30-11:00 Practical application: building your own Docker container and run it in the cloud
11:00-11:30 Break
11:30-12:30 Introduction to the R api and the CWL language
Friday 2 March
09:00-12:30 Practical application: running your own practical project in the cloud using the R api, CWL and Docker.
12:30-14:00 Lunch Break
14:00-17:00 Practical application: running your own practical project in the cloud using the R api, CWL and Docker.
A gitter channel is open for the course. This will allow participants to discuss on their projects but also to ask any question regarding the course.
We presented the scientific projects conducted during the last day of the course at the IARC omics discussion (april 6th 2018). Slides are hosted here
Laptops use Ubuntu 16.04.
Docker is already installed. If you are curious, here is how to install it on Docker website.
If you need a good text editor, Atom is also installed.
Participants would need to install R and Rstudio. One possibility is to use the steps proposed in this gist.
Caution:
- please change the version of rstudio installed into the last one: 1.1.423
- you would probably need to add a key with
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys THE_KEY
. - two more packages are needed, execute:
sudo apt install libcurl4-openssl-dev
andsudo apt-get install libssl-dev
R package sevenbridges-r is also needed:
source("https://bioconductor.org/biocLite.R")
biocLite("sevenbridges")
- Seven Bridges Cancer Genomics Cloud
- CGC documentation
- Cancer Genomics Cloud publication in Cancer Research
- Awesome TCGA: a curated list of TCGA resources maintained by the IARCbioinfo organization. The most useful ones for this course are:
- Genomic Data Commons (GDC) data portal: the official entry point to download TCGA data.
- GDC data release notes
- List of cohorts with sample sizes
- For each cohorts you can download clinical and biospecimen data here for example for LUAD.
- TCGA barcode
- TCGA code tables
- TCGA data dictionary
- MAF file format description
- Docker and DockerHub
Important: your CGC token gives full access to your CGC account, including the protected TCGA data if you have access to it. This is like your username and password. This means that you should never share it with anyone, and only keep it in a secure location (not a USB key, a non-secure computer or a laptop leaving IARC).
Main steps to think about:
- Find which software you want to run.
- Find on which TCGA data you want to run it.
- Try to run it locally if possible.
- Build a Docker container and try to run the analysis in the container.
- Create a Dockerfile and host it on this github repository in your project folder.
- Create an associated automated build on Docker Hub in the iarcbioinfo organization. See this example to specify the folder of your Dockerfile. You should also, for this course, uncheck the box "When active, builds will happen automatically on pushes". Otherwise your docker container will be automatically rebuild each time someone pushes something on github. This is usually a useful feature, but not suitable for this course repository that contains many different things and is shared by multiple users.
- Note that if you prefer to keep it private you can also host your docker image on the CGC (https://docs.cancergenomicscloud.org/docs/upload-your-docker-image).
- Create a project on the CGC.
- Add the TCGA data files you will need in your project.
- Create an App on the CGC that is using your docker container hosted on Docker Hub (use the web interface or write your own CWL code).
- Create a Task to run your App on your files and run it (use the web interface or the R API).
For each project, we have opened an issue to discuss on, and add a folder to host the code.
Project 1: needlestack variant calling. Issue. Code.
Project 2: neutral tumor evolution. Issue. Code.
Project 3: cell populations from RNA-seq. Issue. Code.
Through the web interface, choose the file and copy to your project.
You can also do this easily with the R client for the API:
a$copyFile(id = a$public_file(name = "Homo_sapiens_assembly38.fasta", exact = TRUE)$id, project = p$id)
a$copyFile(id = a$public_file(name = "Homo_sapiens_assembly38.fasta.fai", exact = TRUE)$id, project = p$id)
You can use the interface to get the precise name of the file you need.
This R script gives an example of how using the sevenbridges-r
R package to query data in the CGC platefrom, and copy the resulting files to your project.
A good starting point it to run the base container on your machine (docker run
) and then to interactively install the software you need in the container. Keep note of the commands you use and then create a Dockerfile with them. Once done try to build from your docker file using docker build
. See the docker tutorial for more details.