Find your assigned server name on this website. Instructions to connect to the server can be found on this page.
Use the 'copy' button in the top right corner of each code block to copy these commands.
On Windows, use shift-\
(hold down shift, then press the backslash key) to paste into the Linux terminal. On a Mac, use command-v
.
mkdir /workdir/$USER
cd /workdir/$USER
cp -r /workdir/sc_workshop_2024/cellranger .
Expand for Linux details
mkdir
make directory$USER
replace with your netid (i.e. BioHPC username)cd
change directorycp -r
copy directory and all of its contents to.
(dot = here, your current location)
- Where am I?
pwd
= path of the present working directory
- What files/directories are here?
ls -l
= list (long format)
you should now see the 'cellranger' directory copied to your working directory
- /workdir/$USER: is the path to the working directory you set up for this workshop on your assigned server
- For many commands, you can use the full path (starting from the root /) or a relative path, but either way it needs to be correct
- forward-slashes denote nested directory and file names
- back-slashes have a very different meaning in Linux, they define 'escape' sequences to override literal character meanings. A common use of backslashes in this workshop will be to 'escape' the end-of-line character when code breaks across several lines. If the code is on a single line, the backslashes are not needed.
- try
program-name --help
on the command line. - go to the BioHPC software page
As it could take several hours to run cellranger on the original data files, special fastq files were prepared with 1/2000 downsampling. Within each sub-directory, you will find a README file with a description of the sample and a link to the original data.
Under the data directory you just copied, there are 6 sub-directories containing fastq files:
- IgG1d: Singleplex single cell gene expression library from GEO series GSE201999.
- IgG4: Singleplex single cell gene expression library from GEO series GSE201999.
- UT: Singleplex single cell gene expression library from GEO series GSE201999.
- cellplex: Multiplex single cell gene expression library.
- cellplex_fb: Multiplex single cell gene expression library with antibody capture
- refdata-gex-GRCh38-2020-A: human reference genome required for cellranger
Expand for file details
The first 3 directories contain data from an experiment described in this paper and will comprise the main dataset used in this workshop.
Two additional datasets are provided as examples to run 'cellranger multi' on datasets with hashed (multiplexed) samples and feature barocoding (CITE-seq, antibody capture).
This directory was downloaded from 10x Genomics as a pre-indexed reference genome (+transcriptome) in the format required by cellranger. 10x Genomics provides a limited set of pre-indexed genomes; additional indexed genomes can be created with 'cellranger mkref' for gene expression (GEX) analysis of other species.
- Try listing the contents of each directory
- Add a new parameter to list all of the contents of all subdirectories in one command:
ls -lR
R = recursive list
cd /workdir/$USER/cellranger
export PATH=/programs/cellranger-7.2.0:$PATH
cellranger count --id=run_IgG1d --sample=IgG1d --transcriptome=refdata-gex-GRCh38-2020-A --fastqs=IgG1d --localcores=8 --localmem=24
Expand for Linux details
-
cd: change directory
- You were at the top level of /workdir/$USER/, and moved into the subdirectory named 'cellranger'
- You need to be in the 'cellranger' subdirectory for the code to work.
- If you are not in the right place, figure out where you are ('pwd') and move ('cd') to /workdir/$USER/cellranger
-
export PATH: this makes sure your code can find the cellranger program easily
-
cellranger count: starts the 'cellranger count' program, with the following parameters:
--id=
name of the cellranger run required--sample=
name of the sample to analyze required, must match the beginning of the fastq filename before the S#)--transcriptome=
name of the directory containing the reference genome (index formatted for cellranger) required--fastqs=
must match the directory name containing the fastq files (which happens to be the same as the sample name) required--localcores=
number of CPUs (cores) that your run of cellranger can use recommended for shared servers--localmem=
amount of memory that your run of cellranger can use recommended for shared servers
Notes:
-
This step takes about 5-10 minutes. While you are waiting, you can do Step 4: Run Loupe browser;
-
The output files are in the directory run_IgG1d/out. You can browse the output directory. Files of interest include :
-
web_summary.html: a summery web page
-
possorted_genome_bam.bam: position-sorted read alignment;
-
filtered_feature_bc_matrix.h5: HDF5 formatted single gene expression count matrix;
-
filtered_feature_bc_matrix: directory containing MEX formatted matrix files.
Either filtered_feature_bc_matrix.h5 or filtered_feature_bc_matrix can be used for downstream data analysis with Seurat or Scanpy.
-
-
The parameters "--localcores=8 --localmem=24" restrict the memory cpu core usage by the job.
-
Cellranger results from the original data of the same library are provided in the directory "/workdir/sc_workshop_2024/GSE201999_output". In Step 4, you will examine the full dataset (without downsampling).
Replace all instances of "xxxxx" in the config.ori.csv file with your BioHPC userID, and write the modified content to a new file config.csv. You can use the LINUX sed command to replace the "xxxxx".
cd /workdir/$USER/cellranger
sed "s/xxxxx/$USER/" cellplex/config.ori.csv > cellplex/config.csv
Confirm that the file is updated with your BioHPC userID with the cat
command, which will display the file contents in your terminal.
cat cellplex/config.csv
You can also use a text editor such as nano
if you prefer.
cellranger multi --id=run_plex --csv=cellplex/config.csv --localcores=8 --localmem=40
Expand for Linux details
- cellranger multi: starts the 'cellranger multi' program, with the following parameters:
--id=
name of the cellranger run required--csv=
name of the config file for the run required--localcores=
number of CPUs (cores) that your run of cellranger can use recommended for shared servers--localmem=
amount of memory that your run of cellranger can use recommended for shared servers
- What information is in the config file vs on the command line?
- Use
ls
to view the output directories and contents. Try adding parameters (-l, -R) to investigate the output directory structure and files.
Notes:
- This step takes about 10-15 minutes.
- The output directory is located in run_plex/outs. You would find a "per_sample_outs" directory, with results from each de-multiplexed sample.
Replace all "xxxxx" with your BioHPC userID, and write the modified content to a new file config.csv. You can use the LINUX sed command to do this.
cd /workdir/$USER/cellranger
sed "s/xxxxx/$USER/" cellplex_fb/config.ori.csv > cellplex_fb/config.csv
cat cellplex_fb/config.csv
cellranger multi --id=run_plex_fb --csv=cellplex_fb/config.csv --localcores=8 --localmem=40
Notes:
- This step takes about 10-15 minutes.
- The output directory is located in run_plex_fb/outs. Similar to the previous run, you would find a "per_sample_outs" directory, with results from each demultiplexed sample.
- The protein expression levels are included in the gene expression count matrix, as additional features.
User Filezilla (or other sftp client) to download cellranger results from the original data.
Host: your assigned server (cbsuxxxx.biohpc.cornell.edu)
Port: 22
Remote site: /workdir/sc_workshop_2024/GSE201999_output
Transfer the full directory with Filezilla to your laptop or computer
- TL-IgG1
- TL-IgG4
- Untreated
Within each sample directory, there should be a "web_summary.html" file and a "cloupe.cloupe" file.
Documentation of the web summary file and QC metrics:
https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/cr-outputs-web-summary-count https://www.10xgenomics.com/analysis-guides/quality-assessment-using-the-cell-ranger-web-summary
What do the QC metrics indicate about these samples?
Tutorials for Loupe:
-
It could take a few hours to process each set of real data files. You will need to run cellranger in a persistent "screen" session;
-
Most likely you would need to run cellranger on multiple samples. Some of the BioHPC servers (large memory gen2) have >100 cpu cores, and there is no performance benefit to run cellranger on a single sample with >32 CPU cores. You would want to run the jobs in parallel to use the all the available CPU cores.
You can do this step later during the week.
Note
nano
is a text editor that is available within terminal.
We can use nano to create/edit new or existing files.
Click here for a brief tutorial on using nano
In your terminal write nano followed by the name of the file that you want to write in. In this case we will type:
nano run.sh
This will open a nano editor within your terminal.
Copy and paste the following code in your nano editor.
On Windows, use shift-\
(hold down shift, then press the backslash key) to paste into the Linux terminal. On a Mac, use command-v
.
cellranger count --id=run_IgG1d --sample=IgG1d --transcriptome=/workdir/$USER/cellranger/refdata-gex-GRCh38-2020-A --fastqs=/workdir/$USER/cellranger/IgG1d --localcores=8 --localmem=24
cellranger count --id=run_IgG4 --sample=IgG4 --transcriptome=/workdir/$USER/cellranger/refdata-gex-GRCh38-2020-A --fastqs=/workdir/$USER/cellranger/IgG4 --localcores=8 --localmem=24
cellranger count --id=run_UT --sample=UT --transcriptome=/workdir/$USER/cellranger/refdata-gex-GRCh38-2020-A --fastqs=/workdir/$USER/cellranger/UT --localcores=8 --localmem=24
Once all three lines are added in your nano editor, hold down Ctrl+X
You will see towards the bottom a prompt will appear saying Save modified buffer
.
Type Y
and the prompt will change to File Name to write : run.sh
. Hit return/enter on the keybpard and your script is now created. If you type ls
in your terminal, you will see there will be a new file called run.sh
that is now present in your working directory.
- This code uses the full path for the reference index (
--transcriptome
) and fastq location (--fastqs=
), allowing the script to be run from any directory. - The output directories (named as
--id=
) will be created in the directory that you run the script.
Tutorial for "screen" can be found here.
There are two ways to run this script:
Run in serial (run cellranger on one sample at a time):
export PATH=/programs/cellranger-7.2.0:$PATH
sh run.sh
Run in parallel ("-j 2" : process two samples at a time):
export PATH=/programs/cellranger-7.2.0:$PATH
parallel -j 2 < run.sh
Notes:
- When running in parallel, make sure that the numbers set in "--localmem --localcores" multiply by number of jobs do not exceed the total RAM or CPU cores on the server you are using. Your assigned server for this workshop has 128GB RAM and 24 CPU cores.