This repository contains all the code to reproduce the results and figures in the paper Mitigation of chromosome loss in clinical CRISPR-Cas9-engineered T cells, bioRxiv 2023 (doi.org/10.1101/2023.03.22.533709).
The repo contains:
- Scripts for producing the results and figures in the paper:
crop_seq_and_cell_state_analysis.ipynb
chr14_11gRNA_heatmap.ipynb
Figure*.ipynb
(some of the notebooks depend on additional small data files indata/
)
- Scripts and files for processing the raw fastq and intermediate files to produce the final dataset (gene-expression matrix, sgRNA assignments, copy-number estimates from inferCNV and estimated aneuploidy events):
data_processing/guide_assign_binomial.ipynb
data_processing/ainfercnv_prep.ipynb
data_processing/breakpoint_calc.ipynb
data_processing/inferCNV/Batch{1-9}.R
data_processing/feature_reference_v2.csv
Most of the code we provide refers to the main (CROP-seq) screen presented in our paper. The same code applies to the two other screens (Chr14 and CART) with minor changes.
The repo doesn’t contain data files. All the raw, intermediate and processed data files are available on the GEO series associated with the project (linked from our publication).
If you only need access to the fully processed data, you will likely want to download the following files from our GEO series:
qced.h5ad
concat_InferCNV.pkl
aneuploidy_events.csv
inferCNVgeneName.txt
Below are the notebooks containing all the code for producing all the results and figures presented in the paper. For each notebook, we provide a high-level description of what the code does, specify which figures are generated by the code, and list all the input data files needed to run the code.
Description: Most of the high-level analysis of the CROP-seq library, including the definition of aneuploidy events (based on the breakpoint estimates), comparison between targeted and lost chromosomes, cell cycle analysis, differential gene expression, and cell state analysis (UMAPs and clusters). Also includes cell state analysis of the TRAC library.
Produced figures: Fig. 2C, Fig. 2B, Fig. S3H, data for Fig. 3C (n_cells_per_chrom_loss_and_cell_cycle_status.csv
), data for Fig. 3A (dge_results_v3.csv
, dge_total_gene_scores_v3.csv
), Fig. 3B, Fig. S2C-D, Fig. S4B, Fig. S4A, data for Fig. S2E (chrom_loss_status_and_cluster_cell_counts.csv
), data for Fig. S4C (cell_cycle_and_cluster_cell_counts.csv
), Fig. S1C-D, data for Fig. S1E (chrom_loss_status_and_cluster_cell_counts.csv
- notice that this is a different file than the one used for Fig. S2E)
Required data files: for the CROP-seq analysis: qced.h5ad
, concat_InferCNV.pkl
, inferCNVgeneName.txt
, aneuploidy_events.csv
; for the cell state analysis of the TRAC library: TRAC_Aneuploidy_events.csv
, cas9ProcessedAneuploidyStatus.h5ad
Description: Performs the inferCNV analysis that is used as the basis for the aneuploidy events for cells treated with the 11 gRNAs targeting the TRAC locus of chromosome 14.
Produced figures: Fig. 1B
Required data files: The processed h5ad file, rawSingletForR15Kcells.h5ad
and the input file inferCNVgeneName.txt
.
Description: Breakpoints and aneuploidy were quantified using this notebook. The workflow for estimating breakpoints is similar to the process used in the CROP-seq dataset (refer to crop_seq_and_cell_state_analysis.ipynb
or the "Re-processing the data files" section for more details). It was used to create plots for Figures 1 C and D and serves as the underlying data for panel E.
Produced figures: Fig. 1 C-E
Required data files: cas9ProcessedAneuploidyStatus.h5ad
, GUIDEvsNT_CHR14_RESULTS.txt
, inferCNVgeneName.txt
Output file: TRAC_aneuploidy_breakpoints_ALLChromo.csv
Description: Creates a scatterplot of the genomic coordinates of imputed breakpoints for cells that we predict have lost a partial chunk of a chromosome vs. the genomic coordinates targeted by the gRNA in these cells.
Produced figures: Fig. S2F
Required data files: aneuploidy_events.csv
, qced.h5ad
, inferCNVgeneName.txt
, centromeres.txt
, chr_lengths.csv
(the last two files are available in the data/
folder in this repo, the others are on GEO)
Description: Datasets corresponding to activated T cells from a male donor were sourced from the ENCODE Portal (ATAC-seq: ENCFF233TXT
, H3K36me3: ENCFF055FYI
and H3K9me3:ENCFF129GAM
). The data was analyzed using this notebook. A two-sided Fisher’s Exact Test determined if the presence of an epigenetic mark affected chromosome loss.
Produced figures: Fig. 3D
Required data files: qced.h5ad
, Concat_InferCNV.pkl
, Guide_Genomic_Coordinates.xlsx
, inferCNVgeneName.txt
, All_Aneuploidy_Events_ByChromoAndDominantGuide.xlsx
Description: Breakpoints and aneuploidy were quantified using this notebook. Breakpoint and aneuploidy calling mirrored the methods in the CROP-seq experiment (see crop_seq_and_cell_state_analysis.ipynb
or the "Re-processing the data files" section for more details).
Produced figures: Fig. 5C
Required data files: CAR-T_ALL-CELLS.h5ad
, inferCNV_results_CAR-T_Donor6_Day4.txt
, inferCNV_results_CAR-T_Donor6_Day7.txt
, inferCNV_results_CAR-T_Donor7_Day4.txt
, inferCNV_results_CAR-T_Donor7_Day7.txt
, inferCNVgeneName.txt
Output file: CART_UCSF_aneuploidy_breakpoints.csv
Description: Breakpoints and aneuploidy were tabulated using this notebook, similarly to the CROP-seq experiment (refer to crop_seq_and_cell_state_analysis.ipynb
or the "Re-processing the data files" section for further insights).
Produced figures: Fig. 6E
Required data files: trbc_calls.csv
, trac_calls.csv
, pd1_calls.csv
Output file: Aneuploidy_Enrichment_in_vivo_samples.xlsx
Description: Creates alternate versions of the chromosomal loss enrichment matrix (original: Fig 2C) when different numbers of genes are included in the inferCNV outputs.
Produced figures: Fig. S6D
Required data files: Concat_InferCNV.pkl
, qced.h5ad
, inferCNVgeneName.txt
The raw data files for the CROP-seq dataset originate from three libraries, each with 24 fastq files:
- The gene-expression (GEX) library contains the files
CTJD02{A-F}_S{1-6}_L00{3/4}_R{1/2}_001.fastq.gz
, where A-F and 1-6 (respectively) correspond to the 6 lanes of the 10x (all on one chip), L003/L004 corresponds to the 2 lanes of the Illumina sequencer, and R1/R2 are the two ends of the paired-end sequencing. - The Multi-Seq library contains the files
CTJD02{G-L}_S{7-12}_L00{3/4}_R{1/2}_001.fastq.gz
, where G-L and 7-12 (respectively) correspond to the 6 lanes of the 10x (all on one chip), L003/L004 corresponds to the 2 lanes of the Illumina sequencer, and R1/R2 are the two ends of the paired-end sequencing. - The guide/sgRNA library contains the files
JDCT003{A-F}_S{6-1}_L001_R{1/2}_001.fastq.gz
andJDCT004{A-F}_S{1-6}_L001_R{1/2}_001.fastq.gz
, where A-F and 6-1 or 1-6 (respectively) correspond to the 6 lanes of the 10x (all on one chip), and R1/R2 are the two ends of the paired-end sequencing. The two JDCT003 and JDCT004 files from the same sample are merged later in the analysis to maximize the number of cells with called guides.
We use the cellranger multi
command to process the GEX and Multi-Seq fastq files:
cellranger-7.0.0/bin/cellranger multi --id=sample{1-6} --csv=CTJD02{A-F}{G-L}.csv --localcores=29
The command has to be run 6 times, each with a different sample (--id=sample1
, --id=sample2
, etc.) and a corresponding CSV file (--csv=CTJD02AG
, --csv=CTJD02BH
, etc.).
Each GEX library lane A-F corresponds to Multi-Seq lane G-L (i.e. A goes with G, B goes with H, etc.).
The content of the CSV files CTJD02{A-F}{G-L}.csv
is:
[gene-expression]
reference,/home/ssm-user/references/refdata-gex-GRCh38-2020-A/
cmo-set,featureRefMulti.csv
[libraries]
fastq_id,fastqs,feature_types
CTJD02{A-F},/GEX-fastq-dir/,Gene Expression
CTJD02{G-L},/multiseq–fastq-dir/,Multiplexing Capture
[samples]
sample_id,cmo_ids
sample1,CTJD02A
sample2,CTJD02B
sample3,CTJD02C
sample4,CTJD02D
sample5,CTJD02E
sample6,CTJD02F
Here too, make sure each that each of 6 CSV files has the correct fastq_id
values under [libraries]
(CTJD02A
and CTJD02G
, CTJD02B
and CTJD02H
, etc.). The [samples]
section can remain identical across all 6 files.
All 6 samples use the same multiplexing oligos, provided in featureRefMulti.csv
. The content for this CSV file is:
id,name,read,pattern,sequence,feature_type
CTJD02A,CTJD02A,R2,5P(BC),GGAGAAGA,Multiplexing Capture
CTJD02B,CTJD02B,R2,5P(BC),CCACAATG,Multiplexing Capture
CTJD02C,CTJD02C,R2,5P(BC),TGAGACCT,Multiplexing Capture
CTJD02D,CTJD02D,R2,5P(BC),GCACACGC,Multiplexing Capture
CTJD02E,CTJD02E,R2,5P(BC),AGAGAGAG,Multiplexing Capture
CTJD02F,CTJD02F,R2,5P(BC),TCACAGCA,Multiplexing Capture
Output files:
For each of the 6 samples, a CSV file named assignment_confidence_table.csv
is created. In the GEX directory, they are named sample{1-6}-assignment_confidence_table.csv
.
Here we:
- Concatenate the fastq files from respective lanes from the two sequencing runs of the sgRNA libraries, specifically JDCT003 and JDCT004. By running the full pipeline on the concatenated guide libraries, we maximize the number of cells with a single, confidently assigned guide.
- The
cellranger count
command then counts the number of reads per guide per cell and generates the gene-expression matrices.
The following command concatenates the raw fastq files from the JDCT003 and JDCT004 sequencing runs for each of the 6 samples (A-F) and both paired-ends (R1 and R2):
cat JDCT003A_S6_L001_R1_001.fastq.gz JDCT004A_S1_L001_R1_001.fastq.gz > JDCT005A_S1_L001_R1_001.fastq.gz
(the example is for read R1 of sample A)
The following command is run for each of the 6 samples (you know the drill on how to interpret {A-F}
):
cellranger-7.0.1/bin/cellranger count --id=sample{A-F} --transcriptome=/home/ssm-user/references/refdata-gex-GRCh38-2020-A/ --libraries=/home/ssm-user/csvs/sample{A-F}.csv --feature-ref=/data/feature_reference_v2.csv --localcores=62
Input files:
-
feature_reference_v2.csv
(provided indata_processing/
in the repo): Specifies all the ~400 guides used. -
sample{A-F}.csv
(example provided forsampleA.csv
):fastqs,sample,library_type, /gex_dir/,CTJD02A,Gene Expression, /sgrna_dir/,JDCT005A,CRISPR Guide Capture,
Output files:
-
filtered_feature_bc_matrix.h5
: Gene expression (GEX) matrices for each lane.- GEO file names:
sample{A-F}-filtered_feature_bc_matrix.h5
- GEO file names:
-
protospacer_calls_per_cell.csv
: CSV file mapping between cell barcodes and guides, enumerating how many guide counts are found in each cell.- Contains the following columns:
- cell_barcode: cell barcode
- num_features: how many different guides were identified
- feature_call: the names of the identified guides
- num_umis: UMIs counted for each guide
- GEO file names:
sample{A-F}-protospacer_calls_per_cell.csv
- Contains the following columns:
The notebook data_processing/guide_assign_binomial.ipynb
integrates GEX, Multi-Seq, and guide calls into one consolidated AnnData object.
Input files:
For each of the six samples ({A-F}
):
-
filtered_feature_bc_matrix.h5
: Gene expression matrices for each lane from the previous step. -
protospacer_calls_per_cell.csv
: A mapping CSV file between cell barcodes and guides from the previous step. -
assignment_confidence_table.csv
: Derived from Step 1, this table contains data essential for the guide assignment process.
What the code does:
- Integrate the raw h5 gene-expression data with the guide counts from the six samples.
- Determine the first and second most common guide per cell.
- Utilize a binomial test to determine whether the most common guide significantly surpasses the counts of the second most common guide.
- Incorporate sample calls into the AnnData object.
- Calculate quality control (QC) metrics.
Output file: fully_processed.h5ad
- a consolidated AnnData file with the processed data (available on GEO).
The gtf_to_position_file.py
script (provided by inferCNV here) constructs a metadata file with the chromosomal positions of genes presented in the copy-number matrix of inferCNV.
Command line:
python gtf_to_position_file.py genes.gtf inferCNVgeneName.txt
Following the script execution, the ENS gene IDs in inferCNVgeneName.txt
are translated into gene names.
Input file: genes.gtf
from cellranger's reference (refdata-gex-GRCh38-2020-A/genes/genes.gtf
)
Output file: inferCNVgeneName.txt
(available on GEO)
The notebook data_processing/infercnv_prep.ipynb
applies quality control measures and prepares the files for inferCNV analysis.
Input file: fully_processed.h5ad
- the consolidated AnnData file from Step 3.
What the code does:
- Implement quality control by applying the singlet filter, a filter for less than 10K total counts, and a filter for less than 10% mitochondrial content.
- To enhance inferCNV's computational efficiency, data is segmented into 9 batches, each representing a different cell subset.
Output files:
-
qced.h5ad
: the final processed h5ad file utilized for most analyses (available on GEO). -
Files for inferCNV (3 files for each of the 9 batches):
annotations_Batch{1-9}.csv
genes_Batch{1-9}.csv
counts_Batch{1-9}.h5ad
The R scripts data_processing/inferCNV/Batch{1-9}.R
conduct the inferCNV analysis for each batch of data.
Input files:
annotations_Batch{1-9}.csv
(from Step 5).genes_Batch{1-9}.csv
(from Step 5).counts_Batch{1-9}.h5ad
(from Step 5).inferCNVgeneName.txt
(from Step 4).
What the scripts do: Process and analyze the segmented data batches using inferCNV to quantify copy number variations along the genome in each cell.
Output file: infercnv.observations.txt
(for each of the 9 batches) (available on GEO as Batch{1-9}-infercnv.observations.txt
)
The notebook data_processing/breakpoint_calc.ipynb
combines the outputs from the inferCNV analysis. The code then scans each chromosome for potential "breakpoints" where the difference between the average inferCNV values to the left and right is greatest.
Input files:
qced.h5ad
(from Step 5).infercnv.observations.txt
for each of the 9 batches (from Step 6).inferCNVgeneName.txt
(from Step 4).
What the code does:
- Read and consolidate the
infercnv.observations.txt
files from the 9 batches. - Save the combined data as a pickled object for faster future loading.
- Identify and quantify potential chromosomal breakpoints based on the inferCNV values.
Output files:
concat_InferCNV.pkl
- the concatenated inferCNV data stored as a pickle object for swift access (available on GEO)aneuploidy_events.csv
- for each cell and chromosome, the number of inferCNV genes and average inferCNV values to the left and right of the most likely breakpoint (available on GEO).