input: guide_quant_format.txt (guideline for ENCODE standard) and ENCODE-formatted guide_quant files
NOTE: "guide_quant_format.txt" in this repository might not be up-to-date (last modified: 5/10/2021).
Purpose: Check whether input file has a proper format that matches with guide_quant_format guideline.
sample command line: python {format description file} {test file}
python guide_quant_format.txt GATA_rep1_HS_ENCODE_guideQuant.bed
If successful:
Test 1 passed
Test 2 passed
Test 3 passed
Test 1 passed
Test 2 failed. In line 4695, 15'th element must be either targeting or negative_control
input: ENCODE-formatted guide_quant files, a reference genome fasta file. ref hg38 can be downloaded from:
Purpose: If your file passes, use this script to check whether your PAM coordinates are correctly extracted by checking NGG sequence.
sample command line: python {ifile: guide_quant} {reference fasta}
python MYC_rep1_LS_ENCODE_guideQuant.bed GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
NGG 50732.0
other 398.0
More than 80% of the PAMs are NGG. The coordinates are likely to be correct
NGG 5433.0
other 4689.0
Less than 80% of the PAMs are NGG. The coordinates are likely to be incorrect
input: ENCODE-formatted guide_quant files
Purpose: compute log2FC in three different way: raw log2FC, log2FC z-transformed using negative control guides, log2FC z-tranformed using all guides
output: log2FC_summary files
sample command line (python {gRNA quant 1 (e.g. T0)} {gRNA quant 2 (e.g. T14)} {ofile prefix})
python MYC_rep1_LS_ENCODE_guideQuant.bed screens/MYC_rep1_HS_ENCODE_guideQuant.bed Sabeti_HCRFlowFISH_MYC_R1
input: log2FC_summary file
Purpose: select a type of log2FC in log2FC_summary, and use it to generate .bedgraph. Further, to ensure that the genome coordinates do not overlap (for compatibility with UCSC genome browser), output PAM coordinates will have width of 1 (coordinate of 'N'GG for (+)-strand and of CC'N' for (-)-strand).
output: .bedgraph
sample command line (format {log2FC summary} {col index for log2fc} {ofile name})
python Sabeti_HCRFlowFISH_MYC_R1.log2FC_summary 2 Sabeti_HCRFlowFISH_MYC_R1_log2FC_0.bedgraph
python Sabeti_HCRFlowFISH_MYC_R1.log2FC_summary 3 Sabeti_HCRFlowFISH_MYC_R1_log2FC_1.bedgraph
python Sabeti_HCRFlowFISH_MYC_R1.log2FC_summary 4 Sabeti_HCRFlowFISH_MYC_R1_log2FC_2.bedgraph
python Sabeti_HCRFlowFISH_MYC_R1.log2FC_summary 5 Sabeti_HCRFlowFISH_MYC_R1_log2FC_3.bedgraph
python Sabeti_HCRFlowFISH_MYC_R1.log2FC_summary 6 Sabeti_HCRFlowFISH_MYC_R1_log2FC_4.bedgraph
For more detail, check "guide_quant_format.txt"
chr12 54300767 54300770 GATA1|chr12:54300748-54300767:+ 366 + chr12:54300748-54300767:+ chrX 48786590 48786591 + GATA1 ENSG00000102145 GGATTCCAGTGAGATCCGAG GGATTCCAGTGAGATCCGAG targeting NA
chr12 54300811 54300814 GATA1|chr12:54300792-54300811:+ 551 + chr12:54300792-54300811:+ chrX 48786590 48786591 + GATA1 ENSG00000102145 CTCCACCACAGGTGCCTGAA GCTCCACCACAGGTGCCTGAA targeting NA
Only the first three, fifth, and sixth columns are relevant for these scripts.
First three: PAM coordinate
Fifth: total number gRNA sequences that are sequenced in a cell population
Sixth: strand location of the PAM
name PAM_ID raw_log2FC ztransformed_by_neg_control ztransformed_by_all_guides
GATA1|chr12:54300748-54300767:+ chr12:54300767-54300770:+ -0.6898171127857369 -1.3909202490331116 -1.1558450982161879
GATA1|chr12:54300792-54300811:+ chr12:54300811-54300814:+ 0.14274017211608214 -0.3609625301812802 -0.3592771390999452
chrX 48476136 48476136 0.16272950003810832
chrX 48476137 48476137 0.14542752477380644
Some of the CRISPR screens used in CRISPR WG had coordinates in hg19 (Engreitz/Bassik/Shen lab), and we converted their hg19 coordinates to hg38 using bowtie1. Some python and shell scripts are uploaded here, which can be executed in this order: -> -> -> But, some of these scripts need to be re-configured to be usable (e.g. directory names specified in .sh).