Basic cfDNA-seq analyses

Still under active revision and correction !

Basic cfDNA-seq analyses

1 Background

1.1 A fast introduction

Circulating free DNA (cfDNA), also known as cell-free DNA or plasma DNA, is a crucial component in the field of liquid biopsy, particularly in the context of cancer research and diagnostics. cfDNA refers to the small fragments of DNA that are released into the biofluids (e.g.: blood) by various cells in the body, including cancerous cells. These fragments can carry genetic information that reflects the genomic alterations present in the originating cells. In cancer patients' biofluid, cancer tissue-derived cfDNA is also known as circulating tumor DNA, ctDNA.

In the realm of liquid biopsy, cfDNA has emerged as a non-invasive and promising tool for detecting cancer onset, monitoring disease progression, assessing treatment responses, and predicting prognosis. Its accessibility through a simple blood draw makes it a valuable asset for clinicians and researchers alike.

Illustration	Features (Nongenetic)

1.2 Publication of cfDNA

A glimpse of representative publications:

Review
- 2019, Nat Rev | Genet, Current and future perspectives of liquid biopsies in genomics-driven oncology
- 2021, Science, Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies
- 2023, Cell Genom, Circulating DNA fragmentomics and cancer screening
Research article
- 2016, Cell, Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin
- 2018, Nature, Sensitive tumour detection and classification using plasma cell-free DNA methylomes
- 2018, STM, Enhanced detection of circulating tumor DNA by fragment size analysis
- 2019, Nature, Genome-wide cell-free DNA fragmentation in patients with cancer
- 2021, STM, Analysis of recurrently protected genomic regions in cell-free DNA found in urine
- 2021, NC, Tissue-specific cell-free DNA degradation quantifies circulating tumor DNA burden

more details and publications can be found at gitbook, cfDNA tencent table, or within reviews.

1.3 Variation/Feature/Alteration types

We only consider NGS data in this repo.

DNA-seq (e.g. WGS)

SNV*
CNV*
SV*
WPS**
NucleosomeFootprint**
FragSize**
OCF**
EndMotif**
MbCount***
MtCount***

DIP-seq (e.g. cfMeDIP, 5hmC-SEAL)

5mC/5hmC MethylationCount**

BS-seq (e.g.: WGBS, EM-seq, TAPS)

5mC/5hmC MethylationRatio**
other features in DNA-seq (theoritically)

DIP: DNA ImmunoPrecipation sequencing BS: Bisulfite-based sequencing DNA-seq: direct DNA sequencing EM-seq: Enzyme methylation sequencing TAPS: ...

SNV: Single Nucleotide Variation CNV: Copy Number Variation SV: Structural Variation WPS: Windown Protection Score OCF: Orientation-aware cfDNA fragmentation

*genomic level **epi-genomic level ***non-human

1.4 Resource

Here are some useful links/tools of cfDNA analysis:

tool	publication	data	feature
WPS	2016, Cell	WGS	WPS
ichorCNA	2017, Nature Communications	WGS,WES	CNV,TF
cfDNApipe	2021, Bioinformatics	WGS,WGBS	QC,SNV,FragSize,OCF,CNV,DMR,Deconv,Mb
CRAG	2022, Genome Medicine	WGS	IFS
Griffin	2022, Nature Communications	WGS	GC-corrected coverage
TritonNP	2023, Cancer Discovery	WGS	NucleosomePhasingScore,FragLengthEntropy
GEMINI	2023, Nature Genetics	WGS	SNV
cfDNAPro	R pkg: Bioconductor	WGS	QC

TF: Tumor Fraction (ctDNA/cfDNA) QC: cfDNA quality control cfDNA fragments size distribution with 10 bp periocity and 167 bp peak et al. DMR: Diff Methy Region Deconv: Methy-based deconvolution IFS: integrated fragmentation scores, calculated by weighted counts each sliding bin

2 Config global environment

2.1 Clone the repo

git clone https://github.com/hunnngry/cfDNA-NGS-analysis.git

2.2 Create conda global environment

cd cfDNA-NGS-analysis
# The default conda solver is a bit slow and sometimes has issues with selecting the special version packages. We recommend to install mamba as a drop-in replacement
## install mambaforge (mamba included)
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
export PATH=~/mambaforge/bin:$PATH # you can add this line to ~/.bashrc

# create the cfDNA base environment  
mamba env create --name cfDNA_base --file ./envs/cfDNA_base.yml 
source activate cfDNA_base

3 Run upstream pipeline

Notes

path of inhouse annotation dir (gencode v27): /BioII/lulab_b/baopengfei/projects/exOmics/DIP-seq/ref
path of inhouse bowtie2 index: /BioII/lulab_b/baopengfei/projects/exOmics/DIP-seq/genome/bowtie2-index
path of inhouse bwa-mem2 index: /BioII/lulab_b/baopengfei/projects/exOmics/DNA-seq/genome/bwa-mem2-index
whole pipeline for DIP-seq: ./DIP-seq/snakemake/DIP-seq-pe.snakemake
whole pipeline for BS-seq: ./BS-seq/snakemake/BS-seq-pe.snakemake
whole pipeline for DNA-seq: ./DNA-seq/snakemake/DNA-seq-pe.snakemake
other meta tables: https://docs.qq.com/sheet/DT0NYU1Zxa2dBSnlq?tab=f8m6qf

3.1 DNA-seq

For Paried-End WGS et al., the graph of pipeline (two samples) can be seen at ./DNA-seq/DAG

cd  ./DNA-seq
dst="test" # we show a dataset example named "test"

# prepare input fastq
ls ./data/${dst}/fastq/*.fastq.gz

# prepare config file 
vi ./config/${dst}.yaml

# prepare meta dir
vi ./data/${dst}/meta_data/sample_table.txt

# opt1: submit to slurm cluster
sbatch ./run/${dst}.sh

# opt2: run in local
snakemake --rerun-incomplete --keep-going --printshellcmds --reason \
  --use-conda --nolock --latency-wait 200 --restart-times 1 --jobs 16 \
	--snakefile snakemake/DNA-seq-pe.snakemake \
	--configfile config/${dst}.yaml \
	> log/${dst}/run-${dst}.log 2>&1

3.2 DIP-seq (need test)

For paired-End cfMeDIP et al., the graph of pipeline (two samples) can be seen at ./DIP-seq/DAG

cd  ./DIP-seq
dst="test" # we show a dataset example named "test"

# prepare config file
vi ./config/${dst}.yaml

# prepare meta dir
vi ./data/${dst}/meta_data/sample_table.txt

# opt1: submit to slurm cluster
sbatch ./run/${dst}.sh

# opt2: run in local
snakemake --rerun-incomplete --keep-going --printshellcmds --reason \
  --use-conda --nolock --latency-wait 200 --restart-times 1 --jobs 16 \
	--snakefile snakemake/DIP-seq-pe.snakemake \
	--configfile config/${dst}.yaml \
	> log/${dst}/run-${dst}.log 2>&1

3.3 BS-seq (need test)

For paired-End WGBS et al., the graph of pipeline (two samples) can be seen at ./BS-seq/DAG

cd  ./BS-seq
dst="test" # we show a dataset example named "test"

# prepare config file
vi ./config/${dst}.yaml

# prepare meta dir
vi ./data/${dst}/meta_data/sample_table.txt

# opt1: submit to slurm cluster
sbatch ./run/${dst}.sh

# opt2: run in local
snakemake --rerun-incomplete --keep-going --printshellcmds --reason \
  --use-conda --nolock --latency-wait 200 --restart-times 1 --jobs 16 \
	--snakefile snakemake/BS-seq-pe.snakemake \
	--configfile config/${dst}.yaml \
	> log/${dst}/run-${dst}.log 2>&1

4 Downstream analysis

Above upstream pipeline by snakemake will finish most basic analysis by in one command, in this part we will provide a series of personalized downstream analysis tutorial step-by-step.

4.0 Quality control

Option1: Ngs.plot: https://github.com/shenlab-sinai/ngsplot
Option2: deeptools: https://deeptools.readthedocs.io/en/develop

Check reads distribution around specific regions (gene body). We will test Ngs.plot (v2.61) as example

prepare

# config env described in Ngs.plot github
mamba env create --name ngsplot r-base python=2.7
source activate ngsplot

# download Ngs.plot (v2.61)
wget -c https://github.com/shenlab-sinai/ngsplot/archive/refs/tags/2.61.tar.gz -O ngsplot-2.61.tar.gz
mkdir -p ~/biosoft; tar xzvf ngsplot-2.61.tar.gz -C ~/biosoft

# config global env accordingly
export NGSPLOT=~/biosoft/ngsplot-2.61
export PATH=~/biosoft/ngsplot-2.61/bin:$PATH

#also install other R pkgs that quired 
R
> install.packages("doMC", dep=T)
> install.packages("caTools", dep=T)
> #install.packages("utils", dep=T)
> install.packages("BiocManager")
> BiocManager::install( "BSgenome" )
> #BiocManager::install( "Rsamtools" )
> BiocManager::install( "ShortRead" )

# install hg38 annotation
ngsplotdb.py list  # List installed genomes.
ngsplotdb.py install ngsplotdb_hg38_76_3.00.tar.gz  # Install reference genome from a package file.

# (optional: use existing database if fail to download due to db issue)
#cp /BioII/lulab_b/baopengfei/biosoft/ngsplot/ngsplotdb_hg38_76_3.00.tar.gz ~/biosoft/ngsplot-2.61/
#cp -r /BioII/lulab_b/baopengfei/biosoft/ngsplot/database/hg38 ~/biosoft/ngsplot-2.61/database/hg38

# (optional: you may want to merge&index input bam before next step)
#samtools merge -f merged.bam input_*.bam 
#samtools index merged.bam

# modify config file
vi config/test_ngsplot.txt
#DNA-seq/output/test/bam-sorted-deduped/CRC.bam  -1      "CRC-pool"
#DNA-seq/output/test/bam-sorted-deduped/NC.bam  -1      "NC-pool"

# plot single bam (without region filter, use default ensembl all anno)
mkdir -p test/ngsplot
ngs.plot.r -G hg38 -R genebody \
  -L 4000 \
  -LWD 5 \
  -YAS 0,0.2 \
  -C config/test_ngsplot.txt -O test/ngsplot

# check other available options
ngs.plot.r --help # https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101

e.g. output figure of cfMeDIP (with paired WGS) within output dir:

4.1 SNV

SNV includes single-base nucleotide variation, small insertion/deletion (INDEL<50nt) compared to reference genome. Tools

Somatic variant caller
- GATK Mutect2
- VarScan2
- Strelka
- LoFreq
- VariantDx
- SiNVICT (cfDNA, 2017, Bioinfomatics)
germline variant caller
- GATK HaplotypeCaller
variant annotation
- annovar: used with mafplot et al.
- vep (variantEffectPredictor): used with pyclone et al.

Notes:

priority: Mutect2 >= HaplotypeCaller (both have been used in cfDNA mutation variant calling).
in required reference files, "somatic-hg38_" is actually prefix of downloaded directory，such as somatic-hg38_1000g_pon.hg38.vcf.gz, which actually means germline variant, that used for filtering before/during final somatic variant determination.
GATK Mutect2 --germline-resource: optional germline VCF file that contains an AF (population allele frequency) INFO field. Used to get the frequency of a AF, thereby providing the prior probability that the sample carries the allele in the germline. The Broad Institute provides a version of gnomAD stripped of all fields except AF (af-only-gnomad.hg38.vcf.gz).
GATK Mutect2 --panel-of-normals: optional VCF file of normal samples that used as a reference to filter out common sequencing artifacts and germline variants. A panel of normals is simply a vcf of blacklisted sites flagged as recurrent artifacts. The Broad Institute also provides several panels of normals (e.g.: 1000g_pon.hg38.vcf.gz), but users with a large number (n>=50) may benefit from generating their own panel with CreateSomaticPanelOfNornals.

cd ./DNA-seq
#mamba env create --name DNA --file snakemake/envs/DNA.yml # run for the first test
source activate DNA

# 1.Mutect2
dst="test"
sample_id="NC" 
gn_fa_path="/BioII/lulab_b/baopengfei/shared_reference/hg38/genome.fa" # path of genome fasta
#java_opt="--java-options -Xmx10G" # optional java options that set up mem: gatk HaplotypeCaller --java-options -Xmx10G
tmp_dir="tmp"
input_path="output/${dst}/bam-sorted-deduped-RG/${sample_id}.bam" # input path containing bam file (from streamlined smk)
output_dir="output/${dst}_stepByStep" # output directory containing vcf files
mkdir -p ${output_dir}/log/${sample_id}

gnomad_path="/BioII/lulab_b/baopengfei/projects/exOmics/DNA-seq/ref/SNV_ref/somatic-hg38_af-only-gnomad.hg38.vcf.gz" # path of genomic gnomad vcf file: https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38
genome1k_path="/BioII/lulab_b/baopengfei/projects/exOmics/DNA-seq/ref/SNV_ref/somatic-hg38_1000g_pon.hg38.vcf.gz" # path of genomic genome1k vcf file: https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38

## 1.1.somatic mutation calling
gatk Mutect2 \
  -R ${gn_fa_path} \
  -I ${input_path} \
  -O ${output_dir}/mutect2-vcf/${sample_id}.vcf.gz \
  --germline-resource ${gnomad_path} \
  --panel-of-normals ${genome1k_path} \
  --native-pair-hmm-threads 10 \
  --tmp-dir ${tmp_dir} > ${output_dir}/log/${sample_id}/mutect2.log 2>&1

#~25min with 16-core for test sample

## 1.2.somatic mutation filtering
gatk FilterMutectCalls \
  -R ${gn_fa_path}  \
  -V ${output_dir}/mutect2-vcf/${sample_id}.vcf.gz \
  -O ${output_dir}/mutect2-vcf-filtered/${sample_id}.vcf.gz \
  -f-score-beta 1 \
  --tmp-dir ${tmp_dir} > ${output_dir}/log/${sample_id}/mutect2-filtering.log 2>&1

## 1.3.check output
zcat ${output_dir}/mutect2-vcf-filtered/${sample_id}.vcf.gz | less -S


# 2.HaplotypeCaller 
mkdir ${output_dir}/vcf{,-filtered}
## 2.1.germline mutation calling
gatk HaplotypeCaller \
  -R ${gn_fa_path} \
  -I ${input_path} \
  -O ${output_dir}/vcf/${sample_id}.vcf.gz \
  --tmp-dir ${tmp_dir} > ${output_dir}/log/${sample_id}/haplotypeCaller.log 2>&1

#~20min with 16-core for test sample

## 2.2.germline mutation filtering
gatk VariantFiltration \
  -R ${gn_fa_path} \
  -V ${output_dir}/vcf/${sample_id}.vcf.gz \
  -O ${output_dir}/vcf-filtered/${sample_id}.vcf.gz \
  -window 35 -cluster 3 \
  --filter-name FS20 -filter "FS > 20.0" \
  --filter-name QD2 -filter "QD < 2.0" \
  --filter-name DP10 -filter "DP < 10.0" \
  --filter-name QUAL20 -filter "QUAL < 20.0" \
  --tmp-dir ${tmp_dir} > ${output_dir}/log/${sample_id}/haplotypeCaller-filtering.log 2>&1

## 2.3.check output
zcat ${output_dir}/vcf-filtered/${sample_id}.vcf.gz | less -S


#need finish processing variant-calling above for all samples before the following annotation step!


# 3.mutation annotation 
#https://github.com/mskcc/vcf2maf
#cnode
source activate maf
#need install annovar first
dst="test"
# wd="/BioII/lulab_b/baopengfei/cooperation/yinjianhua"
software="/BioII/lulab_b/baopengfei/biosoft/annovar"
ref="/BioII/lulab_b/baopengfei/biosoft/annovar/humandb/hg38"
# input="${wd}/output/wes-mutect2-FilterMutectCalls"
# output="${wd}/output/wes-mutect2-FilterMutectCalls/annotation"
input_dir="output/${dst}_stepByStep/mutect2-vcf-filtered"
output_dir="output/${dst}_stepByStep/mutect2-vcf-filtered/annotation"
mkdir -p ${output_dir}

#only select SNP using gatk SelectVariants -select-type SNP -V wes.raw.vcf -O wes.snp.vcf

## 3.1.convert
cat data/${dst}/meta_data/sample_ids.txt | while read sample_id
do
  #sample_id="NC"
  echo ${sample_id}
  perl ${software}/convert2annovar.pl \
    -format vcf4 \
    -withfreq ${input_dir}/${sample_id}.vcf.gz \
    > ${output_dir}/${sample_id}.avinput
done

## 3.2.annotation
#Gene-based Annotation
#Region-based Annotation 
#Filter-based Annotation
cat data/${dst}/meta_data/sample_ids.txt | while read sample_id
do
  #sample_id="NC"
  echo ${sample_id}
  perl ${software}/table_annovar.pl \
    ${output_dir}/${sample_id}.avinput \
    -buildver hg38 -remove -operation g,r,r -nastring . \
    -protocol ensGene,cytoBand,genomicSuperDups \
    ${ref} \
    -out ${output_dir}/${sample_id}
done
#refGene 有注释无法下载好像
#-buildver hg19 表示使用hg19版本
#-out snpanno 表示输出文件的前缀为snpanno
#-remove 表示删除注释过程中的临时文件
#-protocol 表示注释使用的数据库，用逗号隔开，且要注意顺序
#-operation 表示对应顺序的数据库的类型（g代表gene-based、r代表region-based、f代表filter-based），用逗号隔开，注意顺序
#-nastring . 表示用点号替代缺省的值
#-csvout 表示最后输出.csv文件
#-thread 表示线程数，仅集群可以用
 
## 3.3.merge files of all samplesall samples
#在上一步每个_multianno.txt最后加上sampleID列并合并所有文件
cat data/${dst}/meta_data/sample_ids.txt | while read sample_id
do
  #sample_id="NC"
  echo ${sample_id}
  grep -v '^Chr' ${output_dir}/${sample_id}.hg38_multianno.txt | cut -f 1-12 | awk -v T=${sample_id} '{print $0"\t"T}' > ${output_dir}/${sample_id}.annovar.vcf
done
 
cat ${output_dir}/*.annovar.vcf > ${output_dir}/annovar_merge.vcf # concate all samples to one file
sed -i '1s/^/Chr\tStart\tEnd\tRef\tAlt\tFunc.refGene\tGene.refGene\tGeneDetail.refGene\tExonicFunc.refGene\tAAChange.refGene\tcytoBand\tgenomicSuperDups\tTumor_Sample_Barcode\n/' ${output_dir}/annovar_merge.vcf # add header (13 cols)

# 3.visualization
Rscript ./scripts/maftools.R ${output_dir}/annovar_merge.vcf \
  /BioII/lulab_b/baopengfei/shared_reference/geneset/IntOGen-Drivers-20200213/Compendium_Cancer_Genes.tsv \
  COREAD

4.2 CNV

option1:

cd  ./DNA-seq

## bam to count matrix
featureCounts -T 4 -O -t gene -g gene_id -M -p  \
	-a ref/gtf/gene.gtf \
	-o raw_count_matrix \
	output/test/bam-sorted-deduped/*.bam \
	> raw_count_matrix.log 2>&1

/usr/bin/Rscript scripts/multiFeatureCounts2countMat.R   \
        -i raw_count_matrix  \
        -o output/test/matrix/count_matrix_gene.txt

## count matrix to CPM matrix
/usr/bin/Rscript  scripts/run-NormCountMat.R  \
	-i  output/test/matrix/count_matrix_gene.txt \
	-s  meta/test/sample_ids.txt \
	-m  TMM  \
	-o  output/test/matrix/CPM-TMM_matrix_gene.txt  \
  >  output/test/matrix/log/NormCountMat_gene.log 2>&1

## correct GC (manually)
vi scripts/correctGC.R

option2: CNVkit

REF="genome/bwa-mem2-index/genome.fa"
workdir="./DNA-seq"
cd $workdir

## no matched normal (CRC as eg.)
cnvkit.py batch \
    output/test/bam-sorted-deduped-RG/CRC*.bam \
    --normal \
    --method wgs \
    -p 10 \
    --fasta ${REF} \
    --output-reference output/test/cnvkit/my_reference.cnn \
    --output-dir output/test/cnvkit \
    --scatter

## has matched normal (CRC as eg.)
cnvkit.py batch \
    output/test/bam-sorted-deduped-RG/CRC*.bam \
    --normal  output/test/bam-sorted-deduped-RG/NC*.bam \
    --method wgs \
    -p 10 \
    --fasta ${REF} \
    --output-reference output/test/cnvkit/my_reference.cnn \
    --output-dir output/test/cnvkit \
    --scatter

Notes:

optionally choose to remove duplicated sequences (wgs automatically calculate from genome)
optionally choose to remove blacklist regions (--access, wgs automatically calculate from genome)
automatically correct GC content
If your sequencing protocol is WGS, then you don’t need a “target” BED file, but you probably do still want refFlat.txt.
If you have no normal samples to use for the reference, you can create a “flat” reference which assumes equal coverage in all bins by using the --normal/-n flag without specifying any additional BAM files
The GC content and repeat-masked fraction of each bin are calculated during generation of the reference from the user-supplied genome. The bias corrections are then performed in the reference and fix commands.
BAM file must be sorted.
hybrid capture protocols in which both on- and off-target reads can be used for copy number detection.
if don't have normal samples. and antitargets are not really helpfull in this situation
With no control samples (--normal), you can create a “flat” reference of neutral copy number
Increase the “target” average bin size (--target-avg-size), e.g. to at least 1000 bases for 30x coverage, or proportionally more for lower-coverage sequencing.
Specify a smaller p-value threshold (segment -t). For the CBS method, 1e-6 may work well. Or, try the hmm segmentation method.
you need determine binsize when generating reference
If your sequencing protocol is WGS, then you don’t need a “target” BED file, you still can use the “target” BED file to limit the analysis to just the genic regions. You can get such a BED file from the UCSC Genome Browser, for example, and you probably do still want refFlat.txt.
*target.bed file in wgs mode is identical to *access.bed "For germline samples, you can also consider bintest to find smaller alterations" seems we better use .call.cns或.cns (seems to lack ci-filtering, more left, which is more suitable for gene-level)
use log2 (Log2 mean coverage depth) not depth (Absolute-scale mean coverage depth) when mapping to gene

option3: GISTIC2

# GISTIC2
## test inhouse
workdir="./DNA-seq"
cd $workdir

## To convert CNVkit’s .cns files to SEG
#NUMEXPR_MAX_THREADS=4 
for i in `cat meta/test/sample_ids.txt |grep -v CRC`
do
cnvkit.py export seg \
 --enumerate-chroms \
 -o  output/test/cnvkit/${i}.cns.seg \
 output/test/cnvkit/${i}.cns
done

## run gistic2
cd ./DNA-seq
REF="genome/bwa-mem2-index/genome.fa"
mkdir -p  output/test/gistic2 

## merge all samples in a project
cat  output/test/cnvkit/*.cns.seg | grep -v chr > output/test/cnvkit/allsamples.cns.seg

## run gistic2
gistic2 \
 -b output/test/gistic2 \
 -seg   output/test/cnvkit/allsamples.cns.seg \
 -refgene /BioII/lulab_b/baopengfei/shared_reference/gistic2/hg38.UCSC.add_miR.160920.refgene.mat \
 -armpeel 1 \
 -brlen 0.7 \
 -cap 1.5 \
 -conf 0.99 \
 -ta 0.1 \
 -td 0.1 \
 -genegistic 1 \
 -gcm extreme \
 -js 4 \
 -maxseg 2000 \
 -qvt 0.25 \
 -rx 0 \
 -savegene 1

Notes:

need config your env according to tutorial when the first installment (wechat search gistic2)
focal_data_by_genes.txt output data is similar to the all_data_by_genes.txt output, but using only focal events with lengths greater than the focal length cutoff. This data is a gene-level table of copy number values for all samples.
The returned copy number values are in units (copy number - 2) so that no amplification or deletion is 0, genes with amplifications have positive values

4.3 SV

mavis，STAR-fusion and Manta within snakemake file should both be OK

# run mavis 
/BioII/lulab_b/baopengfei/projects/multi-omics-explore/scripts/mavis-fusion.sh
# run STAR-fusion
/BioII/lulab_b/baopengfei/projects/multi-omics-explore/scripts/star-fusion.sh
# run Manta
#see DNA snakemake file fule: prepareMantaConfig and runMantan

# get matrix & plot (eg)
#see/BioII/lulab_b/baopengfei/projects/multi-omics-explore/scripts/SV.r

4.4 WPS

(organized by Liyu) Notes

the number of fragments spanning the window minus those with an endpoint within the window
footprints of protein-DNA interactions(such as nucleosome positioning or TF binding)
the windowed protection score (WPS) of a window of size k as the number of molecules spanning the window minus those with an endpoint within the window. We assign the determined WPS to the center of the window. For 35–80 bp fragments (short fraction, S WPS), k = 16; for 120–180 bp fragments (long fraction, L-WPS), k = 120.

Steps

cd DNA-seq/scripts/WPS
dst="test" 
snakemake --rerun-incomplete --keep-going --printshellcmds --reason --use-conda --nolock --latency-wait 80 --restart-times 1 --jobs 100 \
--snakefile snakefile_WPS.smk \
--configfile config/${dst}/config_small.yml \
--cluster-config config/cluster-lsf.json \
--cluster "bsub -n {cluster.threads} -J {cluster.jobname} -q {cluster.queue} -o {cluster.output} -e {cluster.error} " > log/${dst}/run-${dst}.log 2>&1

Input

configured by the user (samples.tsv):
- analysis ID
- samples
- path to sample .bam files
- reference samples fro plotting
- genome build per sample
configured by the user (regions.tsv):
- bed file containing regions of interest (e.g. TFBS), all having the same length

e.g.:

Output

table containing bp specific WPS for regions listed in bed
line plot showing normalized WPS of multiple samples

e.g.:

4.5 NucleosomeFootprint

(organized by Liyu) Calculate (relative) coverage depth at NDR (nucleosome-depleted region) in genes

coverage depth of 3 region types can be used as gene-centric measures
- genebody
- -150TSS+50,
- -300-100exon1end
relative read depth in a specific region as measurement (TPM/RPKM might also work), reflect this gene’s nuleosome structure (more read depth, more compact nucleosome structure, less gene RNA expression)
clinical usage:
- early detection
- monitoring (0.5x)
- TOO

4.6 FragSize

mainly adapted from 2019,Nat,Genome-wide cell-free DNA fragmentation in patients with cancer

# get bam frag
./DNA-seq/scripts/frag/get-FragmentSizeBam.sh 

# get ratio
./DNA-seq/scripts/frag/get-FragmentSizeDepth.sh 

# join to matrix (eg)
./DNA-seq/scripts/frag/frag_size.R

4.7 EndMotif

# count motif freq.
cd  ./DNA-seq
mkdir output/test/motif5
for i in `cat meta/test/sample_ids.txt`
do
echo "start $i at `date`"
python /BioII/lulab_b/baopengfei/gitsoft/cfdna-main/cfdna_practical.py motif-analysis -b output/test/04bam_dedup/${i}.bam -s  False -m 4 | head -n1 | cut -d "{" -f 4 | cut -d "}" -f 1 | tr "," "\n" | tr -d " " > output/test/motif5/${i}_motif5.txt
done

# join outfile to matrix
#see ./DNA-seq/scripts/motif/sum-motif.Rmd

4.8 Microbe

# count motif freq.
cd  ./DNA-seq
mkdir output/test/motif5
for i in `cat meta/test/sample_ids.txt`
do
echo "start $i at `date`"
python /BioII/lulab_b/baopengfei/gitsoft/cfdna-main/cfdna_practical.py motif-analysis -b output/test/04bam_dedup/${i}.bam -s  False -m 4 | head -n1 | cut -d "{" -f 4 | cut -d "}" -f 1 | tr "," "\n" | tr -d " " > output/test/motif5/${i}_motif5.txt
done

# join outfile to matrix
#see ./DNA-seq/scripts/motif/sum-motif.Rmd

5 Appendix

5.1 V-plot

#TODO fragment/insertion size length distribution: basic quality control metrics: cfDNA WGS nucleosome footprint cfDNA WGS SV cfDNA WGS CNV cfDNA WGS vplot

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
BS-seq		BS-seq
DIP-seq		DIP-seq
DNA-seq		DNA-seq
_plots		_plots
config		config
envs		envs
README.md		README.md
multiomics-analysis.md		multiomics-analysis.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic cfDNA-seq analyses

1 Background

1.1 A fast introduction

1.2 Publication of cfDNA

1.3 Variation/Feature/Alteration types

DNA-seq (e.g. WGS)

DIP-seq (e.g. cfMeDIP, 5hmC-SEAL)

BS-seq (e.g.: WGBS, EM-seq, TAPS)

1.4 Resource

2 Config global environment

2.1 Clone the repo

2.2 Create conda global environment

3 Run upstream pipeline

3.1 DNA-seq

3.2 DIP-seq (need test)

3.3 BS-seq (need test)

4 Downstream analysis

4.0 Quality control

prepare

4.1 SNV

4.2 CNV

4.3 SV

4.4 WPS

4.5 NucleosomeFootprint

4.6 FragSize

4.7 EndMotif

4.8 Microbe

5 Appendix

5.1 V-plot

About

Releases

Packages

Languages

HUNNNGRY/cfDNA-NGS-analysis

Folders and files

Latest commit

History

Repository files navigation

Basic cfDNA-seq analyses

1 Background

1.1 A fast introduction

1.2 Publication of cfDNA

1.3 Variation/Feature/Alteration types

DNA-seq (e.g. WGS)

DIP-seq (e.g. cfMeDIP, 5hmC-SEAL)

BS-seq (e.g.: WGBS, EM-seq, TAPS)

1.4 Resource

2 Config global environment

2.1 Clone the repo

2.2 Create conda global environment

3 Run upstream pipeline

3.1 DNA-seq

3.2 DIP-seq (need test)

3.3 BS-seq (need test)

4 Downstream analysis

4.0 Quality control

prepare

4.1 SNV

4.2 CNV

4.3 SV

4.4 WPS

4.5 NucleosomeFootprint

4.6 FragSize

4.7 EndMotif

4.8 Microbe

5 Appendix

5.1 V-plot

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages