Skip to content

TAPS data processing

Buys de Barbanson edited this page Aug 17, 2021 · 12 revisions

TAPS

Add paper and protocol information

Data processing

Install package

As described on the main page, install singlecellmultiomics package.

git clone https://github.com/BuysDB/SingleCellMultiOmics
pip install ./SingleCellMultiOmics

Demultiplexing

Demultiplexing adds UMI, cell, sample, and sequencing index information to the header of the FastQ file.

The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied.

Assume you are located in a directory with fastq files:

example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz

To then autodetect the right barcodes and demultiplex run:

demux.py *.gz --y -merge _

It can be useful to additionally supply -hd 1, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1 but not supplying the --y, this will perform a dry run on a small subset of reads.

This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:

demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads

Adapter trimming

removes residual primers, this improves mapping performance:

cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG"  -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a  "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA"  -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"

Mapping

Use your mapper of preference to map the reads. Make sure to use a reference which includes spike-ins.

Example using BWA mem to map the reads:

bwa mem -t 4 -M {your.reference} trimmed.R1.fastq.gz trimmed.R2.fastq.gz | samtools view -b - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam; 
mv ./sorted.unfinished.bam ./sorted.bam; 
samtools index ./sorted.bam; rm ./unsorted.bam;```

Obtain conversion rate

Some of the methylated bases will not be converted by TAPs. The efficiency of the TAPs conversion can be estimated using a fully methylated spike-in. The resulting conversion efficiency can then be estimated:

estimateTapsConversionEfficiency.py ./sorted.bam -o conv_efficiency -ref [reference_fasta_path] -method nla

conv_efficiency is prefix of the output files.

conv_efficiency_conversions_lambda.csv First column is the library, second is cell index, third is the amount of mismatches to the reference excluding C's, the other columns contains the estimated conversion efficiency for each CpG motif.

Extract methylation calls:

tapsTabulator.py sorted.bam -min_phred_score 20 -ref [reference_fasta_path] | gzip > calls.tsv.gz

Create bigwig files of CpG methylation:

Bigwig files allow for visualisation of the methylation levels in a genome browser and allow for quick access of various statistics.

Create bulk track, 400bp bins: (20 threads)

bam_to_methylation_bw.py sorted.bam -o BULK_bw -method nla -t 20 -bin_size 400

Psuedobulk

create a CSV file with two columns, the first is the barcode index (BI tag) and the second the pseudobulk you want to assign to, for example:

3,cluster_A
4,cluster_A
5,cluster_A
6,cluster_A
7,cluster_B
8,cluster_B
9,cluster_B
10,cluster_B
11,cluster_B

(Stored in my.csv)

bam_to_methylation_bw.py sorted.bam -o PSEUDO_bw -pseudobulk_BI_csv my.csv -method nla -t 20

This generates beta big wigs for each pseudobulk (cluster_A, and cluster_B).

Single cell bigwig

bam_to_methylation_bw.py sorted.bam --single_sample -o SINGLE_bw -method nla -t 20