-
Notifications
You must be signed in to change notification settings - Fork 9
NLA III data processing
- Go to the folder containing the raw reads. Then run:
scmo_workflow.py nlaIII
This creates the Snakefile containing all the required steps and a config file.
-
Edit the config file (config.json), make sure to set the right path to a reference Fasta file and optionally pick a different mapper or bin sizes.
-
Execute the workflow locally by running
snakemake
or run on a cluster by using of the commands shown byscmo_workflow.py
.
The workflow code can be found here
To obtain a count table the following steps have to performed:
- Demultiplexing
- Adapter trimming
- Mapping
- Molecule assignment
- Count table generation
Demultiplexing adds UMI, cell, sample and sequencing index information to the header of the FastQ file. This information is later used to figure out which reads belong to the same molecule and to what cell every molecule belongs. The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied. The demultiplexer also trims the random primer from read 2 and stores it as meta-information, this information is later used for IVT deduplication.
Assume you are located in a directory with fastq files:
example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz
To then autodetect the right barcodes and demultiplex run:
demux.py *.gz --y -merge _
It can be useful to additionally supply -hd 1
, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1
but not supplying the --y
, this will perform a dry run on a small subset of reads.
This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:
demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads
removes residual primers, this improves mapping performance:
cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG" -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA" -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"
The trimmed fastq files can now be mapped using a mapper of your choice. In this example we will use BWA.If you are mapping to a SNP masked genome use Bowtie or STAR.
bwa mem -t 4 -M {your.reference} trimmed.R1.fastq.gz trimmed.R2.fastq.gz | samtools view -b - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam;
mv ./sorted.unfinished.bam ./sorted.bam;
samtools index ./sorted.bam; rm ./unsorted.bam;
The aim of this step is to, for every fragment in sorted.bam find NLA fragments and assign them to a molecule. This can later be used for deduplication. Fragments with the same NLA cut site, cell barcode, umi, library and strand are assumed to belong to the same molecule. The NLAIII cut site (CATG) is expected to be present at the start of R1. The cut site location is recorded into the DS tag. Sometimes the read 1 will be shifted a little, for example it will start with ATG, then the cut-site assigned will be the mapping start coordinate +1. When alleles are specified using -alleles, the molecule assignment is split up by allele, this means that if two fragments map to the same location and share the same umi, but contain SNPs which indicate differing alleles, the reads are not assigned to the same molecule. For every fragment the recorded NLAIII sequence is recorded into the RZ tag, (CATG, ATG, TG).
In order to get a good estimate of coverage we need to correct for the mappability of the molecules. Bamtagmultiome annotates how well a fragment can be mapped to the mp
tag. In order to estimate mappability an index is required which can be generated using createMappabilityIndex.py
bamtagmultiome.py sorted.bam -method nla -o ./tagged.bam -mapfile /references/ref.mappability.stats.bgzf
If you have phased variants available these can be supplied to add allele information to every molecule:
bamtagmultiome.py sorted.bam -method nla -o ./tagged.bam -alleles ./allelic_information.vcf.gz -mapfile /references/ref.mappability.stats.bgzf
The starting location of read 2 of every fragment is the result of random priming, this means that for a single molecule there can be multiple mapping locations for R2 which are indicative of multiple RT copies from the same RNA template. The random hexamer sequence is stored in the rS
tag and the random primer starting coordinate is stored in the rP
tag. Read Molecule-iteration for more information about accessing RT reactions and molecules in Python.
Read the Library-statistics-plots page for information about library quality statistics.
bamCopyNumber.py tagged.bam -bin_size 500_000 -rawmat raw_normalised_counts.csv -rawmatplot raw_counts.png -gcmat gc_corrected_counts.csv -gcmatplot gc_corrected_counts.png -ref /path/to/reference/file.fasta -threads 20
The locations of the NLA cut sites can be binned into bins of a given size. In this example 250kb The minMQ parameter defines what the lowest mapping quality for a molecule should be to be counted. The --filterXA makes filters reads which map to multiple locations (excludes alternative loci) The -bin parameter defines the size of every bin The -sampleTags parameter defines what row identifiers the resulting table should have. If you would like to know the associated barcode you would use "BC" instead of "SM". The --dedup flag makes sure every molecule is counted only once, if this flag is not specified every fragment in the BAM file with a DS tag (Which is a valid NLA site) will be counted.
bamToCountTable.py -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -binTag DS --dedup -joinedFeatureTags reference_name --filterMP
If you supplied a allele vcf file the count table can be generated using allele information by adding DA
to the joinedFeatureTags
parameter.
bamToCountTable.py -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -binTag DS --dedup -joinedFeatureTags reference_name,DA --filterMP
[The original code for determining the (allele specific) copy numbers used in Kester et al. can be found here](https://github.com/BuysDB/TumorEvolutionReconstruction), `bamToCountTable.py` is a more streamlined command line version of that code.
A sliding window based table can be generated by adding the -sliding flag, for example for windows offset by 50k use:
bamToCountTable.py -sliding 50_000 -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -binTag DS --dedup -joinedFeatureTags reference_name --filterMP
The generated BAM file(s) contain a read group for each cell and are directly compatible with at least BCFtools, FreeBayes and GATK.
Once variant calling is completed with one of those callers. Absence and presence of variants be determined by phasing with heterozygous germline variants. For this the script bamExtractVariants.py
can be used. This process requires a VCF file with germline variants.