A genome-wide screen of ASE - V2 - on going

This project was carried out in the Division of Psychological Medicine and Clinical Neurosciences (DPMCN). The workflow follows the the snakemake distribution and reproducibility recommendations.

A snakemake pipeline to for a genome wide screen of allele specific expression using genotype and GeX from 120 human brain bulk tissue samples. Utilising the following packages:

Data

FastQ files

Papers for public data and software

O'Brien et al. (2018)
Miao eat al. (2018) - ASElux

Ensembl links

Scripts

fastq_prep.smk

move_and_rename_fastqs
- Move 526 fastq files (sequenced across 3 sites Cardiff, Edinburgh, Exeter) from neurocluster
- Standarise fastq names
- Due to differing naming conventions used json to populate sample names
zip_fastqs
- Some files were fastqs zipped and some were not. Zipped all.
merge_fastqs
- Merged fastqs that were sequenced over multiple lanes into one file (output: R1 and R2 file for each sample)
fastqc_pretrim
- Initial fastq QC - check for adapters
multiQC_pretrim
- Collate fastqc reports
trim_fastq
- Remove sequencing adapters from reads and run fastqc on trimmed reads
hard_trim_fastq
- For ASElux reads need to be uniform length
- Trim RHS of reads to following length: Cardiff, 55; Edinburgh, 105; Exeter, 55
remove_short_reads
- Remove reads which hard a shorter length than the lengths specified above
read_length_dist_post_QC_and_trimGalore
- QC to confirm read length distribution pre-hard trim
get_read_length_dist_post_hard_and_SrtRead_trim
- QC to confirm read lengths distribution post-hard trim

allele_specific.smk

build_static_index
- Build a static index reference file for ASElux
extract_sample_genotypes -
- Extract genotype infromation for each individual donor at all SNPs
ase_align - run allele specific expression analysis
- Run ASElux to measure allele specific expression in 120 samples

imprinting_annotation_local.smk

ase_get_SNP_ref
- Download Ensembl SNP database file
ase_get_GENE_ref
- Download Ensembl GENE database file
ase_map_snps_to_genes_genomewide
- Map SNPs with MAF >= 0.05 to genes genomewide
- Note: gene labels used in column 3 of gff file were gene and ncRNA_gene
ase_get_imprinted_gene_list
- Generate gene list for is gene imprinted step
ase_cross_ref
- Bottleneck: Cross ref ASE variants for 120 samples with 30K MAF >= 0.05 varaints for ~30K genes
ase_is_gene_imprinted
- Check if gene is consistent with genomic imprinting which is defined as
- At least 90% of reads map to one of the two alleles in 80% of our heterozygotes for each SNP

Note: To improve performance the last two steps were run on GPU. There is code to run it on normal node and GPU node.

BiomaRt method

The original idea (repo V1) was to measure ASE in our 120 samples in a list of 228 imprinted genes. The initial attempt was to use BiomaRt, but there were some issues with time outs. This method would also not be suitable for a genome wide screen.

Get chromosome, start/stop coordinates for all genes (189 genes left - note some genes had multiple entries after this stage)
Get all rsIDs with MAF >= 0.05 for each gene - (169 genes left)
- Only 180 genes processed as some genes were too large for software - I can get the SNPs for these manually
- A further 11 genes had no SNPs within that range
For SNPs MAF >= 0.05 filter those with ASE reads in >= 1 sample - (164 genes left - this step needed to make programming simpler)
For SNPs MAF >= 0.05 filter those with >= 20 ASE reads in >= 10 samples - (103 genes left - note thats 20 reads total across alleles)

Running the pipeline to deal with scratch quota limits

The pipeline has to be run in different blocks in order to balance the competing requirements of the scratch file / memory quota limits which are breeched early on and a huge number of small jobs later on.

During the fastq_prep.smk, multiple versions of the fastq files can be generated at each stage. This results in the quota limit being breeched and pipeline choking at random points. To resolve this I used the following:

The snakemake temp() function to delete olde versions of fastqs as we moved through the pipeline
Lowered the max number of jobs that the pipeline could run on Hawk at any one time from 500 to 50
Gave later jobs in the process higher priority than earlier jobs

At the later annotation.smk step for ASE and SNP cross referencing step, for the genome wide anaysis we need to run 19K x 120 jobs, these take ~30-60s and require negligable resources. It is important to reinstate the job limit to 500 jobs to churn through these jobs quicker.

Copyright and Licence Information

See the LICENCE file.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
config		config
resources/json		resources/json
workflow		workflow
.gitignore		.gitignore
LICENCE.md		LICENCE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A genome-wide screen of ASE - V2 - on going

Copyright and Licence Information

About

Releases

Packages

Languages

Dazcam/fetal_brain_imprinting_290422

Folders and files

Latest commit

History

Repository files navigation

A genome-wide screen of ASE - V2 - on going

Copyright and Licence Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages