Skip to content

danchurova/Genomes_project

Repository files navigation

Genomes_project

Statistical analysis of annotated genomes

Goals and objectives

Find correlation between genomic features (like SNPs, methylation, TFBS) and functional genomic regions in different genomes

  1. Plot sequence features such as TFBS, SNPs, methylation, RNA-seq coverage
  2. Map it on functional genomic regions
  3. Find correlation and check reproducibility for different genomes
  4. Consider annotation quality and outcomes for functional features (like promoters)prediction for not annotated genomes

Data:

Graphs for Oryza sativa [1]

Arabidopsis thaliana

  1. reference genome TAIR10_toplevel (ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/dna/)
  2. annotation TAIR10_GFF3_genes.gff3
  3. variation vcf file 1001 genome TAIR
  4. methylation data

Medicago truncata

  1. annotation (.gff) and assemly (.fasta) from http://www.medicagogenome.org/downloads
  2. SNP files also from http://www.medicagogenome.org/downloads

Homo sapiens

  1. annotation Release 28 (GRCh38.p12) (CHR) in .gff3 format
  2. .fasta of primary assembly (PRI)

Mus musculus

  1. annotation Release M17 (GRCm38.p6) (CHR) in .gff3 format
  2. .fasta of primary assembly (PRI)

Felis catus

  1. annotation assembly Felis_catus_9.0 in .gff format (ID 78)
  2. .fasta of assembly 9.0 (ID 78)

Drosophila melanogaster

  1. reference assembly dmel_r5.57_FB2014_03 from FlyBase, dmel-all-chromosome-r5.57.fasta.gz
  2. annotation dmel_r5.57_FB2014_03 dmel-all-filtered-r5.57.gff.gz
  3. variation downloaded for each chromosome for all populations in one file in .vcf formatPopFly Browser Hervas S, Sanz E, Casillas S, Pool JE, and Barbadilla A (2017) PopFly: the Drosophila population genomics browser. Bioinformatics, 33, 2779-2780;

Danio rerio

Scripts for data preprocessing:

  1. get_ATGs.py
  2. get_4tss.py
  3. get_4tts.py
  4. get_promoters.py
  5. get_fin_anno.py

Data preprocessing:

  1. to create file with ATGs: python3 get_ATGs.py annotation.gff
  2. to create file with tss: python3 get_4tss.py annotation.gff
  3. to create files with promoter regions (.bed + .txt): python3 get_promoters.py 4tss.txt
  4. to obtain promoter regions sequences: sed 's/^>1.*$/>Chr1/' Arabidopsis_thaliana.TAIR10.dna.toplevel.fa | sed 's/^>2.*$/>Chr2/' | sed 's/^>3.*$/>Chr3/'| sed 's/^>4.*$/>Chr4/'| sed 's/^>5.*$/>Chr5/'| sed 's/^>Mt.*$/>ChrM/'| sed 's/^>Pt.*$/>ChrC/' > new_ref.fa in order to get names of chromosomes in fasta consistent with names in bed file, then bedtools getfasta -fi corrected_reference.fasta -bed promoters.bed -name -s -fo promoters_sequences.fasta
  5. to create fin_anno: python3 get_fin_anno.py annotation.gff

Plots visualization:

  1. first (and the most important) file is snp_custom_annotation.r, which contains a function that create custom annotation of snps - all other scripts use these function
  2. ATG_plot.r is used for visualization SNP distribution around start codon (required packages are dplyr, scales)
  3. intron_exon_junctions.r is used for visualization of SNP distribution around exon-intron boundary
  4. promoter-terminator.r is used for visualization of SNP distribution around terminator
  5. transcr_stop_plot.r is used for visualization of SNP distribution around transcription stop codon
  6. transfac.r is used for visualization distribution of TFBSs in promoter region (+-500 nucleotides around TSS)

Several results:

  1. Arabidopsis thaliana

  1. Medicago truncatula

About

Statistical analysis of genomes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published