Skip to content

Tandem repeat catalog from public long-read sequence assemblies

Notifications You must be signed in to change notification settings

bcgsc/tr_catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A tandem repeat (TR) catalog generated from high-quality long-read human genome assemblies

This repository keeps the analysis scripts that were used to generated the TR catalog from public diploid long-read human genome assemblies from the following data soucres:

  1. Human Pangenome Reference Consortium (HPRC)
  2. Human Genome Structural Variation Consortium (HGSVC2)
  3. 1000G ONT Sequencing Consortium

Workflow

workflow

Mapping of TRs from assemblies to the reference genome

Catalog

v1

  • haplotype names separated by semi-colons are shown in first header line preceded by '#'
  • column descriptions:
Column Description
chrom chromosome
start start coordinate
end end coordinate
motif consensus repeat motif
copy_numbers copy numbers in haplotypes separated by semi-colons ('-' for missing genotypes)
sizes sizes (bp) in haplotypes separated by semi-colons ('-' for missing genotypes)
motifs motifs in haplotypes separated by semi-colons ('-' for missing genotypes)
max_change maximum change (of all haplotypes) in size (bp) substracted from reference genome size
num_samples number of samples with genotype
num_calls number of haplotypes with genotype
motif_frequency number of haplotypes associated with each motif observed e.g. CAG(10);CAA(2)
feature gene element overlapped. Format: gene|transcript|, where = exon#|intron#|utr5|utr3|cds|promoter|exon_bound (exon boundary)

About

Tandem repeat catalog from public long-read sequence assemblies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages