This repository keeps the analysis scripts that were used to generated the TR catalog from public diploid long-read human genome assemblies from the following data soucres:
- Human Pangenome Reference Consortium (HPRC)
- Human Genome Structural Variation Consortium (HGSVC2)
- 1000G ONT Sequencing Consortium
Mapping of TRs from assemblies to the reference genome
- haplotype names separated by semi-colons are shown in first header line preceded by '#'
- column descriptions:
Column | Description |
---|---|
chrom | chromosome |
start | start coordinate |
end | end coordinate |
motif | consensus repeat motif |
copy_numbers | copy numbers in haplotypes separated by semi-colons ('-' for missing genotypes) |
sizes | sizes (bp) in haplotypes separated by semi-colons ('-' for missing genotypes) |
motifs | motifs in haplotypes separated by semi-colons ('-' for missing genotypes) |
max_change | maximum change (of all haplotypes) in size (bp) substracted from reference genome size |
num_samples | number of samples with genotype |
num_calls | number of haplotypes with genotype |
motif_frequency | number of haplotypes associated with each motif observed e.g. CAG(10);CAA(2) |
feature | gene element overlapped. Format: gene|transcript|, where = exon#|intron#|utr5|utr3|cds|promoter|exon_bound (exon boundary) |