You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ curl -s https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz | gzip -dc | grep --regexp="\t[NU]\t" | grep "chr22\t"
chr22 1 10000 1 N 10000 telomere no na
chr22 10001 10510000 2 N 10500000 short_arm no na
chr22 10784644 10834643 5 N 50000 contig no na
chr22 10874573 10924572 7 N 50000 contig no na
chr22 10966725 11016724 9 N 50000 contig no na
chr22 11068988 11118987 12 N 50000 contig no na
chr22 11160922 11210921 14 N 50000 contig no na
chr22 11378057 11428056 16 N 50000 contig no na
chr22 11497338 11547337 19 N 50000 contig no na
chr22 11631289 11681288 23 N 50000 contig no na
chr22 11724630 11774629 25 N 50000 contig no na
chr22 11977556 12027555 27 N 50000 contig no na
chr22 12225589 12275588 29 N 50000 contig no na
chr22 12438691 12488690 31 N 50000 contig no na
chr22 12641731 12691730 33 N 50000 contig no na
chr22 12726205 12776204 35 N 50000 contig no na
chr22 12818138 12868137 37 N 50000 contig no na
chr22 12904789 12954788 39 N 50000 contig no na
chr22 12977326 12977425 41 N 100 contig no na
chr22 12986172 12994027 43 N 7856 scaffold yes paired-ends
chr22 13011654 13014130 45 N 2477 scaffold yes paired-ends
chr22 13021323 13021422 47 N 100 contig no na
chr22 13109445 13109544 49 N 100 contig no na
chr22 13163678 13163777 51 N 100 contig no na
chr22 13227313 13227412 53 N 100 contig no na
chr22 13248083 13248182 55 N 100 contig no na
chr22 13254853 13254952 57 N 100 contig no na
chr22 13258198 13258297 59 N 100 contig no na
chr22 13280859 13280958 61 N 100 contig no na
chr22 13285144 13285243 63 N 100 contig no na
chr22 14419455 14419554 65 N 100 contig no na
chr22 14419895 14419994 67 N 100 contig no na
chr22 14420335 14420434 69 N 100 contig no na
chr22 14421633 14421732 71 N 100 contig no na
chr22 15054319 15154318 73 N 100000 contig no na
chr22 16279673 16302843 99 N 23171 scaffold yes unspecified
chr22 16304297 16305427 101 N 1131 scaffold yes paired-ends
chr22 16307049 16307605 103 N 557 scaffold yes paired-ends
chr22 16310303 16310402 105 N 100 scaffold yes paired-ends
chr22 16313517 16314010 107 N 494 scaffold yes paired-ends
chr22 18239130 18339129 142 N 100000 contig no na
chr22 18433514 18483513 144 N 50000 contig no na
chr22 18659565 18709564 146 N 50000 contig no na
chr22 49973866 49975365 1119 N 1500 contig no na
chr22 50808469 50818468 1153 N 10000 telomere no na
Both representations of chromosome 22 produced a gap fingerprint identical to the AGP file above.
The biggest issue with this method is if there are no gaps in the underlying sequence, then this cannot be used to define a meaningful fingerprint. As said on the call the rise of full length assemblies and improvements in sequencing methods will mean there is limited value here but there is an argument that says this is useful.
I would suggest that if we take on this idea that it's an optional extension
The text was updated successfully, but these errors were encountered:
Thanks, @andrewyatz, for your investigations! I see you base the algorithm on a cutoff (10 bases) to determine whether a gap is to be considered consecutive. Although this produces the same result as the AGP file in your test case, at least theoretically it might fail in other cases. For such an extension, why couldn't one just make directly use of the AGP file instead of basing the calculations on the FASTA sequence?
I agree that this should be considered an optional extension. I raised the idea earlier without getting much feedback, if I recall correctly, so I assumed this would be out of scope of the standard. But I do think it really fits well together with other arrays we have been considering, such as alphabet and topology.
The reason I was trying to investigate this method was because if we can base another metric on the sequence content rather than another out of band file or data source there would be an algorithm capable of doing this for any sequence.
As an update I just redid this. Worryingly yes the gaps did start to differ. This is a shame. It seems though either my code is broken badly (very possible) or the edits in each of these sequences is too much to recreate the gaps found in the AGP file
Following on from the previous discussions today I spent some time writing up a bit of code to find gaps in a FASTA file. It's in Python so you don't need to worry about Perl. I ran this over chromosome 22 from UCSC and Ensembl. I also verified this against the [AGP file from UCSC](wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz).
I then ran the gap finder code on two files. One from UCSC which had to be pulled out using their
twoBitToFa
tool and the second from Ensembl's FTP site.Both representations of chromosome 22 produced a gap fingerprint identical to the AGP file above.
The biggest issue with this method is if there are no gaps in the underlying sequence, then this cannot be used to define a meaningful fingerprint. As said on the call the rise of full length assemblies and improvements in sequencing methods will mean there is limited value here but there is an argument that says this is useful.
I would suggest that if we take on this idea that it's an optional extension
The text was updated successfully, but these errors were encountered: