Fingerprinting sequences using gaps #29

andrewyatz · 2022-03-09T17:06:49Z

Following on from the previous discussions today I spent some time writing up a bit of code to find gaps in a FASTA file. It's in Python so you don't need to worry about Perl. I ran this over chromosome 22 from UCSC and Ensembl. I also verified this against the [AGP file from UCSC](wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz).

$ curl -s https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz  | gzip -dc | grep --regexp="\t[NU]\t" | grep "chr22\t"
chr22	1	10000	1	N	10000	telomere	no	na
chr22	10001	10510000	2	N	10500000	short_arm	no	na
chr22	10784644	10834643	5	N	50000	contig	no	na
chr22	10874573	10924572	7	N	50000	contig	no	na
chr22	10966725	11016724	9	N	50000	contig	no	na
chr22	11068988	11118987	12	N	50000	contig	no	na
chr22	11160922	11210921	14	N	50000	contig	no	na
chr22	11378057	11428056	16	N	50000	contig	no	na
chr22	11497338	11547337	19	N	50000	contig	no	na
chr22	11631289	11681288	23	N	50000	contig	no	na
chr22	11724630	11774629	25	N	50000	contig	no	na
chr22	11977556	12027555	27	N	50000	contig	no	na
chr22	12225589	12275588	29	N	50000	contig	no	na
chr22	12438691	12488690	31	N	50000	contig	no	na
chr22	12641731	12691730	33	N	50000	contig	no	na
chr22	12726205	12776204	35	N	50000	contig	no	na
chr22	12818138	12868137	37	N	50000	contig	no	na
chr22	12904789	12954788	39	N	50000	contig	no	na
chr22	12977326	12977425	41	N	100	contig	no	na
chr22	12986172	12994027	43	N	7856	scaffold	yes	paired-ends
chr22	13011654	13014130	45	N	2477	scaffold	yes	paired-ends
chr22	13021323	13021422	47	N	100	contig	no	na
chr22	13109445	13109544	49	N	100	contig	no	na
chr22	13163678	13163777	51	N	100	contig	no	na
chr22	13227313	13227412	53	N	100	contig	no	na
chr22	13248083	13248182	55	N	100	contig	no	na
chr22	13254853	13254952	57	N	100	contig	no	na
chr22	13258198	13258297	59	N	100	contig	no	na
chr22	13280859	13280958	61	N	100	contig	no	na
chr22	13285144	13285243	63	N	100	contig	no	na
chr22	14419455	14419554	65	N	100	contig	no	na
chr22	14419895	14419994	67	N	100	contig	no	na
chr22	14420335	14420434	69	N	100	contig	no	na
chr22	14421633	14421732	71	N	100	contig	no	na
chr22	15054319	15154318	73	N	100000	contig	no	na
chr22	16279673	16302843	99	N	23171	scaffold	yes	unspecified
chr22	16304297	16305427	101	N	1131	scaffold	yes	paired-ends
chr22	16307049	16307605	103	N	557	scaffold	yes	paired-ends
chr22	16310303	16310402	105	N	100	scaffold	yes	paired-ends
chr22	16313517	16314010	107	N	494	scaffold	yes	paired-ends
chr22	18239130	18339129	142	N	100000	contig	no	na
chr22	18433514	18483513	144	N	50000	contig	no	na
chr22	18659565	18709564	146	N	50000	contig	no	na
chr22	49973866	49975365	1119	N	1500	contig	no	na
chr22	50808469	50818468	1153	N	10000	telomere	no	na

I then ran the gap finder code on two files. One from UCSC which had to be pulled out using their twoBitToFa tool and the second from Ensembl's FTP site.

Both representations of chromosome 22 produced a gap fingerprint identical to the AGP file above.

The biggest issue with this method is if there are no gaps in the underlying sequence, then this cannot be used to define a meaningful fingerprint. As said on the call the rise of full length assemblies and improvements in sequencing methods will mean there is limited value here but there is an argument that says this is useful.

I would suggest that if we take on this idea that it's an optional extension

The text was updated successfully, but these errors were encountered:

sveinugu · 2022-03-10T15:15:13Z

Thanks, @andrewyatz, for your investigations! I see you base the algorithm on a cutoff (10 bases) to determine whether a gap is to be considered consecutive. Although this produces the same result as the AGP file in your test case, at least theoretically it might fail in other cases. For such an extension, why couldn't one just make directly use of the AGP file instead of basing the calculations on the FASTA sequence?

I agree that this should be considered an optional extension. I raised the idea earlier without getting much feedback, if I recall correctly, so I assumed this would be out of scope of the standard. But I do think it really fits well together with other arrays we have been considering, such as alphabet and topology.

andrewyatz · 2022-03-10T15:59:55Z

The reason I was trying to investigate this method was because if we can base another metric on the sequence content rather than another out of band file or data source there would be an algorithm capable of doing this for any sequence.

As an update I just redid this. Worryingly yes the gaps did start to differ. This is a shame. It seems though either my code is broken badly (very possible) or the edits in each of these sequences is too much to recreate the gaps found in the AGP file

nsheff added the enhancement New feature or request label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fingerprinting sequences using gaps #29

Fingerprinting sequences using gaps #29

andrewyatz commented Mar 9, 2022

sveinugu commented Mar 10, 2022

andrewyatz commented Mar 10, 2022

Fingerprinting sequences using gaps #29

Fingerprinting sequences using gaps #29

Comments

andrewyatz commented Mar 9, 2022

sveinugu commented Mar 10, 2022

andrewyatz commented Mar 10, 2022