-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence collection, ordered? or unordered? #5
Comments
reading the algorithm, it's clear that the intention is that the set is in-fact sorted, according to step 2 in the algorithm:
So it should not be controversial to add the word "ordered" to the definition of "Sequence collection":
Should perhaps become:
|
Yes, you are right. Thanks, I've made the first two changes. However, I'm not sure about your last point -- in fact can a sequence collection have a repeated sequence with a different name? So perhaps we should be using the word 'list' instead of 'set' altogether. Also, I am not sure what you mean by sequence identity is defined by their digest and length -- I would have said that sequence identity is defined by the digest alone. |
hmmm. I guess there are two ways to look at this: I have an ordered set of named sequences, and you have a second ordered set of named sequences, the sequences are the same, but we are using different names (think hg19 versus b37)
in case 1) the refcol digest should only include the ordered digests of the sequences themselves, and in case 2) it should also include the names of the sequences. I agree that the length of sequence seems to be irrelevant, but I kept it since the original algorithm has seq-digest, length & name, and I just wanted to drop the name part. |
I see -- this is going in the direction of compatibility tests. but case 1) can also be solved by a digest that includes the name of the sequences, and then just ignoring the names. I just posted another issue for discussion about how to actually encode these compatibility tests, and you can see how I was envisioning it: #7 There would only be 1 digest, which would be looked up, and then there's a function that operates on that digest to yield the answers to the questions you pose (and more). So, I think there's nothing in the system preventing you from creating a collection that included 2 of the same sequence, with different names. It would all work. |
I think I see what you're saying -- you're suggesting that there actually be stored multiple digests for a sequence collection in order to answer different compatibility questions. I'd like to hear your feedback on #7, where I proposed a way to be able to answer all the same questions, but without having to rely on multiple digests. |
Just realized that my posts about ordering in issue #7 (comment) should have been posted in the current issue. |
Here is an attempt to write up the different solutions to the ordering issue that have been discussed in the latest meetings. First, I would like to try to define exactly what are the core issues we are discussing. 1a. Should a sequence collection digest refer to an ordered list of sequences or not? As changing the order of the sequences matters in some scenarios (which? read mapping?), it seems we agree that we need to support digests for ordered lists of sequences. Also, it seems that such a digest would form the canonical digest for e.g. a particular reference genome. However, if we have not already done so, I believe we should document the arguments for this clearly. 1b. The question could then be reformulated: Should we support digests for unordered lists of sequences in addition to the ordered ones? And if so, at which layer should we support such digests? (referring here to the layers suggested in #10) Should digests of unordered arrays be supported on layer 0 only? A use case of this is in order to easily compare two seqcols that have equal content except of the order of the sequences, as exemplified in e.g. #7 (comment). Furthermore, should we in addition support a top-level digest for unordered arrays? This would e.g. allow the calculation of a digest that represents the concept of a coordinate system ( i. A UCSC-ordered "chromlength" seqcol digest will only match ordered "official" reference genome digests if the latter are lexicographically sorted. ii. As a BED file can follow any ordering of chromosomes and since the UCSC Genome Browser accepts BED files, an identifier (seqcol digest) to be used instead of the current "genome assembly" names should really conceptually represent an unordered digest (as mentioned above). iii. Even if one has a chromsizes file containing the lengths and names of the sequences, there is in practice no guarantee that this file is ordered in a canonical way, even if the "canon" is the lexicographic sorting of UCSC. All the above are arguments for including unordered digests at both layer 0 and the top level.I |
See #17 |
How to encode sequence order: ordered inputs vs order arraysWe previously established that the sequence collection digest will reflect sequence order (See ADR in #17). There are 3 proposals for how to accommodate this:
Option 1 amounts to no support for unordered collections, as mentioned by @sveinugu above. Options 2 and 3 would provide support at layer 1 (if I've understood the layers correctly). So, advantages of order array are:
But a disadvantage of the option 2 canonical order array, is that it adds some complexity, specifically on determining canonical order. This requires setting new restrictions on attributes (like names must be unique), or a complicated series of tests for tiebreakers, like if name is the same, then go to topology... and so on, which becomes problematic with the possibility of custom arrays. Well-summarized by RD: "each column we can order easily, but combinations of columns make it more difficult." This leads to option 3: we simply order each included array lexicographical, and then include an order_ATTRIBUTE array for each array in the collection, which indexes into that attribute. This provides the benefits of the order array, without the drawbacks of having to define a canonical order. A disadvantage is that you have to include an order index array for every other array, which seems to add complexity and space. |
2. How should we represent the order in the digest algorithm? Three answers have been suggested: A. We only support ordered digests at all layers, i.e. all arrays are ordered. Queries on unordered arrays can instead be implemented in the comparison API. B. All arrays are unordered in themselves, and in addition we add the In order for the digest algorithm to be able to uniquely determine the digests of the unordered arrays, they need to be sorted according to a canonical sorting scheme. In addition, an Arguments for this solution is previously provided here: #7 (comment) and #7 (comment). This suggestion has been debated at length, with the main difficulty related to the details on how to specify the canonical sorting. It has been agreed that sorting on descending length seems to be the natural choice, but the problems arise in the edge cases where several sequences are of the same length. One would then need to use either The definition of canonical order would introduce a dependency between the unordered array digests, even if this is limited to only the four main ones: C. All arrays are maintained as both ordered and unordered lists/digests independently of each other This could be done by maintaining two versions of all arrays, e.g.:
Or equally by adding an order array for each array:
In either case, the unordered arrays could then be canonically sorted lexicographically (or numerically in case of `lengths) independently of the other arrays. Also, one could calculate digests for each of them in order to more easily find whether two seqcols differ in order and/or content, now also on a per-array basis, e.g.: Given the current solution, we define two seqcols, X and Y.
Here, the contents of X and Y are the same, but the order of the sequences differ. Currently, the 0-layer digests will not match (except for topology):
With solution C, the seqcols could be implemented as follows:
Which will provide a much more informative layer 0:
A main argument against C is that doubling the number of arrays makes the solution more complex. Also, it does not directly solve the problem of providing unordered top-level digests (in 1b), as it only provides array digests (layer 0). However, one can easily calculate such a top-level digest by digesting only the unordered arrays. |
I mentioned in the last meeting that I had an idea of a variant of C which seems more elegant and useful. The idea is to add an intermediate level of the organisation to denote respectively the
We will then need to add another layer above the current layer 0, which will then become the new level 0. Here, we can directly observe that the two seqcols A and B are differently ordered versions of the same collection:
With such a structure, we will directly compute digests that capture the concept of an unordered seqcol, as argued for in 1b above. At layer 1, we will get more detailed information about which arrays differ in term of order, compared to the equal unordered counterparts:
Here, we see that e.g. the With such a solution I would argue for the The usefulness of such an approach should become clear in the use case of comparing variants of the human genome of the same core version (e.g. hg38/GRCh38), with different names ('chr1' vs '1') and order. Here the |
I would advocate for the specification to allow for both ordered and unordered sequence collections: I would argue that the best solution (out of the one described above) to support both these use case is the order array (option B in @sveinugu comment above). The issue of uniqueness of the field use to compute the canonical ordering is very common in Bioinformatics in general and I really can't imagine anyone being surprised if we impose that the name is unique within a collection. |
Unfortunately, neither sorting variant of B is really consistent to changes. Consider a case of a series of four seqcols W, X, Y and Z that all consists of sequences with the same lengths, but where two of the sequences has the same length (456). At each step of this series, one aspect of the seqcol changes:
Option A:
Option B, canonical sorting = length (descending), names:
Option B, canonical sorting = length (descending), sequences:
In conclusion, any change in the arrays used as secondary keys in the canonical sorting algorithm will trigger an order change in the other columns (except in the |
For completeness, this is how option C would handle the case above:
With option C, logic is restored at the cost of complexity. |
In the meeting yesterday, it became clear that option B was not logically coherent and elegant enough as a way to relate ordered and unordered arrays (as argued for in the previous comment). Also, the consensus seems to gravitate towards option A for simplicity. It would thus seem that not much progress have been made through the current discussion (but perhaps we have gained more clarity?). Not fully eager to let go of the logical elegance of option C, but aware of the complexity, I thought a bit more into the situation and an idea came to me: What if we support "all of the above"? Clearly, option B is not preferable, so we should not recommend this, but I believe we could in principle still support it. (Note: I accidentally submitted this comment. I am writing on a longer comment providing a detailed proposal that will come later...) |
(continuing from accidentally submitted comment...) How? By realising that the options discussed above (and plenty of others) can be reformulated as a combination of a specific structure together with a specific variant of the digest algorithm (for each level in the data structure). Given the following seqcol:
We have already (more or less) defined three local variants of the digest algorithm, depending on whether we are on the first, the second or the third level, e.g.: ga4gh_seqcol_array:
ga4gh_seqcol_object:
ga4gh_refget_sequences:
So my first idea is to formalise the choice of digest algorithm as part of the data structure itself, for each level, e.g.:
The main advantage of this is that the data structure would then be self-describing in how its digest was created, allowing for a whole new level of automation (which I will return to later). Also, I think the As there are digests at all levels in our design, the
This structure to represent arrays is obviously just a suggestion. An advantage of it is, however, that the name of the array is included in the structure itself, making it self-describing. If the name of the array also is included in the digest algorithm, one would make sure that different types of arrays have unique digests (which seems smart). Going back to the use case of providing digests for unordered sequence collections. Given that we have three seqcols, A, B, and C, which has the same content, but where the sequences are ordered differently. They will then get different digests according to the algorithms described above. However, given that one defines an alternative digest algorithm that disregards the order of the sequences, such an "unordered" algorithm would provide the same digest (say D) for A, B, and C. In a tool setting (e.g. a genome browser), digest D could then be used as a identifier for a "bucket" that collects items as it makes sense in the setting (e.g. representing the unordered coordinate system "hg38") where datasets generated with all ordered seqcols A, B, and C can be visualised together on the same page. It makes sense to use an unordered digest D to represent the genome browser page "bucket", instead of the one of the ordered digests A, B, or C, as the visualisation in any case is sorted first by the canonical sequence names and then by genome position. Should we provide such an unordered digest algorithm? Perhaps? Perhaps not? But if not, I would argue that we should make it easy for someone else to define it while still following the standard. Formalising the digest algorithm within the data structure would provide this. |
Note on the format of the digest algorithm formalisation I have until now assumed the value to be based on some sort of vocabulary of digest algorithms. However, this is not very useful, as one would then need to keep a register of algorithms with said identifiers. A better solution would be to provide URLs to documents describing the algorithm. But URLs are prone to change, which would either result in dead links or the need to change the URL in the data, but that would change the digest. The solution is, I believe CURIEs, and specifically a DOI identifier to a published document describing the algorithm (e.g. depoyed in https://Zenodo.org)! |
I'm not sold on this. I think I get the point, but personally, I don't see what's the big problem with just determining the order by comparing the order of the constituent elements. Sure, you have to go down one layer to get the elements and compare them, so it's one step more work than if you had a digest to compare... but how often do you really need to ask that question? And, even if that's a question that needs to be asked repeatedly, then why wouldn't you just pre-compute it, and stored in a database? What you're suggesting here doesn't seem to be accomplishing any more than that, but is doing so by adding much more complexity to the data structure and algorithms. Now you'll have to have a controlled vocabulary for the digest algorithm identifiers like "ga4gh_seqcol_array", and ideally a server somewhere that maintains a mapping of what those terms represent and how to implement them, and different servers would be incompatible if they implement different digest algorithms, etc. To me this is just way too much overhead to warrant what I believe is actually not even really a benefit, since you can solve the question or order from just pre-computing a compatibility comparison that looks at the elements. |
Unfortunately I have not had time to complete this, there is much more to the suggestion than just this use case. More will follow... |
(EDIT Oct 20: added some example content to get the point better across, as I hardly understood myself when re-reading my post some weeks later...) What makes it difficult for me to let this go is not that our solution makes it a bit more cumbersome to compare two digests, but that our solution does not really provide any digests that can be used as identifiers for a coordinate system to replace what e.g. “hg38” is used for today. Which means that some genome browsers (cough UCSC) probably will not embrace the standard. So what I believe the current solution entails that some canonical ordering of hg38 will be registered as the golden standard for a browser (possibly lexicographically by sequence names, e.g. "chr1, chr10, chr11,…", as this is easy to carry out automatically). For tools that manage or analyze track files (or scripts/command line usage), all track files (e.g. BED) must carry along a reference genome identifier (put there by e.g. the mapper tool). Let's say that we have a BED file "hotspots_6afd637fc190.bed" that we know is made from seqcol "6afd637fc190", helpfully provided in an unstructured way in the file name (as there are no metadata field for the reference assembly identifier in the BED format itself). Or at least somewhat better, the tool in question has a "reference_genome" parameter or similar that stores the seqcol digest. So in order to make use of this info in a service that does not care about the ordering of the sequences (say the UCSC genome browser), one needs to run the digest through a particular function (available in a library which wraps a seqcol comparison function). This function loops through all available canonically ordered coordinate systems (stored in a list or fetched from the genome browser), and for each of them, it checks whether the canonical coordinate system is a subset of the seqcol referred to by the seqcol digest recorded with the BED file, with same lengths and names, but possibly in a different order. In which case, one knows that the coordinate system can be used for the analysis or visualisation. Once this is carried out and a match is found, however, the coordinate system digest that was found cannot really logically replace the currently stored seqcol digest, as it may represent a reordering. So the BED file in our example, cannot be renamed to e.g. "hotspots_35fab9c10384.bed", as the coordinate system "35fab9c10384" explicitly refers to an ordered seqcol that does not match the order in the BED file, which again may cause issues in downstream tools. Currently this "un-FAIR" BED file would have been called e.g. "hotspots_hg38.bed", and this would be ok as tools that require coordinate systems would not care about the order of sequences. Hence, "hg38" in practice means an unordered coordinate system in this context. One could, of course, continue to disregard the order of the sequences in downstream tools, but this would IMO mean that we have defined another standard that doesn't really match the landscape the field of track file analysis/visualisation. So in practice, one loses the current main purpose of the «reference genome» field, which is for the end user (as well as tools) to easily check directly (by reading the string) whether the file can be uploaded to the genome browser or otherwise be compared with other files of the same coordinate system. Being able to replace the main ordered reference genome with a digest representing an unordered coordinate system is a loss of information (one loses the sequences and the order), but at least it logically represents the same concept as the current string ("hg38") and does not add incorrect information (another ordering). With my suggestion, an unordered coordinate system digest would in principle contain information about the canonical ordering as part of the documentation of the digest algorithm (in the part that defines a canonical serialisation of the seqcol object), which I believe is the logically correct place for that piece of information. Also, one can carry out this information reduction offline (removing the sequence array and the order) and it will be independent of particular lists of supported digests for particular genome browsers. |
Unfortunately, the issue of providing digests for unordered coordinate systems reveals a problem with the current "columnwise" solution with all information stored in arrays, e.g.:
Note: there are some inconsistencies in the nomenclature between issues #10, which describes One problem with the current column-wise solution is that it is not possible to generate a level 0 digest for an unordered sequence collections from a level 1 object with per-array digests. This is true even if such level 1 per-array digests referred to unordered arrays, as this would break the row-wise relationships between the "cells" of information at level 2, which is crucial for generating an unordered digest. Hence, for the possible extension to unordered sequence collections, level 1 as defined today (option A above), is useless. This is also true for both variants of option C above, the first variant which doubles the number of arrays (see #5 (comment)) and the second variant which contains 'ordered' and 'unordered' sub-objects that each follow option A (see #5 (comment)), for the same reason. The underlying logical fallacy in the above discussion, and the real reason behind all the confusion is that unordered sequence collections really rely on a "row-wise" structure. The same seqcol in a "row-wise" form would look something like this:
With such a structure, it would be trivial to create an unordered digest from level 1. In the serialisation algorithm, one would just order the array items (which are all digests) alphanumerically before applying the digest algorithm. However, one would loose all the advantages that the column-wise solution gives, mainly that appending new fields will not affect the relationships that the other arrays are part of. The main reason that I am still stressing the issue of unordered sequence digests is that it represents a significant argument against a pure column-wise data structure, with a use case that is of some importance (and of major concern to me in my main scientific domain). |
To me, this is another manifestation of what I'm concluding from our order discussions: that there is no straightforward way to add order-ignoring digests. I don't quite follow your argument for why the above options we've discussed fail in this case -- but anyway I agree that switching to your "row-wise" structure is another way to address the order issue. Either way, all of these solutions we've come up with are very complex, but in my opinion, provide very little practical benefit. To me, there has still not been a satisfactory argument against simply using the compare function to ask questions of order-equivalence. Our effort trying to get to order-ignoring digests has shown that there's no easy way to do that. In contrast, the compare function is easy to write, easy to use, and provides exactly the same information (actually, provides much more information). Therefore, I have returned to my original position: the digests must be ordered to accommodate the ordered use cases, and use cases that require knowing if there is order-ignoring equivalence should simply use the comparison function to identify that.
I disagree here. In fact, my use case is I believe in line with yours, that seqcol needs to be useful for coordinate systems, and even unordered coordinate systems. You and I are both working with BED files and interested in identifying compatibility between them. Your proposals all address the issue with order-ignoring digests -- but the compare function also solves this issue! I'm not sure if you've considered if you can address your need just using the compare function. I suggest taking a bit of a different tack here. I do not think re-opening the question of 'column-wise' vs 'row-wise' is the right answer. Our decision had good rational, such as the independent component digests, which are desirable for the ordered use case which must be the primary function of seqcol. We paused the order/unorder question to work on the compare function in the meetings to see if the compare function could be a workable solution to the use cases we were trying to answer with the order-ignoring digests. So, my suggestion is this: can you make use of the compare function to answer the questions you need to answer about unordered seqcols? The answer is obviously yes, you can, though it might be slightly less convenient, right? But I think if we do the compare function right, it doesn't even have to be much less convenient. So, the question is: what does it take for the compare function to satisfy your use case? Once that is answered, we can put the complicated and very time consuming order question to rest. |
@nsheff Thanks for your thorough answer. However, I have tried to clearly point out the main issue in my previous comment above (#5 (comment)): For seqcol to be useful in a FAIR manner for track files, one would need a globally unique identifier that precisely defines a coordinate system, which is in practical use, as well as logically, an unordered sequence collection of As a side-note, we should probably use In any case, a user with a track file and the related metadata could then simply use the Unfortunately, I believe the coordinate system needs to be an unordered one. And the comparison function will not help with providing an accurate metadata identifier for the coordinate system (the creation of which was the main reason I joined this project!). |
So here's my suggestion: We support both column-wise and row-wise structures (and any other type of structure really), as well as ordered and unordered serialisation algorithms (and possibly more), and any number of hash algorithms we like. However, we can leave it to others to define some of these if we do not want to. The only thing we need to add to the above suggestion is a self-describing information about the data structure, something which is already supported by JSON Schema. Collecting all of the above, this may look like follows:
And similar at all levels and substructures. All identifiers should follow CURIE form and preferably be actionable through https://identifiers.com or https://n2t.net (names2things). As to the With this, it will be up to the server to select the algorithms and schemas to support. We will provide required lists and optional lists, which can be easily extended. Also, as all records are self-described, one can potentially implement clients that will work regardless of the contents (however, this will require that the algorithm identifiers are actually actionable, which the JSON Schema id already is. I don't know if there are any standards for actionable algorithms with runnable implementations, but if not, there should be). With this solution, we don't need to implement a solution for unordered coordinate systems now, but someone else will be able to, and it will be easy to do so this later. EDIT: I suppose the |
I think I disagree here. I think coordinate systems should be ordered. In other words, I'm not convinced that they should be not allowed to have order. |
I agree that this is the core question and that it has potentially large implications (detailed in the current issue). I don't think, however, that the answer to this question should be made based on ease of implementation (for us). I know that both the UCSC Genome Browser and the Genomic HyperBrowser don't care in which order the sequences appear in the input file. Given that we want the possibility of uniquely specifying a coordinate system for each file, there are really three possibilities:
EDIT: Added a few arguments here and there. |
One idea I have been thinking about that would solve the order issue, as well as adding the possibility of seqcol subsets. It would also work independently of my metadata ideas as explained above. What if we allow the following:
For unordered coordinate systems could the be encoded by defining each sequence as a subset, as follows:
For Z_unordered, the subset digests are sorted before the final digest is computed. I believe somehow adding some metadata info in the structure that encodes the information that is an unordered seqcol is also needed. EDIT Dec 1: Removed the sequences and topologies from the coordinate system example, as those arrays wouldn't typically be a part of defining a coordinate system, at least in the horisontal genome browser world. |
As the above issue is lengthy and full of details, I will try a shorter writeup of arguments for the need and my suggested solution to provide digests for unordered coordinate systems: So the main feature I have been advocating here is that the seqcol specification should provide globally unique and persistent identifiers for the coordinate system that a track data file (e.g. a BED file) relates to (FAIR principle F1). I have a vested interest in this issue due to being the lead developer of the FAIRtracks draft standard for genomic track metadata (see e.g. a recent blog post). It has been argued that "genome build information is an essential part of genomic track files" (which is incidentally the title of this paper), and since few track data formats include support for including genome build identifiers, at least any metadata standards should. In the context of track data, the exact sequence contents are often of lesser concern than the coordinate system, and tracks made from different patches of the same version of a reference genome are usually analysed together. Hence, the need to uniquely identify coordinate systems. Of specific importance in this regard is the need to somehow denote whether a track follows the UCSC 'chr' naming scheme, something which we in our solution can manage with the First of all, @nsheff is correct (#5 (comment)) in that the core issue to be determined here is whether this identifier should refer to an ordered or an unordered coordinate system. I argue that for most practical terms, the current vaguely defined identifiers, such as "hg38" are in practice used to refer to a coordinate system where the exact order do not count. If we were to require that the a coordinate system needs to be ordered, the natural order to pick would be the lexicographic order, for several reasons, one being that this is the order required by the BigBED file format. I believe there are many problems with this approach:
This comment is a previous attempt at arguing for the need for a schema for unordered coordinate system identifiers. I will follow up with a writeup of the current suggestion for supporting identifiers for unordered coordinate systems as well as seqcol subsets. |
Can you clarify what you mean by "track file" ? In my experience this is not a very commonly used term and I'm not exactly sure what you mean. Either defining it or using a more frequent term would help me understand. |
How I suggest we solve the issue: First, there doesn't seem to be any solutions for generating a digest for an unordered seqcol from our current level 1 array digests. What is needed is to somehow add support for row-wise data storage at level 2 in addition to the current column-wise solution. Here is an updated suggestion on how to achieve this: Given the following ordered seqcols X and Y.
Say we want to specify a seqcol Z that has X and Y as subsets, in that order. This could be done as follows:
So my idea is that this solution to the subset problem also represents a solution to the problem of representing unordered coordinate systems. First, one would need a way to represent the case where the subsets are unordered (which we in our example will call Z*). The solution to this is trivial; it is just to remove the
And this is really also the only thing needed in terms of data representation to solve the issue of providing digests for unordered sequence collections. The rest of the solution can be implemented on the data side by adding sequence collections where each subset only contains a single sequence, e.g. as illustrated with the unordered seqcol W:
What is lost in this solution is the possibility of support subsets also for unordered coordinate systems. One could possibly allow two levels of subsets, but I think that would complicate everything. In practice, one would just make use of the subsets of the corresponding ordered seqcols instead, if defined. Of course, supporting this suggestion would have consequences for the other endpoints, but I am quite sure it would not be very difficult to sort that out. As argued in the comment above, I believe it would be worth it. Also, one can add other unrelated arguments for supporting subsets, for instance that this would make it possible to see the differences between sequence collections at the subset level, which I believe will be enough for many cases. It will probably also reduce the total storage size needed for a server, due to the current redundancy at the subset level. |
E.g. a BED/WIG/GFF/BigBED/BigWIG/VCF/... file |
I have tried to clarify this in the post by using the term "track data file" instead and exemplify this with the BED file format. I don't know what else to call such files? Do you have any other suggestions? |
I agree with you 100%, and I'm fully on board with that. So, we definitely share the vision for the main point of this....but I just fundamentally disagree on a few of the points, notably:
Here, my opinion diverges. In my opinion:
But set that aside for a moment. If I go along that coordinate systems should be defined as unordered... even if I do that, to me there is one major point that has not gotten enough attention: even if we do provide an unordered digest, I do not believe this will solve the primary purpose of this, which is to "provide globally unique and persistent identifiers for the coordinate system" The reason is this: in my experience, the most frequent incompatibility between coordinate systems are NOT due to order mismatches, but instead, to differing sequence names or collection elements contained. For example, all the commonly used "hg38" references in the wild today differ in terms of collection elements and sequence names. None of them is simply an order-differing variant of another. Therefore, an order-invariant digest will be of no use in this situation, since their order-invariant identifiers will not match anyway. To me, the way to solve this problem is not to mint new identifiers that lack order. Instead, the problem can be addressed through the comparison function, which has the potential to address not only the order question, but all those other compatibility questions as well. I acknowledge that it's less convenient to have to execute a function instead of just looking for a perfect digest match -- however, the function better reflects the complexity of the problem we're dealing with. Those incompatibilities are beyond order and are simply too many things to be solved by unique identifiers. It requires something more powerful. That's the compatibility function. So instead of having people rely on the identifier to convey everything they need, I'm arguing that we need to move people to to using the compatibility function. If it's super easy and lots of tools built around the return value of So, my conclusion is we should not bother adding the complexity of multiple types of digests, and instead, make the comparison function as fast and useful as possible, since that's what people are going to have to be using anyway. A few more thoughts about why the unordered identifiers won't be used:
|
To me, "track" refers to a row in a genome browser. I think a BED file can be represented as a track in a genome browser, but it can also just be a BED file, which I use in an analysis independent of a genome browser. As soon as you say "track data file" you're talking from the perspective of a genome browser. But since 99% of my use of these data files is not in conjunction with a genome browser, I find the terminology confusing. Those files are obviously also very useful outside of genome browsers. So, I wouldn't refer to them as tracks unless you're specifically talking about genome browsers. I usually refer to the files by data type, like "genomic interval data" , but that's specific to BED-type and you're looking for a more generic term that also includes wiggle-type tracks... I'd probably say something like "genome signal/region annotation" files, or something like that. |
So I definitely agree with this experience, which is I believe our current idea is a great attempt at providing a solution to the situation. So when "hg38" is attached to e.g. a BED file, this typically can mean anything in terms of collection elements and sequence names, but typically not order. And the reason for this is this context (file metadata), you would primarily interested in preserving the provenance (which reference genome was used to create this file). My argument is that the use of the term "hg38" changes slightly in meaning in the context of analysis and visualisation tools. Here, you would no longer be so interested in provenance, other than wanting to maintain the existing metadata all the way to the results (and also presumably adding info about the analysis parameters, etc.). But in the analysis or visualisation itself, you are interested in the mathematical model of the space within which your data is defined, and whether this space inherent in your data files are compatible with other data files or the model inherent in the analysis/visualisation. So what I now see more clearly is that we are really talking about two different things: A) One is an identifier representing a sequence collection thet refer to a record in a repository. This is the main scope of the seqcol standard. The ordered coordinate system as represented in a chromSizes file is an extract of this, and I agree that one would always want to carry this identifier along with a data file. B) The other thing is an identifier for the data model that is needed to make mathematical/statistical sense of the track data. Basically, the coordinate space in which the data file is defined. This model needs to be unordered, and should also really also be extended to include information such as gaps in the assembly and other gaps where data is missing. This is an analysis-oriented model in contrast to a storage-oriented model.
So to me, a "coordinate system" is a word I would use for B and for B only. It is a mathematical term. A chromSizes file a representation of a coordinate system, not the other way round. I would also say it is really not the best representation, precisely because it contains an order which is in practice often discarded anyway for analysis (but retained for visualization/results). In addition, a mathematical model of the space in which the data file is defined should really also have information about missing data, the most important of which is the centromere regions, but also other repeating regions, etc. We have been discussing some of these aspects before, but have tended to define them out of scope. I do understand that we cannot solve everything with this, but having the possibility of order-invariant sequence collections would allow for the seqcol standard to be used to also represent a coordinate system in the mathematical sense, at least in a very basic form. |
This is getting a bit out of scope, but IMHO I think you just proved my point that there isn't really any better terms to use than tracks, or possibly "track (data) files", also for non-visual analysis. Here is a bit of text I just wrote for the new FAIRtracks.net website (under construction), containing my core argument:
Forgetting to think of a BED file as a track that follows the genome can have large consequences for statistical analyses, for instance by disregarding the importance of the clustering aspect of the elements along the coordinate system. Not sure if I will be able to win you over here, but at least I'm trying, one bioinformatician at a time... |
Yes I agree there's not a great existing term for that. But to me you're trying to create a new use for an old term. If you can do it, great! I am not arguing against that, just telling you how I interpreted the term. I interpreted it differently from how you used it, and as a result, I am confused when I read some of what you write. For terms like this not in common use, if you want to use them in a way different from what others do, then... you have to make that clear and define them when you use them, and it's going to cause some confusion, until you succeed in training the whole community to use the term in that way. Personally I think this particular one could be a losing battle -- 'track' is a very old term that is already embedded, at least in my mind, from decades of genome browsers. Even now that I understand what you mean, I can't think of it naturally that way. I've used that term way too much to mean something else. Maybe too late but I'd suggest it may be easier to use a new term that conveys the idea you're trying to convey, rather than using one that is already embedded in the community with a different meaning. But if you can convince everyone to broaden the meaning of track, then so be it! |
I have now realized that supporting unordered sequence collection digests in the way I argue for in the comment above (#5 (comment)) could be implemented as a convention in a specialized server using the current seqcol standard, only with the addition of a custom array. Let's name this array
This specialized server could then crawl other seqcol servers and generate and register new unordered coordinate system seqcols, probably by only looking at the It would, however, be useful if the seqcol standard included an array with subset information, priority, or similar, so that one could extract only the core sequences if one wanted. With this, I think we could finally close the issue of unordered seqcols, as I have here sketched an unofficial way to support unordered sequence collection digests in a seqcol-compliant server by just adding one additional array with additional constraints and logic. |
Well, we have been publicly arguing for such use of the term "track" since 2010 (https://doi.org/10.1186/gb-2010-11-12-r121) and did use it internally in this way 2-3 years before that. Also the community of scientists working with method development for track data is not THAT big!... Anyway, let's leave it there. |
I'm trying to get my head around your suggestion here. If I understand correctly, this hearkens back to the idea of a 'canonical order' for arrays. Here the canonical order is found by: 1. taking each element and digesting across arrays, and then sorting that digest. This contrasts with the earlier proposals which based the order on lexographically sorting the actual array values, which led to challenges when some arrays are not present. So I think your suggestion is good as it could get around some of those limitations we faced with specifying a canonical order. If fact, I believe you wouldn't even need the |
Yes, good point. That recommendation for creating unordered digests might even make it into the standard, as it will only be a piece of text and not have any consequences for normal "ordered" scenarios. |
Above, you detailed an issue with the canonical ordering idea,
Does this new way of digesting across arrays solve this? Or does this problem not occur for some other reason? |
So the previous issue was really about the idea of adding a separate In this proposal, we do not include a In conclusion: We have come to this relatively simple suggestion based on a long detour and have collected some baggage on the way that we do not really need to take into consideration if we simplify the issue, as far as I am able to discern. Please let me know if I am thinking incorrectly now. |
Would this require the |
No, I don't think so. Total duplications are not a problem, I believe, as the per-sequence digests will be the same and thus uniquely ordered. |
The discussion of order eventually became #19, #52, and others, which in short concluded:
|
According to the current definition, a "Sequence Collection" is "a set of reference sequences", however, I believe that in order to be useful, this should be changed to an "ordered set of reference sequences" as a change in the order of the collection has implications downstream (for example a sorted bam file will not be considered valid if one changes the order of its sequence dictionary, but more generally, without a specified order, it will be up to the implementations to decide how to serialize any data that uses a collection.
The text was updated successfully, but these errors were encountered: