diff --git a/docs/seqcol_rationale.md b/docs/seqcol_rationale.md index ca72d04..5bcd4a0 100644 --- a/docs/seqcol_rationale.md +++ b/docs/seqcol_rationale.md @@ -15,11 +15,11 @@ More specifically, this document attempts to answer these questions: ## A tale of three use cases -A major need that drove seqcol development is the need to see if two sequence collections match. This need is repeated across many use cases, and a main motivating factor in choosing a deterministic, content-derived identifier, because it lets you know if the content is identical by simply asking whether the digests are identical. Early on in our discussions of sequence collection use cases, it was clear that different use cases would have slightly different formulations of what is important in a sequence collection. A brief example is that in one case, all you're trying to do is assess whether the *content of the sequences* in two sets are the same -- you don't care about the order of the sequences, or the names of the sequences, or any other metadata attributes. In this case, it would be simple to just digest each sequence, sort the digests, concatenate them, and then digest that string -- and in fact, this was an early simple proposal for the sequence collection standard. But while this would work perfectly fine for the simple use case of asserting two *sets of sequences* match in content, this is not the only use case for sequence collections. +A major need that drove seqcol development is the need to see if two sequence collections match. This need is repeated across many use cases, and a main motivating factor in choosing a deterministic, content-derived identifier, because it lets you know if the content is identical by simply asking whether the digests are identical. Early on in our discussions of sequence collection use cases, it was clear that different use cases would have slightly different formulations of what is important in a sequence collection. For example, a first use case is that of the *archiver*: we just need to assess whether the *content of the sequences* in two sets are the same. The archiver doesn't care about the order of the sequences, or the names of the sequences, or any other metadata attributes. In this case, it would be simple to just digest each sequence, sort the digests, concatenate them, and then digest that string -- and in fact, this was an early simple proposal for the sequence collection standard. But while this would work perfectly fine for the simple use case of asserting two *sets of sequences* match in content, this is not the only use case for sequence collections. -Another use case is this: say we want to reproduce a computational pipeline that does sequence alignment. We need an identifier of the reference genome (sequence collection) used for the alignment, so that when we repeat it, we guarantee we are using the same reference. In this scenario, the sorted, concatenated sequence digests wouldn't work for two reasons: First, for many aligners, the order of the sequences matter. Therefore, the final digest representing the collection should *differ* if the underlying sequences had different order. Second, the output of the aligner is going to specify the location of each read *with respect to the name of a sequence*. A sequence collection with the same sequences, even in the same order, would yield a different alignment output file if the sequences have different names. Therefore, the final digest should *differ* if the names of the sequences do not match. To generate a digest useful for this use case, we could adjust the prior proposal by 1) *not* sorting the sequence digests prior to concatenation; and 2) adding in a *names* vector. Digesting this would yield a content-derived identifier that would match only if the analysis could be reproduced entirely. Thus, already we have two use cases with different immediate answers to question; in the first, we imagine an identifier computed from sorted sequence digests, and in the second, we imagine an identifier computed from sequences plus their names in their existing order. +Another use case is that of the *aligner*: say we want to reproduce a computational pipeline that does sequence alignment. We need an identifier of the reference genome (sequence collection) used for the alignment, so that when we repeat it, we guarantee we are using the same reference. In this scenario, the sorted, concatenated sequence digests wouldn't work for two reasons: First, for many aligners, the order of the sequences matter. Therefore, the final digest representing the collection should *differ* if the underlying sequences had different order. Second, the output of the aligner is going to specify the location of each read *with respect to the name of a sequence*. A sequence collection with the same sequences, even in the same order, would yield a different alignment output file if the sequences have different names. Therefore, the final digest should *differ* if the names of the sequences do not match. To generate a digest useful for this use case, we could adjust the prior proposal by 1) *not* sorting the sequence digests prior to concatenation; and 2) adding in a *names* vector. Digesting this would yield a content-derived identifier that would match only if the analysis could be reproduced entirely. Thus, already we have two use cases with different immediate answers to question; in the first, we imagine an identifier computed from sorted sequence digests, and in the second, we imagine an identifier computed from sequences plus their names in their existing order. -Now, here's yet another use case not served by either of these identifiers. An analyst, further downstream, uses data aligned to a reference genome and summarized into genomic annotations. Say they have ChIP-seq data, which defines the binding locations of various transcription factors stored in BED file format. These datasets are columns of a sequence name, plus coordinate start and end for regions of interest. More than 80,000 BED files have been posted on GEO, as these results vary by cell-type, treatment, species, age, etc. The user now is interested in integrating BED files from different studies, either to visualize or analyze them together -- but it only makes sense to integrate them if the BED coordinates are *defined on the same coordinate system*. If defined on different coordinate systems, a position in one does not correspond to a position in another, and they cannot be easily integrated; additional processing would be required, depending on how different the reference genomes are. We would like the digest to somehow inform us as to whether the sequence collections are compatible at the level of a coordinate system. In this use case, the underlying sequence content is irrelevant, as long as the coordinate systems match. Therefore, a digest should consider the *names* and *lengths* of the sequences, but not the actual sequence content. This leads to a third vision of a sequence collection but where the actual base pairs don't matter; what's important is the names and lengths of those sequences. +Now, here's yet another use case not served by either of these identifiers: the *analyst*. An analyst, further downstream, uses data aligned to a reference genome and summarized into genomic annotations. Say they have ChIP-seq data, which defines the binding locations of various transcription factors stored in BED file format. These datasets are columns of a sequence name, plus coordinate start and end for regions of interest. More than 80,000 BED files have been posted on GEO, as these results vary by cell-type, treatment, species, age, etc. The user now is interested in integrating BED files from different studies, either to visualize or analyze them together -- but it only makes sense to integrate them if the BED coordinates are *defined on the same coordinate system*. If defined on different coordinate systems, a position in one does not correspond to a position in another, and they cannot be easily integrated; additional processing would be required, depending on how different the reference genomes are. We would like the digest to somehow inform us as to whether the sequence collections are compatible at the level of a coordinate system. In this use case, the underlying sequence content is irrelevant, as long as the coordinate systems match. Therefore, a digest should consider the *names* and *lengths* of the sequences, but not the actual sequence content. This leads to a third vision of a sequence collection but where the actual base pairs don't matter; what's important is the names and lengths of those sequences. ## Answering the question @@ -29,9 +29,9 @@ An immediate solution is to let each use case define its own identifier. That wo To do this, we employed several strategies in the sequence collections standard: -- First, we separate the question of "what gets included" from the rest of the standard. -- Second, we deploy the *comparison function*, a universal way of comparing two sequence collections. -- Third, we specify a layered algorithm for computing digests, where individual attributes are digested separately and then these are digested again to make the final digest. This provides intermediate digests that can be used for different purposes and also makes it easy to define a custom internal digests. +- First, we **use a schema** to separate the question of "what gets included" from the rest of the standard. +- Second, we deploy the **comparison function**, a more powerful way of comparing two sequence collections. +- Third, we specify a **layered algorithm** for computing digests, where individual attributes are digested separately and then these are digested again to make the final digest. This provides intermediate digests that can be used for different purposes and also makes it easy to define a custom internal digests. These strategies don’t solve all the problems individually, but taken together, they allowed us to design an elegant, powerful, and extensible structure that provides reasonably easy solutions to all our use cases, while maintaining a degree of interoperability among identifiers. Let's dive into each of these strategies in detail to see how it helps us get the best of both worlds.