Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced attribute qualifiers: passthru and transient attributes #86

Open
nsheff opened this issue Oct 18, 2024 · 21 comments
Open

Advanced attribute qualifiers: passthru and transient attributes #86

nsheff opened this issue Oct 18, 2024 · 21 comments

Comments

@nsheff
Copy link
Member

nsheff commented Oct 18, 2024

In our last meeting, we discussed two additional modifiers.

This is my attempt to document the rationale behind these two attribute modifiers.


Advanced attribute qualifiers: passthru and transient attributes

For the basic seqcol attributes (names, lengths, sequences), the general algorithm and basic qualifiers (required, inherent) suffice to describe the representation. But some more nuanced attributes require additional qualifiers to describe their intention and how the server should be behave. For example, sorted_name_length_pairs and sorted_sequences are intended to provide alternative tailored identifiers and comparisons, and not necessarily useful for independent attribute lookup. Similarly, custom extra attributes, like author or alias, may be simple appendages that don't need the complex digesting procedure. In order to flag such attributes in a way that can govern slightly different server expectations, we need a couple of additional advanced attribute qualifiers. For this purpose, we introduce the passthru and transient qualifiers:

Passthru attributes

Most attributes of the canonical (level 2) seqcol representation are digested to create the level 1 representation. But sometimes, we have an attribute for which digesting makes little sense. These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation. Thus, we refer to them as passthru attributes.

Transient attributes

Most attributes of the sequence collection can be retrieved through the /attribute endpoint. However, some attributes may not be retrievable. For example, this could happen for an attribute that we intend to be used primarily as an identifier. In this case, we don't necessarily want to store the original content that went into the digest into the database, because it might be redundant or whatever. We really just want the final attribute. These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable.

The interaction between passthru and transient attributes

By definition, passthru attributes are also transient, because it makes little sense to retrieve a level 2 representation from a level 1 digest if no digest/transformation occurred (you would just be retrieving the same value you used in the request). But it is possible to have transient attributes that are not passthru; these would be attributes you do want to digest before adding to the level 1 representation, because they are either large or intended to be used as an identifier, but you don't need them to be represented in original form on the level 2 digest.


How to use them

Should they be specified as a local qualifier, using a key under a property (like we did with collated); or as an object-level qualifier, specified with a keyed list of properties up one level (like required, and what we did with inherent) ? (Further rationale in decision record: 2023-08-22 - Seqcol schemas MUST specify collated attributes with a local qualifier.

I think passthru should be an object-level qualifier, since it's really only relevant for the transition stage of the object from level 2 to level 1.

I think transient could go either way. It feels a bit more local to a specific attribute, so maybe best as a local qualifier. But, I also see some value in keeping them both in the same place, since they have some similar properties. So, I'm not sure right now where I would put them.

Edit: I think both passthru and transient are really details of a particular implementation, and not inherent to an attribute itself; as such, I think both of them belong as object-level qualifiers.

@sveinugu
Copy link
Collaborator

I think I agree with most of this. The way you explain this, it makes it even clearer to me that these are two different qualifiers and should not be conflated to one.

I like the name "transient", I think you nailed that one. Especially since the schema describes the level 2 representation, which is the representation that is transient (while the level 1 digest lingers on).

Passthru is also named well, as the level 2 representation is passed through to level 1. The only thing is the spelling of this. Wouldn't "passthrough" or "pass-through" be more "standardized"? I wasn't even aware you could spell it "passthru", but it in any case seems to be a US-specific spelling? Anyway, just raising it, and leaving the discussion to native English speakers.

As to where the qualifiers should be defined: I think a bit differently:

I think the "passthru"
qualifier should be local to the attribute, as it really describes something that is essential for the definition of that attribute. An "author" should e.g. always be passthru, and any digested variant is in reality another attribute.

The "transient" qualifier is more dependent on the implementation. If an implementation wants to allow the retrieval of level-2 values for this attribute, then that doesn't change anything essential about the attribute. So I think this should be at the level of "required".

@nsheff
Copy link
Member Author

nsheff commented Oct 20, 2024

On naming: "thru" is a sort of, relatively common abbreviation of "through", I guess. I used passthru because it's the name of a PHP function, which I remember using when I used to program in PHP 20 years ago... so I was just following that precedent. But the right form in language would be "pass through", so we could use "passthrough" or "pass_through" or "pass-through". It's just longer and more complicated and I liked the simplicity that the PHP team used, it just seemed familiar to me. But I don't love any of them, to be honest.

To me, the fact that passthru on applies to level 1 vs level 2 representation means it's not essential to the definition of the attribute per se, but only relevant in the context of the object. The object is what defines whether an attribute is required or not, and the object defines how it's going to treat attributes at level1 vs level2.

@sveinugu
Copy link
Collaborator

To me, the fact that passthru on applies to level 1 vs level 2 representation means it's not essential to the definition of the attribute per se, but only relevant in the context of the object. The object is what defines whether an attribute is required or not, and the object defines how it's going to treat attributes at level1 vs level2.

Not sure I understand your point here. To me, the logic of the schema is that is mainly describes the contents of the level 2 representation, e.g. the type of the "sequences" is an array of strings. The default algorithm is then that the level 2 representation is canonicalized and digested. For the passthru attributes, the schema still describes the level 2 representation, but the passthru denotes that the default digest algorithm is not used. To me, the definition of an attribute should include the level 2 representation AND the algorithm to get to level 1 - both are essential to the definition of an attribute.

However, there is no formal way of specifying the digest algorithm in the schema. One could for instance imagine a flag specifying that the "sorted" attributes has a digest algorithm is a bit more complicated than the default. The transient qualifier should not be used for this, as transient only accidentally overlaps with the sorted attributes right now – one can easily imagine attributes that are transient but not sorted in the future.

Perhaps what we really should have is a 'digest_algorithm' qualifier using a controlled vocabulary, with the current options being e.g. "seqcol_digest", "sorted_seqcol_digest" and "passthru"?

@nsheff
Copy link
Member Author

nsheff commented Oct 21, 2024

Would rather not make it more complicated, I think we're already overengineering this. To me the schema is describing the level 2 representation. It need not be aware of the level 1 representation; that is outside the schema. The schema is more general, and it is usable for other purposes outside seqcol.

For 'passthru', I'm thinking from the perspective of attribute re-use. I thought qualifiers that would be carried over in case of re-use (in an external schema) should be local, whereas those that wouldn't should be global. If I want to build a schema that imported these attributes, as external definitions, I think:

  • 'collated' makes sense to carry along with the attribute; it's related specifically to the attribute, not the whole schema
  • 'required' does not (it's specific to the particular schema)
  • 'inherent' does not (it's also specific to the schema)
  • 'transient' does not
  • 'passthru' does not.

So, from that perspective, I think passthru is a global qualifier.

I guess maybe the main difference is that you're seeing the schema as more tightly coupled to the seqcol protocol -- I'm seeing the schema a little more broadly.

@sveinugu
Copy link
Collaborator

Right. Now I follow you. Yes, I think that makes sense.

I agree that we shouldn't overengineer things more. I'm fine with your suggestions.

My main point was that the standard shouldn't allow people to redefine a passthru attribute in a particular implementation as a digested attribute, breaking interoperability at level 1.

So as long as there is a mechanism for this, I am fine with your suggestion.

What is important now is to get this beast out of the door.

@nsheff
Copy link
Member Author

nsheff commented Oct 21, 2024

My main point was that the standard shouldn't allow people to redefine a passthru attribute in a particular implementation as a digested attribute, breaking interoperability at level 1.

Yes, I agree with this in principle.

But then again, is it so bad if two implementations differ on that? Well:

  • It seem unlikely that that they would differ. If an attribute is passthru, probably every one agrees. But if not...
  • Passthru attributes are likely to be specific to the particular implementation anyway, and if they are somehow shared...
  • what's the matter if one of them calls it passthru and the other does not? All that would happen is the servers would give different values at level 1 for that attribute. is that really a problem? it would be declared in the schema as such, with one annotating the attribute as passthru and the other not.

I don't see this as an important issue, I guess.

But, could also be considered an argument in favor of bringing 'passthru' with the attribute as a local qualifier, as you originally suggested. On the other hand, if somehow there was a situation where one wanted to call it passthru and the other didn't, then maybe there's a reason for that, and then it wouldn't make sense to have it as a local qualifier.

@sveinugu
Copy link
Collaborator

Not a big issue, I agree. Passthru as a global qualifier is fine with me.

@tcezard
Copy link
Collaborator

tcezard commented Oct 29, 2024

Thank you both for the detail description of these two attributes.
I do not have a strong opinion on wether these should be set as global or local qualifiers although I do think that whatever way we define them they should allowed to be implementation specific. i.e. sorted_sequences should be allowed to be passthru & transient in one implementation and not in another.

The one point I'm still unclear on is the expected behaviour of the list endpoints when given a passthru attribute. One of the main purpose of the passthru attribute is to have precomputed filtering capability in the list endpoint without haveing to store the underlying array. So it seems essential that the list endpoint accepts passthru attributes to filter the list of seqcol object.
GET /list/collection/sorted_sequences/osjbsladore5snvhda

However not all passthru attributes will not necessarily be digests and it might be odd if not wrong to allow the list endpoint to be filtered by alias or source
We could of course define yet another qualifier but it seems complicated enough already.

Thoughts ?

@sveinugu
Copy link
Collaborator

sveinugu commented Oct 30, 2024

EDIT: see next comment

Regarding the need for making passthru implementation-specific, then that's an argument for this qualifier as a global qualifier, so case closed on that then.

The issue I had is that there is really nothing in the schema that specifies that a passthru attribute is digested at level 1 or not. I don't know what sorted_sequences at level 1 would be if not digested, so not a real issue for that, but might theoretically be an issue for other attributes. I think a good enough solution for now is just to agree on that in the context of an attribute specification list, outside the schema.

Regarding filtering the list endpoint on passthru attributes: can't we just define that that shoukd be supported for attributes that are of type "string", leaving other types undefined for now?

@sveinugu
Copy link
Collaborator

sveinugu commented Oct 30, 2024

Now you confused me: wasn't the whole point of having two qualifiers that sorted_sequences is transient (by default) and not passthru, and that the schema defines this as an array of strings.

If sorted_sequences is allowed to be passthru on an implementation, then that necessitates a change of the type in the schema to string (it's level-2 specification), breaking interoperability. But I don't see the point in allowing that.

I still don't see why you would want to make passthru implementation specific, thus I still in my heart feel it's a local qualifier (but I also think we shouldn't spend much time on this, so global is fine). Could you provide an example of the need to make passthru implementation-specific (given that sorted_sequences is not passthru)?

@nsheff
Copy link
Member Author

nsheff commented Oct 30, 2024

I think I'm with Tim here. I think both attributes are implementation-specific. Rationale is from my comment above, here:

what's the matter if one of them calls it passthru and the other does not? All that would happen is the servers would give different values at level 1 for that attribute. is that really a problem? it would be declared in the schema as such, with one annotating the attribute as passthru and the other not.

I think that's one of the reasons to declare the variable as passthru in the schema.

TO your comment:

If sorted_sequences is allowed to be passthru on an implementation, then that necessitates a change of the type in the schema to string (it's level-2 specification), breaking interoperability. But I don't see the point in allowing that.

Why do you say that? I disagree in several ways:

  1. why does a passthru attribute have to be a string? According to the definition, "These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation." So I wasn't thinking about it the way you are, I guess. I was thinking it would be an array, passed through to level 1.
  2. even if this did change the schema (which I think it doesn't), how does that break interoperability exactly? I just don't think it's a major problem if two implementations use and declare in their schema that they have a different treatment of a particular, secondary attribute. The main, most important interoperability (of core attributes) is preserved, and even that one attribute can still probably be used similarly. This is going to happen, whether we like it or not.
  3. The point in allowing that is that one implementation might want to digest an attribute and make it retrievable, and another implementation may not. I think you're fixed on sorted_sequences as the example for this, and maybe for that one people would/should all do the same thing, and that could be part of the spec -- but that doesn't mean passthru in general should be fixed! are there other attributes for which some use cases want it to be passthru, and others don't? I presume this could happen.
  4. Finally, even if this can't or shouldn't happen, we have no way of preventing it. If we make it possible to declare attributes as passthru, someone might do that with an attribute, and someone else might choose not to. I don't see how you can prevent that, short of a fixed, global set of attribute definitions, which includes passthru status, and these are the only allowable attributes. I definitely don't think we should go there.

So my conclusion is: passthru is implementation-specific, there's nothing we can do about it anyway, and it's not really a problem for anything I can think of.

@nsheff
Copy link
Member Author

nsheff commented Oct 30, 2024

The one point I'm still unclear on is the expected behaviour of the list endpoints when given a passthru attribute. One of the main purpose of the passthru attribute is to have precomputed filtering capability in the list endpoint without haveing to store the underlying array. So it seems essential that the list endpoint accepts passthru attributes to filter the list of seqcol object.
However not all passthru attributes will not necessarily be digests and it might be odd if not wrong to allow the list endpoint to be filtered by alias or source
We could of course define yet another qualifier but it seems complicated enough already.

Hmm, I see what you mean. That's a tough one.

I was thinking, most of these would probably be transient attributes (not passthru attributes). It's not a problem for transient attributes that are not passthru, those would be digested and thus easily filterable.

But for attributes that are passthru, I can imagine that some would be filterable, and others would not, and that's the issue.

Regarding filtering the list endpoint on passthru attributes: can't we just define that that should be supported for attributes that are of type "string", leaving other types undefined for now?

I think this could work, but it's also a bit unsatisfying...


I guess this is causing me to rethink some of my above response to @sveinugu -- Is this the reason you were suggesting that passthru attributes have to be strings?

Could we just say that all passhtru attributes have to be strings?

But even if they are strings, it still might not make sense to allow them to be filtered...

@sveinugu
Copy link
Collaborator

sveinugu commented Oct 30, 2024

  1. why does a passthru attribute have to be a string? According to the definition, "These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation." So I wasn't thinking about it the way you are, I guess. I was thinking it would be an array, passed through to level 1.

No, a passthru attribute could clearly be anything. My point was in relation to the list endpoint, the filters are (as far as I can remember) only specified for level-1 digests, which are strings. Extending to passthru attributes that also happen to be strings is trivial. How to represent filtering on more complex objects in the form of a URL is more tricky, so I suggested to leave that undefined for now.

  1. even if this did change the schema (which I think it doesn't), how does that break interoperability exactly? I just don't think it's a major problem if two implementations use and declare in their schema that they have a different treatment of a particular, secondary attribute. The main, most important interoperability (of core attributes) is preserved, and even that one attribute can still probably be used similarly. This is going to happen, whether we like it or not.

I think there is a misunderstanding here. I am talking about sorted_sequences specifically, which as far as I understand the current suggestion, is a transient but not passthru attribute, and the type declared in the schema (level 2 schema) is an array of strings (the sorted sequences). Transient means that you do not need to retain the level 2 data for the attribute endpoint if you do not want to. If you do want to return the sorted sequences, you exclude sorted_sequences from the list of transient attributes.

I don't believe it makes sense to define sorted_sequences as passthru (as long as we retain the transient qualifier). The way I understand it, that would in practice mean one of two things:

a. Pass the array of strings at level 2 on to level 1
b. Redeclare the schema and level 2 type to string (the digest) and pass in on to level 1

Neither of these make much sense to me if we retain the transient qualifier. What would be the point of either of these?? Of course, you could redefine sorted_sequences as a completely different passthru attribute, loosing all traces of interoperability, but that makes even less sense.

  1. The point in allowing that is that one implementation might want to digest an attribute and make it retrievable, and another implementation may not. I think you're fixed on sorted_sequences as the example for this, and maybe for that one people would/should all do the same thing, and that could be part of the spec -- but that doesn't mean passthru in general should be fixed! are there other attributes for which some use cases want it to be passthru, and others don't? I presume this could happen.

That is also my question. I am open to that being the case, but it would be easier to discuss with an example. My main point in the above was that sorted_sequences is NOT an example of this.

  1. Finally, even if this can't or shouldn't happen, we have no way of preventing it. If we make it possible to declare attributes as passthru, someone might do that with an attribute, and someone else might choose not to. I don't see how you can prevent that, short of a fixed, global set of attribute definitions, which includes passthru status, and these are the only allowable attributes. I definitely don't think we should go there.

I am not proposing any such thing in general. But I do think we should define sorted_sequences and sorted_name_length_pairs as transient by default (opening up for implementations to change this) but NOT passthru, similarly to the way we are defining names, lengths, and sequences as required. One could implement a seqcol server without e.g. names and lengths, but it would no longer be fully compliant. One could also implement a sorted_sequences attribute that somehow is passthru, but that would also be non-compliant.

My point is to define this in our canonical list of attribute definitions to support interoperability.

The comparison with the required qualifier cleared up a point of confusion for me regarding "global" and "local" qualifiers. I had been thinking that a global qualifier was something the implementations are allowed to change, while local qualifiers are part of the definition of an attribute and thus not changeable. While the last of these is true, there is nothing stopping us in defining that for certain attributes, the global qualifiers needs to be set in a particular way to be in accordance to the standard. This is what we are doing when we require the core attributes.

I now do agree with you that passthru should be a global qualifier, since it is something that is very specific to the seqcol standard (and schema), while local qualifiers should be reserved for things that are more generally applicable, in case attributes are used in other context. It is not a coincidence that we have been approached to standardize the collated qualifier across groups, but I don't foresee that we will be run down by people who wants to standardize the passthru qualifier!

(We should btw respond to Alex on the other issue!)

Edit: I misremembered. It is the "inherent" attribute we are standardizing, hence my last comment does not make sense...

@nsheff
Copy link
Member Author

nsheff commented Nov 1, 2024

Ok, I think we're on the same page.

Would this be an acceptable rule on the filtering of passthru question:

  • transient attributes must be acceptable for filtering.
  • passthru attributes MAY be acceptable for filtering.

That's all we say. Basically, according to the spec, passthru attributes need not implement filtering. but transient attributes do. Does this simple rule solve the problem?

Of course, an implementation is free to implement filtering for passthru attributes that are strings. but we don't need to mandate that.

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 1, 2024

Ok, I think we're on the same page.

Would this be an acceptable rule on the filtering of passthru question:

  • transient attributes must be acceptable for filtering.

  • passthru attributes MAY be acceptable for filtering.

That's all we say. Basically, according to the spec, passthru attributes need not implement filtering. but transient attributes do. Does this simple rule solve the problem?

Of course, an implementation is free to implement filtering for passthru attributes that are strings. but we don't need to mandate that.

Yes, I think this makes sense and is a simple solution, at least for now.

@tcezard
Copy link
Collaborator

tcezard commented Nov 20, 2024

I think I agree with the consensus now and realised that I was slightly confused before about when we would be using transient and passthru so I tried to write down what I understand about the two attribute. Feel free to correct and amend.

Transient attributes

A transient attribute is an attribute that only has a level 1 representation stored in the server.

Construction

A transient attribute behave similarly as other attributes during the construction of the sequence collection in the sense that it will generate a level 1 digest from the original data. The actual construction algorithm needs to be specified for each attribute needs to be provided in the attribute definition.
The transient attribute cannot be inherent so will not contribute to level 0 digest

The flexibility of the construction is meant to allow the construction of level 1 digests that will differ from the standard one but in practice the attribute is expected to be an typed array that contains values collated or not that will go through an encoding similar to the one described in the standard to generate level1 digests

List endpoint

A transient attribute's level1 representation can be used to list sequence collections that contains it.

collection endpoint

At level 1: The level 1 representation of a sequence collection should list the transient attributes and its level 1 representation
At level 2: The level 2 representation of a sequence collection should list the transient attributes and its level 1 representation (TBC)

Comparison endpoint

The comparison representation of a sequence collection should list the transient attributes in the attributes section but will not list them in array_elements (TBC)

Attribute endpoint

transient attributes cannot be used with the attribute endpoint

Passthru attributes

A passthru attribute is an attribute that has a the same representation at level 2 or level 1.

Construction

A passthru attribute is not constructed as it is not modified between the level 2 and level 1 representation.
The passthru attribute cannot be inherent so will not contribute to level 0 digest

List endpoint

A passthru attribute's level1 representation may be used to list sequence collections that contains it if the server supports it. (TBC)

collection endpoint

At level 1: The level 1 representation of a sequence collection should list the passthru attributes and its level 2 representation
At level 2: The level 2 representation of a sequence collection should list the transient attributes and its level 2 representation (TBC)

Comparison endpoint

The comparison representation of a sequence collection should list the passthru attributes in the attributes section but will not list them in array_elements (TBC)

Attribute endpoint

passthru attributes cannot be used with the attribute endpoint

@nsheff
Copy link
Member Author

nsheff commented Nov 20, 2024

I've read through and I think everything looks right -- I'm still thinking about the collection endpoint behavior, I hadn't thought about that yet. But 1 thing I noticed is this:

passthru attributes cannot be used with the attribute endpoint

Not necessarily true, right? I think there MAY be passthru attributes that the server does allow in the list...

So what I wrote is:

  • The attribute endpoint MUST be functional for any attribute defined in the schema, except those marked as transient or passthru.
  • The endpoint MAY respond to requests for attributes marked as passthru.
  • The endpoint SHOULD NOT respond to requests for attributes marked as transient.

Edit: cross out what was wrong

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 20, 2024

Thanks @tcezard for this overview. Great that you reviewed all the endpoints in detail, which was lacking. I think I agree with most of the TBC points, except perhaps wording (e.g. passthru attributes at level 1 should list the level 1 representation, which happens to be exactly the same as level 2. So no need to make a special case out of this, I believe).

Three things I am unsure about:

  1. @nsheff Regarding supportingpassthru attributes in the attribute endpoint: Why would you want to do that? Since the endpoint requires the user to provie the level-1 digest, which in this case would be the passthru value, then the endpoint would just return the same passthru value back to you. Am I missing something?
  2. Should the collection endpoint at level 2 list the level 1 contents of transient attributes? Since the idea was that the same attribute (say sorted_sequences) could be transient in one server but not in another, then the outputs of the collection endpoints at level 2 would differ between these servers, with one stating that the value of sorted_sequences is a digest, while the other claims it is an array. Perhaps it would be better to omit transient attributes at level 2 to not have this potential discrepancy?
  3. Is there a reason why transient or passthru attributes might not also be inherent? I don't have any use case in mind, but there is as far as I can see no technical reasons why this might not be allowed, so why add a rule for this at all?

@nsheff
Copy link
Member Author

nsheff commented Nov 20, 2024

  • Resolved 1 because I made a mistake, I meant the /list endpoint, not the /attribute endpoint.
  • Resolve 2: level2 should exclude transient attributes.
  • probably don't need to state this. probably people won't want to do this, but there's no reason why they couldn't if there was a reason.

@nsheff
Copy link
Member Author

nsheff commented Nov 20, 2024

I've updated the PR with the latest changes, please have a look.

@nsheff
Copy link
Member Author

nsheff commented Nov 20, 2024

We've come to consensus on these points, and this is now explained in PR #87.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants