Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add optional values field to response filtering terms #160

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

gsfk
Copy link
Collaborator

@gsfk gsfk commented Aug 20, 2024

PR for issue #79, where I suggested adding a field to filtering terms, so here is the change. Targeting the main branch since dev is trailing main.

The proposal is to bring FilteringTerm in the request into better alignment with FilteringTerm in the response from various /filtering_terms endpoints. I ignore a third, similar definition of filters available at /framework/src/configuration/filteringTermSchema, this file is orphaned and is falling behind the rest of the spec.

This PR proposes an optional addition to the response; this is a non-breaking change (see this example) which among other selling points reduces the verbosity of /filtering_terms responses. See these slides for further discussion.

Motivation / use cases

An alphanumeric query has required fields id, operator, value; an example query (from the docs) is

{
    "id": "geographicOrigin",
    "operator": "=",
    "value": "England"
}

But the response filtering term discards both the operator and value fields. The expected workaround among some beacon users appears to be to package the id, operator and value into a single string so it can be returned as an id, eg:

id: geographicOrigin=England

or something similar. But this is unsupported by the specification and presents other limitations, such as verbose unstructured responses from /filtering_terms endpoints, see slides linked above for discussion. The proposed solution is to not discard the "value" field.

The proposed change was motivated by the addition of beacon v2 to the Bento platform, where we encountered a few issues with alphanumeric search:

(1) String fields not covered by a specification: In order to handle data from the Quebec covid bioank (Biobanque québécoise de la COVID-19), we added an extra property field for subject domicile, with values matching the clinical data collected:

[
    "At home",
    "In residence for the elderly (RPA)",
    "Nursing home (CHSLD)",
    "Homeless",

    ... etc...
]

Some fields (such as "homeless") map well to existing ontology terms, but others involve Quebec-specific terminology such as "CHSLD"; domicile data is stored as a string from an enum of acceptable values. We anticipate similar issues handling data from the upcoming Pan-Canadian Genome Library.

(2) Binned Search: Bento beacon is focused on public anonymous search, so some fields are searchable by predefined bins only, to inhibit re-identification attacks. Here are some example bins for BMI, again from the covid biobank:

[
    "< 20",
    "[20, 25)",
    "[25, 27)",
    "[27, 30)",
    "[30, 35)",
    "[35, 40)",
    "≥ 40"
]

but these bins are not discoverable without workarounds. Some fields may have a large number of bins, which makes the expected workaround above needlessly verbose.

Discussion

The goal here is to improve alignment between request and response filtering terms, since only the latter is discoverable.

The problem is not simply that value and operator go missing when we make a /filtering_term response, but that there is also an ambiguity in how different queries treat the id field. For OntologyFilter, id represents a value to be found in the data, but for AlphanumericFilter, id is a field to be queried. The design of response filtering terms seems to have considered only the first kind of id. Consequently, only OntologyFilter is truly discoverable, other kinds of query require workarounds outside of the specification.

What if I don't like this PR?

The discoverability problem for alphanumeric filters still exists! There are alternative solutions:

  • codify the expected workaround (squashing everything into a single string)
  • adopt the "target" proposal mentioned in issue 115: this goes some way to preserving the connection between the values being searched for and the fields where you can expect to find them.

Both of these alternatives leave us with a flattened, verbose /filtering_terms response which this PR eliminates, but these are better than changing nothing. Or, as I've heard suggested, do a complete redesign of filters. But that looks like Beacon v3, which is too far in the future.

@tb143
Copy link
Collaborator

tb143 commented Oct 28, 2024

This solution and PR has been presented to the Beacon Filters Scout and we are asking for votes on whether to accept. During the Filters Scout meeting today, there was a vote of 5 in favour of accepting and 0 against.

We want to give those who couldn't attend the meeting, or need more time to consider things, an opportunity to vote. Therefore, leave +1 or -1 before the next Filters Scout meeting on 25th November 2024. A final decision will be made then.

@redmitry
Copy link
Collaborator

I think, the proposed schema should be more strict.
I would add "minItems": 1 into "values" - we must provide at least one value.
If we want to use enumeration, we should consider "enumeration" type of the filter and put a condition it:

"dependentSchemas": {
    "values": {
        "properties": {
            "type": { "const": "enumeration" } 
        }
    }
}

So basically allow "values" only when "type" is "enumeration"

Cheers,

Dmitry

@costero-e
Copy link
Collaborator

costero-e commented Oct 29, 2024

Hi @gsfk and thanks for this PR. I'm 100% in favor of accepting this proposal, but I would first direct it towards dev and then, when we decide which milestone this fits in, merge it to main with the corresponding announcement. I know this is not a breaking change but this is not an urgent fix but a feature, so we should follow the branch management rules. Why do you want to make the request to main directly?
Regarding @redmitry's suggestion, I agree that the "minItems": 1 is useful to not let the beacons return empty arrays of values but for the "type": { "const": "enumeration" } this would restrict the queries just to the values that appear for each filtering term, which means that if you have values, then what you type in value for the filtering query is restricted to the values themselves, which wouldn't let the user make a "free search query". This would clash with "like queries" and I wouldn't do it as enumeration.

+1

@redmitry
Copy link
Collaborator

for the "type": { "const": "enumeration" } this would restrict the queries just to the values that appear for each filtering term, which means that if you have values, then what you type in value for the filtering query is restricted to the values themselves, which wouldn't let the user make a "free search query". This would clash with "like queries" and I wouldn't do it as enumeration.

Didn't get it... Do we need explicit string values for the "ontology" or "alphanumeric" filters?
I think "enumeration" is the right word for the case. If we want regex textual filters (queries) - see no much sense in a list of "possible" values. Anyway we may always postpone this for other discussion.

+1

D.

@gsfk gsfk changed the base branch from main to develop October 29, 2024 15:01
@gsfk
Copy link
Collaborator Author

gsfk commented Oct 29, 2024

Regarding the target branch, no trouble to merge to develop first. When I wrote this pr, develop was trailing main (rather than the other way around)

@mbaudis
Copy link
Member

mbaudis commented Oct 29, 2024

+1

That being said: In the long run the same structure could remodel the current structure of filtering terms since it is basically turning the current structure on its head:

  • currently: a list of individual filtering terms with a (new) path indicating where they apply
  • this mode: a list of "1 term per target model path" with embedded values supported by the endpoint

When turning this around for the ontology terms one could do the same... e.g. instead of ~800 objects for histological_diagnosis.id we would have one object w/ a list of 800 CURIEs. Fewer bytes but more prescriptive...

@antbro
Copy link

antbro commented Nov 25, 2024

Agree with the PR to better align FilteringTerm in the request with FilteringTerm in the response.
But I'd also like to see a fix for the two different ways we're using "id"
And query values should not be limited to enum values in any Beacon, as there may be more values elsewhere in a network of Beacons

Copy link
Member

@mbaudis mbaudis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this in current form and leaving additional changes (enum discussion etc.) for another time.

@mbaudis
Copy link
Member

mbaudis commented Nov 26, 2024

@costero-e Can you take up the merge? Thanks!

Copy link
Collaborator

@redmitry redmitry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve as a quick fix.
This is a new type of filter - "enumeration" see:
#160 (comment)

Copy link

@zykonda zykonda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@costero-e costero-e merged commit 1d804b6 into ga4gh-beacon:develop Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants