Skip to content
This repository has been archived by the owner on Jul 18, 2022. It is now read-only.

resultSet wrapper object for all responses #68

Closed
Tom-Shorter opened this issue May 7, 2021 · 20 comments
Closed

resultSet wrapper object for all responses #68

Tom-Shorter opened this issue May 7, 2021 · 20 comments
Labels
enhancement New feature or request
Milestone

Comments

@Tom-Shorter
Copy link

There are a few problems with the current implementation of results, for g_variants there is a problem with pagination #26, for biosamples there is a problem with aggregation/dataset identification #67. There are a few other problems but instead of going over them in detail I would like to propose a new resultSet wrapper class and all response.results are always a list of resultSet objects.

@mbaudis has also mentioned this concept in #26, #67, #63 (and possibly elsewhere that I am missing) and his description of the concept from #26 (quoted below) is almost exactly what I was envisioning.

As in the linked discussions: IMO it would be good to have always one or more resultSet objects in the response. So basically one would introduce another layer instead of now

- results => array of responseResults (e.g. VariantInSampleResponseResults)

it becomes

- results => array of resultSet objects (with info about the resultSet, e.g. dataset...) => array of responseResults (e.g. VariantInSampleResponseResults)

The only change I would make is that results would not be an array of resultSet objects but instead an object containing resultSet objects where each resultSet object is identified by a unique key. This change is purely for making life easier for UI implementers as each resultSet can now be identified without having to look at a property of the resultSet. The unique key will likely be the datasetID or other such ID. I dont think this is possible to describe within openAPI however so I am more than happy to have an array of resultSets for simplicity as it makes little difference to the final outcome

A resultSet object can be formed from any responseObject, such as a genomicVariantResponse or a cohortResponse as well as any additional response objects which will be added in the future.

The below is an idea of what properties a resultSet object might have:

  • Identifier(required) - only if an array of resultSets is chosen over an object of resultSets, each given a unique key
  • source (required) - An array of unique ID's for each source where data was gathered to create this resultSet, such as a datasetID
  • responseType (required) - This would detail if the resultSet is a dataset, a cohort, etc. etc. This could likely be implied from the unique key given to the resultSet object/ identifier property or from the request that was made to the beacon but I do not want to assume that is always the case
  • exists (required) - boolean
  • count (optional) - no. of results
  • selectionCriteria (optional) - I am unsure about many of the details around cohorts but a property within the resultSet which details the applied selection parameters would likely be useful
  • results (optional - depends on level of access) - this will be an array of results, much like what is currently returned by response.results

The inclusion of this wrapper will also be of benefit when incorporating AAI into beacon v2, @jrambla has proposed a 4 stage access model as described below:

  1. overall exists/count
  2. dataset exists/count
  3. minimal data response
  4. expanded data response

The current response model does not allow for the above as there is no field for dataset exists/count other than within the datasetAlleleResponse object as far as I am aware. Adding in the resultSet wrapper will provide a logical place for these fields to be inserted and stage 2 - "dataset exists/count" will become "resultSet exists/count". If the user is able to access stages 3 and 4 of data then the resultSet object just needs to be expanded with the results property. Each resultSet object can also provide different information for stages 3 and 4 when compared to other resultSet objects of the same type as these can be tailored to each dataset etc.

This wrapper also solves a possible implementation problem with the aggregation of data from different datasets which should not be aggregated, such as individuals and biosamples. A beacon serving 4 datasets submitted from the same research group could come across a problem where the groups standard naming convention for individuals is "ind{X}" where x is an auto-incrementing integer, therefore there would be four version of "ind1" within the beacon, if the beacon applies aggregation to these individuals then a single result could be returned that is present within the 4 datasets but upon further inspection you realise that "ind1" is actually a different individual within each of the datasets and as such shouldn't have been aggregated. This issue should not arise with variants as there exist standardised naming conventions for variants, such as rsid's and HGVS.

This situation is also possible without aggregation due to the lack of a datasetID within /biosample responses, if a request was sent to /individuals/ind1/biosamples then all responses would look like they come from the same individual whereas they actually come from 4 individuals.

A lot of thought will need to be put into how this would actually be implemented and the properties to be included in the new object.

@jrambla
Copy link
Collaborator

jrambla commented May 12, 2021

@Tom-Shorter @mbaudis @sdelatorrep as mentioned in the mail thread, I would like to see an example of the above suggestion.
The legacy DatasetAlleleResponse model is no longer applicable since we have cohorts as another collection type, therefore my assumption that applying the DatasetAlleleResponse could solve the need was incorrect/incomplete.
Having said so, the issue #62 should be combined with the suggested resultset.

@mbaudis
Copy link
Member

mbaudis commented May 12, 2021

@Tom-Shorter

The only change I would make is that results would not be an array of resultSet objects but instead an object containing resultSet objects where each resultSet object is identified by a unique key.

I understand this - we're doing something similar for cases where the objects are in a defined scope (e.g. retrieving dataset statistics, where the identifier then is the dataset_id). However, this may be problematic if having a response with a list/set ... where the scope of the identifier isnt't clear (dataset, cohort ...).

@colinveal
Copy link

@jrambla @mbaudis @Tom-Shorter @sdelatorrep So for us something like below would be ok to give us the splits of information per dataset whilst retaining a single section for results, however I'm still not sure whether dividing results into resultSets would better, what do you think @Tom-Shorter ?

"response": {
    "exists": true,
    "error": {
      "errorCode": "same as HTTP status code",
      "errorMessage": "string"
    },
    "numTotalResults": 100,
    "results": [
      {
"response": {
    "exists": true,
    "error": {
      "errorCode": "same as HTTP status code",
      "errorMessage": "string"
    },
    "numTotalResults": 0,
    "datasets":[
      {
           "datasetA": {
                 "exists":true,
                 "count":50,
                 "link":"",
                 "contact":"",
                 "etc":""
               }
       },{
           "datasetB": {
                 "exists":false,
                 "count":0,
                 "link":"",
                 "contact":"",
                 "etc":""
                }
        }
],
    "results": [
      {}]

@mbaudis
Copy link
Member

mbaudis commented May 13, 2021

Yes, this is one of the options which IMO work; having the per-resultSet aggregation data separate.

Notes:

  • It shouldn't be datasets then, but resultSets
  • I wouldn't use the current structure (I know this is just a doodle, but FYI) but keep it more like:
response:
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      biosampleCount: 116'232
      ...
    - id: datasetB
  ...
  results:
  ...

Still, the tagging of the individual response elements for their parent resultSet remains ...

@Tom-Shorter
Copy link
Author

Still, the tagging of the individual response elements for their parent resultSet remains ...

I was going to say that this depends entirely on how the unique ID for a result is created but if the unique ID is created from a concatenation of a dataset ID and the result ID then if querying a cohort endpoint the unique ID would then not suffice as it would not contain the cohort ID so tagging would still be needed.

I think many users wont use beacon as a way of obtaining data, at least not through the results array, but the data handover may be more popular due to the data being in a standard format and not a beacon result format which will require more work to parse and may or may not contain the information you actually want. Instead users might use beacon as a way to locate where they can find suitable data, using the the results as an indicator of what data is actually available and then use the data sources own interface to query the data in a more customised way. No matter how well the beacon spec is designed a beacon query will not out perform a custom interface for the querying of data and the beacon results response will unlikely be as detailed or as easy to parse as one of the many standards already out there

The below is what I envisioned when I first started talking about resultSets but I do agree that @colinveal's approach will also work:

response:
  exists: true
  numTotalResults: 100
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    - id: datasetB
  ...

This approach allows you to pull out results from any dataset/cohort you want without having to loop through each result and find out which dataset it comes from by looking at a tag.

As mentioned in my original comment this could be slightly modified to have an object of resultSets identified by their ID, this allows for more precise data look ups as instead of having to loop through a list of resultSets to find the one with a matching ID you can instead access the correct resultSet directly.

response:
  exists: true
  numTotalResults: 100
  resultSets:
    "datasetA":
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    "datasetB":
  ...

Another consideration is the aggregation of common data, such as is seen in g_variants. It does make sense to only display this information once but if I query two datasets, one 0 based and the other 1 based then the position of the same variant will differ and the common data will not be common.

In the /filtering_terms endpoint we use the ontologyResource object as a datastore to describe ontologies, maybe the same could be true for other endpoints and we have, for example, a variantResource object which holds information about all variants seen in the results and then the results are linked to a variantResource entry though the variant ID. A caveat will have to be placed on this aggregation, only where beacons know with 100% certainty that an entity in two datasets is the same should they both point to the same resource. 100% certainty in this instance would be where it has been explicitly stated by those who did the analysis that they are the same or where each individual/biosample/... is assigned a unique ID which is used to identify them throughout all studies carried out, datasets created from those studies and additional cohorts the individual is part of. Aggregation without this level of certainty will lead to errors.

@mbaudis
Copy link
Member

mbaudis commented May 14, 2021

Ahem:

one 0 based and the other 1 based ... the common data will not be common

This case should not be supported.

@Tom-Shorter
Copy link
Author

I agree, it shouldn't be supported in an ideal world but the beacon owner might not have control over the data in the beacon so they can't standardise it. Does the beacon owner then return all results using the same 0 or 1 base (I forget which is the beacon standard) knowing that the positions provided for variants would be misleading if the user was to use the data sources own query interface or do they return results that remain true to the data, eventhough this means possibly making more than one variantResource for the same variant. Either situation is not ideal and will both be of annoyance to a beacon user so it might be a case of choosing the lesser of two evils in how to handle this situation.

@colinveal
Copy link

So in summary for me any of these three would be fine, with a preference for the latter two:

response:
  ...
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      biosampleCount: 116'232
      ...
    - id: datasetB
  ...
  results:
  ...
response:
  exists: true
  numTotalResults: 100
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    - id: datasetB
  ...
response:
  exists: true
  numTotalResults: 100
  resultSets:
    "datasetA":
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    "datasetB":
  ...

@mbaudis
Copy link
Member

mbaudis commented May 14, 2021

@colinveal @Tom-Shorter @jrambla Can live with any of those, too, with 1 ~ 2 > 3, after some thinking :-)

@Tom-Shorter
Copy link
Author

^ ditto

@jrambla
Copy link
Collaborator

jrambla commented May 14, 2021 via email

@sdelatorrep
Copy link
Contributor

Hi @Tom-Shorter! Could you create a PR introducing this change? Many thanks!

@Tom-Shorter
Copy link
Author

hi @sdelatorrep, I made a pull request last week : #72 I probably haven't done everything correctly if you haven't been notified of it though so let me know.

@sdelatorrep
Copy link
Contributor

Ah, sorry, I didn't see it. I'll review it asap. Thanks!

@sdelatorrep
Copy link
Contributor

But @Tom-Shorter , this PR is only about filters, isn't it? I don't see anything about resultSet in it.

@Tom-Shorter
Copy link
Author

Oh sorry @sdelatorrep , I'm an idiot.I saw pull request and immediately thought of the filters scout group pull request I made.

In the beacon call it was agreed that it would be best for someone from your group to add the resultsets in to the spec, I'm familiar with the g_variants endpoint responses but I haven't had to implement the other endpoints yet so I thought it best that someone who was familiar with the endpoints should update the spec.

We don't mind which properties are added to the resultSet wrapper object as long as it has the exists/count so feel free to set the object up however you wish to make sure it fits in with the rest of the beacon specs.

@sdelatorrep
Copy link
Contributor

sdelatorrep commented May 19, 2021

AFAIK, the winner was option 2 (see all here):

response:
  exists: true
  numTotalResults: 100
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    - id: datasetB
  ...

I'll add this to the spec.

sdelatorrep added a commit that referenced this issue May 21, 2021
@sdelatorrep
Copy link
Contributor

Hi @Tom-Shorter! I've just created the PR implementing this. Would you mind taking a look at it? I couldn't add you as a reviewer, but I've sent you an invitation to the repo, if you want to join :) Thanks!

@mbaudis
Copy link
Member

mbaudis commented May 21, 2021

@Tom-Shorter @sdelatorrep I've just approved, but with a comment, especially regarding additional parameters for ResultSet (counts!):

in principle extend it to a subset of the BeaconDataset parameters, w/o making it explicitly a dataset

@sdelatorrep sdelatorrep added this to the version 2.0 milestone May 21, 2021
@mbaudis mbaudis added the enhancement New feature or request label May 26, 2021
sdelatorrep added a commit that referenced this issue Jun 7, 2021
@jrambla
Copy link
Collaborator

jrambla commented Jul 1, 2021

This is solved.

@jrambla jrambla closed this as completed Jul 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants