Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Product summary object has an incomplete "properties" set #277

Closed
4 tasks
alexdunnjpl opened this issue Mar 9, 2023 · 4 comments · Fixed by #322
Closed
4 tasks

Product summary object has an incomplete "properties" set #277

alexdunnjpl opened this issue Mar 9, 2023 · 4 comments · Fixed by #322
Assignees
Labels
B14.0 bug Something isn't working i&t.done s.medium Medium level severity

Comments

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Mar 9, 2023

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

Currently, per @tloubrieu-jpl, the properties enumeration contained in the response returned from an endpoint like /products/?q=someQuery only contains an enumeration for the products contained in the page, which may be incomplete.

Initially-proposed requirement: provide full enumeration of properties with every page.

Problem: This is O(n^2) with the number of products, as each page of API results would then require the API to iterate over all (pages of/products in) hits returned by the db.

Proposed solution: Do not provide properties information in the endpoint used to access products. Instead, implement an endpoint to explicitly request this information, like /products/properties/?someQuery, which reduces "get the full set of properties and get all the products" to O(n) and avoids the need to mess around with caching.

🕵️ Expected behavior

I expect (tentatively, pending approval)

  • properties object is not returned in summary object
  • /products/properties endpoint implemented
  • /collections/properties endpoint implemented
  • /bundles/properties endpoint implemented
@tloubrieu-jpl
Copy link
Member

To move forward on that we will:

  1. investigate if opensearch has a feature to do what we want which is getting the full schema for a subset of data without getting all the documents.
  2. if not create the properties end-point without filtering capabilities for now
  3. ask for user inputs on ow they like to get these properties listed (by which subset)
  4. facet development might also have relationship with these questions

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Mar 21, 2023

For step 1, we could use the field _field_names (new since opensearch 1.3) and do an aggregation on it, as proposed on stackoverflow https://stackoverflow.com/questions/23378365/list-all-fields-in-an-elasticsearch-index, the aggregation command would be:

{
  "aggs": {
    "Field names": {
      "terms": {
        "field": "_field_names", 
        "size": {to be defined, defalut if 10 which is too small for us}
      }
    }
  }
}

I will test that on a local deployment.

@tloubrieu-jpl
Copy link
Member

Ooops that is an elasticsearch feature, that might not exist in OpenSearch...

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Mar 22, 2023

Apparently it is not possible anymore in elasticSearch, since version 5, to use _field_names in aggregation and it did not work either in OpenSearch, tested locally with OpenSearch v2.6.

The propose alternative is to use the _mapping end-point which can not be subsetted depending on search criteria.

To move forward I suggest to proceed to step 2: create a new /properties end-point to get the list of available properties for the full dataset.

I assign the ticket to @alexdunnjpl for that job.

Some specification for the new end-point:

  • The format of the properties should follow the dot notation, see https://nasa-pds.github.io/pds-api/guides/search/endpoints.html#fields-dot-notation
  • the proposed formats (Accept header) should be consistent with what is proposed for /products end-points
  • keep it simple
  • the end-point does not support any path or query parameters, it always returns the same result unless more data is loaded in the registry. If the request takes too much time/resources (more than 1s), the cache parameters of the end-point should be set by the application so that the request is not processed each time. 1 hour cache should be ok. To Be Confirmed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.0 bug Something isn't working i&t.done s.medium Medium level severity
Projects
No open projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

4 participants