Skip to content
This repository has been archived by the owner on Dec 22, 2022. It is now read-only.

As a developer, I want to utilize ElasticSearch performance robustness for API response time requirements. #13

Closed
2 tasks done
tloubrieu-jpl opened this issue Apr 1, 2021 · 17 comments
Assignees
Labels

Comments

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Apr 1, 2021

Motivation

I want the best request performances elasticSearch can give me so that I can request millions of records (e.g. products of a collection) as quickly as possible.

Additional Details

This includes:

  • only pulling the necessary fields from elasticSearch
  • buffering requests, e.g. for the properties of collection's products, instead of doing one request per product in the registry index, all the product's properties of one page can be request in a single request.

Acceptance Criteria

Given the duration of the equivalent optimized requests to elasticSearch
When I perform the same request through the API
Then I expect the duration (excluding network time between client and api) is not longer than '20%' the elasticSearch requests.

Given an API request
When I perform the underlying elasticSearch request (shown in logs)
Then I expect no extra fields to be pulled from elasticSearch

Given an API request to the end-point /collections/{lidvid}/products
When I perform the request
Then I expect only 2 calls to elasticSearch to be done (per page)

@tloubrieu-jpl tloubrieu-jpl added requirement the current issue is a requirement needs:triage labels Apr 1, 2021
@jordanpadams jordanpadams changed the title As an API user, I want the best request performances elasticSearch can give me As an API user, I want an average response time for queries of 1 second. Apr 18, 2021
@jordanpadams jordanpadams changed the title As an API user, I want an average response time for queries of 1 second. As an API user, I want an average query response time 1 second and max of 10 seconds. Apr 18, 2021
@jordanpadams jordanpadams changed the title As an API user, I want an average query response time 1 second and max of 10 seconds. As a developer, I want to utilize ElasticSearch performance robustness for API response time requirements. Apr 18, 2021
@al-niessner
Copy link
Contributor

Does this mean you want the registry to be roughly as fast as giving the same query directly to elastic search (ES)? If not, then please explain.

The 20% rule is problematic as well. It may be able to hold true for large searches, but not for short quick ones. If the Java overhead is a fixed cost of 20 ms then a 1 ms ES will be 21 ms through the registry. I doubt you will notice it, but it violates your acceptance criteria. I appreciate the 20% but maybe 1.2 * ES time + 0.5 seconds would allow for fixed costs.

@tloubrieu-jpl
Copy link
Member Author

@al-niessner we want to improve the performance by doing the 2 items given in additionnal details, and yes the api should be roughly as fast as the elasticsearch request.

Don't worry about the 20%, this was a random attempt to have a quantifiable acceptance criteria. We can adjust that, your proposoal sounds file to me. Note that we would like all requests to be less that 1 or 2 second, so the .5s fixed cost is maybe to much.

Actuaully an easier acceptance criteria would be to measure a request with the existing development and check that it is better and how much better. I have one example in a jupyter notebook which made multiple pages for /collections/{lidvid}/products which now takes 13 minutes. That is a good reference that we can check after the development is done.

@tloubrieu-jpl
Copy link
Member Author

@al-niessner will test what the best stratagy to request elasticsearch for multple id (of products).

no properties will be returned unless the user explicitly request them in field= parameter.

to get all attributes we can use field=*.

@tloubrieu-jpl
Copy link
Member Author

@al-niessner the fetchSource method that you were mentionning (https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/6.0.1/org/elasticsearch/search/builder/SearchSourceBuilder.html#fetchSource-java.lang.String:A-java.lang.String:A-) is also the one I had in mind, the SearchSourceBuilder object is already used in multiple places in the code, for example

so you can easily add the call to the fetchSrouce method there.

The one which need to be updated more heavily is this line

GetRequest getProductRequest = new GetRequest(this.collectionProductRelationships

If you confirm that doing a single request for multiple product ids is more efficient, then you will anyway have to re-write this part, I was thinking by caching the result of the request on the multiple product ids when needed. We can discuss that later.

Thanks

@al-niessner
Copy link
Contributor

al-niessner commented May 20, 2021

@jordanpadams @tloubrieu-jpl

Using just elasticsearch and its URL interface, tested individual fetch versus group fetch. The individual test was performed like:

for lidvid in lidvids: http://localhost:9200/registry/search?q=lidvid

For the group it was done like this:

http://localhost:9200/registry/search?q='+'.join(lidvids)

Using the harvest test data set which has a total of 10 lidvids. Iterated over the two 10000 times just to rinse out odd glitches.

Statistics:

  • Individual (one at a time but times are for all 10)

    • mean: 30 ms
    • median: 25 ms
    • sum: 300 seconds
  • Grouped

    • mean: 5.6 ms
    • median: 5.0 ms
    • sum: 55.9 seconds

For large numbers, grouped becomes significant. For small numbers (pagination about 10) it make little difference as they are both sub second.

@tloubrieu-jpl
Copy link
Member Author

Thanks @al-niessner , for the reference test that I use to check on the performance improvement (see https://github.com/NASA-PDS/pds-api-notebook/blob/main/notebooks/pds-api-client-ovirs-part1-explore-a-collection.ipynb) I am using pages of 500 products.

@al-niessner
Copy link
Contributor

@jordanpadams @tloubrieu-jpl

Do we have a path to use that notebook with a large database to test changes for improvement to the code on this issue? I would like to run it before making changes for a baseline then again as code develops. If that number (14.8 minutes from what I saw) does not improve or worsens it would be best to detect that earlier rather than later.

@jordanpadams
Copy link
Member

@al-niessner here is the test registry / API we have populated online: http://pds-gamma.jpl.nasa.gov/api/

@jordanpadams
Copy link
Member

and agreed. having a test script we can run and check against some time threshold would be excellent. then we can tune down that threshold as we continue to refine the response time

@tloubrieu-jpl
Copy link
Member Author

@al-niessner the elasticsearch used for the pds-gamma deployment is only accessible from pds-gamma. So we will need to deploy the development version on pds-gamma which is ok with me. I will share the instruction to update the deployment with you.

@al-niessner
Copy link
Contributor

@tloubrieu-jpl @jordanpadams

Have a working algorithm that replaces the pagination with group collects of data. However it is not behaving as expected. Two things can be wrong: One, my expectation is erroneous. Two, the class gov.nasa.jpl.pds.model.Products needs expansion/rewrite.

After fighting with elasticsearch and the basic algorithm for collecting multiple entries given a list of lidvids, found the problem area. Lets start with the setup:

curl -silent --header 'Accept: application/json' "http://localhost:8080/bundles/urn:nasa:pds:izenberg_pdart14_meap::1.0/products?limit=5&start=0&fields=ops:Label_File_Info/ops:md5_checksum&only-summary=false"

The important part of this is that it wants all products of a bundle of which 65 are found and return just the first 5. Most importantly, return only the field "ops:Label_File_Info/ops:md5_checksum". Side note, I have no idea what only-summary is supposed to mean.

Unexpectedly, the return from the curl is:

{"summary":
  {"start":0,"limit":5,"sort":[],"properties":["ops:Label_File_Info/ops:md5_checksum"]},
 "data":[
     {"investigations":[],"observing_system_components":[],"targets":[],"metadata":{},"properties":null},
     {"investigations":[],"observing_system_components":[],"targets":[],"metadata":{},"properties":null},
     {"investigations":[],"observing_system_components":[],"targets":[],"metadata":{},"properties":null},
     {"investigations":[],"observing_system_components":[],"targets":[],"metadata":{},"properties":null},
     {"investigations":[],"observing_system_components":[],"targets":[],"metadata":{},"properties":null}
   ]
}

Note that there are 5 responses as expected but not the field. Two days to get here, added some log information for debugging and this is what the code is doing:

2021-06-03 22:22:43.394  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************* working on a hit
2021-06-03 22:22:43.394  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************ map: {ops:Label_File_Info/ops:md5_checksum=0085b62fa3a17b7b2aecb8e3997e9f79}
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************* working on a hit
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************ map: {ops:Label_File_Info/ops:md5_checksum=35a8e678eca82460279fbc06dca303ba}
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************* working on a hit
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************ map: {ops:Label_File_Info/ops:md5_checksum=f3973a40ff344e636e895f5cc0984c6d}
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************* working on a hit
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************ map: {ops:Label_File_Info/ops:md5_checksum=38323ea80c17fe1f3365945801b69e1e}
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************* working on a hit
2021-06-03 22:22:43.432  INFO 55 --- [0.3-8080-exec-1] g.n.p.a.e.c.MyProductsApiBareController  : ************ map: {ops:Label_File_Info/ops:md5_checksum=cbd8f69c1ddbc3d21627836bbbee227c}

This means the output is found as expected and only the requested field is returned. However it is not being added to the product in this code block:

results.addDataItem (ElasticSearchUtil.ESentityProductToAPIProduct(
    				objectMapper.convertValue(response, EntityProduct.class)));

Which is creating a gov.nasa.jpl.pds.model.Product. However it is not including the field that was requested and is in turn returning an empty JSON object.

My expectation, which may be wrong, is that the user wanted to see just the field requested and the API should have reduced the output to just that field(s) given. However, we seem to rarely do what the user says but rather what we thing they really meant to say which means my expectation is erroneous.

The gov.nasa.jpl.pds.model.Product requires specific format so that users that request something it does not understand it ignores it. What it is not a List<HashMap<String,Object>> I do not know. However, the further Product behaves from List<HashMap<>> makes the logic of what is said and what is meant far more complicated.

So, does my expectation wrong or does gov.nasa.jpl.pds.model.Product or do both need correction?

@jordanpadams
Copy link
Member

My expectation, which may be wrong, is that the user wanted to see just the field requested and the API should have reduced the output to just that field(s) given.

this is correct. fields means return only those fields and we should pass that to elasticsearch to only return those fields. so somewhere in this code we need to define that in what we pass to the elasticsearch object

@jordanpadams
Copy link
Member

oops @al-niessner ☝️

@al-niessner
Copy link
Contributor

Change of expectation: some slashes become dots so run it through the filter. Then change them back again before building product.

If product still does not build after expectation change, then chase it.

@tloubrieu-jpl
Copy link
Member Author

ok, that sounds good @al-niessner

I would like to let you know that I also tried the fetchContext thing there:

That works.

And I also updated this utility function to filter fields

public static Map<String, PropertyValues> getFilteredProperties(

This was to get rid of the blob.

That might not be as useful though if they are already filtered at the elasticSearch query level...

And created this ProductBusinessObject class to hide some of the interaction with the ElasticSearch database.

@tloubrieu-jpl
Copy link
Member Author

Thank you @al-niessner I confirm performance is back to 11 minute. I will merge. Thanks

@jordanpadams
Copy link
Member

closed per #45

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants