-
Notifications
You must be signed in to change notification settings - Fork 3
As a developer, I want to utilize ElasticSearch performance robustness for API response time requirements. #13
Comments
Does this mean you want the registry to be roughly as fast as giving the same query directly to elastic search (ES)? If not, then please explain. The 20% rule is problematic as well. It may be able to hold true for large searches, but not for short quick ones. If the Java overhead is a fixed cost of 20 ms then a 1 ms ES will be 21 ms through the registry. I doubt you will notice it, but it violates your acceptance criteria. I appreciate the 20% but maybe 1.2 * ES time + 0.5 seconds would allow for fixed costs. |
@al-niessner we want to improve the performance by doing the 2 items given in additionnal details, and yes the api should be roughly as fast as the elasticsearch request. Don't worry about the 20%, this was a random attempt to have a quantifiable acceptance criteria. We can adjust that, your proposoal sounds file to me. Note that we would like all requests to be less that 1 or 2 second, so the .5s fixed cost is maybe to much. Actuaully an easier acceptance criteria would be to measure a request with the existing development and check that it is better and how much better. I have one example in a jupyter notebook which made multiple pages for /collections/{lidvid}/products which now takes 13 minutes. That is a good reference that we can check after the development is done. |
@al-niessner will test what the best stratagy to request elasticsearch for multple id (of products). no properties will be returned unless the user explicitly request them in field= parameter. to get all attributes we can use field=*. |
@al-niessner the fetchSource method that you were mentionning (https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/6.0.1/org/elasticsearch/search/builder/SearchSourceBuilder.html#fetchSource-java.lang.String:A-java.lang.String:A-) is also the one I had in mind, the SearchSourceBuilder object is already used in multiple places in the code, for example Line 75 in db52a57
so you can easily add the call to the fetchSrouce method there. The one which need to be updated more heavily is this line Line 89 in db52a57
If you confirm that doing a single request for multiple product ids is more efficient, then you will anyway have to re-write this part, I was thinking by caching the result of the request on the multiple product ids when needed. We can discuss that later. Thanks |
Using just elasticsearch and its URL interface, tested individual fetch versus group fetch. The individual test was performed like:
For the group it was done like this:
Using the harvest test data set which has a total of 10 lidvids. Iterated over the two 10000 times just to rinse out odd glitches. Statistics:
For large numbers, grouped becomes significant. For small numbers (pagination about 10) it make little difference as they are both sub second. |
Thanks @al-niessner , for the reference test that I use to check on the performance improvement (see https://github.com/NASA-PDS/pds-api-notebook/blob/main/notebooks/pds-api-client-ovirs-part1-explore-a-collection.ipynb) I am using pages of 500 products. |
Do we have a path to use that notebook with a large database to test changes for improvement to the code on this issue? I would like to run it before making changes for a baseline then again as code develops. If that number (14.8 minutes from what I saw) does not improve or worsens it would be best to detect that earlier rather than later. |
@al-niessner here is the test registry / API we have populated online: http://pds-gamma.jpl.nasa.gov/api/ |
and agreed. having a test script we can run and check against some time threshold would be excellent. then we can tune down that threshold as we continue to refine the response time |
@al-niessner the elasticsearch used for the pds-gamma deployment is only accessible from pds-gamma. So we will need to deploy the development version on pds-gamma which is ok with me. I will share the instruction to update the deployment with you. |
Have a working algorithm that replaces the pagination with group collects of data. However it is not behaving as expected. Two things can be wrong: One, my expectation is erroneous. Two, the class gov.nasa.jpl.pds.model.Products needs expansion/rewrite. After fighting with elasticsearch and the basic algorithm for collecting multiple entries given a list of lidvids, found the problem area. Lets start with the setup:
The important part of this is that it wants all products of a bundle of which 65 are found and return just the first 5. Most importantly, return only the field "ops:Label_File_Info/ops:md5_checksum". Side note, I have no idea what only-summary is supposed to mean. Unexpectedly, the return from the curl is:
Note that there are 5 responses as expected but not the field. Two days to get here, added some log information for debugging and this is what the code is doing:
This means the output is found as expected and only the requested field is returned. However it is not being added to the product in this code block:
Which is creating a gov.nasa.jpl.pds.model.Product. However it is not including the field that was requested and is in turn returning an empty JSON object. My expectation, which may be wrong, is that the user wanted to see just the field requested and the API should have reduced the output to just that field(s) given. However, we seem to rarely do what the user says but rather what we thing they really meant to say which means my expectation is erroneous. The gov.nasa.jpl.pds.model.Product requires specific format so that users that request something it does not understand it ignores it. What it is not a So, does my expectation wrong or does gov.nasa.jpl.pds.model.Product or do both need correction? |
this is correct. fields means return only those fields and we should pass that to elasticsearch to only return those fields. so somewhere in this code we need to define that in what we pass to the elasticsearch object |
oops @al-niessner ☝️ |
Change of expectation: some slashes become dots so run it through the filter. Then change them back again before building product. If product still does not build after expectation change, then chase it. |
ok, that sounds good @al-niessner I would like to let you know that I also tried the fetchContext thing there: Line 202 in 22e7957
That works. And I also updated this utility function to filter fields Line 125 in 5e47e95
This was to get rid of the blob. That might not be as useful though if they are already filtered at the elasticSearch query level... And created this ProductBusinessObject class to hide some of the interaction with the ElasticSearch database. |
Thank you @al-niessner I confirm performance is back to 11 minute. I will merge. Thanks |
closed per #45 |
Motivation
I want the best request performances elasticSearch can give me so that I can request millions of records (e.g. products of a collection) as quickly as possible.
Additional Details
This includes:
Acceptance Criteria
Given the duration of the equivalent optimized requests to elasticSearch
When I perform the same request through the API
Then I expect the duration (excluding network time between client and api) is not longer than '20%' the elasticSearch requests.
Given an API request
When I perform the underlying elasticSearch request (shown in logs)
Then I expect no extra fields to be pulled from elasticSearch
Given an API request to the end-point /collections/{lidvid}/products
When I perform the request
Then I expect only 2 calls to elasticSearch to be done (per page)
The text was updated successfully, but these errors were encountered: