Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broad api calls #20

Merged
merged 8 commits into from
Jul 16, 2021
Merged

Broad api calls #20

merged 8 commits into from
Jul 16, 2021

Conversation

xtrojak
Copy link
Collaborator

@xtrojak xtrojak commented Jul 13, 2021

Improvement of API calls efficiency - where possible, broader calls were created (instead of obtaining single result, e.g. InChIKey, we obtain multiple attributes at once). Results are stored in cache for the single spectra data, as the obtained result can be reused for another job. This was implemented in Annotator.execute_job_with_cache.

Another cache for async requests was implemented using asynchronous version of lru_cache. Close #8.

Additionally, implementation of individual services was simplified where possible (mostly by separating calls and parsing), but this caused issue #19.

@xtrojak xtrojak requested review from martenson and hechth July 13, 2021 06:59
libs/utils/HashableDict.py Outdated Show resolved Hide resolved
Copy link
Member

@hechth hechth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, only problem is that asyncstdlib is not on conda!

@hechth hechth self-requested a review July 14, 2021 06:23
Copy link
Member

@martenson martenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. I've added simple test data input and output files.

The efficiency of the double caching with both execute_job_with_cache and the @lru_cache method decorator is unclear to me. Testing with a significant dataset could clarify this, but is not required at the moment in my opinion.

Beware that the default lru_cache maxsize is unlimited.

@martenson
Copy link
Member

Oh, and we have a 500MB .msp file in the UMSA library for some serious testing: https://umsa.cerit-sc.cz/library/list#folders/F1c84aa7fc4490e6d/datasets/9a34fd777b6c8572

@@ -6,6 +9,7 @@ class Converter:
def __init__(self, session):
self.session = session

@lru_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure self is hashable here? It contains the session.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we have to make sure default hash based on id is sufficient here. We will have a look into that in #1 (cache testing).

@xtrojak
Copy link
Collaborator Author

xtrojak commented Jul 16, 2021

Seems fine. I've added simple test data input and output files.

The efficiency of the double caching with both execute_job_with_cache and the @lru_cache method decorator is unclear to me. Testing with a significant dataset could clarify this, but is not required at the moment in my opinion.

The current design requires specification of type (source, target, service), which defines that user wants to obtain target attribute based on source attribute using particular service. In this PR, I have changed queries to obtain as much data as possible from single API call. In execute_job_with_cache we cache these data for single spectra. Then for another triple requesting an already cached attribute we can just return it. Note that the request itself would be actually different from the previous one, therefore lru_cache would not help.

lru_cache, on the other hand, is used to avoid executing the exactly same requests multiple times. Mostly applied when spectra in a single .msp file share an attribute (which seems to be the case).

Beware that the default lru_cache maxsize is unlimited.

It seems like default is 128 (default arg value)

@martenson martenson merged commit 59dbb1e into RECETOX:main Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A more mature caching approach
3 participants