-
Notifications
You must be signed in to change notification settings - Fork 978
-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamoDB batch retrieval does slow sequential gets (online_read) #2247
Comments
cc @vlin-lgtm |
@adchia, May I know what is needed to be done. I would like to contribute to this issue. |
It appears we have the list of primary keys (entity_ids), so we can switch to BatchGetItem, instead. |
@adchia @vlin-lgtm @Vandinimodi1595 switching to
One alternative could be to write a custom Another alternative is using threads to read items in parallel. SageMaker SDK Ingestion Manager could be taken as example for doing it, it works similar than the previous option calling put_record on a set of indexes. If you're open, I'm happy to collaborate with this. |
DynamoDB's record size limit is 400KB. Since we have a list of primary keys, we know the maximum number of records we would get back, and we can assume each record is at most 400KB. We can call |
@vlin-lgtm I was also considering the case of a secondary index, this might be useful when you want to access to attributes other than pk. For example
|
@TremaMiguel, I am not sure if secondary indices are relevant. But I don't know Feast' codebase well enough to be certain. We are not looking to create a generic wrapper for the BatchQuery API for DynamoDB. A for-loop to |
@vlin-lgtm @adchia apologies I've misunderstood that this issue is related to sequential calls in online_write_batch, a similar issue could be raise for online_read as it iteratively process the I still believe the approach mentioned above could help
what are your thoughts? |
Thanks @TremaMiguel, Sounds good. #2351 appears to be a dup of this issue. I believe doing a BatchGet or a for-loop of BatchGet is good enough. What is your use case? This is used for online; I don't think there will be a case where we will need to fetch features for a lot of entities online in bulk. Multi-threading and multi-processing can add unnecessary complexity. In any case, thanks a lot for wanting to contribute. Please feel free to open a PR. 🙏 |
Agreed with @vlin-lgtm on all points! BatchGet is the largest win here and simplest. Threads are fine too though after you first use batch gets, though I think that'll be much less of an impact. You can see how we expose multithreading in https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/online_stores/datastore.py via write_concurrency. |
Thanks @adchia and @vlin-lgtm, I'll start working on this |
The current DynamoDB implementation does sequential gets (https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/online_stores/dynamodb.py#L163)
Possible Solution
A better approach is to do some multi-get operation or at least run these queries in parallel and collect the results.
The text was updated successfully, but these errors were encountered: