Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDB batch retrieval does slow sequential gets (online_read) #2247

Closed
adchia opened this issue Jan 27, 2022 · 12 comments · Fixed by #2371
Closed

DynamoDB batch retrieval does slow sequential gets (online_read) #2247

adchia opened this issue Jan 27, 2022 · 12 comments · Fixed by #2371
Assignees

Comments

@adchia
Copy link
Collaborator

adchia commented Jan 27, 2022

The current DynamoDB implementation does sequential gets (https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/online_stores/dynamodb.py#L163)

Possible Solution

A better approach is to do some multi-get operation or at least run these queries in parallel and collect the results.

@adchia
Copy link
Collaborator Author

adchia commented Jan 27, 2022

cc @vlin-lgtm

@adchia adchia added the good first issue Good for newcomers label Jan 27, 2022
@Vandinimodi1595
Copy link

@adchia, May I know what is needed to be done. I would like to contribute to this issue.

@vlin-lgtm
Copy link

@adchia, May I know what is needed to be done. I would like to contribute to this issue.

It appears we have the list of primary keys (entity_ids), so we can switch to BatchGetItem, instead.

@vlin-lgtm
Copy link

^^ @Vandinimodi1595

@TremaMiguel
Copy link
Member

@adchia @vlin-lgtm @Vandinimodi1595 switching to BatchGetItem could throw an ValidationException because

A single operation can retrieve up to 16 MB of data, which can contain as many as 100 items. BatchGetItem returns a partial result if the response size limit is exceeded

One alternative could be to write a custom BatchReader to automatically handle batch writes to DynamoDB, boto3 BatchWriter could be taken as example for doing this, it handles unprocessed items.

Another alternative is using threads to read items in parallel. SageMaker SDK Ingestion Manager could be taken as example for doing it, it works similar than the previous option calling put_record on a set of indexes.

If you're open, I'm happy to collaborate with this.

@vlin-lgtm
Copy link

vlin-lgtm commented Feb 28, 2022

DynamoDB's record size limit is 400KB. Since we have a list of primary keys, we know the maximum number of records we would get back, and we can assume each record is at most 400KB. We can call BatchGetItem in micro batches so ValidationException wouldn't be thrown.

@TremaMiguel
Copy link
Member

TremaMiguel commented Feb 28, 2022

@vlin-lgtm I was also considering the case of a secondary index, this might be useful when you want to access to attributes other than pk. For example

user_group    location                        value     
100                usa                                value_1.com
100                mexico                             value_2.com
100                brazil                             value_3.com    

@vlin-lgtm
Copy link

@TremaMiguel, I am not sure if secondary indices are relevant. But I don't know Feast' codebase well enough to be certain.
The code snippet in the description is to get features given a set of entity ids. Entity ids are primary keys. I don't know where Feast uses secondary indices or if it even uses it at all.

We are not looking to create a generic wrapper for the BatchQuery API for DynamoDB.

A for-loop to BatchGet 40 entities at a time is sufficient for this.

@TremaMiguel
Copy link
Member

TremaMiguel commented Mar 2, 2022

@vlin-lgtm @adchia apologies I've misunderstood that this issue is related to sequential calls in online_write_batch, a similar issue could be raise for online_read as it iteratively process the entity_keys #2351

I still believe the approach mentioned above could help

Another alternative is using threads to read items in parallel. SageMaker SDK Ingestion Manager could be taken as example for doing it, it works similar than the previous option calling put_record on a set of indexes.

what are your thoughts?

@vlin-lgtm
Copy link

Thanks @TremaMiguel,

Sounds good. #2351 appears to be a dup of this issue.

I believe doing a BatchGet or a for-loop of BatchGet is good enough. What is your use case? This is used for online; I don't think there will be a case where we will need to fetch features for a lot of entities online in bulk. Multi-threading and multi-processing can add unnecessary complexity.

In any case, thanks a lot for wanting to contribute. Please feel free to open a PR. 🙏

@adchia adchia changed the title DynamoDB batch retrieval does slow sequential gets DynamoDB batch retrieval does slow sequential gets (online_read) Mar 4, 2022
@adchia
Copy link
Collaborator Author

adchia commented Mar 4, 2022

Agreed with @vlin-lgtm on all points! BatchGet is the largest win here and simplest.

Threads are fine too though after you first use batch gets, though I think that'll be much less of an impact. You can see how we expose multithreading in https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/online_stores/datastore.py via write_concurrency.

@TremaMiguel
Copy link
Member

Thanks @adchia and @vlin-lgtm, I'll start working on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants