DynamoDB batch retrieval does slow sequential gets (online_read) #2247

adchia · 2022-01-27T18:34:49Z

The current DynamoDB implementation does sequential gets (https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/online_stores/dynamodb.py#L163)

Possible Solution

A better approach is to do some multi-get operation or at least run these queries in parallel and collect the results.

adchia · 2022-01-27T18:36:50Z

cc @vlin-lgtm

Vandinimodi1595 · 2022-01-27T19:45:42Z

@adchia, May I know what is needed to be done. I would like to contribute to this issue.

vlin-lgtm · 2022-01-27T21:07:58Z

@adchia, May I know what is needed to be done. I would like to contribute to this issue.

It appears we have the list of primary keys (entity_ids), so we can switch to BatchGetItem, instead.

vlin-lgtm · 2022-01-27T21:10:47Z

^^ @Vandinimodi1595

TremaMiguel · 2022-02-28T01:57:30Z

@adchia @vlin-lgtm @Vandinimodi1595 switching to BatchGetItem could throw an ValidationException because

A single operation can retrieve up to 16 MB of data, which can contain as many as 100 items. BatchGetItem returns a partial result if the response size limit is exceeded

One alternative could be to write a custom BatchReader to automatically handle batch writes to DynamoDB, boto3 BatchWriter could be taken as example for doing this, it handles unprocessed items.

Another alternative is using threads to read items in parallel. SageMaker SDK Ingestion Manager could be taken as example for doing it, it works similar than the previous option calling put_record on a set of indexes.

If you're open, I'm happy to collaborate with this.

vlin-lgtm · 2022-02-28T02:36:21Z

DynamoDB's record size limit is 400KB. Since we have a list of primary keys, we know the maximum number of records we would get back, and we can assume each record is at most 400KB. We can call BatchGetItem in micro batches so ValidationException wouldn't be thrown.

TremaMiguel · 2022-02-28T03:33:10Z

@vlin-lgtm I was also considering the case of a secondary index, this might be useful when you want to access to attributes other than pk. For example

user_group    location                        value     
100                usa                                value_1.com
100                mexico                             value_2.com
100                brazil                             value_3.com

vlin-lgtm · 2022-02-28T07:45:42Z

@TremaMiguel, I am not sure if secondary indices are relevant. But I don't know Feast' codebase well enough to be certain.
The code snippet in the description is to get features given a set of entity ids. Entity ids are primary keys. I don't know where Feast uses secondary indices or if it even uses it at all.

We are not looking to create a generic wrapper for the BatchQuery API for DynamoDB.

A for-loop to BatchGet 40 entities at a time is sufficient for this.

TremaMiguel · 2022-03-02T00:37:53Z

@vlin-lgtm @adchia apologies I've misunderstood that this issue is related to sequential calls in online_write_batch, a similar issue could be raise for online_read as it iteratively process the entity_keys #2351

I still believe the approach mentioned above could help

Another alternative is using threads to read items in parallel. SageMaker SDK Ingestion Manager could be taken as example for doing it, it works similar than the previous option calling put_record on a set of indexes.

what are your thoughts?

vlin-lgtm · 2022-03-02T09:13:33Z

Thanks @TremaMiguel,

Sounds good. #2351 appears to be a dup of this issue.

I believe doing a BatchGet or a for-loop of BatchGet is good enough. What is your use case? This is used for online; I don't think there will be a case where we will need to fetch features for a lot of entities online in bulk. Multi-threading and multi-processing can add unnecessary complexity.

In any case, thanks a lot for wanting to contribute. Please feel free to open a PR. 🙏

adchia · 2022-03-04T23:29:58Z

Agreed with @vlin-lgtm on all points! BatchGet is the largest win here and simplest.

Threads are fine too though after you first use batch gets, though I think that'll be much less of an impact. You can see how we expose multithreading in https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/online_stores/datastore.py via write_concurrency.

TremaMiguel · 2022-03-04T23:32:37Z

Thanks @adchia and @vlin-lgtm, I'll start working on this

adchia added kind/bug priority/p1 labels Jan 27, 2022

github-actions bot assigned achals Jan 27, 2022

adchia unassigned achals Jan 27, 2022

adchia added the good first issue Good for newcomers label Jan 27, 2022

adchia assigned Vandinimodi1595 Jan 31, 2022

TremaMiguel mentioned this issue Mar 4, 2022

Optimize DynamoDB Online Store method online_read #2351

Closed

adchia changed the title ~~DynamoDB batch retrieval does slow sequential gets~~ DynamoDB batch retrieval does slow sequential gets (online_read) Mar 4, 2022

TremaMiguel mentioned this issue Mar 6, 2022

feat: Add support for DynamoDB online_read in batches #2371

Merged

feast-ci-bot closed this as completed in #2371 Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DynamoDB batch retrieval does slow sequential gets (online_read) #2247

DynamoDB batch retrieval does slow sequential gets (online_read) #2247

adchia commented Jan 27, 2022

adchia commented Jan 27, 2022

Vandinimodi1595 commented Jan 27, 2022

vlin-lgtm commented Jan 27, 2022

vlin-lgtm commented Jan 27, 2022

TremaMiguel commented Feb 28, 2022

vlin-lgtm commented Feb 28, 2022 •

edited

Loading

TremaMiguel commented Feb 28, 2022 •

edited

Loading

vlin-lgtm commented Feb 28, 2022

TremaMiguel commented Mar 2, 2022 •

edited

Loading

vlin-lgtm commented Mar 2, 2022

adchia commented Mar 4, 2022

TremaMiguel commented Mar 4, 2022

DynamoDB batch retrieval does slow sequential gets (online_read) #2247

DynamoDB batch retrieval does slow sequential gets (online_read) #2247

Comments

adchia commented Jan 27, 2022

Possible Solution

adchia commented Jan 27, 2022

Vandinimodi1595 commented Jan 27, 2022

vlin-lgtm commented Jan 27, 2022

vlin-lgtm commented Jan 27, 2022

TremaMiguel commented Feb 28, 2022

vlin-lgtm commented Feb 28, 2022 • edited Loading

TremaMiguel commented Feb 28, 2022 • edited Loading

vlin-lgtm commented Feb 28, 2022

TremaMiguel commented Mar 2, 2022 • edited Loading

vlin-lgtm commented Mar 2, 2022

adchia commented Mar 4, 2022

TremaMiguel commented Mar 4, 2022

vlin-lgtm commented Feb 28, 2022 •

edited

Loading

TremaMiguel commented Feb 28, 2022 •

edited

Loading

TremaMiguel commented Mar 2, 2022 •

edited

Loading