Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify batch size when cycle detected #127

Closed
wants to merge 2 commits into from
Closed

Conversation

snopoke
Copy link
Contributor

@snopoke snopoke commented Oct 24, 2019

This is a workaround to the issue described here #107 (comment)

The other PR dealing with this is still a better long term solution but requires more effort: #124

@snopoke snopoke requested a review from esoergel October 24, 2019 15:34
Copy link
Contributor

@esoergel esoergel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved unless there is an issue with nondeterministic ordering

batch = self.get(resource, params)
if not total_count or total_count == 'unknown' or fetched >= total_count:
total_count = int(batch['meta']['total_count']) if batch['meta']['total_count'] else 'unknown'
fetched = 0

fetched += len(batch['objects'])
new_in_batch = [obj for obj in batch['objects'] if obj['id'] not in last_batch_ids]
last_batch_ids = {obj['id'] for obj in batch['objects']}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking my understanding here:
Besides the initial request, each subsequent request is going to be a strict superset of results, therefore there's no need to keep track of last_batch_ids beyond the most recent request.

Or is there a possibility of say, non-deterministic ordering which could result in objects in batch 1 that aren't also in batch 2?

For instance, this should work fine:
Batch 1 - fetch 5 items starting at date A: [A1, A2, A3, B1, B2]
Batch 2 - fetch 5 items starting at date B: [B1, B2, B3, B4, B5]
Batch 3 - fetch 7 items starting at date B: [B1, B2, B3, B4, B5, B6, B7]

But if the ordering is inconsistent beyond date, there could be issues:
Batch 1 - fetch 5 items starting at date A: [A1, A2, A3, B1, B2]
Batch 2 - fetch 5 items starting at date B: [B2, B3, B4, B5, B6] # missing B1!
Batch 3 - fetch 7 items starting at date B: [B1, B2, B3, B4, B5, B6, B7] # B1 will be yielded a second time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question. I was not able to determine from any documentation whether or not the sort order would deterministic however in practice we've seen it to be so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible way to break the cycle would be to compare the filter parameters. If they are identical then we would increase the batch size. I think that would work regardless of sorting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Seems a bit clearer of an approach too. Though this would still need to exclude duplicates when fetching overlapping data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible way to break the cycle would be to compare the filter parameters. If they are identical then we would increase the batch size.

I think this is a great idea. Not only would it work regardless of sorting, it would also pre-empt the problem. In the Ethan's example, in Batch 2, when we get to B5 we already know we must increase the batch size. If we keep going through B6 and B7 to C1, then Batch 3 will start at C1 at worst, and D1 at best.

We would want a max batch size, and if we reach it we would know we have an unrecoverable problem, so probably best to make it high.

@snopoke snopoke closed this Nov 27, 2019
@snopoke snopoke deleted the sk/modify-batch-size branch April 21, 2022 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants