-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify batch size when cycle detected #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved unless there is an issue with nondeterministic ordering
batch = self.get(resource, params) | ||
if not total_count or total_count == 'unknown' or fetched >= total_count: | ||
total_count = int(batch['meta']['total_count']) if batch['meta']['total_count'] else 'unknown' | ||
fetched = 0 | ||
|
||
fetched += len(batch['objects']) | ||
new_in_batch = [obj for obj in batch['objects'] if obj['id'] not in last_batch_ids] | ||
last_batch_ids = {obj['id'] for obj in batch['objects']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking my understanding here:
Besides the initial request, each subsequent request is going to be a strict superset of results, therefore there's no need to keep track of last_batch_ids
beyond the most recent request.
Or is there a possibility of say, non-deterministic ordering which could result in objects in batch 1 that aren't also in batch 2?
For instance, this should work fine:
Batch 1 - fetch 5 items starting at date A: [A1, A2, A3, B1, B2]
Batch 2 - fetch 5 items starting at date B: [B1, B2, B3, B4, B5]
Batch 3 - fetch 7 items starting at date B: [B1, B2, B3, B4, B5, B6, B7]
But if the ordering is inconsistent beyond date, there could be issues:
Batch 1 - fetch 5 items starting at date A: [A1, A2, A3, B1, B2]
Batch 2 - fetch 5 items starting at date B: [B2, B3, B4, B5, B6] # missing B1!
Batch 3 - fetch 7 items starting at date B: [B1, B2, B3, B4, B5, B6, B7] # B1 will be yielded a second time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question. I was not able to determine from any documentation whether or not the sort order would deterministic however in practice we've seen it to be so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible way to break the cycle would be to compare the filter parameters. If they are identical then we would increase the batch size. I think that would work regardless of sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Seems a bit clearer of an approach too. Though this would still need to exclude duplicates when fetching overlapping data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible way to break the cycle would be to compare the filter parameters. If they are identical then we would increase the batch size.
I think this is a great idea. Not only would it work regardless of sorting, it would also pre-empt the problem. In the Ethan's example, in Batch 2, when we get to B5 we already know we must increase the batch size. If we keep going through B6 and B7 to C1, then Batch 3 will start at C1 at worst, and D1 at best.
We would want a max batch size, and if we reach it we would know we have an unrecoverable problem, so probably best to make it high.
This is a workaround to the issue described here #107 (comment)
The other PR dealing with this is still a better long term solution but requires more effort: #124