modify batch size when cycle detected #127

snopoke · 2019-10-24T15:12:28Z

This is a workaround to the issue described here #107 (comment)

The other PR dealing with this is still a better long term solution but requires more effort: #124

esoergel

Approved unless there is an issue with nondeterministic ordering

esoergel · 2019-10-24T17:30:02Z

commcare_export/commcare_hq_client.py

                batch = self.get(resource, params)
                if not total_count or total_count == 'unknown' or fetched >= total_count:
                    total_count = int(batch['meta']['total_count']) if batch['meta']['total_count'] else 'unknown'
                    fetched = 0

-                fetched += len(batch['objects'])
+                new_in_batch = [obj for obj in batch['objects'] if obj['id'] not in last_batch_ids]
+                last_batch_ids = {obj['id'] for obj in batch['objects']}


Checking my understanding here:
Besides the initial request, each subsequent request is going to be a strict superset of results, therefore there's no need to keep track of last_batch_ids beyond the most recent request.

Or is there a possibility of say, non-deterministic ordering which could result in objects in batch 1 that aren't also in batch 2?

For instance, this should work fine:
Batch 1 - fetch 5 items starting at date A: [A1, A2, A3, B1, B2]
Batch 2 - fetch 5 items starting at date B: [B1, B2, B3, B4, B5]
Batch 3 - fetch 7 items starting at date B: [B1, B2, B3, B4, B5, B6, B7]

But if the ordering is inconsistent beyond date, there could be issues:
Batch 1 - fetch 5 items starting at date A: [A1, A2, A3, B1, B2]
Batch 2 - fetch 5 items starting at date B: [B2, B3, B4, B5, B6] # missing B1!
Batch 3 - fetch 7 items starting at date B: [B1, B2, B3, B4, B5, B6, B7] # B1 will be yielded a second time

Interesting question. I was not able to determine from any documentation whether or not the sort order would deterministic however in practice we've seen it to be so.

Another possible way to break the cycle would be to compare the filter parameters. If they are identical then we would increase the batch size. I think that would work regardless of sorting.

That's a good idea. Seems a bit clearer of an approach too. Though this would still need to exclude duplicates when fetching overlapping data.

Another possible way to break the cycle would be to compare the filter parameters. If they are identical then we would increase the batch size.

I think this is a great idea. Not only would it work regardless of sorting, it would also pre-empt the problem. In the Ethan's example, in Batch 2, when we get to B5 we already know we must increase the batch size. If we keep going through B6 and B7 to C1, then Batch 3 will start at C1 at worst, and D1 at best.

We would want a max batch size, and if we reach it we would know we have an unrecoverable problem, so probably best to make it high.

snopoke added 2 commits October 24, 2019 17:10

make it an int

5e8a4a0

detect when we are stuck in a loop and increase batch size to break out

b86111d

snopoke requested a review from esoergel October 24, 2019 15:34

esoergel approved these changes Oct 24, 2019

View reviewed changes

snopoke closed this Nov 27, 2019

snopoke deleted the sk/modify-batch-size branch April 21, 2022 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify batch size when cycle detected #127

modify batch size when cycle detected #127

snopoke commented Oct 24, 2019 •

edited

Loading

esoergel left a comment

esoergel Oct 24, 2019

snopoke Oct 25, 2019

snopoke Oct 25, 2019

esoergel Oct 25, 2019

kaapstorm Jun 23, 2020

modify batch size when cycle detected #127

modify batch size when cycle detected #127

Conversation

snopoke commented Oct 24, 2019 • edited Loading

esoergel left a comment

Choose a reason for hiding this comment

esoergel Oct 24, 2019

Choose a reason for hiding this comment

snopoke Oct 25, 2019

Choose a reason for hiding this comment

snopoke Oct 25, 2019

Choose a reason for hiding this comment

esoergel Oct 25, 2019

Choose a reason for hiding this comment

kaapstorm Jun 23, 2020

Choose a reason for hiding this comment

snopoke commented Oct 24, 2019 •

edited

Loading