Use inserted_at to determine which forms/cases need updating #159

orangejenny · 2020-09-27T15:22:44Z

https://dimagi-dev.atlassian.net/browse/USH-277

A client is reporting that occasionally properties aren't syncing correctly. @calellowitz helpfully came up with a theory that this is due to ES lag:

case updated at 9:27
DET run at 9:30 (causes checkpoint at 9:30)
case updated in ES at 9:32 (date modified is still 9:27)
next DET run checks only for cases modified after 9:30

The client is running the DET frequently, every 5 minutes.

This PR changes the date filtering to use inserted_at, the time the pillow inserted the item, instead of the server modified time. I don't think this will cause the exported last modified date to change, since it only updates the filters for getting forms/cases and then the pagination.

This will cause all case/forms to be resynced if the case or xform mappings change, although those mappings are quite stable. I'd guess it'll also cause resyncing when ES is upgraded, which might be a performance problem for us and/or for clients, although planning for that that seems better than living with the existing bug.

snopoke

@orangejenny ES lag should only be an issue if one Kafka partition is ahead of the another. The DET uses the modification time of the last seen document as the filter value for the next batch and not the 'time of run'.

Either way I think this is a good change if we are OK with the reindexing issue.

Also FYI:

commcare_export/commcare_minilinq.py

orangejenny · 2020-09-29T14:33:15Z

ES lag should only be an issue if one Kafka partition is ahead of the another.

@snopoke How often does that happen? Do you think that makes it less plausible that this change will address the reported issue?

snopoke · 2020-09-29T14:47:45Z

Looking at Datadog it seems to happen often enough that it could certainly be an issue: https://app.datadoghq.com/notebook/278293/change-lag-by-kafka-partition-case-form-pillows

orangejenny · 2020-09-29T16:42:15Z

@snopoke Thank you! There's one specific example from the client, which is a case that the client's sql db says was last modified 2020-09-20 17:41, but the real last modified date, from the API, is 2020-09-23 17:49. Moving the notebook's date range to the afternoon of the 23rd, there's a 2.5-hour lag for cases starting around 18:15, so it does seem plausible that lag caused that case update to get missed.

snopoke · 2020-09-30T07:28:14Z

I think I linked the wrong PR in my previous comment. Here's the correct link: #124

snopoke

Build still failing

commcare_export/commcare_minilinq.py

orangejenny · 2020-10-01T20:20:24Z

still working out some test failures

czue · 2021-01-23T14:41:42Z

@czue on that point, what do you think of using the old sorting for any existing tables and automatically using the new sorting for any new table as well as any time you pass in --start-over? That would allow us to roll it out in one go though it does make it a bit less visible to users.

This seems like a great idea to me. I can't imagine any scenario where users would want visibility into this weird internal behavior, so hiding it seems like better overall UX.

snopoke · 2021-01-27T13:12:00Z

@czue @orangejenny this is ready for review. Changes as follows:

make pagination mode configurable within code
store pagination mode in the checkpoint
decide at start up which mode to use based on command line args and previous checkpoints
- existing tables will continue with the previous pagination mode unless --start-over or --since are used
- new tables will always use the new mode

czue

Wow, this looks great.

Caveats: I only reviewed code from 0f76ab9, and some of the changes to the tests made my head hurt a bit and I didn't fully take the time to understand them (but mostly made sense/looked good).

No need to change this - but I was wondering if we can safely flip from legacy to indexed_on in a regular checkpointed update. I think you would still sort/filter by the old value, but then theoretically you could use the indexed_on of the last doc to start the next round. There's probably an edge case in there that doesn't make it perfect (two docs, one has higher modified on and the other has higher indexed on that happen to fall right at the barrier?). You could also use an earlier indexed_on, but then you run the risk of doing extra work the next time around... 🤷‍♂️

commcare_export/commcare_minilinq.py

czue · 2021-01-27T13:53:01Z

commcare_export/commcare_minilinq.py

-SUPPORTED_RESOURCES = {
-    'form', 'case', 'user', 'location', 'application', 'web-user'
-}
+class FormFilterSinceParams(object):


(assume this was just restoring the old class)

czue · 2021-01-27T14:05:48Z

tests/test_cli.py

@@ -421,7 +429,7 @@ class MockCheckpointingClient(CommCareHqClient):
    to return mocked data.

    Note this client needs to be re-initialized after use."""
-    def __init__(self, mock_data):
+    def     __init__(self, mock_data):


was this intentional?

(I think it got fixed in a later commit also)

czue · 2021-01-27T14:06:58Z

tests/test_cli.py

+        try:
+            objects = mock_requests.pop(key)
+        except KeyError:
+            print(mock_requests.keys())


nit: if this is intentional could consider adding a comment explaining why it's being printed out. print statements usually trigger a warning in my brain.

not intentional, I'll remove it. Thanks

czue · 2021-01-27T14:14:01Z

tests/test_cli.py

+        self._check_checkpoint(checkpoint_manager, '2012-04-24T05:13:01', 'doc 2')
+
+    def test_cli_pagination_since(self, writer, all_db_checkpoint_manager):
+        """Test that we use to the new pagination mode when using 'since'"""


this confused for a moment since it seems odd that you'd ever have a checkpoint and then manually override it. what's the use case for that?

not sure if there is a use case but it's not disallowed so just wanted to test it. I also realized now that if you pass in --since or --until then we don't do checkpointing at all. I'm going to update the test to reflect that.

czue

Oops, sorry meant to approve.

proteusvacuum

Very nice! Thanks for pushing this past the finish line.

snopoke · 2021-01-27T15:06:03Z

No need to change this - but I was wondering if we can safely flip from legacy to indexed_on in a regular checkpointed update. I think you would still sort/filter by the old value, but then theoretically you could use the indexed_on of the last doc to start the next round. There's probably an edge case in there that doesn't make it perfect (two docs, one has higher modified on and the other has higher indexed on that happen to fall right at the barrier?). You could also use an earlier indexed_on, but then you run the risk of doing extra work the next time around...

This is what I was suggesting here: #159 (comment) but it's not straight forward to get the ID of the last doc from tables since they may not have a suitable column to sort on and the ID may also not be just the doc ID (in the case of exporting form repeats or cases from forms). We'd also then need to look up that doc in the API to get the indexed_on date. It's not impossible but I think it would at minimum require an intermediate release to ensure that the doc ID and date column are present in all tables.

czue

thanks for follow ups!

czue · 2021-01-28T13:58:51Z

🎊

mikecjohn

The link in the log warning message to the Wiki does not exist, where did you want it to go? It currently points to: "https://wiki.commcarehq.org/display/commcarepublic/CommCare+Export+Tool+Release+Notes"

tobiasmcnulty · 2021-02-25T20:03:30Z

The wiki and pypi all seem to refer back to GitHub Releases for release notes:

czue · 2021-02-26T07:11:16Z

Oops, thanks for the heads up! Confirming the last link is the correct one. #177

orangejenny added 2 commits September 27, 2020 11:05

Replaced server_date_modified with inserted_at for cases

6b9b3f9

Replaced server_modified_on with inserted_at for forms

9321587

orangejenny requested review from snopoke and calellowitz September 27, 2020 15:22

snopoke requested changes Sep 28, 2020

View reviewed changes

commcare_export/commcare_minilinq.py Outdated Show resolved Hide resolved

Simplified FormFilterSinceParams query

39c06df

Replaced other usage of server_modified_missing

dbfb8fa

snopoke mentioned this pull request Sep 30, 2020

increase limit for case api dimagi/commcare-hq#28461

Merged

Removed FormFilterSinceParams

ab9a9f4

snopoke requested changes Oct 1, 2020

View reviewed changes

commcare_export/commcare_minilinq.py Outdated Show resolved Hide resolved

Removed unneeded secondary sort

7f03f74

orangejenny force-pushed the jls/inserted_at branch from e345ad8 to 1a3261b Compare October 1, 2020 19:39

orangejenny force-pushed the jls/inserted_at branch 7 times, most recently from 4f02fa0 to 8d5ea26 Compare October 3, 2020 18:07

orangejenny added 2 commits October 3, 2020 14:08

Removed received_on from mock clients now that's it's not used to sort

6db6e91

Updated FakeDateFormSession now that FormFilterSinceParams is gone

9438bf3

orangejenny force-pushed the jls/inserted_at branch from 8d5ea26 to 6dbad16 Compare October 3, 2020 18:10

orangejenny added 2 commits October 3, 2020 14:10

Updated xlsx queries to use indexed_on instead of server dates

2f94eeb

Replaced inserted_at, the ES name, with indexed_on, the API name

14baad6

orangejenny force-pushed the jls/inserted_at branch from 6dbad16 to 14baad6 Compare October 3, 2020 18:10

snopoke added 7 commits January 25, 2021 14:23

refactor paginators & params

0f76ab9

configurable pagination mode

5838363

support old and new pagination modes

7a00b74

Merge branch 'master' into jls/inserted_at

4d78850

store and retrieve pagination mode from checkpoint

a9c2b8c

test pagination at CLI level

5c4b2b0

add log warning

8d9d98d

snopoke requested review from czue and removed request for snopoke and calellowitz January 27, 2021 13:09

czue reviewed Jan 27, 2021

View reviewed changes

czue approved these changes Jan 27, 2021

View reviewed changes

proteusvacuum approved these changes Jan 27, 2021

View reviewed changes

snopoke added 4 commits January 27, 2021 17:06

remove whitespace

eb23b7f

remove checkpoint manager when using 'since'

8f57ed9

remove print

67f0624

allow checkpointing with since / until in integration tests

f7db1c9

snopoke approved these changes Jan 27, 2021

View reviewed changes

czue approved these changes Jan 28, 2021

View reviewed changes

snopoke merged commit 9f82c71 into master Jan 28, 2021

snopoke deleted the jls/inserted_at branch January 28, 2021 13:03

tobiasmcnulty mentioned this pull request Feb 10, 2021

pin commcare-export==1.5.0 caktus/commcare-utilities#54

Merged

mikecjohn reviewed Feb 25, 2021

View reviewed changes

snopoke mentioned this pull request May 20, 2021

Switch to using date added to ES #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use inserted_at to determine which forms/cases need updating #159

Use inserted_at to determine which forms/cases need updating #159

orangejenny commented Sep 27, 2020

snopoke left a comment

orangejenny commented Sep 29, 2020

snopoke commented Sep 29, 2020

orangejenny commented Sep 29, 2020

snopoke commented Sep 30, 2020 •

edited

Loading

snopoke left a comment

orangejenny commented Oct 1, 2020

czue commented Jan 23, 2021

snopoke commented Jan 27, 2021

czue left a comment

czue Jan 27, 2021

snopoke Jan 27, 2021

czue Jan 27, 2021

czue Jan 27, 2021

snopoke Jan 27, 2021

snopoke Jan 27, 2021

czue Jan 27, 2021

snopoke Jan 27, 2021

snopoke Jan 27, 2021

czue Jan 27, 2021

snopoke Jan 27, 2021

snopoke Jan 27, 2021

czue left a comment

proteusvacuum left a comment

snopoke commented Jan 27, 2021

czue left a comment

czue commented Jan 28, 2021

mikecjohn left a comment •

edited

Loading

tobiasmcnulty commented Feb 25, 2021

czue commented Feb 26, 2021

Use inserted_at to determine which forms/cases need updating #159

Use inserted_at to determine which forms/cases need updating #159

Conversation

orangejenny commented Sep 27, 2020

snopoke left a comment

Choose a reason for hiding this comment

orangejenny commented Sep 29, 2020

snopoke commented Sep 29, 2020

orangejenny commented Sep 29, 2020

snopoke commented Sep 30, 2020 • edited Loading

snopoke left a comment

Choose a reason for hiding this comment

orangejenny commented Oct 1, 2020

czue commented Jan 23, 2021

snopoke commented Jan 27, 2021

czue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czue left a comment

Choose a reason for hiding this comment

proteusvacuum left a comment

Choose a reason for hiding this comment

snopoke commented Jan 27, 2021

czue left a comment

Choose a reason for hiding this comment

czue commented Jan 28, 2021

mikecjohn left a comment • edited Loading

Choose a reason for hiding this comment

tobiasmcnulty commented Feb 25, 2021

czue commented Feb 26, 2021

snopoke commented Sep 30, 2020 •

edited

Loading

mikecjohn left a comment •

edited

Loading