Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use inserted_at to determine which forms/cases need updating #159

Merged
merged 32 commits into from
Jan 28, 2021

Conversation

orangejenny
Copy link
Contributor

https://dimagi-dev.atlassian.net/browse/USH-277

A client is reporting that occasionally properties aren't syncing correctly. @calellowitz helpfully came up with a theory that this is due to ES lag:

  1. case updated at 9:27
  2. DET run at 9:30 (causes checkpoint at 9:30)
  3. case updated in ES at 9:32 (date modified is still 9:27)
  4. next DET run checks only for cases modified after 9:30

The client is running the DET frequently, every 5 minutes.

This PR changes the date filtering to use inserted_at, the time the pillow inserted the item, instead of the server modified time. I don't think this will cause the exported last modified date to change, since it only updates the filters for getting forms/cases and then the pagination.

This will cause all case/forms to be resynced if the case or xform mappings change, although those mappings are quite stable. I'd guess it'll also cause resyncing when ES is upgraded, which might be a performance problem for us and/or for clients, although planning for that that seems better than living with the existing bug.

Copy link
Contributor

@snopoke snopoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orangejenny ES lag should only be an issue if one Kafka partition is ahead of the another. The DET uses the modification time of the last seen document as the filter value for the next batch and not the 'time of run'.

Either way I think this is a good change if we are OK with the reindexing issue.

Also FYI:

commcare_export/commcare_minilinq.py Outdated Show resolved Hide resolved
@orangejenny
Copy link
Contributor Author

ES lag should only be an issue if one Kafka partition is ahead of the another.

@snopoke How often does that happen? Do you think that makes it less plausible that this change will address the reported issue?

@snopoke
Copy link
Contributor

snopoke commented Sep 29, 2020

Looking at Datadog it seems to happen often enough that it could certainly be an issue: https://app.datadoghq.com/notebook/278293/change-lag-by-kafka-partition-case-form-pillows

@orangejenny
Copy link
Contributor Author

@snopoke Thank you! There's one specific example from the client, which is a case that the client's sql db says was last modified 2020-09-20 17:41, but the real last modified date, from the API, is 2020-09-23 17:49. Moving the notebook's date range to the afternoon of the 23rd, there's a 2.5-hour lag for cases starting around 18:15, so it does seem plausible that lag caused that case update to get missed.

@snopoke
Copy link
Contributor

snopoke commented Sep 30, 2020

I think I linked the wrong PR in my previous comment. Here's the correct link: #124

Copy link
Contributor

@snopoke snopoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build still failing

commcare_export/commcare_minilinq.py Outdated Show resolved Hide resolved
@orangejenny
Copy link
Contributor Author

still working out some test failures

@orangejenny orangejenny force-pushed the jls/inserted_at branch 7 times, most recently from 4f02fa0 to 8d5ea26 Compare October 3, 2020 18:07
@czue
Copy link
Member

czue commented Jan 23, 2021

@czue on that point, what do you think of using the old sorting for any existing tables and automatically using the new sorting for any new table as well as any time you pass in --start-over? That would allow us to roll it out in one go though it does make it a bit less visible to users.

This seems like a great idea to me. I can't imagine any scenario where users would want visibility into this weird internal behavior, so hiding it seems like better overall UX.

@snopoke snopoke requested review from czue and removed request for snopoke and calellowitz January 27, 2021 13:09
@snopoke
Copy link
Contributor

snopoke commented Jan 27, 2021

@czue @orangejenny this is ready for review. Changes as follows:

  • make pagination mode configurable within code
  • store pagination mode in the checkpoint
  • decide at start up which mode to use based on command line args and previous checkpoints
    • existing tables will continue with the previous pagination mode unless --start-over or --since are used
    • new tables will always use the new mode

Copy link
Member

@czue czue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this looks great.

Caveats: I only reviewed code from 0f76ab9, and some of the changes to the tests made my head hurt a bit and I didn't fully take the time to understand them (but mostly made sense/looked good).

No need to change this - but I was wondering if we can safely flip from legacy to indexed_on in a regular checkpointed update. I think you would still sort/filter by the old value, but then theoretically you could use the indexed_on of the last doc to start the next round. There's probably an edge case in there that doesn't make it perfect (two docs, one has higher modified on and the other has higher indexed on that happen to fall right at the barrier?). You could also use an earlier indexed_on, but then you run the risk of doing extra work the next time around... 🤷‍♂️

commcare_export/commcare_minilinq.py Outdated Show resolved Hide resolved
SUPPORTED_RESOURCES = {
'form', 'case', 'user', 'location', 'application', 'web-user'
}
class FormFilterSinceParams(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assume this was just restoring the old class)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -421,7 +429,7 @@ class MockCheckpointingClient(CommCareHqClient):
to return mocked data.

Note this client needs to be re-initialized after use."""
def __init__(self, mock_data):
def __init__(self, mock_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think it got fixed in a later commit also)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:
objects = mock_requests.pop(key)
except KeyError:
print(mock_requests.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if this is intentional could consider adding a comment explaining why it's being printed out. print statements usually trigger a warning in my brain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not intentional, I'll remove it. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._check_checkpoint(checkpoint_manager, '2012-04-24T05:13:01', 'doc 2')

def test_cli_pagination_since(self, writer, all_db_checkpoint_manager):
"""Test that we use to the new pagination mode when using 'since'"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this confused for a moment since it seems odd that you'd ever have a checkpoint and then manually override it. what's the use case for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there is a use case but it's not disallowed so just wanted to test it. I also realized now that if you pass in --since or --until then we don't do checkpointing at all. I'm going to update the test to reflect that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@czue czue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry meant to approve.

Copy link
Contributor

@proteusvacuum proteusvacuum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks for pushing this past the finish line.

@snopoke
Copy link
Contributor

snopoke commented Jan 27, 2021

No need to change this - but I was wondering if we can safely flip from legacy to indexed_on in a regular checkpointed update. I think you would still sort/filter by the old value, but then theoretically you could use the indexed_on of the last doc to start the next round. There's probably an edge case in there that doesn't make it perfect (two docs, one has higher modified on and the other has higher indexed on that happen to fall right at the barrier?). You could also use an earlier indexed_on, but then you run the risk of doing extra work the next time around...

This is what I was suggesting here: #159 (comment) but it's not straight forward to get the ID of the last doc from tables since they may not have a suitable column to sort on and the ID may also not be just the doc ID (in the case of exporting form repeats or cases from forms). We'd also then need to look up that doc in the API to get the indexed_on date. It's not impossible but I think it would at minimum require an intermediate release to ensure that the doc ID and date column are present in all tables.

Copy link
Member

@czue czue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for follow ups!

@snopoke snopoke merged commit 9f82c71 into master Jan 28, 2021
@snopoke snopoke deleted the jls/inserted_at branch January 28, 2021 13:03
@czue
Copy link
Member

czue commented Jan 28, 2021

🎊

Copy link

@mikecjohn mikecjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link in the log warning message to the Wiki does not exist, where did you want it to go? It currently points to: "https://wiki.commcarehq.org/display/commcarepublic/CommCare+Export+Tool+Release+Notes"

@tobiasmcnulty
Copy link

@czue
Copy link
Member

czue commented Feb 26, 2021

Oops, thanks for the heads up! Confirming the last link is the correct one. #177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants