Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/backfill duplicate handling #72

Merged
merged 6 commits into from
Feb 28, 2024
Merged

Conversation

adekbadek
Copy link
Member

@adekbadek adekbadek commented Feb 27, 2024

All Submissions:

Changes proposed in this Pull Request:

Fixes two issues:

  1. If and event is not persisted in the event log, it should not be processed.
  2. When backfilling reader_registration events, disregard current_page_url metadata.

By accident this branch also contains #71, but it won't hurt to add that to the alpha release.

How to test the changes in this Pull Request:

  1. On alpha,
  2. Trigger a new reader_registered event by registering as a new reader on the site
  3. Observe in the Event Log that the event has a metadata.current_page_url key in the data property
  4. Run the backfill for reader_registered events (e.g. wp newspack-network data-backfill reader_registered --start=2024-02-26 --end=2024-12-31 --verbose)
  5. Observe a new reader_registered event added, even though a reader_registered for this reader is already present
  6. Observe that the new event does not have the metadata.current_page_url key
  7. Switch to this branch, repeat the testing steps, observe the duplicate event is not added

Other information:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@adekbadek adekbadek force-pushed the fix/backfill-duplicate-handling branch from 8f51d4c to 58d0fba Compare February 27, 2024 09:12
@adekbadek adekbadek changed the base branch from trunk to alpha February 27, 2024 10:25
Copy link
Contributor

@chickenn00dle chickenn00dle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and looks good for the most part. I am running into one issue while testing though. With these changes checked out, I see that all events will trigger exactly one duplicate before being ignore by the data backfill script.

For example, Here I have created 3 users (test4, test5, and test6@example.com), and each have triggered an initial reader_registered event:

Screenshot 2024-02-27 at 10 28 14

When I run the backfill script the first time, each one triggers an additional reader_registered event:

Screenshot 2024-02-27 at 10 30 12

It isn't until the second time I run this command that duplicates are actually ignored:

Screenshot 2024-02-27 at 10 30 54

Is this expected?

@adekbadek
Copy link
Member Author

Not expected, you've uncovered a shortcoming of this solution. It assumes the only metadata missing will be current_page_url, but in your example there's also a referer metadata key. The solution should be to disregard all data. 62fdf8d should fix it.

Copy link
Contributor

@chickenn00dle chickenn00dle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm still running into an issue, although the behavior has slightly changed. This time, everything but the first of a batch of user registrations is being duplicated. For example, here is test7, 8, and 9. Running the backfill script creates duplicate events for test 8 and 9:

Screenshot 2024-02-27 at 10 57 01

@adekbadek adekbadek force-pushed the fix/backfill-duplicate-handling branch from ca136ae to ee7be33 Compare February 27, 2024 19:34
@adekbadek
Copy link
Member Author

This is a bigger issue than I thought! The code is comparing timestamps. The event will be saved a few milliseconds after the actual registration. Thus when looking for an event at time of the registration, it won't be found. Fixed in ee7be33

Copy link
Contributor

@chickenn00dle chickenn00dle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @adekbadek!

This is working now so gonna give a thumbs up.

I did have one question/nit which is can we simply disregard/drop timestamp as well if the only relevant information we need to consider are email and date? Definitely not a blocker though and something we can revisit if necessary.

@adekbadek
Copy link
Member Author

Good suggestion! For registrations that makes sense, because there should be just one per email. The timestamp-range solution should also work for other events, though, and for these disregarding the timestamp altogether is not desirable. There might be many of them, with only the timestamp as differentiator (e.g. donations).

@adekbadek adekbadek merged commit bbce13b into alpha Feb 28, 2024
3 checks passed
@adekbadek adekbadek deleted the fix/backfill-duplicate-handling branch February 28, 2024 07:15
matticbot pushed a commit that referenced this pull request Feb 28, 2024
# [1.3.0-alpha.3](v1.3.0-alpha.2...v1.3.0-alpha.3) (2024-02-28)

### Bug Fixes

* backfill duplicate handling; woo links in wp-admin-bar ([#71](#71), [#72](#72)) ([bbce13b](bbce13b))
@matticbot
Copy link
Contributor

🎉 This PR is included in version 1.3.0-alpha.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

matticbot pushed a commit that referenced this pull request Feb 28, 2024
# [1.3.0](v1.2.0...v1.3.0) (2024-02-28)

### Bug Fixes

* backfill duplicate handling; woo links in wp-admin-bar ([#71](#71), [#72](#72)) ([bbce13b](bbce13b))
* cli commands ([#73](#73)) ([ff563ac](ff563ac))
* memberships sync ([#63](#63)) ([0a54f1d](0a54f1d))
* missing woocommerce-memberships handling ([76dbdf7](76dbdf7))

### Features

* add hub bookmarks to nodes ([#56](#56)) ([54700a0](54700a0))
* add option to manually sync users  ([#53](#53)) ([3ec1b19](3ec1b19))
* display membership plans from the network ([#59](#59)) ([6fa01fd](6fa01fd))
* enhance network subscriptions and orders search ([#55](#55)) ([bcb0615](bcb0615))
* Node connection using a link ([#58](#58)) ([721f8b2](721f8b2))
* **wp-cli:** add command to backfill events ([#51](#51)) ([13d2498](13d2498))
* **wp-cli:** add process pending webhook requests command ([#67](#67)) ([7dbd8dc](7dbd8dc))
* **wp-cli:** sync all events from the Hub ([#65](#65)) ([dc595ca](dc595ca))
@matticbot
Copy link
Contributor

🎉 This PR is included in version 1.3.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@matticbot
Copy link
Contributor

🎉 This PR is included in version 1.4.0-alpha.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@matticbot
Copy link
Contributor

🎉 This PR is included in version 1.5.0-alpha.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@matticbot
Copy link
Contributor

🎉 This PR is included in version 1.5.0-epic-data-integrity-improvements.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@matticbot
Copy link
Contributor

🎉 This PR is included in version 1.5.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants