Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent saving duplicate rows in spec11 pipeline #1810

Merged
merged 2 commits into from
Dec 15, 2022

Conversation

sarahcaseybot
Copy link
Collaborator

@sarahcaseybot sarahcaseybot commented Oct 7, 2022

This change is Reviewable

@sarahcaseybot sarahcaseybot requested a review from weiminyu October 7, 2022 19:16
Copy link
Collaborator

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @sarahcaseybot)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

                            .setDomainRepoId(input.getKey().domainRepoId())
                            .setRegistrarId(input.getKey().registrarId())
                            .setId(IdService.allocateId())

IdService is not working correctly on BEAM right now. Each worker has their own sequence generator so there will be collisions.

If we do this, we need to wait until IdService is fixed.

Code quote:

.setId(IdService.allocateId()

core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 184 at r1 (raw file):

                }));

    spec11ThreatMatches.apply("Prevent Fusing", Reshuffle.viaRandomKey());

spec11ThreatMatches = spec11ThreatMatches.apply(...)

The apply method returns a new instance.

It may read better to switch order with the transformId declaration and chain the remaining transformations here.
This way we won't need this assignment.

Code quote:

spec11Threat

core/src/test/java/google/registry/beam/spec11/Spec11PipelineTest.java line 291 at r1 (raw file):

        .transact(
            () -> {
              ImmutableList<Spec11ThreatMatch> sqlThreatMatches =

This name shadows the instance variable. Better rename it.

Code quote:

sqlThreatMatches

@sarahcaseybot sarahcaseybot force-pushed the fixSpec11DuplicateRows branch 2 times, most recently from 736170b to 458040e Compare October 14, 2022 18:42
Copy link
Collaborator Author

@sarahcaseybot sarahcaseybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

IdService is not working correctly on BEAM right now. Each worker has their own sequence generator so there will be collisions.

If we do this, we need to wait until IdService is fixed.

Ok, I will wait for b/201547855 to be fixed before merging this. Or do you recommend I do this another way?


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 184 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

spec11ThreatMatches = spec11ThreatMatches.apply(...)

The apply method returns a new instance.

It may read better to switch order with the transformId declaration and chain the remaining transformations here.
This way we won't need this assignment.

Done.


core/src/test/java/google/registry/beam/spec11/Spec11PipelineTest.java line 291 at r1 (raw file):

Previously, weiminyu (Weimin Yu) wrote…

This name shadows the instance variable. Better rename it.

Done.

@sarahcaseybot sarahcaseybot added the do not merge Do not merge this PR. label Oct 14, 2022
@jianglai jianglai requested a review from weiminyu October 14, 2022 18:46
Copy link
Collaborator

@jianglai jianglai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, sarahcaseybot wrote…

Ok, I will wait for b/201547855 to be fixed before merging this. Or do you recommend I do this another way?

I'm working on making it possible to bring your own ID allocation provider in #1809.

Given that it probably will be a while before we can start using SQL to allocate the ID, I wonder if we should look into calling the non-GAE datastore API here to allocate the ID in Beam. If that works, we might just migrate to that, for the Beam pipeline, and for the main application, which will make it less important (or urgent) to use SQL to allocate IDs.

@jianglai jianglai requested a review from ptkach October 17, 2022 17:48
Copy link
Collaborator

@jianglai jianglai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @ptkach and @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, jianglai (Lai Jiang) wrote…

I'm working on making it possible to bring your own ID allocation provider in #1809.

Given that it probably will be a while before we can start using SQL to allocate the ID, I wonder if we should look into calling the non-GAE datastore API here to allocate the ID in Beam. If that works, we might just migrate to that, for the Beam pipeline, and for the main application, which will make it less important (or urgent) to use SQL to allocate IDs.

@ptkach FYI. If switching to SQL sequence is going to take some time, we should look into if we can use the non-GAE datastore API to allocate IDs to unblock this use case.

@ptkach
Copy link
Collaborator

ptkach commented Oct 18, 2022

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @ptkach and @weiminyu)

core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, jianglai (Lai Jiang) wrote…
@ptkach FYI. If switching to SQL sequence is going to take some time, we should look into if we can use the non-GAE datastore API to allocate IDs to unblock this use case.

@jianglai @CydeWeys I'm looking at this whole allocate id thing, however there seems to be a lot of history and complexity to it, so it's gonna take time for me to fully assess the problem. However, maybe a dumb question, but still - is there any reason we don't want to update data type and start using UUID generator? This is how it commonly done in similar circumstances?

@jianglai jianglai requested a review from CydeWeys October 18, 2022 22:06
Copy link
Collaborator

@jianglai jianglai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, ptkach (Pavlo Tkach) wrote…

@jianglai @CydeWeys I'm looking at this whole allocate id thing, however there seems to be a lot of history and complexity to it, so it's gonna take time for me to fully assess the problem. However, maybe a dumb question, but still - is there any reason we don't want to update data type and start using UUID generator? This is how it commonly done in similar circumstances?

Isn't UUID a combination of letters and numbers? We need a long allocated in this case.

Copy link
Collaborator

@jianglai jianglai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, jianglai (Lai Jiang) wrote…

Isn't UUID a combination of letters and numbers? We need a long allocated in this case.

Also, perhaps more importantly, the ID must be generated from the server side to prevent collision.

@ptkach
Copy link
Collaborator

ptkach commented Oct 19, 2022

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)

core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, jianglai (Lai Jiang) wrote…
Also, perhaps more importantly, the ID must be generated from the server side to prevent collision.

I realize it's currently long and UUID is a string 😀, which is why I mentioned "update data type". Sorry I wasn't clear enough. I meant update all the places where the data type is expected as long to actually accept a string UUID, then update schemas and that should probably do the trick.

By random UUID generator I suppose we can use java.util.UUID. randomUUID which is immutable and provides enough entropy for it to be thread-safe. So that should cover beam case.

Am I missing something?

Copy link
Collaborator

@jianglai jianglai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, ptkach (Pavlo Tkach) wrote…

I realize it's currently long and UUID is a string 😀, which is why I mentioned "update data type". Sorry I wasn't clear enough. I meant update all the places where the data type is expected as long to actually accept a string UUID, then update schemas and that should probably do the trick.

By random UUID generator I suppose we can use java.util.UUID. randomUUID which is immutable and provides enough entropy for it to be thread-safe. So that should cover beam case.

Am I missing something?

Hmmm, I'm not sure if we would want to entertain the idea of changing the primary ID type for most of the entities... It seems like a risky migration that would undoubtedly require some downtime to perform. It certainly warrants some discussion to weight the pros and cons.

But still, as I mentioned earlier, I doubt an client-side allocation scheme would ever work, due to the chance of collision between multiple instances. We are not just talking about thread safety here, but rather distributed workers that all try to allocate globally unique IDs at the same time. This applies to Beam works, or just multiple Nomulus instances serving different clients concurrently. Is this achievable without some sort of client-server coordination (which is what Datastore gives us)?

Copy link
Collaborator

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 2 unresolved discussions (waiting on @CydeWeys and @ptkach)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, jianglai (Lai Jiang) wrote…

Hmmm, I'm not sure if we would want to entertain the idea of changing the primary ID type for most of the entities... It seems like a risky migration that would undoubtedly require some downtime to perform. It certainly warrants some discussion to weight the pros and cons.

But still, as I mentioned earlier, I doubt an client-side allocation scheme would ever work, due to the chance of collision between multiple instances. We are not just talking about thread safety here, but rather distributed workers that all try to allocate globally unique IDs at the same time. This applies to Beam works, or just multiple Nomulus instances serving different clients concurrently. Is this achievable without some sort of client-server coordination (which is what Datastore gives us)?

UUID in practice can run into bad states with lots of collisions. See ramsey/uuid#80, at some point someone saw hundreds of collisions out of one million calls. And someone reported this as late as 09/2020.

@ptkach
Copy link
Collaborator

ptkach commented Oct 20, 2022

Reviewable status: 0 of 3 files reviewed, 2 unresolved discussions (waiting on @CydeWeys and @ptkach)

core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, jianglai (Lai Jiang) wrote…
UUID in practice can run into bad states with lots of collisions. See ramsey/uuid#80, at some point someone saw hundreds of collisions out of one million calls. And someone reported this as late as 09/2020.

UUID generator can not run into a collision in our case, it's just not possible. According to wiki - 50% probability of at least one collision can be achieved (after) ... generating 1 billion UUIDs per second for about 85 years. The issue mentioned is just implementational bug for this particular library. Same issue can happen with datastore's allocateId implementation, postgres sequence nextVal or any other implementation

Copy link
Member

@CydeWeys CydeWeys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 2 unresolved discussions (waiting on @ptkach and @weiminyu)


core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java line 178 at r1 (raw file):

Previously, ptkach (Pavlo Tkach) wrote…

UUID generator can not run into a collision in our case, it's just not possible. According to wiki - 50% probability of at least one collision can be achieved (after) ... generating 1 billion UUIDs per second for about 85 years. The issue mentioned is just implementational bug for this particular library. Same issue can happen with datastore's allocateId implementation, postgres sequence nextVal or any other implementation

For the record, changing all IDs across our codebase from longs to Strings would be a lot of work that we simply don't want to take on right now.

@@ -39,7 +40,7 @@
@Index(name = "spec11threatmatch_tld_idx", columnList = "tld"),
@Index(name = "spec11threatmatch_check_date_idx", columnList = "checkDate")
})
public class Spec11ThreatMatch extends ImmutableObject implements Buildable {
public class Spec11ThreatMatch extends ImmutableObject implements Buildable, Serializable {

Check failure

Code scanning / CodeQL

No clone method

No clone method, yet implements Cloneable.
@sarahcaseybot sarahcaseybot removed the do not merge Do not merge this PR. label Dec 14, 2022
@sarahcaseybot
Copy link
Collaborator Author

Tested this in alpha and the pipeline succeeded. However, spec11 only returns an empty file in alpha.

Copy link
Collaborator

@weiminyu weiminyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ptkach and @sarahcaseybot)

@sarahcaseybot sarahcaseybot merged commit 4ede5f0 into google:master Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants