-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent saving duplicate rows in spec11 pipeline #1810
Prevent saving duplicate rows in spec11 pipeline #1810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @sarahcaseybot)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
.setDomainRepoId(input.getKey().domainRepoId()) .setRegistrarId(input.getKey().registrarId()) .setId(IdService.allocateId())
IdService is not working correctly on BEAM right now. Each worker has their own sequence generator so there will be collisions.
If we do this, we need to wait until IdService is fixed.
Code quote:
.setId(IdService.allocateId()
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 184 at r1 (raw file):
})); spec11ThreatMatches.apply("Prevent Fusing", Reshuffle.viaRandomKey());
spec11ThreatMatches = spec11ThreatMatches.apply(...)
The apply method returns a new instance.
It may read better to switch order with the transformId declaration and chain the remaining transformations here.
This way we won't need this assignment.
Code quote:
spec11Threat
core/src/test/java/google/registry/beam/spec11/Spec11PipelineTest.java
line 291 at r1 (raw file):
.transact( () -> { ImmutableList<Spec11ThreatMatch> sqlThreatMatches =
This name shadows the instance variable. Better rename it.
Code quote:
sqlThreatMatches
736170b
to
458040e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, weiminyu (Weimin Yu) wrote…
IdService is not working correctly on BEAM right now. Each worker has their own sequence generator so there will be collisions.
If we do this, we need to wait until IdService is fixed.
Ok, I will wait for b/201547855 to be fixed before merging this. Or do you recommend I do this another way?
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 184 at r1 (raw file):
Previously, weiminyu (Weimin Yu) wrote…
spec11ThreatMatches = spec11ThreatMatches.apply(...)
The apply method returns a new instance.
It may read better to switch order with the transformId declaration and chain the remaining transformations here.
This way we won't need this assignment.
Done.
core/src/test/java/google/registry/beam/spec11/Spec11PipelineTest.java
line 291 at r1 (raw file):
Previously, weiminyu (Weimin Yu) wrote…
This name shadows the instance variable. Better rename it.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, sarahcaseybot wrote…
Ok, I will wait for b/201547855 to be fixed before merging this. Or do you recommend I do this another way?
I'm working on making it possible to bring your own ID allocation provider in #1809.
Given that it probably will be a while before we can start using SQL to allocate the ID, I wonder if we should look into calling the non-GAE datastore API here to allocate the ID in Beam. If that works, we might just migrate to that, for the Beam pipeline, and for the main application, which will make it less important (or urgent) to use SQL to allocate IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @ptkach and @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, jianglai (Lai Jiang) wrote…
I'm working on making it possible to bring your own ID allocation provider in #1809.
Given that it probably will be a while before we can start using SQL to allocate the ID, I wonder if we should look into calling the non-GAE datastore API here to allocate the ID in Beam. If that works, we might just migrate to that, for the Beam pipeline, and for the main application, which will make it less important (or urgent) to use SQL to allocate IDs.
@ptkach FYI. If switching to SQL sequence is going to take some time, we should look into if we can use the non-GAE datastore API to allocate IDs to unblock this use case.
@jianglai @CydeWeys I'm looking at this whole allocate id thing, however there seems to be a lot of history and complexity to it, so it's gonna take time for me to fully assess the problem. However, maybe a dumb question, but still - is there any reason we don't want to update data type and start using UUID generator? This is how it commonly done in similar circumstances? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, ptkach (Pavlo Tkach) wrote…
@jianglai @CydeWeys I'm looking at this whole allocate id thing, however there seems to be a lot of history and complexity to it, so it's gonna take time for me to fully assess the problem. However, maybe a dumb question, but still - is there any reason we don't want to update data type and start using UUID generator? This is how it commonly done in similar circumstances?
Isn't UUID a combination of letters and numbers? We need a long
allocated in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, jianglai (Lai Jiang) wrote…
Isn't UUID a combination of letters and numbers? We need a
long
allocated in this case.
Also, perhaps more importantly, the ID must be generated from the server side to prevent collision.
I realize it's currently long and UUID is a string 😀, which is why I mentioned "update data type". Sorry I wasn't clear enough. I meant update all the places where the data type is expected as long to actually accept a string UUID, then update schemas and that should probably do the trick. By random UUID generator I suppose we can use Am I missing something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @CydeWeys, @ptkach, and @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, ptkach (Pavlo Tkach) wrote…
I realize it's currently long and UUID is a string 😀, which is why I mentioned "update data type". Sorry I wasn't clear enough. I meant update all the places where the data type is expected as long to actually accept a string UUID, then update schemas and that should probably do the trick.
By random UUID generator I suppose we can use
java.util.UUID. randomUUID
which is immutable and provides enough entropy for it to be thread-safe. So that should cover beam case.Am I missing something?
Hmmm, I'm not sure if we would want to entertain the idea of changing the primary ID type for most of the entities... It seems like a risky migration that would undoubtedly require some downtime to perform. It certainly warrants some discussion to weight the pros and cons.
But still, as I mentioned earlier, I doubt an client-side allocation scheme would ever work, due to the chance of collision between multiple instances. We are not just talking about thread safety here, but rather distributed workers that all try to allocate globally unique IDs at the same time. This applies to Beam works, or just multiple Nomulus instances serving different clients concurrently. Is this achievable without some sort of client-server coordination (which is what Datastore gives us)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 2 unresolved discussions (waiting on @CydeWeys and @ptkach)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, jianglai (Lai Jiang) wrote…
Hmmm, I'm not sure if we would want to entertain the idea of changing the primary ID type for most of the entities... It seems like a risky migration that would undoubtedly require some downtime to perform. It certainly warrants some discussion to weight the pros and cons.
But still, as I mentioned earlier, I doubt an client-side allocation scheme would ever work, due to the chance of collision between multiple instances. We are not just talking about thread safety here, but rather distributed workers that all try to allocate globally unique IDs at the same time. This applies to Beam works, or just multiple Nomulus instances serving different clients concurrently. Is this achievable without some sort of client-server coordination (which is what Datastore gives us)?
UUID in practice can run into bad states with lots of collisions. See ramsey/uuid#80, at some point someone saw hundreds of collisions out of one million calls. And someone reported this as late as 09/2020.
UUID generator can not run into a collision in our case, it's just not possible. According to wiki - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 2 unresolved discussions (waiting on @ptkach and @weiminyu)
core/src/main/java/google/registry/beam/spec11/Spec11Pipeline.java
line 178 at r1 (raw file):
Previously, ptkach (Pavlo Tkach) wrote…
UUID generator can not run into a collision in our case, it's just not possible. According to wiki -
50% probability of at least one collision can be achieved (after) ... generating 1 billion UUIDs per second for about 85 years.
The issue mentioned is just implementational bug for this particular library. Same issue can happen with datastore's allocateId implementation, postgres sequence nextVal or any other implementation
For the record, changing all IDs across our codebase from longs to Strings would be a lot of work that we simply don't want to take on right now.
458040e
to
ab53da4
Compare
@@ -39,7 +40,7 @@ | |||
@Index(name = "spec11threatmatch_tld_idx", columnList = "tld"), | |||
@Index(name = "spec11threatmatch_check_date_idx", columnList = "checkDate") | |||
}) | |||
public class Spec11ThreatMatch extends ImmutableObject implements Buildable { | |||
public class Spec11ThreatMatch extends ImmutableObject implements Buildable, Serializable { |
Check failure
Code scanning / CodeQL
No clone method
ab53da4
to
204f32a
Compare
Tested this in alpha and the pipeline succeeded. However, spec11 only returns an empty file in alpha. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ptkach and @sarahcaseybot)
This change is