Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify scope of eventID's uniqueness #391

Merged
merged 1 commit into from
May 16, 2019

Conversation

alanconway
Copy link
Contributor

This is for issue #331

Signed-off-by: Alan Conway aconway@redhat.com

Copy link
Contributor

@cneijenhuis cneijenhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pushing this 👍

spec.md Outdated
information such as the type of the event source, the organization
publishing the event, the process that produced the event, and some unique
identifiers. The exact syntax and semantics behind the data encoded in the URI
is event producer defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you dropping these two sentences? They give a lot of motivation for why a URI is chosen at all. If you don't want to include information such as..., then you could also go straight for a UUID and be done with it 😉

I think it is very valuable to describe what to put into the source, and why a hierarchical data structure was chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the important thing here is that source can be a URI with an internet-unique authority so you don't have to resort to UUIDs for uniqueness. As I read it, the only normative use of the source is to uniquely identify a producer.

The design of source URIs will depend on the application. It might include "information such as type/organization/process..." but it might not. This spec doesn't seem a good place for advice on URI design, which is a topic in its own right.

I'm ok with putting it back if you feel strongly about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the existing text because it's a little less strict in that the value doesn't necessarily have to be a valid address on the web. For example, they could use http://myserver/.... is that is ok for their setup. We purposely didn't want to get too precise here to allow for flexibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my latest text. I want to make it clear that you MAY have an internet-unique authority and URI here but you also MAY also have something of application specific scope. Otherwise it's hard to answer the original question "what is the scope in which source+id is unique" in any meaningful way.

Copy link
Collaborator

@duglin duglin Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanconway can you elaborate on what problem you're trying to solve with this change? In the end, aren't these all just opaque identifiers since we're not asking the receiver to do anything with these URI-references - like de-reference them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to clarify the scope of uniqueness, the "U" in URI. URIs have a standard authority component which provides well-known, internet-wide uniqueness guarantees. I'm trying to clarify that a "source" can be URI-unique if it has a URI authority OR it can be an authority-less reference path which is only unique in some application-defined context. Both are useful uniqueness guarantees for different applications.

We could leave it unstated since it's implied by using a URI-reference, but this PR is requesting "Clarify scope of eventID's uniqueness" so it seems to need saying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of the request for clarity around uniqueness was more around simple string comparison type of things :-) So, while the semantics behind these values might be interesting in some cases, I think what's more important (from a spec perspective) is that people know if they do the equivalent of ce.source + ce.id they'll get a unique value within the scope of that producer, and can use it appropriately. Getting into the semantics of the strings, while interesting, doesn't really change what the code a consumer would write - I don't think anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I've restored the original comments on "source" - with the exception of saying "identifies" rather than "describes" and changed the "id" part along the lines you suggested.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@alanconway
Copy link
Contributor Author

Updated to deal with most of the comments above, left some conversations unresolved where I need more feedback.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@alanconway
Copy link
Contributor Author

I've minimized the changes and updated in line with comments - I think the only open point of discussion is whether type is part of the dedup or not. As my changes read now it is not.

spec.md Outdated Show resolved Hide resolved
@alanconway alanconway force-pushed the unique-id branch 2 times, most recently from 6171d4f to f836f27 Compare February 26, 2019 16:11
spec.md Outdated Show resolved Hide resolved
@alanconway
Copy link
Contributor Author

alanconway commented Feb 27, 2019 via email

@cneijenhuis
Copy link
Contributor

@alanconway I think we agree on:

  • Having two different events with the same identification is a failure state
  • Someone should to be responsible to make sure that doesn't happen

What we don't agree on is who should be responsible. You're saying it should be the Event Producer. I don't agree.

Let me present a few practical examples:

  • I'm creating IoT devices to be embedded into rooms. Each IoT device is configured to know what building and room it is in. A source may be http://acmeinc.com/buildings/1/rooms/100/sensors/temperature/1. How can the IoT device (the event producer) know whether there is another IoT device with the same source, or not? It can't. The responsibility to configure all IoT devices correctly does not lie with the IoT device itself.
  • I'm creating an Open Source project producing events. In the config, I'm asking for a unique URI. If two groups inside ACME Inc decide to deploy separate instances of the project with both acmeinc.com as the URI, is it the responsibility of the Open Source project to figure that out?
  • I'm a cloud provider. I'm allowing you to completely tear down a service (say, a blob storage bucket) and re-create it from scratch with the same name, as if it never existed. I'm clearly documenting that if you re-create the service with name, events might clash.
    • In a test deployment, I can tearing down all my services, including the event producer (the bucket) and all consumers. Everything is fine. People enjoy that feature and frequently re-create buckets for integration testing.
    • If someone decides to tear down only the bucket, but keep the consumers alive, things will go wrong.
    • Again, it is not the event producer/cloud providers fault that this was done.
  • An event producer is restored from backup, with old data. How is the event producer supposed to know which id's have already been sent?

All the event producer can do is to promise that it'll be unique within their scope. You can not make the event producer responsible for the whole application, because the event producer likely isn't in charge of it.

@alanconway
Copy link
Contributor Author

@cneijenhuis I think we agree on everything, but my wording was messy - have a look at the update.

Here's what I'm trying to say:

  1. Producers MUST generate unique "id" attributes for their own messages only.
  2. Producer identifiers (i.e. "source" attribute) MUST be unique in a wider context that I'm calling the "application". We don't define how that's accomplished, but we give examples of standard URL and URN options, as well as simple application-defined strings or paths.
  3. Given 1+2, consumers can safely assume that events with identical source+id are duplicates.

@evankanderson
Copy link
Contributor

evankanderson commented Mar 2, 2019

Thinking about this a bit more, it seems like it should be the type+source+id triple that is unique.

Firestore is a good example: for a document update, it may fire some of:

Event Type Trigger
onCreate Triggered when a document is written to for the first time.
onUpdate Triggered when a document already exists and has any value changed.
onDelete Triggered when a document with data is deleted.
onWrite Triggered when onCreate, onUpdate or onDelete is triggered.

Note that onWrite and onUpdate may fire with the same source for the same occurrence. Using the same event ID allows downstream consumers to correlate onWrite events with creates and updates, for example.

@alanconway
Copy link
Contributor Author

alanconway commented Mar 4, 2019 via email

@cneijenhuis
Copy link
Contributor

@alanconway Sorry for the late reply. To follow your number-list:

On 3.: I agree, this is the point where the spec needs to be more specific, and clearly spell out when a consumer can deduplicate a message.

On 1.: Yes. But that is already in the spec.

On 2.: I agree we should provide guidance on that. But I think the primer is a much better place for it. We can only advise readers on how to set up their application properly, so that they don't run into problems given 3.
But putting a MUST into the spec is a pretty high bar. I fear the effect is that event producers will have to jump through additional hoops to comply with such a MUST, despite them (in many situations) having little control over the application as a whole.

@cneijenhuis
Copy link
Contributor

Also, to go back to the discussion started in #331 : Like @Tapppi I am not convinced that source + id is a good choice to clarify "scope".

It also goes against the implicit design of the spec, given by the current examples.

The current examples emphasize a unique type, but not necessarily a unique source. Given the examples, I would design a type of org.example.user.created, and a source of /users/123. I wouldn't go for https://example.org/users/123, especially if no such URL exists (maybe the current URL is https://api.example.org/v2/users/123, but what happens if I change to v3 or - like GitHub - deprecate my REST-ish API in favor of GraphQL?).

I think using DNS for uniqueness is a good idea in general, but I favor the convention of using reverse-DNS, as done in many programming languages for this use case. A package name com.acmeinc is as unique http://acmeinc.com, but my class Foo reads much nicer as com.acmeinc.Foo than http://acmeinc.com/Foo, and it doesn't imply that I could somehow access Foo via that URL.

Furthermore, I think it is much more important for the event type to be globally unique than for the source. I have no data to back this up, but I'm predicting that message routing will happen primarily on type, and only secondarily on source.
As an example, as a consumer (not middleware!) I'd choose to consume events from org.example.* or com.acmeinc.* - if they both happen to define a /user/123, I won't notice. Even if I consume events from both, I'll likely treat those events completely differently, so the non-globally-unique source won't be a problem in practice.

Summary: I think the spec, as it is currently written, implicitly favors type+source+id as the "unique" scope. Design-wise, I personally prefer trying to make the type the thing that ensures globally uniqueness.

@duglin
Copy link
Collaborator

duglin commented Mar 7, 2019

@alanconway is this one ready for review or did you want to reply to @cneijenhuis's comments first?

spec.md Outdated Show resolved Hide resolved
@alanconway
Copy link
Contributor Author

alanconway commented Mar 7, 2019 via email

@alanconway
Copy link
Contributor Author

alanconway commented Mar 7, 2019 via email

@duglin
Copy link
Collaborator

duglin commented Mar 14, 2019

@evankanderson wrote:

it seems like it should be the eventid+source+id triple that is unique.

I think there's a typo in there since eventid and id are the same, no? Just eventid is the old name for id.


I tend to agree with @alanconway that including type is unnecessary and could actually be problematic if someone chooses to send two messages with the same source and id but different types. Given that id is meant to be unique for the same producer, I'm not sure how someone is supposed to interpret these two messages.

@evankanderson
Copy link
Contributor

Sorry, that was a typo, I meant eventtype + source + id

@cathyhongzhang
Copy link

I also agree with @alanconway :
"There's no reason to bring type into it - a producer that generates unique
IDs is responsible for never using the same ID on non-duplicate events, and
by definition duplicate events will have the same type (and every other
attribute that matters)"

It creates unnecessary complication for the event consumer if we add type. Since the spec already defines that event ID is unique within the scope of the event source, then as long as we ensure source is globally unique, then "source+eventID" will make the event globally unique.

As to the example of Firestore, it seems the problem can be solved by assigning different event IDs to
onWrite and onUpdate firing with the same source.

@alanconway alanconway force-pushed the unique-id branch 2 times, most recently from 5129089 to 808c16f Compare March 27, 2019 17:18
@duglin
Copy link
Collaborator

duglin commented Apr 17, 2019

@alanconway can you rebase this?

On last week's call we discussed this a bit and the overall consensus of the group is that id + source should be sufficient for uniqueness. There didn't appear to be a desire to add type into the mix. However, there was a brief discussion around how using non-unique source values could lead to duplicates - so a value like producer would not be wise, but a DNS name or UUID would be. I believe the text in this PR here (https://github.com/cloudevents/spec/pull/391/files#diff-958e7270f96f5407d7d980f500805b1bR189) tries to address this.

Overall, what do people think about this? Aside from the merge-conflict I think it's ready for review/consideration - I'll add it to the call this week, but if you have concerns please voice them in the PR.

@alanconway alanconway force-pushed the unique-id branch 2 times, most recently from 6552d0c to 985490d Compare April 17, 2019 20:52
@alanconway
Copy link
Contributor Author

Rebased and tightened the text a bit.

spec.md Outdated Show resolved Hide resolved
@alanconway
Copy link
Contributor Author

alanconway commented Apr 17, 2019 via email

alanconway added a commit to alanconway/cloudevents-spec that referenced this pull request Apr 24, 2019
See discussion at issue cloudevents#326.
This PR is base on pR cloudevents#391 as it depends on those changes.
@alanconway
Copy link
Contributor Author

Note: the CI build failure above is cause by a broken link to the cloudevent logo, nothing to do with my changes AFAIK.

alanconway added a commit to alanconway/cloudevents-spec that referenced this pull request Apr 24, 2019
See discussion at issue cloudevents#326.
This PR is base on pR cloudevents#391 as it depends on those changes.
@alanconway alanconway mentioned this pull request Apr 24, 2019
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@duglin
Copy link
Collaborator

duglin commented May 9, 2019

@alanconway on last week's call we agreed with this general direction. Could you address the merge conflict and any outstanding comments? The test associated in here might change based on @deissnerk's PR ( #420 ) but we may just have to see which goes in first and then the other will have to adjust accordingly. Also, we may need to adjust the Primer too.

@alanconway alanconway force-pushed the unique-id branch 2 times, most recently from 79ba923 to 7856f13 Compare May 9, 2019 20:40
@alanconway
Copy link
Contributor Author

Updated and rebased. I have re-worded the source definition to use the terms defined in #420 - so at this point this PR really needs to go in after #420 as currently the source definition talks about sources with many producers, which contradicts other places where producer==source.

Reword "id" and "source" to clarify uniqueness requirements.
Examples to show different approaches to generating unique source/IDs
Clarify producer/consumer responsibilities.

Signed-off-by: Alan Conway <aconway@redhat.com>
URNs, DNS authorities or an application-specific scheme to create
unique `source` identifiers.

A source MAY include more than one producer. In that case the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the word "include" the right word? Perhaps "use" or "leverage"? Not a big deal since I think I get the point, but "include" might sound like we're getting include impl details and suggesting that a producer is a sub-component of a source - which it may not be.

@duglin
Copy link
Collaborator

duglin commented May 15, 2019

Just one minor question, otherwise LGTM

@alanconway
Copy link
Contributor Author

alanconway commented May 15, 2019 via email

@duglin
Copy link
Collaborator

duglin commented May 15, 2019

Let's just leave it as is then unless someone thinks of some better word.

@duglin
Copy link
Collaborator

duglin commented May 16, 2019

Approved on the 5/16 call

@duglin duglin merged commit b61f315 into cloudevents:master May 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants