Clarify scope of eventID's uniqueness #391

alanconway · 2019-02-20T17:47:54Z

This is for issue #331

Signed-off-by: Alan Conway aconway@redhat.com

cneijenhuis

Thank you for pushing this 👍

cneijenhuis · 2019-02-20T20:08:32Z

spec.md

-  information such as the type of the event source, the organization
-  publishing the event, the process that produced the event, and some unique
-  identifiers. The exact syntax and semantics behind the data encoded in the URI
-  is event producer defined.


Why are you dropping these two sentences? They give a lot of motivation for why a URI is chosen at all. If you don't want to include information such as..., then you could also go straight for a UUID and be done with it 😉

I think it is very valuable to describe what to put into the source, and why a hierarchical data structure was chosen.

To me, the important thing here is that source can be a URI with an internet-unique authority so you don't have to resort to UUIDs for uniqueness. As I read it, the only normative use of the source is to uniquely identify a producer.

The design of source URIs will depend on the application. It might include "information such as type/organization/process..." but it might not. This spec doesn't seem a good place for advice on URI design, which is a topic in its own right.

I'm ok with putting it back if you feel strongly about it.

I prefer the existing text because it's a little less strict in that the value doesn't necessarily have to be a valid address on the web. For example, they could use http://myserver/.... is that is ok for their setup. We purposely didn't want to get too precise here to allow for flexibility.

Check my latest text. I want to make it clear that you MAY have an internet-unique authority and URI here but you also MAY also have something of application specific scope. Otherwise it's hard to answer the original question "what is the scope in which source+id is unique" in any meaningful way.

@alanconway can you elaborate on what problem you're trying to solve with this change? In the end, aren't these all just opaque identifiers since we're not asking the receiver to do anything with these URI-references - like de-reference them?

I'm trying to clarify the scope of uniqueness, the "U" in URI. URIs have a standard authority component which provides well-known, internet-wide uniqueness guarantees. I'm trying to clarify that a "source" can be URI-unique if it has a URI authority OR it can be an authority-less reference path which is only unique in some application-defined context. Both are useful uniqueness guarantees for different applications.

We could leave it unstated since it's implied by using a URI-reference, but this PR is requesting "Clarify scope of eventID's uniqueness" so it seems to need saying.

My interpretation of the request for clarity around uniqueness was more around simple string comparison type of things :-) So, while the semantics behind these values might be interesting in some cases, I think what's more important (from a spec perspective) is that people know if they do the equivalent of ce.source + ce.id they'll get a unique value within the scope of that producer, and can use it appropriately. Getting into the semantics of the strings, while interesting, doesn't really change what the code a consumer would write - I don't think anyway.

+1 I've restored the original comments on "source" - with the exception of saying "identifies" rather than "describes" and changed the "id" part along the lines you suggested.

spec.md

alanconway · 2019-02-20T21:54:56Z

Updated to deal with most of the comments above, left some conversations unresolved where I need more feedback.

spec.md

alanconway · 2019-02-21T15:56:36Z

I've minimized the changes and updated in line with comments - I think the only open point of discussion is whether type is part of the dedup or not. As my changes read now it is not.

spec.md

alanconway · 2019-02-27T15:56:44Z

On Tue, Feb 26, 2019 at 12:03 PM Christoph Neijenhuis < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec.md <#391 (comment)>: > information such as the type of the event source, the organization publishing the event, the process that produced the event, and some unique - identifiers. The exact syntax and semantics behind the data encoded in the URI - is event producer defined. + identifiers. + + The exact syntax of `source`, and the scope in which it is unique, + depends on the application. Applications range from a single service I don't think we should talk about "application" here, but only about event producer.

It makes no sense to say that 'source' is unique in the scope of a producer - the source *identifies* the producer. The only way an event consumer can tell if events are from "the same producer" or "different producers" is to compare source fields. So uniqueness of source is a consideration for larger application design.

See https://github.com/cloudevents/spec/blob/master/primer.md#design-goals especially and later can be connected to create new applications.

An (implementation of an) event producer MAY be part of many applications.

The author of the event producer may not know what applications it will be part of after a few years.

That's exactly why I mention application. We know that scope+id MUST be unique for events from a single producer. If your application consists of a single producer and directly connected consumers, then the source name can be anything you like and you are done. To build applications that include multiple producers and loosely coupled event delivery (routing, store/forward etc.) producers must use 'source' names that will be unique across all the producers that might ever have events routed to the same consumer. Otherwise there is no way for the *consumer* to know whether events with the same source+id are duplicates, or distinct events from different producers that happen to use the same source name.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#391 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHa6XrWNq1vcANr7UeVxfyFA1ypTvDGGks5vRWjdgaJpZM4bFxWU> .

cneijenhuis · 2019-02-28T08:53:49Z

@alanconway I think we agree on:

Having two different events with the same identification is a failure state
Someone should to be responsible to make sure that doesn't happen

What we don't agree on is who should be responsible. You're saying it should be the Event Producer. I don't agree.

Let me present a few practical examples:

I'm creating IoT devices to be embedded into rooms. Each IoT device is configured to know what building and room it is in. A source may be http://acmeinc.com/buildings/1/rooms/100/sensors/temperature/1. How can the IoT device (the event producer) know whether there is another IoT device with the same source, or not? It can't. The responsibility to configure all IoT devices correctly does not lie with the IoT device itself.
I'm creating an Open Source project producing events. In the config, I'm asking for a unique URI. If two groups inside ACME Inc decide to deploy separate instances of the project with both acmeinc.com as the URI, is it the responsibility of the Open Source project to figure that out?
I'm a cloud provider. I'm allowing you to completely tear down a service (say, a blob storage bucket) and re-create it from scratch with the same name, as if it never existed. I'm clearly documenting that if you re-create the service with name, events might clash.
- In a test deployment, I can tearing down all my services, including the event producer (the bucket) and all consumers. Everything is fine. People enjoy that feature and frequently re-create buckets for integration testing.
- If someone decides to tear down only the bucket, but keep the consumers alive, things will go wrong.
- Again, it is not the event producer/cloud providers fault that this was done.
An event producer is restored from backup, with old data. How is the event producer supposed to know which id's have already been sent?

All the event producer can do is to promise that it'll be unique within their scope. You can not make the event producer responsible for the whole application, because the event producer likely isn't in charge of it.

alanconway · 2019-03-01T19:15:20Z

@cneijenhuis I think we agree on everything, but my wording was messy - have a look at the update.

Here's what I'm trying to say:

Producers MUST generate unique "id" attributes for their own messages only.
Producer identifiers (i.e. "source" attribute) MUST be unique in a wider context that I'm calling the "application". We don't define how that's accomplished, but we give examples of standard URL and URN options, as well as simple application-defined strings or paths.
Given 1+2, consumers can safely assume that events with identical source+id are duplicates.

evankanderson · 2019-03-02T07:23:57Z

Thinking about this a bit more, it seems like it should be the type+source+id triple that is unique.

Firestore is a good example: for a document update, it may fire some of:

Event Type	Trigger
onCreate	Triggered when a document is written to for the first time.
onUpdate	Triggered when a document already exists and has any value changed.
onDelete	Triggered when a document with data is deleted.
onWrite	Triggered when onCreate, onUpdate or onDelete is triggered.

Note that onWrite and onUpdate may fire with the same source for the same occurrence. Using the same event ID allows downstream consumers to correlate onWrite events with creates and updates, for example.

alanconway · 2019-03-04T13:40:55Z

On Sat, Mar 2, 2019 at 2:24 AM Evan Anderson ***@***.***> wrote: Thinking about this a bit more, it seems like it should be the eventid+source+id triple that is unique.

What is "eventid"? My understanding is that the event context attribute named "id", defined here: https://github.com/alanconway/cloudevents-spec/blob/master/spec.md#L218 Is an identifier for the event - within the scope of a given producer, identfied by the attribute named "source". I don't see "eventid" in the spec.

Firestore is a good example: for a document update, it may fire some of: Event Type Trigger onCreate Triggered when a document is written to for the first time. onUpdate Triggered when a document already exists and has any value changed. onDelete Triggered when a document with data is deleted. onWrite Triggered when onCreate, onUpdate or onDelete is triggered. Note that onWrite and onUpdate may fire with the same source for the same occurrence. Using the same event ID allows downstream consumers to correlate onWrite events with creates and updates, for example.

It is the responsibility of a source implementation to ensure unique "id" values per event. If there are concurrent event streams then there are 2 choices: 1. a single source serializes the concurrent events into a single stream, and generates unique ids for them all 2. concurrent event streams are represented as separate sources to preserve concurrency and eliminate the need for synchronization. What's the benefit of adding a 3rd layer to the scheme?

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#391 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHa6XtjL2D2S13KC02vYW5JJIQ2EZ03aks5vSicRgaJpZM4bFxWU> .

cneijenhuis · 2019-03-06T13:26:21Z

@alanconway Sorry for the late reply. To follow your number-list:

On 3.: I agree, this is the point where the spec needs to be more specific, and clearly spell out when a consumer can deduplicate a message.

On 1.: Yes. But that is already in the spec.

On 2.: I agree we should provide guidance on that. But I think the primer is a much better place for it. We can only advise readers on how to set up their application properly, so that they don't run into problems given 3.
But putting a MUST into the spec is a pretty high bar. I fear the effect is that event producers will have to jump through additional hoops to comply with such a MUST, despite them (in many situations) having little control over the application as a whole.

cneijenhuis · 2019-03-06T13:58:05Z

Also, to go back to the discussion started in #331 : Like @Tapppi I am not convinced that source + id is a good choice to clarify "scope".

It also goes against the implicit design of the spec, given by the current examples.

The current examples emphasize a unique type, but not necessarily a unique source. Given the examples, I would design a type of org.example.user.created, and a source of /users/123. I wouldn't go for https://example.org/users/123, especially if no such URL exists (maybe the current URL is https://api.example.org/v2/users/123, but what happens if I change to v3 or - like GitHub - deprecate my REST-ish API in favor of GraphQL?).

I think using DNS for uniqueness is a good idea in general, but I favor the convention of using reverse-DNS, as done in many programming languages for this use case. A package name com.acmeinc is as unique http://acmeinc.com, but my class Foo reads much nicer as com.acmeinc.Foo than http://acmeinc.com/Foo, and it doesn't imply that I could somehow access Foo via that URL.

Furthermore, I think it is much more important for the event type to be globally unique than for the source. I have no data to back this up, but I'm predicting that message routing will happen primarily on type, and only secondarily on source.
As an example, as a consumer (not middleware!) I'd choose to consume events from org.example.* or com.acmeinc.* - if they both happen to define a /user/123, I won't notice. Even if I consume events from both, I'll likely treat those events completely differently, so the non-globally-unique source won't be a problem in practice.

Summary: I think the spec, as it is currently written, implicitly favors type+source+id as the "unique" scope. Design-wise, I personally prefer trying to make the type the thing that ensures globally uniqueness.

duglin · 2019-03-07T01:49:05Z

@alanconway is this one ready for review or did you want to reply to @cneijenhuis's comments first?

spec.md

alanconway · 2019-03-07T19:30:26Z

On Thu, Mar 7, 2019 at 12:18 PM Scott Nichols ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec.md <#391 (comment)>: > ### id * Type: `String` -* Description: ID of the event. The semantics of this string are explicitly - undefined to ease the implementation of producers. Enables deduplication. +* Description: Identifies of the event, enables de-duplication. The + format of this string is determined by the producer. Each producer + MUST generate unique `id` values for its own events, `id` values + from different producers might clash. Consumers MAY assume that + events with identical `id` and `source` values are duplicates, and I would say that eventtype is also required to understand if the event is a duplicate. So: id + eventtype + source combine to be unique.

I don't think type belongs in de-duplication. Type describes a class of events with similar semantics, it does not define where they come from. Many independent producers can produce events of the same type. The only agent that can reliably generate unique IDs is a producer, because it *produces* the events. A single producer can only ensure it's own IDs are unique, so to check for duplicates you absolutely must look at: - the id, because that's the only bit of data that varies on a per-event basis. - the source, because that identifies the producer, and different producers generate ids independently. There's no reason to bring type into it - a producer that generates unique IDs is responsible for never using the same ID on non-duplicate events, and by definition duplicate events will have the same type (and every other attribute that matters) That said, if the consensus is to include type we can. I don't see any benefit to it, but apart from the extra complexity it doesn't cause a problem. —

…

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#391 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHa6Xuzm_PkCbIlenzsMiOlYQ1tfs9yPks5vUUnxgaJpZM4bFxWU> .

alanconway · 2019-03-07T19:34:17Z

I think it warrants review to decide if it makes sense as it stands based on using source+id. We need to work thru the current discussion about adding type as part of the identifier. It should be straightforward to update to the text once we've reached a consensus.

…

On Wed, Mar 6, 2019 at 8:49 PM Doug Davis ***@***.***> wrote: @alanconway <https://github.com/alanconway> is this one ready for review or did you want to reply to @cneijenhuis <https://github.com/cneijenhuis>'s comments first? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#391 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHa6XtpRh4LAkFzR7jRNWZKG-HLMk4sFks5vUHAUgaJpZM4bFxWU> .

duglin · 2019-03-14T12:45:56Z

@evankanderson wrote:

it seems like it should be the eventid+source+id triple that is unique.

I think there's a typo in there since eventid and id are the same, no? Just eventid is the old name for id.

I tend to agree with @alanconway that including type is unnecessary and could actually be problematic if someone chooses to send two messages with the same source and id but different types. Given that id is meant to be unique for the same producer, I'm not sure how someone is supposed to interpret these two messages.

evankanderson · 2019-03-14T16:33:11Z

Sorry, that was a typo, I meant eventtype + source + id

cathyhongzhang · 2019-03-14T18:46:13Z

I also agree with @alanconway :
"There's no reason to bring type into it - a producer that generates unique
IDs is responsible for never using the same ID on non-duplicate events, and
by definition duplicate events will have the same type (and every other
attribute that matters)"

It creates unnecessary complication for the event consumer if we add type. Since the spec already defines that event ID is unique within the scope of the event source, then as long as we ensure source is globally unique, then "source+eventID" will make the event globally unique.

As to the example of Firestore, it seems the problem can be solved by assigning different event IDs to
onWrite and onUpdate firing with the same source.

duglin · 2019-04-17T15:26:08Z

@alanconway can you rebase this?

On last week's call we discussed this a bit and the overall consensus of the group is that id + source should be sufficient for uniqueness. There didn't appear to be a desire to add type into the mix. However, there was a brief discussion around how using non-unique source values could lead to duplicates - so a value like producer would not be wise, but a DNS name or UUID would be. I believe the text in this PR here (https://github.com/cloudevents/spec/pull/391/files#diff-958e7270f96f5407d7d980f500805b1bR189) tries to address this.

Overall, what do people think about this? Aside from the merge-conflict I think it's ready for review/consideration - I'll add it to the call this week, but if you have concerns please voice them in the PR.

alanconway · 2019-04-17T20:53:07Z

Rebased and tightened the text a bit.

spec.md

alanconway · 2019-04-17T21:00:12Z

On Wed, Apr 17, 2019 at 11:26 AM Doug Davis ***@***.***> wrote: @alanconway <https://github.com/alanconway> can you rebase this?

Done, and tightened up the text a bit.

On last week's call we discussed this a bit and the overall consensus of the group is that id + source should be sufficient for uniqueness. There didn't appear to be a desire to add type into the mix. However, there was a brief discussion around how using non-unique source values could lead to duplicates - so a value like producer would not be wise, but a DNS name or UUID would be. I believe the text in this PR here ( https://github.com/cloudevents/spec/pull/391/files#diff-958e7270f96f5407d7d980f500805b1bR189) tries to address this.

Yep the new text is: An application MUST assign a distinct `source` to each distinct producer. The application MAY use UUIDs, URNs, DNS authorities or an application-specific scheme to create unique identifiers. There are examples of each approach.

Overall, what do people think about this? Aside from the merge-conflict I think it's ready for review/consideration - I'll add it to the call this week, but if you have concerns please voice them in the PR.

On a related note, I added a plea to #326 - make eventId optional. It was closed but I feel unjustly :)

See discussion at issue cloudevents#326. This PR is base on pR cloudevents#391 as it depends on those changes.

alanconway · 2019-04-24T19:33:33Z

Note: the CI build failure above is cause by a broken link to the cloudevent logo, nothing to do with my changes AFAIK.

See discussion at issue cloudevents#326. This PR is base on pR cloudevents#391 as it depends on those changes.

spec.md

duglin · 2019-05-09T01:04:29Z

@alanconway on last week's call we agreed with this general direction. Could you address the merge conflict and any outstanding comments? The test associated in here might change based on @deissnerk's PR ( #420 ) but we may just have to see which goes in first and then the other will have to adjust accordingly. Also, we may need to adjust the Primer too.

alanconway · 2019-05-09T20:42:58Z

Updated and rebased. I have re-worded the source definition to use the terms defined in #420 - so at this point this PR really needs to go in after #420 as currently the source definition talks about sources with many producers, which contradicts other places where producer==source.

Reword "id" and "source" to clarify uniqueness requirements. Examples to show different approaches to generating unique source/IDs Clarify producer/consumer responsibilities. Signed-off-by: Alan Conway <aconway@redhat.com>

duglin · 2019-05-15T13:21:43Z

spec.md

+  URNs, DNS authorities or an application-specific scheme to create
+  unique `source` identifiers.
+
+  A source MAY include more than one producer. In that case the


is the word "include" the right word? Perhaps "use" or "leverage"? Not a big deal since I think I get the point, but "include" might sound like we're getting include impl details and suggesting that a producer is a sub-component of a source - which it may not be.

duglin · 2019-05-15T13:22:42Z

Just one minor question, otherwise LGTM

alanconway · 2019-05-15T14:03:05Z

On Wed, May 15, 2019 at 3:21 PM Doug Davis ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec.md <#391 (comment)>: > +- Description: Identifies the context in which an event + happened. Often this will include information such as the type of + the event source, the organization publishing the event or the + process that produced the event. The exact syntax and semantics + behind the data encoded in the URI is defined by the event producer. + + Producers MUST ensure that `source` + `id` is unique for each + distinct event. + + An application MAY assign a unique `source` to each distinct + producer, which makes it easy to produce unique IDs since no other + producer will have the same source. The application MAY use UUIDs, + URNs, DNS authorities or an application-specific scheme to create + unique `source` identifiers. + + A source MAY include more than one producer. In that case the is the word "include" the right word? Perhaps "use" or "leverage"? Not a big deal since I think I get the point, but "include" might sound like we're getting include impl details and suggesting that a producer is a sub-component of a source - which it may not be. I'm fine with another word: "encompass", "comprise", "involve", "contain"?

The source isn't a thing in itself, it is a group of related producers, so I'm not to sure about "use" or "leverage" - there's no separate source object that acts on the producers. I don't object however.

duglin · 2019-05-15T15:23:39Z

Let's just leave it as is then unless someone thinks of some better word.

duglin · 2019-05-16T23:47:38Z

Approved on the 5/16 call

alanconway mentioned this pull request Feb 20, 2019

Clarify scope of eventID's uniqueness #331

Closed

cneijenhuis reviewed Feb 20, 2019

View reviewed changes

duglin reviewed Feb 20, 2019

View reviewed changes