Add dataref attribute and describe Claim Check Pattern #377

cneijenhuis · 2019-01-28T12:53:51Z

This resulted mostly out of the discussion around #364 but is also related to #373

The Claim Check Pattern is used for different purposes, including large payload size and security concerns.

This is on purpose "hand-wavy" around the actual retrieval of the payload.
The advantage is that in this way, any auth and storage can be used, e.g. the blob storage of a cloud provider and the protection by their auth mechanisms.
The downside is that it is hard for a middleware or a SDK to retrieve the payload. A non-protected HTTP(S) endpoint would be simpler.

@JemDay I would appreciate it if you could give this a read :)

Signed-off-by: Christoph Neijenhuis christoph.neijenhuis@commercetools.de

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

rperelma · 2019-01-28T23:24:18Z

This is a really good start, @cneijenhuis, thanks for the proposal!

duglin · 2019-01-29T14:17:32Z

spec.md

@@ -281,6 +281,27 @@ help intermediate gateways determine how to route the events.
 As defined by the term [Data](#data), CloudEvents MAY include domain-specific
 information about the occurrence. When present, this information will be
 encapsulated within the `data` attribute.
+The `dataref` attribute can be used to reference another location where this


I like the direction of this. My comments are editorial in nature:

it feels like most of this should be put under the dataref section instead of here because most of it applies to only when dataref is present. Then we could just add a pointer to the dataref attribute from here.

s/can/MAY/ in this sentence

Thanks for the careful review!

I have moved everything except the paragraph that describes the interaction between data and dataref into the dataref section: 3b923d4

duglin · 2019-01-29T14:18:21Z

spec.md

@@ -281,6 +281,27 @@ help intermediate gateways determine how to route the events.
 As defined by the term [Data](#data), CloudEvents MAY include domain-specific
 information about the occurrence. When present, this information will be
 encapsulated within the `data` attribute.
+The `dataref` attribute can be used to reference another location where this
+information is stored. Known as the "Claim Check Pattern", the `dataref`
+attribute can be used for different purposes:


s/can/MAY/ I think

s/different purposes/a variety of purposes, including:/

duglin · 2019-01-29T14:18:53Z

spec.md

+  `dataref` attribute.
+* If the consumer wants to verify that the information has not been tampered
+  with, it can retrieve it from a trusted source using the `dataref` attribute.
+* If the information MUST only be viewed by trusted consumers (e.g. personally


s/MUST/is to/

duglin · 2019-01-29T14:20:26Z

spec.md

+Both the `data` and `dataref` attribute MAY exist at the same time. A middleware
+MAY drop the `data` attribute when the `dataref` attribute exists, it MAY add
+the `dataref` attribute and drop the `data` attribute, or it MAY add the `data`
+attribute by using the `dataref` attribute.


Just to complete all options... is there ever a case where they can drop dataref ?

If we allow a middleware to drop the dataref, a consumer can not verify anymore that the data hasn't been tampered with. Less significantly, another middleware downstream can not (easily) drop the data attribute, if it wants to shrink the message size.

I can't come up with a good use case for a middleware to drop it. Did you have one in mind?

Nope - was just asking for completeness. Thanks

We need to be clear here on whether the dataref includes information carried in data attribute. If the dataref refers to information in addition to what is carried in data, then the middleware should not drop the data attribute. There are cases that it will result in performance enhancement if the event producer sends some key data information in-band in data attribute and other "might-need" large size information out-of-band via dataref. Extra latency is a drawback of serverless technology, so optimizing latency is important. There might be other use cases too. We may need to consider such use cases and allow data to be carried in both in-band data attribute and out-of-band dataref attribute.

We need to be clear here on whether the dataref includes information carried in data attribute.

I thought it should already be clear from the previous line, but apparently it is not. The information (as the spec calls it) of data and the one retrieved via dataref MUST be exactly the same.

I'll try to rephrase it.

If the dataref refers to information in addition to what is carried in data, then the middleware should not drop the data attribute.

It must not.

There are cases that it will result in performance enhancement if the event producer sends some key data information in-band in data attribute and other "might-need" large size information out-of-band via dataref.

Yes, this is a completely valid use case in itself, but not what dataref is. dataref is a reference to (what would be in) data.
If one wants to put some information within data, and some outside of it, that is completely fine - the references can be contained within data.
The point of dataref is that it allows a middleware (including a SDK) to switch between in-band and out-of-band without affecting consumers or producers.

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

cathyhongzhang · 2019-02-07T19:27:25Z

spec.md

+
+* Constraints:
+  * OPTIONAL
+


Adding this attribute will help avoid sending large amount of data in-band. But as raised in the meeting discussion, there could be cases that need multiple URI-ref. So we will need to consider how to define the type

cathyhongzhang · 2019-02-07T19:31:56Z

spec.md

+
+* Constraints:
+  * OPTIONAL
+


Does the URI give information on how to obtain the data information from that URI? I assume the URI could point to different types of storing mechanism, such as block storage or DB etc.. The mechanism to retrieve the info could be different. From an event consumer's point of view, I need to know the type or how to retrieve the data info.

For the security focused use cases, this is explicitly an anti-goal:

Only a pre-approved consumer knows how to retrieve the data, so that no one can listen in.

A consumer will retrieve the data from a trusted source, to make sure it hasn't been tampered with. It must explicitly not try to retrieve the data in any other way.

I agree that it would be nice for the message-too-large use case to have pre-defined mechanisms, but my guess is that it would be vendor-specific. E.g. Azure would use Blob Storage, AWS would use S3 etc. I'm not sure how to pull that off properly, or if the URI isn't already enough info to know if it is e.g. a Azure Blob or a AWS S3 reference.

duglin · 2019-02-12T14:39:08Z

@clemensv I believe you had some ideas for a proposal to support this by reusing the existing data property - did you want to add it here for the call this week so we could discuss it?

Tapppi · 2019-02-14T18:29:09Z

Since the spec is thus far mainly concerned with features of the events that require support/definition themselves, not requiring additional capabilities from consumers, is this a required capability (being an attribute in the main spec) or is it ok to drop (or even reject, e.g. HTTP endpoint returning status code 400) the event because it uses dataref instead of data. Either way I think there should be language to clarify that.

To say it another way, you might not support all transports and even if you support HTTP transport, you will not support all event formats in a single endpoint accepting cloudevents. It is quite clear how to respond in that situation but I don't think it is clear here since you know about the attribute (it being in the main spec and all), but you simply can't or don't support it. Are you still compliant if you don't accept the event because you need to validate or look at the contents of the data attribute instead of just forwarding the message? The HTTP binding is clarified like this in two places:

All implementations of this specification MUST support the non-batching JSON event format, but MAY support any additional, including proprietary, formats.

If a receiver detects the CloudEvents media type, but with an event format that it cannot handle, for instance application/cloudevents+avro, it MAY still treat the event as binary and forward it to another party as-is.

This would not be a problem for me if it was an extension (because you don't have to support it) or included in the data attribute, because it could not then be "the exact and whole data payload" replacing the actual data as it is in this proposal.

@duglin raised a point about clarifying the responsibilities of consumers and producers more generally regarding MUST etc. in the spec, but I think it is particularly thorny in this case, because it is ascribing capabilities other than the processing of a singular event to consumers.

clemensv · 2019-02-15T14:28:52Z

I find the dataref attribute problematic for a few reasons.

Events are generally notifications about something and that something is usually being pointed at. As a matter of principle I'm having issues with payloads for an event that are so large that the claim check pattern is even required.

Example: Azure Blob events

Azure Blob events are about and point to files that might be Gigabytes in size, and the events might be sent to an audience of dozens, but the event per-se is super compact, and contains further custom information about that blob.

Likewise, any event coming out of a solution such as a CRM system might point to a complex and large object, but that pointer will be accompanied by further metadata that will allow the handler to determine whether following the link even makes sense, i.e. just like with the blob even, that would sit in the data for the handler to understand.

And additional complication for a generic mechanism is network scoping. Since dataref is and must be mutually exclusive with data, the event is not completely delivered at the subscriber when the URL can't be resolved. Since events might be routed and traverse different network scopes, there would have to be a rule for how to assure how any receiver can resolve the payload, and I don't know how that might look like.

I'm indeed disputing that there is a real use-case for events with giant payloads that warrant the use of the claim check pattern in a generic fashion and where putting a link into data, accompanied by a description of the referenced object, is not the better choice.

duglin · 2019-02-18T16:52:42Z

@Tapppi I opened: #388 so we remember to address your concerns.

duglin · 2019-02-18T18:39:07Z

I believe in our first demo we used a form of the claim-check pattern because there was a URL to the image in the event's data attribute. Do people think that we ran into some interop issues during that time where having a well-defined spot (either a new property, or location in data) would have make life easier?

cneijenhuis · 2019-02-18T22:53:47Z

@clemensv I'm a bit too lazy to make a POC, but there is at least a chance I can make a Azure Blob event become larger than 64KB. There are several ways I can blow up the size of the event:

Azure blob storage supports folders. I can create a deeply nested folder structure with long folder names. These will then show up in the subject and data.url fields.
The clientRequestId is user-provided
The content type is user-provided
The account name is user-provided (part of both topic and url)

I'm sure all of these are limited, and it may be I might not succeed with creating an event >64KB - but at least there could be an implementation of a blob storage that allows both long folder names and deep nesting that I can end up with both subject and data.url approaching 32KB each (NTFS apparently supports 32K characters in the path, for example).

My overall point being: The moment you include user-provided data in your event, you can end up with an unpredictable variation in your event data size.

Since events might be routed and traverse different network scopes, there would have to be a rule for how to assure how any receiver can resolve the payload

Yes, this is a problem. An approach could be that the intermediary (who routes from one network scope to another) needs to change the dataref in such cases.

I'm indeed disputing that there is a real use-case for events with giant payloads that warrant the use of the claim check pattern in a generic fashion and where putting a link into data, accompanied by a description of the referenced object, is not the better choice.

As far as I can see, that link is already always part of a CloudEvent via the source attribute.

In the Blob Storage example, the events are simply BlobCreated and BlobDeleted. For any change I'll always have to go back to the source and fetch everything.
That is fine for Blobs, but for something that has more structure and meaning, it is useful if the events contain actual info about what has changed.

Given your CRM example, maybe it points back to a customer account. But that customer account contains a large amount of information. If something changes, your event should include what has actually changed. Maybe it is just the phone number (small), or maybe a phone call transcript (usually small, but maybe big) has been added.

In commerce, we often have objects (products, orders, ...) that are big (in MBs), but an event consumer wants to know which part has changed. The consumer prefers not to download the whole thing each time, but be able to work directly from the event. We therefore send, along with the event, the changes that have been performed. These are usually much smaller than the object itself, but can easily be a few hundert KB.
Even if we would only point back to the sub-paths that have changed, these can exceed 64KB. E.g. our products may contain price information for each physical store the product is sold in. If, in a marketing campaign, the price is changed in hundreds of stores, just pointing to the changed prices (without the actual new price) can exceed 64KB.

cathyhongzhang · 2019-02-21T18:08:53Z

To support interoperability between all types of event producers and event consumers, it is better to define the ref. so that all entities along the way know where and how to retrieve the large payload. I think the difficult point is how we should define the ref to ensure that any event consumer can successfully retrieve the large payload based on this "ref" since there are different medias and mechanisms the payload can be stored. Maybe we can list all existing medias/mechanisms and define the ref for each type? We can always update the list when we find a new type.

duglin · 2019-02-24T01:12:57Z

In thinking about this one, a couple of things come to mind:

at some point in the spec's lifecycle we may find ourselves where we've left the stage of gathering the "minimum required" to claim success, and we start to enter the realm of "might be nice to haves". Where it almost borders on "scope creep". I'm not 100% sure we're at that stage yet, but I am wondering about it when I think about this feature because we were able to support the claim-check pattern just fine for the first interop event and I don't recall any pain around it.
is a single URL sufficient? What if there are multiple large objects for a particular event, do we need to support an array? how do people know which URL points to what? Are we getting into the semantics of the payload at that point?
how does someone access the data behind that URL since we're not including any auth info? And that implies some out-of-band knowledge which makes me wonder if that means that the receiver would know how to extract the URL from the data anyway.
who do we expect to process this property? The PR talks about middleware, but can we assume it has the auth necessary to do so?
someone mentioned wanting this URL outside of data, but this is meant to be an alternative for placing a large chuck of info into data - are we somehow violation our model now?

I can definitely appreciate the desire for a consistent/interoperable location for this information, however, I question the ability for anything but the receiving application (or some other entity that understands type and data) to meaningfully know when (and how) to retrieve the remote data. And so, if that's the case then that same code will probably know exactly how to handle the application specific claim-check pattern encoded within the data property.

I guess I'm at the position now where I'd like to get more data on how needed this really is before we add it as a 'core' property. My biggest concern is adding a feature that may not really be as useful as we think and then we're stuck with it. But, if we start out with it being an extension then we can always upgrade it to a core spec property later w/o breaking people.

rperelma · 2019-02-24T05:25:45Z

Very well put, @duglin I agree 100%

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

JemDay · 2019-03-27T15:45:21Z

Now we have datacontentencoding and 'datacontenttype' attributes i'm curious if we can now leverage them to indicate that the data attribute contains a reference as opposed to the actual data - i believe this is in line with @clemensv's earlier comment.

cneijenhuis · 2019-03-28T12:34:51Z

It is an interesting approach, I think it could work if we want to go with XOR semantics.

However, in this PR I'm proposing OR semantics (i.e. allowing both data and dataref to be present). The producer may decide to include both, because then a middleware can decide to drop data (e.g. because it is too large, or because it wants to enforce stronger security on the next hop), and a consumer can opt-in to retrieve the data from a trusted source to make sure it hasn't been tampered with (maybe some consumers don't want to do this at all, or only for some events).

@JemDay Anyway, if you want to push the Claim Check pattern forward, feel free to take over from me.

duglin · 2019-04-02T12:57:44Z

extensions/dataref.md

@@ -0,0 +1,73 @@
+# Dataref (Claim Check Pattern)
+
+As defined by the term [Data](spec.md#data), CloudEvents MAY include domain-specific


CI failure noticed that this needs to be ../spec since we're in the extensions dir now. I think there are a few more below too.

duglin · 2019-04-02T13:12:28Z

Where are we on this? Ready for review? (aside from the minor CI issue)

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

cneijenhuis · 2019-04-02T14:48:51Z

Yes, it is ready for review. However I'll probably not join this weeks call, or only for the first few minutes, so I suggest we review it the week after.

duglin · 2019-04-14T01:54:24Z

Approved on the 4/11 call.

duglin · 2019-04-15T13:26:50Z

Jem/Christoph were the voting members who voiced support for this.

Add dataref, describe claim check pattern

f30df22

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

duglin reviewed Jan 29, 2019

View reviewed changes

Implement Dougs suggestions

3b923d4

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

cneijenhuis force-pushed the claim-check-pattern branch from 535f390 to 3b923d4 Compare January 29, 2019 20:54

cathyhongzhang reviewed Feb 7, 2019

View reviewed changes

duglin mentioned this pull request Feb 24, 2019

Proposal for an event key field as an extension #218

Merged

Clarify content of data and dataref MUST be identical

1812972

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

cneijenhuis force-pushed the claim-check-pattern branch 2 times, most recently from 891f498 to b4df7bb Compare March 20, 2019 09:02

Move dataref to extension

b4c18a1

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

cneijenhuis force-pushed the claim-check-pattern branch from b4df7bb to b4c18a1 Compare March 20, 2019 09:03

duglin reviewed Apr 2, 2019

View reviewed changes

Fix links

ad82d3b

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>

duglin merged commit 3ea7805 into cloudevents:master Apr 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataref attribute and describe Claim Check Pattern #377

Add dataref attribute and describe Claim Check Pattern #377

cneijenhuis commented Jan 28, 2019

rperelma commented Jan 28, 2019

duglin Jan 29, 2019

cneijenhuis Jan 29, 2019

duglin Jan 29, 2019

duglin Jan 29, 2019

duglin Jan 29, 2019

cneijenhuis Jan 29, 2019

duglin Jan 29, 2019

cathyhongzhang Feb 7, 2019

cneijenhuis Feb 12, 2019

cathyhongzhang Feb 7, 2019

cathyhongzhang Feb 7, 2019

cneijenhuis Feb 12, 2019

duglin commented Feb 12, 2019

Tapppi commented Feb 14, 2019 •

edited

Loading

clemensv commented Feb 15, 2019 •

edited

Loading

duglin commented Feb 18, 2019

duglin commented Feb 18, 2019

cneijenhuis commented Feb 18, 2019

cathyhongzhang commented Feb 21, 2019

duglin commented Feb 24, 2019

rperelma commented Feb 24, 2019

JemDay commented Mar 27, 2019 •

edited

Loading

cneijenhuis commented Mar 28, 2019

duglin Apr 2, 2019

duglin commented Apr 2, 2019

cneijenhuis commented Apr 2, 2019

duglin commented Apr 14, 2019

duglin commented Apr 15, 2019

		@@ -0,0 +1,73 @@
		# Dataref (Claim Check Pattern)

		As defined by the term [Data](spec.md#data), CloudEvents MAY include domain-specific

Add dataref attribute and describe Claim Check Pattern #377

Add dataref attribute and describe Claim Check Pattern #377

Conversation

cneijenhuis commented Jan 28, 2019

rperelma commented Jan 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duglin commented Feb 12, 2019

Tapppi commented Feb 14, 2019 • edited Loading

clemensv commented Feb 15, 2019 • edited Loading

duglin commented Feb 18, 2019

duglin commented Feb 18, 2019

cneijenhuis commented Feb 18, 2019

cathyhongzhang commented Feb 21, 2019

duglin commented Feb 24, 2019

rperelma commented Feb 24, 2019

JemDay commented Mar 27, 2019 • edited Loading

cneijenhuis commented Mar 28, 2019

Choose a reason for hiding this comment

duglin commented Apr 2, 2019

cneijenhuis commented Apr 2, 2019

duglin commented Apr 14, 2019

duglin commented Apr 15, 2019

Tapppi commented Feb 14, 2019 •

edited

Loading

clemensv commented Feb 15, 2019 •

edited

Loading

JemDay commented Mar 27, 2019 •

edited

Loading