Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataref attribute and describe Claim Check Pattern #377

Merged
merged 5 commits into from
Apr 14, 2019

Conversation

cneijenhuis
Copy link
Contributor

This resulted mostly out of the discussion around #364 but is also related to #373

The Claim Check Pattern is used for different purposes, including large payload size and security concerns.

This is on purpose "hand-wavy" around the actual retrieval of the payload.
The advantage is that in this way, any auth and storage can be used, e.g. the blob storage of a cloud provider and the protection by their auth mechanisms.
The downside is that it is hard for a middleware or a SDK to retrieve the payload. A non-protected HTTP(S) endpoint would be simpler.

@JemDay I would appreciate it if you could give this a read :)

Signed-off-by: Christoph Neijenhuis christoph.neijenhuis@commercetools.de

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@rperelma
Copy link
Contributor

This is a really good start, @cneijenhuis, thanks for the proposal!

spec.md Outdated
@@ -281,6 +281,27 @@ help intermediate gateways determine how to route the events.
As defined by the term [Data](#data), CloudEvents MAY include domain-specific
information about the occurrence. When present, this information will be
encapsulated within the `data` attribute.
The `dataref` attribute can be used to reference another location where this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction of this. My comments are editorial in nature:

  • it feels like most of this should be put under the dataref section instead of here because most of it applies to only when dataref is present. Then we could just add a pointer to the dataref attribute from here.
  • s/can/MAY/ in this sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful review!

I have moved everything except the paragraph that describes the interaction between data and dataref into the dataref section: 3b923d4

spec.md Outdated
@@ -281,6 +281,27 @@ help intermediate gateways determine how to route the events.
As defined by the term [Data](#data), CloudEvents MAY include domain-specific
information about the occurrence. When present, this information will be
encapsulated within the `data` attribute.
The `dataref` attribute can be used to reference another location where this
information is stored. Known as the "Claim Check Pattern", the `dataref`
attribute can be used for different purposes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • s/can/MAY/ I think
  • s/different purposes/a variety of purposes, including:/

spec.md Outdated
`dataref` attribute.
* If the consumer wants to verify that the information has not been tampered
with, it can retrieve it from a trusted source using the `dataref` attribute.
* If the information MUST only be viewed by trusted consumers (e.g. personally
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/MUST/is to/

spec.md Outdated
Both the `data` and `dataref` attribute MAY exist at the same time. A middleware
MAY drop the `data` attribute when the `dataref` attribute exists, it MAY add
the `dataref` attribute and drop the `data` attribute, or it MAY add the `data`
attribute by using the `dataref` attribute.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to complete all options... is there ever a case where they can drop dataref ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow a middleware to drop the dataref, a consumer can not verify anymore that the data hasn't been tampered with. Less significantly, another middleware downstream can not (easily) drop the data attribute, if it wants to shrink the message size.

I can't come up with a good use case for a middleware to drop it. Did you have one in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - was just asking for completeness. Thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be clear here on whether the dataref includes information carried in data attribute. If the dataref refers to information in addition to what is carried in data, then the middleware should not drop the data attribute. There are cases that it will result in performance enhancement if the event producer sends some key data information in-band in data attribute and other "might-need" large size information out-of-band via dataref. Extra latency is a drawback of serverless technology, so optimizing latency is important. There might be other use cases too. We may need to consider such use cases and allow data to be carried in both in-band data attribute and out-of-band dataref attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be clear here on whether the dataref includes information carried in data attribute.

I thought it should already be clear from the previous line, but apparently it is not. The information (as the spec calls it) of data and the one retrieved via dataref MUST be exactly the same.

I'll try to rephrase it.

If the dataref refers to information in addition to what is carried in data, then the middleware should not drop the data attribute.

It must not.

There are cases that it will result in performance enhancement if the event producer sends some key data information in-band in data attribute and other "might-need" large size information out-of-band via dataref.

Yes, this is a completely valid use case in itself, but not what dataref is. dataref is a reference to (what would be in) data.
If one wants to put some information within data, and some outside of it, that is completely fine - the references can be contained within data.
The point of dataref is that it allows a middleware (including a SDK) to switch between in-band and out-of-band without affecting consumers or producers.

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
spec.md Outdated

* Constraints:
* OPTIONAL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this attribute will help avoid sending large amount of data in-band. But as raised in the meeting discussion, there could be cases that need multiple URI-ref. So we will need to consider how to define the type

spec.md Outdated

* Constraints:
* OPTIONAL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the URI give information on how to obtain the data information from that URI? I assume the URI could point to different types of storing mechanism, such as block storage or DB etc.. The mechanism to retrieve the info could be different. From an event consumer's point of view, I need to know the type or how to retrieve the data info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the security focused use cases, this is explicitly an anti-goal:

  • Only a pre-approved consumer knows how to retrieve the data, so that no one can listen in.
  • A consumer will retrieve the data from a trusted source, to make sure it hasn't been tampered with. It must explicitly not try to retrieve the data in any other way.

I agree that it would be nice for the message-too-large use case to have pre-defined mechanisms, but my guess is that it would be vendor-specific. E.g. Azure would use Blob Storage, AWS would use S3 etc. I'm not sure how to pull that off properly, or if the URI isn't already enough info to know if it is e.g. a Azure Blob or a AWS S3 reference.

@duglin
Copy link
Collaborator

duglin commented Feb 12, 2019

@clemensv I believe you had some ideas for a proposal to support this by reusing the existing data property - did you want to add it here for the call this week so we could discuss it?

@Tapppi
Copy link
Contributor

Tapppi commented Feb 14, 2019

Since the spec is thus far mainly concerned with features of the events that require support/definition themselves, not requiring additional capabilities from consumers, is this a required capability (being an attribute in the main spec) or is it ok to drop (or even reject, e.g. HTTP endpoint returning status code 400) the event because it uses dataref instead of data. Either way I think there should be language to clarify that.

To say it another way, you might not support all transports and even if you support HTTP transport, you will not support all event formats in a single endpoint accepting cloudevents. It is quite clear how to respond in that situation but I don't think it is clear here since you know about the attribute (it being in the main spec and all), but you simply can't or don't support it. Are you still compliant if you don't accept the event because you need to validate or look at the contents of the data attribute instead of just forwarding the message? The HTTP binding is clarified like this in two places:

All implementations of this specification MUST support the non-batching JSON event format, but MAY support any additional, including proprietary, formats.

If a receiver detects the CloudEvents media type, but with an event format that it cannot handle, for instance application/cloudevents+avro, it MAY still treat the event as binary and forward it to another party as-is.

This would not be a problem for me if it was an extension (because you don't have to support it) or included in the data attribute, because it could not then be "the exact and whole data payload" replacing the actual data as it is in this proposal.

@duglin raised a point about clarifying the responsibilities of consumers and producers more generally regarding MUST etc. in the spec, but I think it is particularly thorny in this case, because it is ascribing capabilities other than the processing of a singular event to consumers.

@clemensv
Copy link
Contributor

clemensv commented Feb 15, 2019

I find the dataref attribute problematic for a few reasons.

Events are generally notifications about something and that something is usually being pointed at. As a matter of principle I'm having issues with payloads for an event that are so large that the claim check pattern is even required.

Example: Azure Blob events

Azure Blob events are about and point to files that might be Gigabytes in size, and the events might be sent to an audience of dozens, but the event per-se is super compact, and contains further custom information about that blob.

Likewise, any event coming out of a solution such as a CRM system might point to a complex and large object, but that pointer will be accompanied by further metadata that will allow the handler to determine whether following the link even makes sense, i.e. just like with the blob even, that would sit in the data for the handler to understand.

And additional complication for a generic mechanism is network scoping. Since dataref is and must be mutually exclusive with data, the event is not completely delivered at the subscriber when the URL can't be resolved. Since events might be routed and traverse different network scopes, there would have to be a rule for how to assure how any receiver can resolve the payload, and I don't know how that might look like.

I'm indeed disputing that there is a real use-case for events with giant payloads that warrant the use of the claim check pattern in a generic fashion and where putting a link into data, accompanied by a description of the referenced object, is not the better choice.

@duglin
Copy link
Collaborator

duglin commented Feb 18, 2019

@Tapppi I opened: #388 so we remember to address your concerns.

@duglin
Copy link
Collaborator

duglin commented Feb 18, 2019

I believe in our first demo we used a form of the claim-check pattern because there was a URL to the image in the event's data attribute. Do people think that we ran into some interop issues during that time where having a well-defined spot (either a new property, or location in data) would have make life easier?

@cneijenhuis
Copy link
Contributor Author

@clemensv I'm a bit too lazy to make a POC, but there is at least a chance I can make a Azure Blob event become larger than 64KB. There are several ways I can blow up the size of the event:

  • Azure blob storage supports folders. I can create a deeply nested folder structure with long folder names. These will then show up in the subject and data.url fields.
  • The clientRequestId is user-provided
  • The content type is user-provided
  • The account name is user-provided (part of both topic and url)

I'm sure all of these are limited, and it may be I might not succeed with creating an event >64KB - but at least there could be an implementation of a blob storage that allows both long folder names and deep nesting that I can end up with both subject and data.url approaching 32KB each (NTFS apparently supports 32K characters in the path, for example).

My overall point being: The moment you include user-provided data in your event, you can end up with an unpredictable variation in your event data size.

Since events might be routed and traverse different network scopes, there would have to be a rule for how to assure how any receiver can resolve the payload

Yes, this is a problem. An approach could be that the intermediary (who routes from one network scope to another) needs to change the dataref in such cases.

I'm indeed disputing that there is a real use-case for events with giant payloads that warrant the use of the claim check pattern in a generic fashion and where putting a link into data, accompanied by a description of the referenced object, is not the better choice.

As far as I can see, that link is already always part of a CloudEvent via the source attribute.

In the Blob Storage example, the events are simply BlobCreated and BlobDeleted. For any change I'll always have to go back to the source and fetch everything.
That is fine for Blobs, but for something that has more structure and meaning, it is useful if the events contain actual info about what has changed.

Given your CRM example, maybe it points back to a customer account. But that customer account contains a large amount of information. If something changes, your event should include what has actually changed. Maybe it is just the phone number (small), or maybe a phone call transcript (usually small, but maybe big) has been added.

In commerce, we often have objects (products, orders, ...) that are big (in MBs), but an event consumer wants to know which part has changed. The consumer prefers not to download the whole thing each time, but be able to work directly from the event. We therefore send, along with the event, the changes that have been performed. These are usually much smaller than the object itself, but can easily be a few hundert KB.
Even if we would only point back to the sub-paths that have changed, these can exceed 64KB. E.g. our products may contain price information for each physical store the product is sold in. If, in a marketing campaign, the price is changed in hundreds of stores, just pointing to the changed prices (without the actual new price) can exceed 64KB.

@cathyhongzhang
Copy link

To support interoperability between all types of event producers and event consumers, it is better to define the ref. so that all entities along the way know where and how to retrieve the large payload. I think the difficult point is how we should define the ref to ensure that any event consumer can successfully retrieve the large payload based on this "ref" since there are different medias and mechanisms the payload can be stored. Maybe we can list all existing medias/mechanisms and define the ref for each type? We can always update the list when we find a new type.

@duglin
Copy link
Collaborator

duglin commented Feb 24, 2019

In thinking about this one, a couple of things come to mind:

  • at some point in the spec's lifecycle we may find ourselves where we've left the stage of gathering the "minimum required" to claim success, and we start to enter the realm of "might be nice to haves". Where it almost borders on "scope creep". I'm not 100% sure we're at that stage yet, but I am wondering about it when I think about this feature because we were able to support the claim-check pattern just fine for the first interop event and I don't recall any pain around it.
  • is a single URL sufficient? What if there are multiple large objects for a particular event, do we need to support an array? how do people know which URL points to what? Are we getting into the semantics of the payload at that point?
  • how does someone access the data behind that URL since we're not including any auth info? And that implies some out-of-band knowledge which makes me wonder if that means that the receiver would know how to extract the URL from the data anyway.
  • who do we expect to process this property? The PR talks about middleware, but can we assume it has the auth necessary to do so?
  • someone mentioned wanting this URL outside of data, but this is meant to be an alternative for placing a large chuck of info into data - are we somehow violation our model now?

I can definitely appreciate the desire for a consistent/interoperable location for this information, however, I question the ability for anything but the receiving application (or some other entity that understands type and data) to meaningfully know when (and how) to retrieve the remote data. And so, if that's the case then that same code will probably know exactly how to handle the application specific claim-check pattern encoded within the data property.

I guess I'm at the position now where I'd like to get more data on how needed this really is before we add it as a 'core' property. My biggest concern is adding a feature that may not really be as useful as we think and then we're stuck with it. But, if we start out with it being an extension then we can always upgrade it to a core spec property later w/o breaking people.

@rperelma
Copy link
Contributor

Very well put, @duglin I agree 100%

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@cneijenhuis cneijenhuis force-pushed the claim-check-pattern branch 2 times, most recently from 891f498 to b4df7bb Compare March 20, 2019 09:02
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@cneijenhuis cneijenhuis force-pushed the claim-check-pattern branch from b4df7bb to b4c18a1 Compare March 20, 2019 09:03
@JemDay
Copy link
Contributor

JemDay commented Mar 27, 2019

Now we have datacontentencoding and 'datacontenttype' attributes i'm curious if we can now leverage them to indicate that the data attribute contains a reference as opposed to the actual data - i believe this is in line with @clemensv's earlier comment.

@cneijenhuis
Copy link
Contributor Author

It is an interesting approach, I think it could work if we want to go with XOR semantics.

However, in this PR I'm proposing OR semantics (i.e. allowing both data and dataref to be present). The producer may decide to include both, because then a middleware can decide to drop data (e.g. because it is too large, or because it wants to enforce stronger security on the next hop), and a consumer can opt-in to retrieve the data from a trusted source to make sure it hasn't been tampered with (maybe some consumers don't want to do this at all, or only for some events).

@JemDay Anyway, if you want to push the Claim Check pattern forward, feel free to take over from me.

@@ -0,0 +1,73 @@
# Dataref (Claim Check Pattern)

As defined by the term [Data](spec.md#data), CloudEvents MAY include domain-specific
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failure noticed that this needs to be ../spec since we're in the extensions dir now. I think there are a few more below too.

@duglin
Copy link
Collaborator

duglin commented Apr 2, 2019

Where are we on this? Ready for review? (aside from the minor CI issue)

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@cneijenhuis
Copy link
Contributor Author

Yes, it is ready for review. However I'll probably not join this weeks call, or only for the first few minutes, so I suggest we review it the week after.

@duglin
Copy link
Collaborator

duglin commented Apr 14, 2019

Approved on the 4/11 call.

@duglin duglin merged commit 3ea7805 into cloudevents:master Apr 14, 2019
@duglin
Copy link
Collaborator

duglin commented Apr 15, 2019

Jem/Christoph were the voting members who voiced support for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants