-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataref attribute and describe Claim Check Pattern #377
Conversation
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
This is a really good start, @cneijenhuis, thanks for the proposal! |
spec.md
Outdated
@@ -281,6 +281,27 @@ help intermediate gateways determine how to route the events. | |||
As defined by the term [Data](#data), CloudEvents MAY include domain-specific | |||
information about the occurrence. When present, this information will be | |||
encapsulated within the `data` attribute. | |||
The `dataref` attribute can be used to reference another location where this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction of this. My comments are editorial in nature:
- it feels like most of this should be put under the
dataref
section instead of here because most of it applies to only when dataref is present. Then we could just add a pointer to the dataref attribute from here. - s/can/MAY/ in this sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the careful review!
I have moved everything except the paragraph that describes the interaction between data
and dataref
into the dataref
section: 3b923d4
spec.md
Outdated
@@ -281,6 +281,27 @@ help intermediate gateways determine how to route the events. | |||
As defined by the term [Data](#data), CloudEvents MAY include domain-specific | |||
information about the occurrence. When present, this information will be | |||
encapsulated within the `data` attribute. | |||
The `dataref` attribute can be used to reference another location where this | |||
information is stored. Known as the "Claim Check Pattern", the `dataref` | |||
attribute can be used for different purposes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- s/can/MAY/ I think
- s/different purposes/a variety of purposes, including:/
spec.md
Outdated
`dataref` attribute. | ||
* If the consumer wants to verify that the information has not been tampered | ||
with, it can retrieve it from a trusted source using the `dataref` attribute. | ||
* If the information MUST only be viewed by trusted consumers (e.g. personally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/MUST/is to/
spec.md
Outdated
Both the `data` and `dataref` attribute MAY exist at the same time. A middleware | ||
MAY drop the `data` attribute when the `dataref` attribute exists, it MAY add | ||
the `dataref` attribute and drop the `data` attribute, or it MAY add the `data` | ||
attribute by using the `dataref` attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to complete all options... is there ever a case where they can drop dataref
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow a middleware to drop the dataref
, a consumer can not verify anymore that the data
hasn't been tampered with. Less significantly, another middleware downstream can not (easily) drop the data
attribute, if it wants to shrink the message size.
I can't come up with a good use case for a middleware to drop it. Did you have one in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope - was just asking for completeness. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be clear here on whether the dataref includes information carried in data attribute. If the dataref refers to information in addition to what is carried in data, then the middleware should not drop the data attribute. There are cases that it will result in performance enhancement if the event producer sends some key data information in-band in data attribute and other "might-need" large size information out-of-band via dataref. Extra latency is a drawback of serverless technology, so optimizing latency is important. There might be other use cases too. We may need to consider such use cases and allow data to be carried in both in-band data attribute and out-of-band dataref attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be clear here on whether the dataref includes information carried in data attribute.
I thought it should already be clear from the previous line, but apparently it is not. The information (as the spec calls it) of data
and the one retrieved via dataref
MUST be exactly the same.
I'll try to rephrase it.
If the dataref refers to information in addition to what is carried in data, then the middleware should not drop the data attribute.
It must not.
There are cases that it will result in performance enhancement if the event producer sends some key data information in-band in data attribute and other "might-need" large size information out-of-band via dataref.
Yes, this is a completely valid use case in itself, but not what dataref
is. dataref
is a reference to (what would be in) data
.
If one wants to put some information within data
, and some outside of it, that is completely fine - the references can be contained within data
.
The point of dataref
is that it allows a middleware (including a SDK) to switch between in-band and out-of-band without affecting consumers or producers.
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
535f390
to
3b923d4
Compare
spec.md
Outdated
|
||
* Constraints: | ||
* OPTIONAL | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this attribute will help avoid sending large amount of data in-band. But as raised in the meeting discussion, there could be cases that need multiple URI-ref. So we will need to consider how to define the type
spec.md
Outdated
|
||
* Constraints: | ||
* OPTIONAL | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the URI give information on how to obtain the data information from that URI? I assume the URI could point to different types of storing mechanism, such as block storage or DB etc.. The mechanism to retrieve the info could be different. From an event consumer's point of view, I need to know the type or how to retrieve the data info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the security focused use cases, this is explicitly an anti-goal:
- Only a pre-approved consumer knows how to retrieve the data, so that no one can listen in.
- A consumer will retrieve the data from a trusted source, to make sure it hasn't been tampered with. It must explicitly not try to retrieve the data in any other way.
I agree that it would be nice for the message-too-large use case to have pre-defined mechanisms, but my guess is that it would be vendor-specific. E.g. Azure would use Blob Storage, AWS would use S3 etc. I'm not sure how to pull that off properly, or if the URI isn't already enough info to know if it is e.g. a Azure Blob or a AWS S3 reference.
@clemensv I believe you had some ideas for a proposal to support this by reusing the existing |
Since the spec is thus far mainly concerned with features of the events that require support/definition themselves, not requiring additional capabilities from consumers, is this a required capability (being an attribute in the main spec) or is it ok to drop (or even reject, e.g. HTTP endpoint returning status code To say it another way, you might not support all transports and even if you support HTTP transport, you will not support all event formats in a single endpoint accepting cloudevents. It is quite clear how to respond in that situation but I don't think it is clear here since you know about the attribute (it being in the main spec and all), but you simply can't or don't support it. Are you still compliant if you don't accept the event because you need to validate or look at the contents of the data attribute instead of just forwarding the message? The HTTP binding is clarified like this in two places:
This would not be a problem for me if it was an extension (because you don't have to support it) or included in the data attribute, because it could not then be "the exact and whole data payload" replacing the actual @duglin raised a point about clarifying the responsibilities of consumers and producers more generally regarding |
I find the Events are generally notifications about something and that something is usually being pointed at. As a matter of principle I'm having issues with payloads for an event that are so large that the claim check pattern is even required. Example: Azure Blob events Azure Blob events are about and point to files that might be Gigabytes in size, and the events might be sent to an audience of dozens, but the event per-se is super compact, and contains further custom information about that blob. Likewise, any event coming out of a solution such as a CRM system might point to a complex and large object, but that pointer will be accompanied by further metadata that will allow the handler to determine whether following the link even makes sense, i.e. just like with the blob even, that would sit in the And additional complication for a generic mechanism is network scoping. Since I'm indeed disputing that there is a real use-case for events with giant payloads that warrant the use of the claim check pattern in a generic fashion and where putting a link into |
I believe in our first demo we used a form of the claim-check pattern because there was a URL to the image in the event's |
@clemensv I'm a bit too lazy to make a POC, but there is at least a chance I can make a Azure Blob event become larger than 64KB. There are several ways I can blow up the size of the event:
I'm sure all of these are limited, and it may be I might not succeed with creating an event >64KB - but at least there could be an implementation of a blob storage that allows both long folder names and deep nesting that I can end up with both My overall point being: The moment you include user-provided data in your event, you can end up with an unpredictable variation in your event data size.
Yes, this is a problem. An approach could be that the intermediary (who routes from one network scope to another) needs to change the
As far as I can see, that link is already always part of a CloudEvent via the In the Blob Storage example, the events are simply Given your CRM example, maybe it points back to a customer account. But that customer account contains a large amount of information. If something changes, your event should include what has actually changed. Maybe it is just the phone number (small), or maybe a phone call transcript (usually small, but maybe big) has been added. In commerce, we often have objects (products, orders, ...) that are big (in MBs), but an event consumer wants to know which part has changed. The consumer prefers not to download the whole thing each time, but be able to work directly from the event. We therefore send, along with the event, the changes that have been performed. These are usually much smaller than the object itself, but can easily be a few hundert KB. |
To support interoperability between all types of event producers and event consumers, it is better to define the ref. so that all entities along the way know where and how to retrieve the large payload. I think the difficult point is how we should define the ref to ensure that any event consumer can successfully retrieve the large payload based on this "ref" since there are different medias and mechanisms the payload can be stored. Maybe we can list all existing medias/mechanisms and define the ref for each type? We can always update the list when we find a new type. |
In thinking about this one, a couple of things come to mind:
I can definitely appreciate the desire for a consistent/interoperable location for this information, however, I question the ability for anything but the receiving application (or some other entity that understands I guess I'm at the position now where I'd like to get more data on how needed this really is before we add it as a 'core' property. My biggest concern is adding a feature that may not really be as useful as we think and then we're stuck with it. But, if we start out with it being an extension then we can always upgrade it to a core spec property later w/o breaking people. |
Very well put, @duglin I agree 100% |
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
891f498
to
b4df7bb
Compare
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
b4df7bb
to
b4c18a1
Compare
Now we have datacontentencoding and 'datacontenttype' attributes i'm curious if we can now leverage them to indicate that the data attribute contains a reference as opposed to the actual data - i believe this is in line with @clemensv's earlier comment. |
It is an interesting approach, I think it could work if we want to go with XOR semantics. However, in this PR I'm proposing OR semantics (i.e. allowing both @JemDay Anyway, if you want to push the Claim Check pattern forward, feel free to take over from me. |
extensions/dataref.md
Outdated
@@ -0,0 +1,73 @@ | |||
# Dataref (Claim Check Pattern) | |||
|
|||
As defined by the term [Data](spec.md#data), CloudEvents MAY include domain-specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failure noticed that this needs to be ../spec
since we're in the extensions
dir now. I think there are a few more below too.
Where are we on this? Ready for review? (aside from the minor CI issue) |
Yes, it is ready for review. However I'll probably not join this weeks call, or only for the first few minutes, so I suggest we review it the week after. |
Approved on the 4/11 call. |
Jem/Christoph were the voting members who voiced support for this. |
This resulted mostly out of the discussion around #364 but is also related to #373
The Claim Check Pattern is used for different purposes, including large payload size and security concerns.
This is on purpose "hand-wavy" around the actual retrieval of the payload.
The advantage is that in this way, any auth and storage can be used, e.g. the blob storage of a cloud provider and the protection by their auth mechanisms.
The downside is that it is hard for a middleware or a SDK to retrieve the payload. A non-protected HTTP(S) endpoint would be simpler.
@JemDay I would appreciate it if you could give this a read :)
Signed-off-by: Christoph Neijenhuis christoph.neijenhuis@commercetools.de