Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Type MUST be used on all PUT/POST/PATCH requests? #70

Closed
kjetilk opened this issue Sep 25, 2019 · 50 comments
Closed

Content-Type MUST be used on all PUT/POST/PATCH requests? #70

kjetilk opened this issue Sep 25, 2019 · 50 comments

Comments

@kjetilk
Copy link
Member

kjetilk commented Sep 25, 2019

The LDP spec says

5.2.3.6 LDP servers SHOULD use the Content-Type request header to determine the request representation's format when the request has an entity body.

I'm not sure how the SHOULD has been justified there, since I struggle to see any other valid way, so it seems we should have a MUST there.

@pmcb55
Copy link

pmcb55 commented Sep 26, 2019

Couldn't the server use content sniffing to 'guess' the content type? I'm not sure if that counts as a 'valid way', but it seems to have been common enough in the past I believe...

@RubenVerborgh
Copy link
Contributor

RubenVerborgh commented Sep 26, 2019

Couldn't the server use content sniffing to 'guess' the content type?

Security issue; let's not do that. Agree on MUST.

@csarven
Copy link
Member

csarven commented Sep 26, 2019

[Saw this on my phone.. literally rushed to get my laptop to respond.. Ruben beat me to it by a second.]

https://tools.ietf.org/html/rfc7231#section-3.1.1.5 is fairly clear about potential issues around content sniffing.

@csarven
Copy link
Member

csarven commented Sep 26, 2019

MUST raises the bar - so to speak from, SHOULD - but it will remove potential issues and any ambiguity. That's a quality on its own and what to expect from clients. So, +1 to MUST.

@pmcb55
Copy link

pmcb55 commented Sep 26, 2019

Yeah, +1 to MUST from me too (I was only trying to point out that content sniffing was perhaps another 'valid way', I certainly wasn't trying to suggest that we ever rely on it, absolutely not :) !)

@csarven
Copy link
Member

csarven commented Sep 26, 2019

For the sake of documentation and to offer an answer for why SHOULD and to eventually resolve this issue:

The server may not want to completely rely on what a client claims in the request. It gives a way out I suppose to do some verification. For authenticated agents, this would be less of a concern.

@acoburn
Copy link
Member

acoburn commented Sep 26, 2019

The SHOULD-level requirement in LDP is used in the context of clients sending POST requests -- requests with an entity body -- to a server. If this is increased to MUST, what are the implications if a client does not sent a Content-Type header?

For example, if a client attempts to create an LDP-NR and does not send a Content-Type, does that mean that the request must be rejected? If the request is accepted, does that mean that subsequent responses for that resource cannot use, for example, Content-Type: application/octet-stream?

@kjetilk
Copy link
Member Author

kjetilk commented Sep 26, 2019

For example, if a client attempts to create an LDP-NR and does not send a Content-Type, does that mean that the request must be rejected?

Yes, that has been my understanding. There is basically no legitimate way to tell what media type there is without the header, so a 400 response results. We should have structured error message bodies for those cases though.

@pmcb55
Copy link

pmcb55 commented Sep 26, 2019

So @csarven - could you elaborate a little on what you mean below: (as I don't understand what you mean at all):

The server may not want to completely rely on what a client claims in the request. It gives a way out I suppose to do some verification.

And @acoburn, when you say:

The SHOULD-level requirement in LDP is used in the context of clients sending POST requests -- requests with an entity body -- to a server.

...I think that's clear from @kjetilk's original comment that created this issue. The question is, does anyone know a good justification for the LDP guys deciding on the Content-Type header being a SHOULD here instead of a MUST, right?

'Cos I agree with @kjetilk that 'the implications if a client does not send a Content-Type header' should be simply a 400 response. The only other option I see for processing an incoming request without a Content-Type header is content sniffing, which everyone (so far) agrees would be a really, really bad idea.

So again, are we missing something here that the LDP spec editors realised, and that justified their making the Content-Type header a SHOULD instead of a MUST?

@csarven
Copy link
Member

csarven commented Sep 27, 2019

If this is increased to MUST, what are the implications if a client does not sent a Content-Type header?

Noting here that changing requirements around Content-Type will need to be factored in TSE (The Solid Ecosystem), especially the HTTP section.

@csarven
Copy link
Member

csarven commented Sep 27, 2019

So @csarven - could you elaborate a little on what you mean below: (as I don't understand what you mean at all):

The server may not want to completely rely on what a client claims in the request. It gives a way out I suppose to do some verification.

(I'm only speculating a possible reason because I don't have a citation.)

We can consider the public-write or append case where a client may end up using an incorrect Content-Type value - not matching the payload. Noting here that if a server wants to ensure that a resource is eventually served accurately, it may end up content-sniffing (whether Content-Type has a value and valid for the payload, or not). This is neither encouraged or part of the agreement for Content-Type use.

"[..] does anyone know a good justification [..]"

@TallTed @bblfish may want to chime in.

@bblfish
Copy link
Contributor

bblfish commented Sep 27, 2019

Protocols have to say what must happen for things to work correctly. One can work one's way around bad clients, but then one is taking risks, which inevitably will be taken advantage of at some point by folks trying to break security.

I believe this is related to deontic logics, and at some point I'll see if I can find a good explanation of this in those terms. Until then I'd go with the above recipe.

@kjetilk
Copy link
Member Author

kjetilk commented Sep 27, 2019

We can consider the public-write or append case where a client may end up using an incorrect Content-Type value - not matching the payload.

Yup, but that's where we slap the $ from https://www.w3.org/DesignIssues/HTTPFilenameMapping.html on it, I think.

@csarven
Copy link
Member

csarven commented Sep 27, 2019

slap the $

Implementation detail... useful if there is a Content-Type.

@RubenVerborgh
Copy link
Contributor

Implementation detail..

Not for all backends.

@timbl has indicated that he wants https://www.w3.org/DesignIssues/HTTPFilenameMapping.html to evolve into a specification that makes Solid also interoperable on a filesystem level.

@csarven
Copy link
Member

csarven commented Sep 27, 2019

I was saying that in context of TSE. As for it being part of "file-based-solid-spec", okay. Perhaps I shouldn't have used "..." because the "if" is important. I'm not sure how HTTPFilenameMapping is expected to work for the no Content-Type case, unless of course there is some fallback to .ttl and/or check Link header for rel RDFSource or something (but still wouldn't know if the payload is actually Turtle or something else). And, that brings things back to non-turtle-rdf.ttl being conneg'd for Turtle.. So, I think Content-Type is quite critical (MUST) for file-based-solid-spec.

@TallTed
Copy link
Contributor

TallTed commented Sep 27, 2019

Forgive me for not tracking down all citations; I'm on a wet-string connection at the moment.

As noted above, the cited quote of 5.2.3.6 from LDP is within the 5.2.3 HTTP POST section (so, not a global LDP rule).

Note also, from 6.2 HTTP 1.1 --

6.2.6 When the Content-Type request header is absent from a request, LDP servers might infer the content type by inspecting the entity body contents ([RFC7231] section 3.1.1.5).

I believe that the RFCs and W3 specs upon which LDP was built -- and which inheritance should be persisted in Solid -- state that POST (and PUT) without a Content-Type at least MAY (and my personal feeling is SHOULD) be accepted and treated as if submitted with Content-Type: application/octet-stream. I don't understand the apparently perceived issue with so doing.

The server doesn't parse, transform, or otherwise manipulate the payload of such submissions. Turtle content is left as Turtle; it is not parsed and loaded to a back-end RDF store. Clients requesting Content-Type: text/turtle from that resource don't get that, even though the content of the file is structured as Turtle. Clients requesting Content-Type: text/ld+json (or whatever that MIME type is) don't get a JSON-LD transformation of the Turtle. Clients including */* or application/octet-stream in their GET request Accept: (or similar) list will get the raw resource, with Content-Type: application/octet-stream, and what they choose to do with that is their own lookout.

What am I missing?

@acoburn
Copy link
Member

acoburn commented Sep 27, 2019

FWIW, the Trellis server does, effectively, what @TallTed describes:

A POST with an LDP-NR link header, an entity body and no content-type header will store the resource with Content-Type: application/octet-stream. Subsequent requests for that LDP-NR returns a resource with a Content-Type: application/octet-stream response header.

A POST with no LDP link header, an entity body and no content-type header is accepted as an LDP-NR and will respond with a Content-Type: application/octet-stream header.

A POST with an LDP-RS link header, an entity body and no content-type header is rejected immediately.

Nowhere in that process is there any content sniffing. LDP-NRs are always treated (internally) as opaque byte arrays, and LDP-RSs must be parseable (if an entity body is present), based on the provided content-type.

@csarven
Copy link
Member

csarven commented Sep 28, 2019

@TallTed Thanks for the feedback and giving the thread a bit more juice :)

I don't understand the apparently perceived issue with so doing.

AFAICT, the perceived issue wasn't particularly about application/octet-stream but given that it is a MAY and the alternative option that's mentioned in https://tools.ietf.org/html/rfc7231#section-3.1.1.5 :

or examine the data to determine its type

and the paragraph that discusses misconfigurations and potential risks.

What am I missing?

Nothing. The discussion just went in the direction to eliminate ambiguity by forcing clients to always indicate their intentions.

I now see that my comment in #70 (comment) wasn't adequate either because it was only in context of eliminating ambiguity as much as possible based on earlier discussion (re: "content sniffing"). So with that as a premise, MAY for application/octet-stream probably didn't matter. We need to revisit.

@csarven
Copy link
Member

csarven commented Sep 28, 2019

@acoburn thanks for sharing!

The first two cases leading to LDP-NR/application/octet-stream is already covered by LDP/RFC and so leave as is (as @TallTed and you already indicated).

As for:

A POST with an LDP-RS link header, an entity body and no content-type header is rejected immediately.

Is the reasoning based on https://www.w3.org/TR/ldp/#ldpc-post-createrdf :

If any requested interaction model cannot be honored, the server MUST fail the request.

Does Trellis consider Content-Type as a requirement to check if the interaction model can be honoured? Or is this separate?

We can acknowledge that RDF payload can make its way into a server but can only be reused via */* or application/octet-stream. Perhaps more specifically, I'd like to understand what are the use cases for @TallTed's remark about clients:

what they choose to do with that is their own lookout.

Put differently, why should RDF payload without indicating an interaction model through Link header or Content-Type be accepted whereas if the interaction model is specified (LDP-RS in Link header) gets a reject:

A POST with an LDP-RS link header, an entity body and no content-type header is rejected immediately.

What am I missing?

@acoburn
Copy link
Member

acoburn commented Sep 28, 2019

A POST with an LDP-RS link header, an entity body and no content-type header is rejected immediately.

Actually I was incorrect about this. A POST with an LDP-RS link header and no content type is accepted as text/turtle.

Within Trellis, LDP-NRs and LDP-RSs follow entirely different code paths and are stored in different subsystems. LDP-NRs are just opaque byte streams while LDP-RSs need to be parsed and validated before being persisted: hence, there is a need to know the content type of the HTTP entity body.

Thus far, I have treated these "what to do if the client isn't completely explicit" heuristics as implementation decisions, but if the Solid specification chooses to weigh in on or clarify these behaviors, it would be pretty simple to make adjustments, if adjustments need to be made.

RDF payload without indicating an interaction model through Link header or Content-Type be accepted

In this case, the reason the request is accepted has to do with the fact that it is accepted as an LDP-NR. This also raises a slightly different issue: if a client does not supply an LDP interaction model, what interaction model (if the request is accepted) should be assigned to the resource? Here, Trellis makes a decision based on the Content-Type header, choosing either an LDP-NR or an LDP-RS (if there is no Content-Type header, as stated above, an LDP-NR is chosen).

@TallTed
Copy link
Contributor

TallTed commented Sep 30, 2019

@acoburn --

Actually I was incorrect about this. A POST with an LDP-RS link header and no content type is accepted [by Trellis] as text/turtle.

That doesn't seem quite the right action. An LDP-RS link header does give you a strong hint of the client's intent, but it would be equally valid attached to a JSON-LD payload as a Turtle payload -- and indeed, a payload in any other RDF serialization (though only JSON-LD and Turtle are required to be supported by LDP servers). Minimally, it seems that Trellis should test whether the payload is Turtle or JSON-LD (or neither), and take appropriate next steps...

@acoburn
Copy link
Member

acoburn commented Sep 30, 2019

@TallTed the reasoning behind assuming that an LDP-RS is Turtle (rather than any other format) stems from an inversion of this statement:

4.3.2.1 LDP servers must respond with a Turtle representation of the requested LDP-RS when the request includes an Accept header specifying text/turtle, unless HTTP content negotiation requires a different outcome.

In other words, treat representations as Turtle unless there is a reason to think differently. I would also be very cautious about any sort of content sniffing, as has already been pointed out above.

@TallTed
Copy link
Contributor

TallTed commented Sep 30, 2019

@acoburn - Speaking as a participant in the LDP WG that produced that spec, I don't think any of us expected any of its statements to be inverted in such a way.

Further, I'm not understanding how telling Servers "you must deliver Turtle representation of LDP-RS when Turtle is requested (i.e., when request includes Accept: text/turtle), unless conneg indicates another representation of the LDP-RS is preferred" turns into telling Servers "every LDP-RS submitted by a Client must be (treated as) Turtle unless you're told otherwise by the Client" into nor telling Clients "every LDP-RS you submit must be (or will be treated as) Turtle unless you tell the Server otherwise" especially given that we explicitly required Servers to accept both JSON-LD (5.2.3.14) and TTL (5.2.3.5) if they accepted POST at all (5.2.3).

Still further, we didn't recommend content sniffing, though we said, in a non-normative section, that "When the Content-Type request header is absent from a request, LDP servers might infer the content type by inspecting the entity body contents ([RFC7231] section 3.1.1.5)." Note also that RFC7231 says in the third paragraph of 3.1.1.5 "If a Content-Type header field is not present, the recipient MAY either assume a media type of "application/octet-stream" ([RFC2046], Section 4.5.1) or examine the data to determine its type."

In other words -- the expectation is that you would either examine the data, or assume that it is application/octet-stream -- not that you would assume that it is text/turtle.

In sum -- it is dangerous to pluck any single statement from any spec, and more so to use your own interpretation of the inverse of such statement.

@csarven
Copy link
Member

csarven commented Sep 30, 2019

A POST with an LDP-RS link header and no content type is accepted as text/turtle.

If the payload is in JSON-LD, what does Trellis do as it accepts the request? What happens when the created resource is requested i) without an Accept header, ii) with Accept: application/ld+json ?

@acoburn
Copy link
Member

acoburn commented Sep 30, 2019

@TallTed my position on this is that the LDP specification is silent on this matter. As a consequence, a server's behavior is an implementation decision. If another specification were to define what ought to be done in this case, I would be happy to follow that, and I believe that is the basic purpose of this issue. I really have no opinion on what ought to be done in this case.

@csarven if the POST payload is JSON-LD, then the client-submitted request is parsed as JSON-LD (provided that the request includes a Content-Type header). Provided that the request succeeds, then subsequent GET requests (i) without an Accept header would return text/turtle while (ii) with an Accept: application/ld+json would return application/ld+json.

@csarven
Copy link
Member

csarven commented Sep 30, 2019

@acoburn

(provided that the request includes a Content-Type header)

is a different case. I'm trying to understand this better:

A POST with an LDP-RS link header and no content type is accepted as text/turtle.

If the request has an LDP-RS Link header and message body in JSON-LD, what does "accepted as text/turtle" entail? Can you expand on that process?

@acoburn
Copy link
Member

acoburn commented Sep 30, 2019

Internally, for LDP-RSs, a create/update request involves a parsing stage before the data is persisted. That is, the byte stream (java.io.InputStream) that is an incoming RDF serialization (in this case, a JSON-LD document) is translated into Java objects, specifically a org.apache.commons.rdf.api.Dataset with some additional metadata.

That parsing stage must succeed before the RDF is persisted into the storage layer. In fact, between parsing and storage, there is also a validation step. For instance, LDP imposes certain rules, and if those rules are violated, the request will be rejected. (Shape validation of various sorts happens at this point)

But as for the parsing stage, the parser needs to be told exactly how to parse the InputStream: is it text/turtle or application/n-triples or application/ld+json? Is it an RDF serialization that is not supported (e.g. application/rdf+xml or application/trig)? If that parsing stage fails, the request fails.

So, when I write:

accepted as text/turtle

I mean that, in the absence of an explicit Content-Type from the client, the RDF parser uses a Turtle-based reader. So in this case, if a Turtle-based reader is used (because the client didn't supply a Content-Type) but the HTTP entity body is actually something else (e.g. JSON-LD), then the request will simply fail with a parsing error. A log entry will be generated and the server will respond with a 400 Bad Request.

@acoburn
Copy link
Member

acoburn commented Oct 1, 2019

One thing to add as more of a pragmatic point is that, by having the RDF parser default to Turtle, that means that an empty InputStream can be treated as valid RDF, since an empty file is also a valid Turtle document. So a request with no body can be treated as a valid incoming RDF document.

@pmcb55
Copy link

pmcb55 commented Oct 1, 2019

Great discussion, but I remain a MUST-er (i.e. Content-Type MUST be used). That means a Solid server receiving a POST with an LDP-RS link header and no content type is rejected immediately (intuitively, I just don't like the implicit assumption of Turtle, as it feels (as with all implicit assumptions) 'dangerous' somehow).

@acoburn's point on being able to handle empty streams is interesting though, as it does seem a little strange to MUST provide a Content-Type if I know I'm passing an empty body. But I still think just a non-normative note to suggest clients provide Content-Type: text/turtle in that (presumably edge-) case is fine too...

So @TallTed, given your experience on the LDP spec, do you personally think a MUST is justified here (i.e. in the Solid spec), or would we be losing something if we mandate that (apart from the obvious 'burden' it places on the client to be explicit in what it wants to actually do!)?

@kjetilk
Copy link
Member Author

kjetilk commented Oct 7, 2019

Indeed... I mean, we should always strive for requiring as little as possible, and I have been advocating for that just a little above any Web server would be a good idea, but it seems to me that the authn sets requirements in both ends.

@dmitrizagidulin
Copy link
Member

@TallTed Ah, got it. Yeah, definitely needs an auth client. (And I don't think that's against the spirit of the LDP spec, since, ahem, the LDP spec left authentication as out of scope :) )

@Mitzi-Laszlo Mitzi-Laszlo assigned kjetilk and unassigned csarven Oct 11, 2019
@RubenVerborgh RubenVerborgh changed the title Content-Type MUST be used? Content-Type MUST be used on all PUT/POST/PATCH requests? Oct 29, 2019
@kjetilk
Copy link
Member Author

kjetilk commented Oct 30, 2019

Proposal following F2F meeting with @csarven , @timbl and @kjetilk present of 2019-10-30:

We found it is advantageous to avoid lack of clarity in content types, since a fallback to defaults like application/octet-stream would result in that apps cannot determine the content type, and therefore not present a suitable UX for it.

Users of basic UAs (e.g. curl) should be prevented from skipping the content type, because that may cause subsequent problems for apps using the data.

The cost of requiring clients to submit content type is thus much lower than cost of the requirement on servers and clients to deal with the consequences of wrong or useless content types.

This points towards a strict interpretation, i.e. MUST.

@TallTed
Copy link
Contributor

TallTed commented Oct 30, 2019

Does this mean that the uploads must include content types that are known to the solid server? Or just ANY content type (which might well include application/octet-stream)?

How does the server test whether the content type a client specifies for a resource is appropriate to its content? If the server does not test such, how does this requirement prevent "wrong or useless content types"?

I do not see how simply forcing a content type to be present achieves the stated goals -- i.e., guaranteeing a suitable UX for that content, or that apps will know how to work with such content types.

@kjetilk
Copy link
Member Author

kjetilk commented Oct 30, 2019

Does this mean that the uploads must include content types that are known to the solid server?

No.

It can't prevent a bad client from adding bad data, if the client has no idea about the content-type, it will fall back to application/octet-stream, but then, the client needs to understand that other clients will not be able to use it for anything. I didn't write "guarantee", I wrote "advantageous" :-)

It is more about making the chance of breakage smaller, not prevent breakage entirely.

@TallTed
Copy link
Contributor

TallTed commented Oct 30, 2019

"Other clients will not be able to use it for anything" overstates the impact. "Other clients" may inspect the payload of .ttl typed as application/octet-stream, discover that it's actually JSON-LD, and take appropriate action.

"Making the chance of breakage smaller" sounds to me much more like a SHOULD rule, than a MUST rule, as the latter will tend to lead to expectations that will not necessarily be satisfied.

This decision is not mine in the end, but I hope that the documentation around it will be clear about actual effect, as opposed to wishful effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

9 participants