-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on new implementation of Prefer header in pywb #7
Comments
Thanks for taking on this work! Here is feedback from the Memento team in Los Alamos: => The terms "rewritten", "raw", "banner-only" are risky in that there could be potential for other applications of => Note that http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html included terms to convey semantics that don't seem to be covered by the terms introduced here. Ultimately, we as a community should decide what makes sense and what doesn't with this regard. => There is the question of which negotiation is handled first: datetime or Prefer. The Memento RFC, which was published prior to the existence of Prefer, does state that datetime negotiation is handled prior to any other content negotiation, by which was meant prior to e.g. language, format, etc negotiation. Given the goal of Prefer in the current context, it seems that this rule should also apply to Prefer, i.e. datetime negotiation first, Prefer next. Even though Prefer isn't really considered negotiation ... => The pywb implementation of Pattern 1.3 is really problematic from the perspective of Memento clients. Clients decide that a resource is a Memento on the basis of the existence of a Memento-Datetime header and a link with rel="original", see [1] of http://mementoweb.org/guide/resourcetype/. Strictly speaking, a client could use Memento-Datetime only to make that determination. But, when doing so, the client does not know what the original (URI-R) is and hence can not continue its time travel, e.g. to obtain another Memento of the same resource, visit the original on the live web. The link="original" is in that sense essential for Mementos (URI-M) and also for TimeGates (URI-G). |
Although I appreciate the idea of having a prefix to the If so, white-listing types or re-write that should not be applied seems clumsy. If some new type or rewrite comes along (e.g. modifying As with the I guess EDIT to further explain my confusion, isn't |
I think two orthogonal aspects are being intertwined here:
Regarding (1): I referred to the use of the Regarding (2): The blog post indeed comes from the perspective of expressing various degrees of "rawness" and, as such, in the approach described, multiple terms can indeed be combined. It doesn't necessarily have to be that way, and I did not suggest it had to be that way in the above. That is why I indicated that "Ultimately, we as a community should decide what makes sense and what doesn't with this regard". That's also why we have asked for feedback from the community ever since the blog post was published. In my opinion, the questions at this point are:
A few more detailed considerations:
I hope we can get some reactions to this all. |
Ah, sorry @hvdsomp for not picking up that the blog post was meant to be illustrative rather than suggestive. I have two concrete use-cases. The first is that, while attempting to integrate Mementos from multiple sources into a proxy service, I wanted to be able to request the un-rewritten entity, because none of that should be necessary in proxy mode. I had imagined this means the original headers as-is as well, but I now realise I'd not thought that through. The second is more of a convenience, in that it would be nice to run a proxy service for users and for generating screenshots of archived resources, and in the latter case I'd want to switch off the banner. Of course, I could just set up a separate endpoint for that (in fact I'd probably do that anyway to separate out the load!), so I could live without it if it's problematic. So, I seem to just be reiterating the main use cases covered here, but this selection of use cases does not seem broad enough to answer all the questions we have. We could do with hearing from more users and use cases, I guess. I have other cases that I would like to cover eventually, but they are very immature use cases that may or may not fit here. For example, as well as the standard WARC content, I also have (but have no way to give access to):
We've also considered making available:
However, given I was kind of surprised by this statement: "When using a Memento client, no rewriting is needed for replay." I'm thinking I may have entirely missed the point. |
Re: Namespace Re: original-content, original-links, echo-original-headers The effect of combining preferences means that there would be 6 different combinations that an implementation needs to support, and a client needs to understand, and that's not covering any other preferences. The use cases for each of the 6 preferences is unclear. I would recommend avoiding combining preferences unless there is a clear use case for having these combinations. Perhaps a better way to think about the Prefer is not as 'dimensions of rawness', but rather as format selection. A memento can be in Format A, or Format B, etc.. If a format is not available, a different format can be provided instead. The current implementation has taken this approach: A memento can be in the There may be some formats that are extensible, like screenshot vs screenshot + clickable map, for example, but even then I'd hesitate to start combining preferences, unless absolutely needed. Re: Re: http headers Re: HTTP proxy mode and memento and Prefer Perhaps there should be a Memento specification for proxy mode, since A client using pywb in proxy mode could do something like this to receive a raw memento at specified date.
This usage patterns happens to be very close to Pattern 1.3 behavior, but isn't quite, and perhaps it should have a separate name. Of course the client knows that its connecting via a proxy, so there should be no confusion there. Re: negotiation order |
Rereading the descriptions at: https://mementoweb.github.io/rfc-extensions/raw-memento/#rawness I can understand the intent of this to set up a kind of constraint system on independent dimensions of rewriting. In practice though, there aren't really independent dimensions, but only a few formats that make sense and have practical applications. I thought it may be useful to list these:
These correspond to the While there is roughly a 'url rewriting' and 'content rewriting' settings, they are not independent dimensions, as url rewriting implies content rewriting. Headers need to be modified whenever any content rewriting happens. One additional format to consider, I think this is something @ibnesayeed is interested in:
Other than that, I'm not sure there are any other options here, without delving into very implementation-specific details of rewriting systems. Removing just the banner, for example, while keeping the rewriting system in place, may not be a desirable option to expose for security reasons (and a server can always set an empty banner if they desire). Perhaps entirely client-side rewriting approaches (such as work being done by @ibnesayeed) will require some other type of hybrid format, or maybe that would be better handled by receiving a full WARC record? (the Possible AdditionsHere's a summary of some possible additional preferences, based on the comments and thoughts so far. (Names are just preliminary and with no determination on a possible prefix). (From Andy's suggestions):
(My thoughts so far):
|
One approach to namespace terms is to use
This way we will only need to register one |
A multi-valued attribute as I described above has another advantage as it allows composition of behavior on different aspects such as headers, payload, and banner separately. Hence, fewer unique keywords can yield many possible combinations. |
Kudos on moving forward with applying Prefer dynamics, @ikreymer. I would like to echo the need for namespacing of these Prefer values with the expectation that the terms may have different semantics in other contexts, as @hvdsomp mentioned. In talking with @phonedude about this GitHub issue, we discussed using a preference model with a state-of-the-art basis, e.g., IA and most replay systems serve rewritten URI-Ms with a banner visible by default, so build the Prefer semantics on a subtractive model (e.g., a "raw" Preference is "less processed") and an additive model (e.g., a derived screenshot representation per @anjackson). It seems that some Prefer terms and concepts will work well when combined while others will result in ambiguity, e.g., I'd like to see a combination of @ibnesayeed and @hvdsomp's approaches combining both I have a use case (again, like @anjackson) that is not yet clearly defined but with which I planned on leveraging the existing investigations of Memento+Prefer. My exploration lies in sending Prefer to URI-Ts with values about the mementos themselves (ideally pre-calculated) that would affect the set of mementos returned. For example, something like |
Thanks @ikreymer for starting this thread and applying these ideas! Re: Re: Pie-in-the-sky Ideas I do not have a use case for this, but I also envision a scenario where someone might want a memento digitally signed by an archive for legal purposes. |
(Apologies for slow response; I have been out of operation for a few days) Some observations regarding the very constructive above discussion: => Some things that are suggested could - and IMO should - be handled using regular content negotiation, not => I am a bit hesitant regarding @ikreymer's stance that we shouldn't support requests for which we currently don't see a use case. It's hard to predict future use cases. Also, boxing ourselves in from the outset regarding a framework that is all about adding expressiveness to Memento negotiation seems at odds with its very intent. Hence, I am in favor of an approach that syntactically allows conveying preferences that supports degrees of freedom and extensibility, even if various of the preferences that could be expressed would not be supported by web archives. As long as the archives can say that they did not honor the preferences, IMO, all is good. In support of this reasoning, I will just quote @ikreymer who sounds a bit hesitant to reject a potential preference: Removing just the banner, for example, while keeping the rewriting system in place, may not be a desirable option to expose for security reasons (and a server can always set an empty banner if they desire). I really would like to see a more "open and extensible" syntax approach. With that regard, I also like @ibnesayeed's => When considering potential uses for => A clarification for @anjackson re Memento and URL rewriting: Memento clients do not need rewritten URIs. Current Memento clients clearly can handle rewritten URLs because they're all over; but they can also work with un-rewritten URLs. In both cases, prior to continuing a time travel session, the "nature" of the URL to which the client will travel is checked by looking at its HTTP header. If that has a Memento-Datetime, then the URL is a URI-M and the client uses the URI-R from the link rel="original" to continue traveling. If it doesn't have Memento-Datetime, then the URL is a URI-R and can directly be used for traveling. In both cases, URI-R is used for time traveling. |
I think this syntax is illegal as per the ABNF grammar of the Prefer header. However, you can have something like:
Alternatively, along the lines of
Note that various operators might not be legal outside, but it should be perfectly valid to add them inside the quoted value. How the value is parsed and consumed will depend on how the |
https://egr.vcu.edu/media/school-of-engineering/egr-main/img/layout/chem-icon.svg becomes: https://archive.is/MIq1y/2b24413af4630e8917a9f2dd2d9b5a634f5293eb.svg Replay of original headers isn't going to be defined for that URI.
http://archive.is/download/MIq1y.zip There is currently no machine-readable mapping from http://archive.is/MIq1y --> http://archive.is/download/MIq1y.zip but there should be, and that should be done with a rel type (which would also allow different MIME types, both WARC (when it gets one) and "application/webpackage+cbor", etc.). |
I am all OK with @phonedude's proposal to use a |
I believe @ibnesayeed's Prefer approach to be a good one that encourages extensibility, namely, using two different syntactic structures:
This would require only two preferences to be registered with IANA and allows for variability of both mementos and TimeMaps in arbitrary dimensions with conditions we currently do not foresee. The ability, as provided by these syntax, for conditions to be quantitative beyond boolean (e.g., values within a range instead of simply present/processed or not) is particularly appealing. |
Would it be possible to avoid the technology name prefixes ( Maybe. 😸 |
Hey @machawk1 and @ibnesayeed: Forgive me for being thick, but would it be possible to fully elaborate the proposed
I would prefer (sic) to focus on the |
To illustrate various approaches, let me first list a few variations on four different aspects of the response. These variations are in no way exhaustive and should be taken for the illustration purpose only. Some of these might be better suited to be negotiated differently.
These variations can be put together in various ways of which some may be semantically illegal (or mutually exclusive) and some may not be very useful. These can be used in two modes:
Either of these modes can be added in one of the two scopes as described below:
If we use the namespaced scope, variations can be organized in two ways as described below:
The
|
I agree that we should discuss the |
@ibnesayeed: thanks for this detailed proposal. could you next map these proposed approaches to existing archives? e.g., for IA (w/o any "_" modifiers) would it be: Preference-Applied: memento-variant="header-original rewrite-hyperlinks rewrite-requisites rewrite-location banner-inline" anything missing? and if that's IA, what would archive.is look like? |
The above proposal describes five meaningful ways utilize the
We can use the following abbreviations to simplify examples.
Since the format of the OpenWayback Default implementation of OWB rewrites various links/references, includes an inline banner, and provides original response headers, but does not fix any JS issues.
PyWB Default implementation of PyWB rewrites various links/references, serves banner as the isolated parent document and serves Memento inside it in an iframe at a different URL (using
Archive.is Archive.is rewrites various links/references, includes an inline banner, serves a serialized rendered DOM, and mutes all the JS on the page, but does not provide original response headers.
A typical original unrewritten (or raw) response (that is usually served using the
|
@ibnesayeed: thanks for the additional info. borrowing from the archive.is example, my initial preference is for the simplicity of "NF": NF> Preference-Applied: memento-variant="rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsmuted payload-render banner-inline" but I recognize that "NS" is more extensible: NS> Preference-Applied: memento-variant="rewrite=location&hyperlinks&requisites payload=jsmuted&render banner=inline" what do others think? |
Global styles, both GC and GA, are difficult to extend as each time they will require a very careful choice of preference name that does not step on to something more generic with a different semantic meaning. The latter has another issue of registering way too many preferences. While NC does not have the problem of registering too many preferences (as it registers only one, i.e., Namespaced-Atomic styles, both NF and NS eliminate some of the issues described above. NF is simpler, but less expressive that NS. Due to the atomic nature of features it might be desired to express how certain features are to be combined. By bringing some structure using operators, we can express many preferences more effectively. For example, in NF when we say:
It is not clear that the client wants to rewrite both
By including some relational operators between the name and value (as illustrated earlier) in the NS style we can express some more useful preferences such as inequalities or negations as following:
By including some parameters with values (as illustrated earlier) in the NS style we can express some more useful preferences such as viewport dimensions of a screenshot or thumbnail size:
It is not always necessary to have values associated with each name (that means the syntax of NS is a superset of NF):
If we write a grammar for |
Thanks everyone for all the feedback, and thanks to @ibnesayeed for all the detailed examples. Going back to the original purpose for this, to create a standardized, interoperable API, i think perhaps it is important to indicate when there is no interoperability, eg. when comparing any rewriting systems from one software to another. The A rewritten memento from openwayback vs pywb vs archiveis is going to be different result, so why not just state this explicitly, by adding But then, taking a step back from this even further, given the use cases and suggested formats, I am sort of wondering why a rather esoteric approach like the Prefer/Preference-Applied header is needed at all here. I would suggest that what is being described as "content negotiation" here can in all honesty be described as a "search query". With the exception of a few cases mentioned above, (eg. HTTP/S proxy mode), I think most of the requested functionality can and should be implemented with a standard search query API, with a format like: This would work for both memento and timemap queries. For memento, the response could be a JSON response listing all the possible results, with best match being first, as is typical in a search query. This would make the API much more accessible and flexible, allowing multiple result formats, as well as possible endpoint to list the available formats. No need to create custom header parsers, or register anything with IANA. I think its fair to say that for most client users and server implementers would find it much easier to use the standard url query parsing, then to introduce custom parsing of a new header. Ease of implementation and conformance with existing expectations about APIs is more likely to lead to adoption and interoperability. |
If the purpose of this exercise is just to find the possibility of interoperability then that can be handled in many different ways (some of which might not be the right way of interacting with an HTTP system):
The A URL query parameter-based approach has some issues:
|
To me, the point of this is to retrieve a resources in a known definitive format, so that it can be treated in a particular way, such as for example, compared to other resources in the same format. (I believe that was also an original use case, make it easier to compare mementos). If two mementos are in
I am struggling to find a good use case for when a client would be 'not too picky' about a response.
Too rigid? How many complex search systems with boolean logic, etc.. have been built using url query arguments? :)
But isn't the whole point that we're defining a new API specification? The query params will be part of the spec. There are existing tools that can help with this, unlike with Prefer header. Overloading the behavior of existing URLs via HTTP headers in not any easier (for the implementor or client) than defining a new API endpoint. |
I agree with @ikreymer that it is important for a client to understand unambiguously what it is receiving. The Prefer/Preference-Applied provides that functionality, if defined appropriately. I am not at all in favor of a query API approach as suggested by @ikreymer because it entails introducing a new API from scratch rather than leveraging the Memento "API" that is meanwhile widely adopted. Prefer/Preference-Applied is a logical step to add expressiveness (via additional negotiation) to the Memento protocol. I sincerely hope we can achieve a solution along that path. |
I might have misunderstood it, because I thought @ikreymer was using "ambiguity" to refer to the "soft content negotiation" nature of In my opinion, client should be able to ask for a memento's "preferred" variation, which the server should try to fulfill as best as possible, but returns a memento even if some (or all) preferences were not applied. However, it is very important for the server to tell the client precisely and "unambiguously" about the response returned (using Here are a few examples where an unfulfilled preferences might still be more useful than returning nothing:
I would not vote for redefining a URL query parameter-based API that breaks existing systems. There are situations where negotiating header-based preference is not possible, for example, clicking a link to open a memento in a browser or including a memento inside of an iFrame using |
We seem to have strayed a long way from the original use cases, and I'm not convinced the complexity is justified. Firstly, I agree with @hvdsomp in that the whole point of this was to add the ability to negotiate between However, I disagree with some of the analysis from @ibnesayeed - specifically, the distinction between 'atomic' and 'composite' is rather artificial and I don't think it will age well. If we look at the examples (and ignore that
The latter 'atomic' variant appears to be preferred, but requires the client to enumerate the ways in which the content should be original. Is there really only the header and the payload? What about the protocol version? Even if this is a exaustive set (which I accept it probably is in this case), there is nothing more fundamental about having to list the conceptual components of a HTTP response. It's just longer. The
I agree with @ikreymer in that I do not think it is helpful to consider So I'd like to be able to ask for |
@anjackson would you prefer the GC style (e.g., Whatever mechanism we eventually agree on, I would like a way to request raw mementos, but their location headers rewritten. |
No strong opinion. I've tried looking at the examples and the currently registered preferences. I think I'm repeating someone else's suggestion, but perhaps NC is easier to standardise as it's just a single top-level preference. |
I wanted to briefly mention a new implementation of Prefer header that is being added to pywb, as part of work for the UK Web Archive.
(The implementation is currently available on this branch in https://github.com/ukwa/pywb)
Here's a brief summary of this implementation.
TimeGate and URL-M both accept Prefer header
As a compromise between the previously suggested options, both the TimeGate and the URL-M support
Prefer
header in pywb and respond accordingly. The exact behavior is dependent on the memento negotiation pattern that is in use, as explained below.Supported Preferences
The following preferences are supported:
Prefer: rewritten
-- fully rewritten content as needed for replay. URLs may be rewritten throughout the content and and other custom changes, such as a banner, may be injected into the memento.Prefer: raw
-- original, unaltered memento. No content is altered. Certain hop-by-hop headers may be prefixed withX-Archive-Orig-
Prefer: banner-only
-- A banner is inserted into the<head>
element in one continuous block, but the content is otherwise unaltered. No links are rewritten, and no other content is modified. Certain hop-by-hop headers may be prefixed withX-Archive-Orig-
but headers are otherwise not rewritten. The banner can be easily detected with start and end markers. This preference is especially useful for proxy mode.Memento Negotiation Patterns
Since pywb actually supports multiple memento negotiation patterns defined in RFC 7089, it makes sense to have the
Prefer
header behavior also correspond to the negotiation pattern already in use.Pattern 2.1 -- 302 Style Negotiation (spec)
When using 302 (*) style negotiation in pywb, the
Prefer
header results in a redirect to the 'canonical' url representing that format. The redirect happens when the Prefer header is present on either a URL-G and URL-M request. ThePreference-Applied
header is served with the response.Pattern 2.2 -- 200 Style Negotiation (spec)
When using 200 style negotiation, the
Prefer
header can also be applied on URL-G or URL-M, and the desired resource is served directly, with the correctPreference-Applied
header. TheContent-Location
header is set with the canonical representation of the resource.This mode is the default in pywb 2.0
Pattern 1.3 -- 200 Style Negotiation (spec)
The Pattern 1.3 pattern (**) is the proxy mode behavior, where the user connects to pywb via an HTTP/S proxy and no url rewriting is performed. The
Prefer
header is also supported in this mode, and the Preference-Applied is returned in response. Since URL-M = URL-G = URL-R in this mode, no redirect or alternative Content-Location is included. The Prefer header is especially useful for requesting different format resources since no unique canonical urls exist.This mode only supports
raw
andbanner-only
preferences. IfPrefer: rewritten
is requested, the response is actually the banner-only memento, eg.Preference-Applied: banner-only
Canonical Url Representations
For non-proxy mode replay (Pattern 2.1 and 2.2), each preference corresponds to a 'canonical' urls, which are:
raw
-http://host/prefix/<timestamp>id_/<url>
banner-only
-http://host/prefix/<timestamp>bn_/<url>
The canonical representation for
rewritten
is also changes if running in framed or frameless replay:rewritten
ishttp://host/prefix/<timestamp>mp_/<url>
rewritten
ishttp://host/prefix/<timestamp>/<url>
Request for feedback
Let me know if these is any feedback on this implementation, or other thoughts.
Some of this may be particular to the pywb implementation, but some of this behavior may make sense to standardize further or change.
If anyone is interested in code, here are a few unit tests that test for this behavior:
Prefer header patterns 2.1 and 2.2
Proxy mode tests, including Prefer header patterns 1.3
Notes
*: pywb actually uses 307 redirects instead of 302
**: pywb almost supports pattern 1.3 fully, but can not include a link to
rel=original
in proxy mode, as it is not available. Nevertheless, the behavior is otherwise identical to pattern 1.3, but perhaps there should be another name for it?The text was updated successfully, but these errors were encountered: