Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on new implementation of Prefer header in pywb #7

Open
ikreymer opened this issue Mar 1, 2018 · 31 comments
Open

Feedback on new implementation of Prefer header in pywb #7

ikreymer opened this issue Mar 1, 2018 · 31 comments

Comments

@ikreymer
Copy link

ikreymer commented Mar 1, 2018

I wanted to briefly mention a new implementation of Prefer header that is being added to pywb, as part of work for the UK Web Archive.

(The implementation is currently available on this branch in https://github.com/ukwa/pywb)

Here's a brief summary of this implementation.

TimeGate and URL-M both accept Prefer header

As a compromise between the previously suggested options, both the TimeGate and the URL-M support Prefer header in pywb and respond accordingly. The exact behavior is dependent on the memento negotiation pattern that is in use, as explained below.

Supported Preferences

The following preferences are supported:

  • Prefer: rewritten -- fully rewritten content as needed for replay. URLs may be rewritten throughout the content and and other custom changes, such as a banner, may be injected into the memento.

  • Prefer: raw -- original, unaltered memento. No content is altered. Certain hop-by-hop headers may be prefixed with X-Archive-Orig-

  • Prefer: banner-only -- A banner is inserted into the <head> element in one continuous block, but the content is otherwise unaltered. No links are rewritten, and no other content is modified. Certain hop-by-hop headers may be prefixed with X-Archive-Orig- but headers are otherwise not rewritten. The banner can be easily detected with start and end markers. This preference is especially useful for proxy mode.

Memento Negotiation Patterns

Since pywb actually supports multiple memento negotiation patterns defined in RFC 7089, it makes sense to have the Prefer header behavior also correspond to the negotiation pattern already in use.

Pattern 2.1 -- 302 Style Negotiation (spec)

When using 302 (*) style negotiation in pywb, the Prefer header results in a redirect to the 'canonical' url representing that format. The redirect happens when the Prefer header is present on either a URL-G and URL-M request. The Preference-Applied header is served with the response.

Pattern 2.2 -- 200 Style Negotiation (spec)

When using 200 style negotiation, the Prefer header can also be applied on URL-G or URL-M, and the desired resource is served directly, with the correct Preference-Applied header. The Content-Location header is set with the canonical representation of the resource.

This mode is the default in pywb 2.0

Pattern 1.3 -- 200 Style Negotiation (spec)

The Pattern 1.3 pattern (**) is the proxy mode behavior, where the user connects to pywb via an HTTP/S proxy and no url rewriting is performed. The Prefer header is also supported in this mode, and the Preference-Applied is returned in response. Since URL-M = URL-G = URL-R in this mode, no redirect or alternative Content-Location is included. The Prefer header is especially useful for requesting different format resources since no unique canonical urls exist.

This mode only supports raw and banner-only preferences. If Prefer: rewritten is requested, the response is actually the banner-only memento, eg. Preference-Applied: banner-only

Canonical Url Representations

For non-proxy mode replay (Pattern 2.1 and 2.2), each preference corresponds to a 'canonical' urls, which are:

  • raw - http://host/prefix/<timestamp>id_/<url>
  • banner-only - http://host/prefix/<timestamp>bn_/<url>

The canonical representation for rewritten is also changes if running in framed or frameless replay:

  • If framed (the default), canonical url for rewritten is http://host/prefix/<timestamp>mp_/<url>
  • If in frameless, canonical url for rewritten is http://host/prefix/<timestamp>/<url>

Request for feedback

Let me know if these is any feedback on this implementation, or other thoughts.
Some of this may be particular to the pywb implementation, but some of this behavior may make sense to standardize further or change.

If anyone is interested in code, here are a few unit tests that test for this behavior:

Notes

*: pywb actually uses 307 redirects instead of 302
**: pywb almost supports pattern 1.3 fully, but can not include a link to rel=original in proxy mode, as it is not available. Nevertheless, the behavior is otherwise identical to pattern 1.3, but perhaps there should be another name for it?

@hvdsomp
Copy link

hvdsomp commented Mar 1, 2018

Thanks for taking on this work!

Here is feedback from the Memento team in Los Alamos:

=> The terms "rewritten", "raw", "banner-only" are risky in that there could be potential for other applications of Prefer to use them., especially if they would not be registered as per "The Registry of Preferences" of https://www.rfc-editor.org/rfc/rfc7240.txt. Apart from that, it would be nice to give the terms some kind of "branding" that refers to web archiving, memento applications. For these purposes, we had used original- for the terms we proposed in http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html. Doing so also provides a kind of extensibility mechanism, i.e. all terms with a same prefix relate to the same framework.

=> Note that http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html included terms to convey semantics that don't seem to be covered by the terms introduced here. Ultimately, we as a community should decide what makes sense and what doesn't with this regard.

=> There is the question of which negotiation is handled first: datetime or Prefer. The Memento RFC, which was published prior to the existence of Prefer, does state that datetime negotiation is handled prior to any other content negotiation, by which was meant prior to e.g. language, format, etc negotiation. Given the goal of Prefer in the current context, it seems that this rule should also apply to Prefer, i.e. datetime negotiation first, Prefer next. Even though Prefer isn't really considered negotiation ...

=> The pywb implementation of Pattern 1.3 is really problematic from the perspective of Memento clients. Clients decide that a resource is a Memento on the basis of the existence of a Memento-Datetime header and a link with rel="original", see [1] of http://mementoweb.org/guide/resourcetype/. Strictly speaking, a client could use Memento-Datetime only to make that determination. But, when doing so, the client does not know what the original (URI-R) is and hence can not continue its time travel, e.g. to obtain another Memento of the same resource, visit the original on the live web. The link="original" is in that sense essential for Mementos (URI-M) and also for TimeGates (URI-G).

@anjackson
Copy link

anjackson commented Mar 2, 2018

Although I appreciate the idea of having a prefix to the Prefer options, I'm having trouble understanding original-*. If you just want the raw response, as per the time of capture, how do you specify that? Do you need all three of original-content, original-links, original-headers?

If so, white-listing types or re-write that should not be applied seems clumsy. If some new type or rewrite comes along (e.g. modifying video or embedded media tags to aid playback) how does that work? Do we now need to define an original-media-elements mode and start using it? Or am I misunderstanding it?

As with the banner-only option, it seems there may be a preference for whitelisting the re-writes you want rather than the other way around?

I guess rewritten is a way of saying 'all rewriting is fine', and I don't know how to express that using your original-* either, but maybe that's fine and that what not Prefering anything means?

EDIT to further explain my confusion, isn't original-content, original-links the same as original-content? It's not the original content if the links have been re-written, is it?

@hvdsomp
Copy link

hvdsomp commented Mar 2, 2018

I think two orthogonal aspects are being intertwined here:

  1. a suggestion to prefix the terms used for Prefer in Memento/archive related applications
  2. the semantics that are conveyed using Prefer in such applications

Regarding (1): I referred to the use of the original- prefix in the aforementioned blog post merely as an illustration of prefixing terms an sich, i.e. carving out a "namespace" of terms that pertain to the same framework/application. I did not suggest original- was the term to be used.

Regarding (2): The blog post indeed comes from the perspective of expressing various degrees of "rawness" and, as such, in the approach described, multiple terms can indeed be combined. It doesn't necessarily have to be that way, and I did not suggest it had to be that way in the above. That is why I indicated that "Ultimately, we as a community should decide what makes sense and what doesn't with this regard". That's also why we have asked for feedback from the community ever since the blog post was published.

In my opinion, the questions at this point are:

  1. Do we want to prefix the terms and, if so, which prefix should we pick?
  2. What are the semantics we want to convey using Prefer for memento/archive applications, i.e. what kind of Mementos do we want to be able to request? The three types of Mementos covered by Ilya's write-up make sense to me; yet see also below. Do they make sense to others?

A few more detailed considerations:

  • It seems to me that, in all 3 cases, X- headers may be provided. So, there is no way to request a raw Memento without X- headers? Is everyone OK with that?
  • How do the proposed terms apply to Mementos in web archives that only collects screenshot Mementos?
  • The description of rewritten states "fully rewritten content as needed for replay". When using a Memento client, no rewriting is needed for replay.
  • Note that Prefer also allows to do stuff like "banner=yes" ; "rewritten=no" ; xheaders="yes"

I hope we can get some reactions to this all.

@anjackson
Copy link

Ah, sorry @hvdsomp for not picking up that the blog post was meant to be illustrative rather than suggestive.

I have two concrete use-cases. The first is that, while attempting to integrate Mementos from multiple sources into a proxy service, I wanted to be able to request the un-rewritten entity, because none of that should be necessary in proxy mode. I had imagined this means the original headers as-is as well, but I now realise I'd not thought that through.

The second is more of a convenience, in that it would be nice to run a proxy service for users and for generating screenshots of archived resources, and in the latter case I'd want to switch off the banner. Of course, I could just set up a separate endpoint for that (in fact I'd probably do that anyway to separate out the load!), so I could live without it if it's problematic.

So, I seem to just be reiterating the main use cases covered here, but this selection of use cases does not seem broad enough to answer all the questions we have.

We could do with hearing from more users and use cases, I guess.

I have other cases that I would like to cover eventually, but they are very immature use cases that may or may not fit here. For example, as well as the standard WARC content, I also have (but have no way to give access to):

  1. The screenshot we took during the browser rendering of the original page, during the harvesting process.
  2. The thumbnail version of the screenshot we took in (1.)
  3. The screenshot we took in (1.) but with an image map overlayed so it's clickable.
  4. The HTML DOM from the browser at 'on-ready' that we stored when we rendered the web page.

We've also considered making available:

  1. A rendered screenshot of the archived version of the given resource.
  2. The re-written HTML, with particularly problematic elements (e.g. a Google Map panel) replaced with the relevant portion of the screenshot.

However, given I was kind of surprised by this statement: "When using a Memento client, no rewriting is needed for replay." I'm thinking I may have entirely missed the point.

@ikreymer
Copy link
Author

ikreymer commented Mar 3, 2018

Re: Namespace
I agree that a namespace, such as memento- or webarchive- or wa- should be used to avoid confusion. original- is probably not a good choice.

Re: original-content, original-links, echo-original-headers
I think this approach is problematic for several reasons, as well as thinking of this as dimensions of rawness. (also discussed in #1 and #2).

The effect of combining preferences means that there would be 6 different combinations that an implementation needs to support, and a client needs to understand, and that's not covering any other preferences. The use cases for each of the 6 preferences is unclear.

I would recommend avoiding combining preferences unless there is a clear use case for having these combinations.

Perhaps a better way to think about the Prefer is not as 'dimensions of rawness', but rather as format selection. A memento can be in Format A, or Format B, etc.. If a format is not available, a different format can be provided instead. The current implementation has taken this approach: A memento can be in the rewritten format, in banner-only format, and raw format.

There may be some formats that are extensible, like screenshot vs screenshot + clickable map, for example, but even then I'd hesitate to start combining preferences, unless absolutely needed.

Re: rewritten format
This preference/format is defined so that there is a name given to the default format that is suitable for replay. A client may choose to use this format to save and serve to the user later, or perform some analysis on the modifications. Without going into implementation specific details, it may be hard to make this more specific, but perhaps could add rewritten, pywb=2.0.2 or something like that to indicate the rewriting engine used. Again, there should be specific use cases for adding additional details.

Re: http headers
Unfortunately, it is not possible to send unaltered HTTP headers as they may be interpreted by the server, especially hop-by-hop headers. A possible solution would be to add a Prefer: archival-record format, where the format is a full WARC record in the body of the response. This would be easy to support for most web archives, and would allow for cleanly sending original headers + original payload.

Re: HTTP proxy mode and memento and Prefer
Regarding the almost-Pattern 1.3 use case, it should be noted that this is specifically a client connecting via HTTP/S proxy mode. HTTP/S proxy mode is an important way to access web archive contact, used by British Library, and also oldweb.today, and others.

Perhaps there should be a Memento specification for proxy mode, since Prefer and Accept-Datetime are arguably more useful when using proxy mode, since there is no other way to specify a format or a date.

A client using pywb in proxy mode could do something like this to receive a raw memento at specified date.

curl -x pywb:8080 -H Prefer: raw -H Accept-Datetime: Wed, 16 Jul 2014 20:02:43 GMT http://example.com/

This usage patterns happens to be very close to Pattern 1.3 behavior, but isn't quite, and perhaps it should have a separate name. Of course the client knows that its connecting via a proxy, so there should be no confusion there.

Re: negotiation order
The datetime negotiation can be thought of as happening first (technically they happen at the same time). I think this would only be an issue if a certain preference existed for only certain mementos. Currently, there isn't an example of this, but something that should be considered.

@ikreymer
Copy link
Author

ikreymer commented Mar 3, 2018

Rereading the descriptions at: https://mementoweb.github.io/rfc-extensions/raw-memento/#rawness I can understand the intent of this to set up a kind of constraint system on independent dimensions of rewriting. In practice though, there aren't really independent dimensions, but only a few formats that make sense and have practical applications. I thought it may be useful to list these:

  • No content rewriting (raw). Certain HTTP headers still need to be prefixed.

  • Content rewriting to insert an informational banner only. HTTP headers such as Content-Length, Content-Encoding (content needs to be decoded) need to be updated. Other headers prefixed as needed. This mode is useful for HTTP/S proxy mode replay access.

  • Content rewriting + url rewriting for replay + banner insertion. URLs are rewritten, as well as other content in the HTML, custom JS may be inserted for client side rewriting, and an informational banner added. Content-related headers are altered as in previous mode, but also links in Location: headers are updated. This mode is useful for standard rewriting replay access.

These correspond to the raw, banner-only, and rewritten modes

While there is roughly a 'url rewriting' and 'content rewriting' settings, they are not independent dimensions, as url rewriting implies content rewriting. Headers need to be modified whenever any content rewriting happens.

One additional format to consider, I think this is something @ibnesayeed is interested in:

  • No Content Rewriting. HTTP headers prefixed when needed, AND links in Location header rewritten. This is a convenience to make it easier to follow redirect mementos and are useful for a client-side aggregator. Perhaps it should be called raw-rewrite-redirects?

Other than that, I'm not sure there are any other options here, without delving into very implementation-specific details of rewriting systems.

Removing just the banner, for example, while keeping the rewriting system in place, may not be a desirable option to expose for security reasons (and a server can always set an empty banner if they desire).

Perhaps entirely client-side rewriting approaches (such as work being done by @ibnesayeed) will require some other type of hybrid format, or maybe that would be better handled by receiving a full WARC record? (the archive-record idea)

Possible Additions

Here's a summary of some possible additional preferences, based on the comments and thoughts so far.

(Names are just preliminary and with no determination on a possible prefix).

(From Andy's suggestions):

  • screenshot - A full size screenshot, suitable for displaying at normal size
  • thumbnail - A thumbnail size screenshot, suitable for displaying a thumbnail
  • rendered-dom - A static DOM snapshot retrieved from document.outerHTML and probably with all <script> tags removed, to be fully static.
  • screenshot+imagemap - A screenshot with image map?

(My thoughts so far):

  • raw-rewrite-redirects - almost like raw, but Location header also rewritten
  • archive-record - Serve the entire WARC (or ARC?) record in the HTTP response, to allow further processing by the client as needed. Can be useful for retrieving original HTTP headers and allow the downstream client to perform any rewriting/processing. May want a version with and without revisit-resolving.

@ibnesayeed
Copy link

One approach to namespace terms is to use name="value1 value2" style along the line of rel attribute in the Link header. This way, we will not introduce too many attribute keywords to the Prefer header. Once such term can be memento-variant which will have different values described above, e.g.,

Prefer: memento-variant=raw
Prefer: memento-variant="rewritten"
Prefer: memento-variant="screenshot thumbnail"

This way we will only need to register one Prefer extension (i.e., memento-variant) that will encapsulate all current and future variations in the form of well-defined values and we do not need to worry about the potential collision in other contexts. Parsing such preference will also be easier as all the variants will be available under the same term rather than looking for certain patterns in preference names.

@ibnesayeed
Copy link

ibnesayeed commented Mar 3, 2018

A multi-valued attribute as I described above has another advantage as it allows composition of behavior on different aspects such as headers, payload, and banner separately. Hence, fewer unique keywords can yield many possible combinations.

@machawk1
Copy link

machawk1 commented Mar 4, 2018

Kudos on moving forward with applying Prefer dynamics, @ikreymer. I would like to echo the need for namespacing of these Prefer values with the expectation that the terms may have different semantics in other contexts, as @hvdsomp mentioned.

In talking with @phonedude about this GitHub issue, we discussed using a preference model with a state-of-the-art basis, e.g., IA and most replay systems serve rewritten URI-Ms with a banner visible by default, so build the Prefer semantics on a subtractive model (e.g., a "raw" Preference is "less processed") and an additive model (e.g., a derived screenshot representation per @anjackson).

It seems that some Prefer terms and concepts will work well when combined while others will result in ambiguity, e.g., raw+banner-only could not be combined based on the current definition.

I'd like to see a combination of @ibnesayeed and @hvdsomp's approaches combining both rel-like attributes with namespacing and semantic (Prefer: memento-variant="banner"), readable memento attribute specification (Prefer: banner="yes").

I have a use case (again, like @anjackson) that is not yet clearly defined but with which I planned on leveraging the existing investigations of Memento+Prefer. My exploration lies in sending Prefer to URI-Ts with values about the mementos themselves (ideally pre-calculated) that would affect the set of mementos returned. For example, something like Prefer: damage<0.5 would return a TimeMap containing only the URI-Ms that met this condition. While this is currently pie-in-the-sky, the syntax we define here ought to be extensible for use cases like this where additional attributes about the memento/TimeMap requested are specified via Prefer.

@shawnmjones
Copy link
Contributor

Thanks @ikreymer for starting this thread and applying these ideas!

Re: archive-record
As someone studying the content of mementos as captured, I really like Ilya's archive-record concept. For my work, if archive-record existed, I may not need raw. Seeing as I usually arrive at URI-Ms via a TimeMap I also have a requirement to negotiate against the URI-M itself. It would be nice if there were a form of archive-record that dealt with redirects. I'm still thinking through how that might work.

Re: Pie-in-the-sky Ideas
While we're thinking about pie-in-the-sky ideas, why not introduce a preference that would return a Web Archive Transformation (WAT) record for the given URI-M that has the metadata, links, and other content extracted and converted to JSON?

I do not have a use case for this, but I also envision a scenario where someone might want a memento digitally signed by an archive for legal purposes.

@hvdsomp
Copy link

hvdsomp commented Mar 7, 2018

(Apologies for slow response; I have been out of operation for a few days)

Some observations regarding the very constructive above discussion:

=> Some things that are suggested could - and IMO should - be handled using regular content negotiation, not Prefer. The archive-record is the best example. All it takes is a MIME type for WARC, which unfortunately - as per https://kris-sigur.blogspot.com/2016/05/warc-mime-type.html - does not exist. But one could obviously be defined. In the end, WARC is some kind of web package for which an archival use case is already being considered. Web packages for web resources will be requested via content negotiation. No reason WARC could not be requested that way. Similar consideration holds IMO for @shawnmjones WAT idea. Bottom line: If things can be addressed using regular content negotiation, they should.

=> I am a bit hesitant regarding @ikreymer's stance that we shouldn't support requests for which we currently don't see a use case. It's hard to predict future use cases. Also, boxing ourselves in from the outset regarding a framework that is all about adding expressiveness to Memento negotiation seems at odds with its very intent. Hence, I am in favor of an approach that syntactically allows conveying preferences that supports degrees of freedom and extensibility, even if various of the preferences that could be expressed would not be supported by web archives. As long as the archives can say that they did not honor the preferences, IMO, all is good. In support of this reasoning, I will just quote @ikreymer who sounds a bit hesitant to reject a potential preference: Removing just the banner, for example, while keeping the rewriting system in place, may not be a desirable option to expose for security reasons (and a server can always set an empty banner if they desire). I really would like to see a more "open and extensible" syntax approach. With that regard, I also like @ibnesayeed's memento-variant idea and would appreciate if @machawk1 could provide some more details about his perspective. I am not sure I follow his above proposal, which seems like a combo of @ibnesayeed's and mine.

=> When considering potential uses for Prefer and web archives, I think we should focus on cases that one would want to support in an interoperable manner across web archives, across web archive softwares. After all, this is about extending the expressiveness of the Memento protocol, which is an interoperable framework for time-based access to web resources. Prefer could at any point also be used in non-interoperable ways for special use cases in (a) specific archive(s).

=> A clarification for @anjackson re Memento and URL rewriting: Memento clients do not need rewritten URIs. Current Memento clients clearly can handle rewritten URLs because they're all over; but they can also work with un-rewritten URLs. In both cases, prior to continuing a time travel session, the "nature" of the URL to which the client will travel is checked by looking at its HTTP header. If that has a Memento-Datetime, then the URL is a URI-M and the client uses the URI-R from the link rel="original" to continue traveling. If it doesn't have Memento-Datetime, then the URL is a URI-R and can directly be used for traveling. In both cases, URI-R is used for time traveling.

@ibnesayeed
Copy link

@machawk1: For example, something like Prefer: damage<0.5 would return a TimeMap containing only the URI-Ms that met this condition.

I think this syntax is illegal as per the ABNF grammar of the Prefer header. However, you can have something like:

Prefer: maximum-memento-damage-threshold=0.5

Alternatively, along the lines of memento-variant, we can have something for the TimeMap to allow filtering:

Prefer: timemap-variant="we-do-it no-pagination memento-damage<0.5 scheme=http|https status=2xx duplicate=last"

Note that various operators might not be legal outside, but it should be perfectly valid to add them inside the quoted value. How the value is parsed and consumed will depend on how the timemap-variant is defined. We can have spaces as a separator of various attributes while other operators (without any spaces) to specify values to those attributes, if applicable.

@phonedude
Copy link

  • I'd like to echo @hvdsomp's comments re: use cases. Unless there are certain combinations that are self-contradictory, we should allow them. An archive is still able to choose to honor only certain subsets. For example, archive.is, webcitation.org, and other archives that dedup binary files by storing them under a hash value, e.g.:

https://egr.vcu.edu/media/school-of-engineering/egr-main/img/layout/chem-icon.svg

becomes:

https://archive.is/MIq1y/2b24413af4630e8917a9f2dd2d9b5a634f5293eb.svg

Replay of original headers isn't going to be defined for that URI.

  • Similarly, archive.is isn't going to be able to provide a "banner+raw" combo for:

http://archive.is/MIq1y

  • for "archive-record", I think we're going to need to define a rel because a MIME type won't be descriptive enough by itself. Setting aside the question of transitive closure of embedded resources and/or links from a WARC file for the particular URI-M, some archives 1) don't use WARC, and 2) provide zips. for example, from the archive.is URI above, the following is an "archive-record", but the MIME type is just "application/zip":

http://archive.is/download/MIq1y.zip

There is currently no machine-readable mapping from http://archive.is/MIq1y --> http://archive.is/download/MIq1y.zip but there should be, and that should be done with a rel type (which would also allow different MIME types, both WARC (when it gets one) and "application/webpackage+cbor", etc.).

@hvdsomp
Copy link

hvdsomp commented Mar 7, 2018

I am all OK with @phonedude's proposal to use a rel to point at packages. A similar approach was recently discussed to point at packages containing various components of a dataset in scholarly communication. Bottom line remains the same: this does not require a Prefer approach.

@machawk1
Copy link

machawk1 commented Mar 7, 2018

I believe @ibnesayeed's Prefer approach to be a good one that encourages extensibility, namely, using two different syntactic structures:

  • Prefer: memento-variant="no-banner rewritten" and
  • Prefer: timemap-variant="memento-damage<0.5 someOtherAttribute=foo|bar|qux baz"

This would require only two preferences to be registered with IANA and allows for variability of both mementos and TimeMaps in arbitrary dimensions with conditions we currently do not foresee. The ability, as provided by these syntax, for conditions to be quantitative beyond boolean (e.g., values within a range instead of simply present/processed or not) is particularly appealing.

@BigBlueHat
Copy link

Would it be possible to avoid the technology name prefixes (memento- and timemap-) in the new preference names? It seems (if possible) that avoiding technology specific naming and preferring concept specific naming might provide for others to use these conceptual preferences in other (yet unexplored) ways while still providing the preference values needed for this specific use case.

Maybe. 😸

@hvdsomp
Copy link

hvdsomp commented Mar 8, 2018

Hey @machawk1 and @ibnesayeed: Forgive me for being thick, but would it be possible to fully elaborate the proposed memento-variant approach in the sense of:

  1. Showing how a client would request using Prefer a combo of: banner/no banner; rewrite/no rewrite; xheaders/no xheaders.
  2. Show how the server would indicate with Preference-Applied whether or not it had honored the client preferences.

I would prefer (sic) to focus on the memento-variant stuff, for now, because we have real use cases in that realm. I think the timemap- stuff needs quite some more thinking.

@ibnesayeed
Copy link

ibnesayeed commented Mar 8, 2018

To illustrate various approaches, let me first list a few variations on four different aspects of the response. These variations are in no way exhaustive and should be taken for the illustration purpose only. Some of these might be better suited to be negotiated differently.

Header Variants Description
original Include original response headers
request Include corresponding original request headers
Payload Variants Description
original Original unchanged response payload
mobile Mobile-friendly version of the page
amp AMP variant of the page
jsfixed Rewrite JS and/or include extra JS to fix some replay issues
jsmuted Disable all the JS on the page both inline and external
render Rendered and serialized DOM with any JS and WebComponents
screenshot A full page screenshot image
thumbnail A fixed width and height thumbnail image
metadata A structured description and metadata (potentially in JSON)
Rewrite Variants Description
hyperlinks Rewrite hyperlinks
requisites Rewrite page requisite references such as images and CSS
location Rewrite Location header
Banner Variants Description
none Do not include any banners
inline Add inline banner markup
isolated Use iframes or other techniques to include a banner

These variations can be put together in various ways of which some may be semantically illegal (or mutually exclusive) and some may not be very useful. These can be used in two modes:

  1. Atomic - Pick and choose zero or more variants from each aspect and supply them as separate tokens. This approach is flexible and future-proof, but verbose and may yield in invalid combinations. E.g.,
    • header-original payload-original banner-none
    • header-original rewrite-location payload-render banner-isolated
    • payload-original payload-render (Invalid)
  2. Composite - Define a handful of unique modes that are useful and representative of some hand-picked combinations of the atomic variants. This approach would require an extension each time we come up with new use cases. E.g.,
    • raw same as header-original payload-original banner-none
    • rewritten same as header-original rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsfixed banner-isolated

Either of these modes can be added in one of the two scopes as described below:

  1. Global - Adding preferences as top-level attributes of the Prefer header. This will introduce too many preferences in the registry that are very specific to Memento and might not be very useful in other contexts. Adding more variations will require an extension. This will require inline name-spacing. E.g.,
    • Prefer: memento-header-original memento-payload-original memento-banner-none (in atomic mode)
    • Prefer: memento-raw (in composite mode)
  2. Namespaced - Adding all of the Memento related preferences under a single top-level memento-variant attribute of the Prefer header as values. This will require registration of a single new attribute while different variations can be described in the specification with the possibility of future additions. E.g.,
    • Prefer: memento-variant="header-original payload-original banner-none" (in atomic mode)
    • Prefer: memento-variant=raw (in composite mode)

If we use the namespaced scope, variations can be organized in two ways as described below:

  1. Flat - In this organization variation values are added as a flat list of tokens. This is more like the rel attribute of the Link header. Tokens are longer as they are prefixed with their corresponding aspect names. Parsing this organization is simpler, but it has limited flexibility in terms of expressiveness. While we can have optional parameters for the memento-variant, it will be difficult to associate those parameters with specific variations. E.g.,
    • Prefer: memento-variant="header-original header-request rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsfixed banner-isolated"
  2. Structured - In this organization we can introduce optional operators, such as, =, !=, ~=, >, >=, <, and <= between aspect names and their corresponding vales (without any spaces), | and & between any two values of the same aspect, and ~ for additional parameters. This approach can do everything that flat organization can do, often using shorter tokens, but it can allow more expressive preferences, such as supplying dimensions of the desired thumbnail or asking for resources that are not 5xx. E.g.,
    • Prefer: memento-variant="header=original&request rewrite=location&hyperlinks&requisites payload=jsfixed banner=inline|isolated"
    • Prefer: memento-variant="payload=thumbnail~300x200 response!=5xx banner=none"

The Preference-Applied can use similar notation to acknowledge aspects of the namespaced preferences that were applied. The syntax for the Preference-Applied header is same as the Prefer header with one exception that the latter includes optional parameters separated by semicolons (which we did not use anywhere). E.g.,

  • Prefer: memento-variant="header-original header-request rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsfixed banner-isolated" (in flat organization)
    • Preference-Applied: memento-variant="header-original rewrite-location rewrite-hyperlinks rewrite-requisites banner-isolated" (some preferences are not honored)
  • Prefer: memento-variant="header=original&request rewrite=location&hyperlinks&requisites payload=jsfixed banner=inline|isolated" (in structured organization)
    • Preference-Applied: memento-variant="header=original rewrite=location&hyperlinks&requisites banner=inline" (some preferences are not honored)

@ibnesayeed
Copy link

@hvdsomp: I would prefer (sic) to focus on the memento-variant stuff, for now, because we have real use cases in that realm. I think the timemap- stuff needs quite some more thinking.

I agree that we should discuss the timemap-variant in a separate ticket. However, having that in mind as a possibility can help us shape the format that can be reused.

@phonedude
Copy link

@ibnesayeed: thanks for this detailed proposal. could you next map these proposed approaches to existing archives? e.g., for IA (w/o any "_" modifiers) would it be:

Preference-Applied: memento-variant="header-original rewrite-hyperlinks rewrite-requisites rewrite-location banner-inline"

anything missing? and if that's IA, what would archive.is look like?

@ibnesayeed
Copy link

ibnesayeed commented Mar 14, 2018

The above proposal describes five meaningful ways utilize the Prefer header using different choices. Potential merits and demerits of these choices are briefly described in their description.

Prefer/
├── Global/
│   ├── Composite
│   └── Atomic
└── Namespaced/
    ├── Composite
    └── Atomic/
        ├── Flat
        └── Structured

We can use the following abbreviations to simplify examples.

GC => Global-Composite
GA => Global-Atomic
NC => Namespaced-Composite
NF => Namespaced-Atomic-Flat
NS => Namespaced-Atomic-Structured

Since the format of the Prefer and Preference-Applied headers is almost the same, let's illustrate a default response (without any modifiers such as id_ and as they are currently) under different settings below by using just the latter for brevity sake.

OpenWayback

Default implementation of OWB rewrites various links/references, includes an inline banner, and provides original response headers, but does not fix any JS issues.

GC> Preference-Applied: memento-rewritten
GA> Preference-Applied: memento-header-original memento-rewrite-location memento-rewrite-hyperlinks memento-rewrite-requisites memento-banner-inline
NC> Preference-Applied: memento-variant="rewritten"
NF> Preference-Applied: memento-variant="header-original rewrite-location rewrite-hyperlinks rewrite-requisites banner-inline"
NS> Preference-Applied: memento-variant="header=original rewrite=location&hyperlinks&requisites banner=inline"

PyWB

Default implementation of PyWB rewrites various links/references, serves banner as the isolated parent document and serves Memento inside it in an iframe at a different URL (using mp_ modifier), provides original response headers, and fixes many JS issues using Wombat.js (with some contributions from @N0taN3rd's work).

GC> Preference-Applied: memento-rewritten
GA> Preference-Applied: memento-header-original memento-rewrite-location memento-rewrite-hyperlinks memento-rewrite-requisites memento-payload-jsfixed memento-banner-isolated
NC> Preference-Applied: memento-variant="rewritten"
NF> Preference-Applied: memento-variant="header-original rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsfixed banner-isolated"
NS> Preference-Applied: memento-variant="header=original rewrite=location&hyperlinks&requisites payload=jsfixed banner=isolated"

Archive.is

Archive.is rewrites various links/references, includes an inline banner, serves a serialized rendered DOM, and mutes all the JS on the page, but does not provide original response headers.

GC> Preference-Applied: memento-rewritten
GA> Preference-Applied: memento-rewrite-location memento-rewrite-hyperlinks memento-rewrite-requisites memento-payload-jsmuted memento-payload-render memento-banner-inline
NC> Preference-Applied: memento-variant="rewritten"
NF> Preference-Applied: memento-variant="rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsmuted payload-render banner-inline"
NS> Preference-Applied: memento-variant="rewrite=location&hyperlinks&requisites payload=jsmuted&render banner=inline"

A typical original unrewritten (or raw) response (that is usually served using the id_ modifier), where supported, will look something like the following.

GC> Preference-Applied: memento-raw
GA> Preference-Applied: memento-header-original memento-payload-original memento-banner-none
NC> Preference-Applied: memento-variant="raw"
NF> Preference-Applied: memento-variant="header-original payload-original banner-none"
NS> Preference-Applied: memento-variant="header=original payload=original banner=none"

@phonedude
Copy link

@ibnesayeed: thanks for the additional info. borrowing from the archive.is example, my initial preference is for the simplicity of "NF":

NF> Preference-Applied: memento-variant="rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsmuted payload-render banner-inline"

but I recognize that "NS" is more extensible:

NS> Preference-Applied: memento-variant="rewrite=location&hyperlinks&requisites payload=jsmuted&render banner=inline"

what do others think?

@ibnesayeed
Copy link

Global styles, both GC and GA, are difficult to extend as each time they will require a very careful choice of preference name that does not step on to something more generic with a different semantic meaning. The latter has another issue of registering way too many preferences.

While NC does not have the problem of registering too many preferences (as it registers only one, i.e., memento-variant), it has the challenge of recognizing many useful combinations that are future-proof.

Namespaced-Atomic styles, both NF and NS eliminate some of the issues described above. NF is simpler, but less expressive that NS. Due to the atomic nature of features it might be desired to express how certain features are to be combined. By bringing some structure using operators, we can express many preferences more effectively.

For example, in NF when we say:

Prefer: memento-variant="rewrite-hyperlinks rewrite-requisites banner-isolated banner-inline"

It is not clear that the client wants to rewrite both hyperlinks AND requisites, but only wants one of the isolated OR inline banners. This same preference can be expressed in a more meaningful ways by using & and | logical operators in the NS style between any two values as following:

Prefer: memento-variant="rewrite=hyperlinks&requisites banner=isolated|inline"

By including some relational operators between the name and value (as illustrated earlier) in the NS style we can express some more useful preferences such as inequalities or negations as following:

Prefer: memento-variant="response!=5xx damage<0.5"

By including some parameters with values (as illustrated earlier) in the NS style we can express some more useful preferences such as viewport dimensions of a screenshot or thumbnail size:

Prefer: memento-variant="payload=screenshot~600x500|thumbnail~300x200"

It is not always necessary to have values associated with each name (that means the syntax of NS is a superset of NF):

Prefer: memento-variant="include-serviceworker banner=none"

If we write a grammar for memento-variant's value parser, it can potentially be reused in timemap-variant later.

@ikreymer
Copy link
Author

ikreymer commented Mar 16, 2018

Thanks everyone for all the feedback, and thanks to @ibnesayeed for all the detailed examples.

Going back to the original purpose for this, to create a standardized, interoperable API, i think perhaps it is important to indicate when there is no interoperability, eg. when comparing any rewriting systems from one software to another. The memento-rewritten label implies that there is no guarantee of interoperability between any other format, except perhaps from the same service. While these classifications may be interesting in a study of rewriting systems (rewrite-requisites, payload-jsfixed, etc...) there is no interoperability when these attribute are applied no matter how you slice it. Adding these additional attribute will only serve to confuse someone into believing there is a standardized format in the response if a certain preference is applied.
For other formats like, screenshots, or WARC records, there is a defined response format (specified via mime) and that should be made clear.

A rewritten memento from openwayback vs pywb vs archiveis is going to be different result, so why not just state this explicitly, by adding memento-server="pywb" or memento-server="openwayback", or using a custom field such as memento-variant="rewritten-pywb" to indicate that there is no equivalency.

But then, taking a step back from this even further, given the use cases and suggested formats, I am sort of wondering why a rather esoteric approach like the Prefer/Preference-Applied header is needed at all here. I would suggest that what is being described as "content negotiation" here can in all honesty be described as a "search query".

With the exception of a few cases mentioned above, (eg. HTTP/S proxy mode), I think most of the requested functionality can and should be implemented with a standard search query API, with a format like: /memento?url=...&format=raw or /timemap?url=...&damage=<0.5 or /memento?url=...&format=image:320x200

This would work for both memento and timemap queries. For memento, the response could be a JSON response listing all the possible results, with best match being first, as is typical in a search query.
There could also be a shortcut, maybe /best-memento?url...&format=raw that automatically redirects to the 'best result' as a convenience.

This would make the API much more accessible and flexible, allowing multiple result formats, as well as possible endpoint to list the available formats. No need to create custom header parsers, or register anything with IANA.

I think its fair to say that for most client users and server implementers would find it much easier to use the standard url query parsing, then to introduce custom parsing of a new header. Ease of implementation and conformance with existing expectations about APIs is more likely to lead to adoption and interoperability.

@ibnesayeed
Copy link

ibnesayeed commented Mar 16, 2018

If the purpose of this exercise is just to find the possibility of interoperability then that can be handled in many different ways (some of which might not be the right way of interacting with an HTTP system):

  • Feature detection -- by looking at certain signatures in the response and recognizing whether interoperability is possible
  • Server header -- identifying the software used and its version number then having some out-of-band information about its capabilities
  • Alternate links -- server providing various rel=alternate links to other variations, but clients will have very little clue of which alternate links has which variation unless dereferenced

The Prefer header is not necessarily for checking for interoperability (though it can be used for that), instead it if for "soft content negotiation" where the client is not too picky about the response, but would "prefer" if certain preference were applied and the server informs whether or not certain preferences were applied, then the client can decide what to do with the received response.

A URL query parameter-based approach has some issues:

  • It is a more rigid way of asking for a resource with a combination of preference parameters which the server either has the exact combination or not. If it cannot honor all the preferences listed in different URL parameters, it will return a failure response. One can think of redirecting to a URL with the closest set of parameters. However, it means the client now needs to understand parameters in the redirected URL (to achieve something like Preference-Applied) which is not elegant and often not possible in web browsers where redirects are handled transparently by the browser.
  • It requires all archival replay systems to agree on a set of well-defined URL query parameter names that will be an out-of-band information, which is not a very HTTP way of interoperability.
  • It forces to URL-encode URI-R segment in URI-Ms which would be a breaking change because a URI-M of the format /memento/<14-digits>/<plain-original-uri-with-potential-query-params> will not be an option anymore.

@ikreymer
Copy link
Author

ikreymer commented Mar 16, 2018

If the purpose of this exercise is just to find the possibility of interoperability then that can be handled in many different ways (some of which might not be the right way of interacting with an HTTP system):

To me, the point of this is to retrieve a resources in a known definitive format, so that it can be treated in a particular way, such as for example, compared to other resources in the same format. (I believe that was also an original use case, make it easier to compare mementos). If two mementos are in raw HTML format, they can easily be compared. If they are two images of the same size, they can also be compared more easily. Two mementos both rewritten by pywb can be compared as well, knowing that the pywb-style rewriting, whatever it is, is applied consistently. But if they're two images of differing sizes or a memento rewritten by pywb vs a memento rewritten by archiveis, comparing them will require more work -- possibly significantly more work, and the results may not be as accurate. I think that's why the formats (I hesitate to call them preferences now) should be defined as unambiguously as possible.

The Prefer header is not necessarily for checking for interoperability (though it can be used for that), instead it if for "soft content negotiation" where the client is not too picky about the response, but would "prefer" if certain preference were applied and the server informs whether or not certain preferences were applied, then the client can decide what to do with the received response.

I am struggling to find a good use case for when a client would be 'not too picky' about a response.
Generally, ambiguity just means more work for the client! I don't think this is a desirable behavior for any API, frankly. If I search for "images" and I get images sometimes, but sometimes PDFs and text files, that means more work for the client to filter out the results. If I wanted "images or PDF or text file", I'd search for that. But if I want to search for images, and the server returns a 404, I can decide if I want to try a different format, or not. A concise search query makes things easier for the client, while 'not being too picky' just leads to unpredictable behavior that seems broken.

It is a more rigid way of asking for a resource with a combination of preference parameters which the server either has the exact combination or not. If it cannot honor all the preferences listed in different URL parameters, it will return a failure response. One can think of redirecting to a URL with the closest set of parameters. However, it means the client now needs to understand parameters in the redirected URL (to achieve something like Preference-Applied) which is not elegant and often not possible in web browsers where redirects are handled transparently by the browser.

Too rigid? How many complex search systems with boolean logic, etc.. have been built using url query arguments? :)
I think looking at an existing search query API, perhaps from something like Solr is a good place to start rather than inventing a new header, grammar and syntax for something that's already been done many times.

It requires all archival replay systems to agree on a set of well-defined URL query parameter names that will be an out-of-band information, which is not a very HTTP way of interoperability.
It forces to URL-encode URI-R segment in URI-Ms which would be a breaking change because a URI-M of the format /memento/<14-digits>/ will not be an option anymore.

But isn't the whole point that we're defining a new API specification? The query params will be part of the spec. There are existing tools that can help with this, unlike with Prefer header. Overloading the behavior of existing URLs via HTTP headers in not any easier (for the implementor or client) than defining a new API endpoint.

@hvdsomp
Copy link

hvdsomp commented Mar 19, 2018

I agree with @ikreymer that it is important for a client to understand unambiguously what it is receiving. The Prefer/Preference-Applied provides that functionality, if defined appropriately.

I am not at all in favor of a query API approach as suggested by @ikreymer because it entails introducing a new API from scratch rather than leveraging the Memento "API" that is meanwhile widely adopted. Prefer/Preference-Applied is a logical step to add expressiveness (via additional negotiation) to the Memento protocol. I sincerely hope we can achieve a solution along that path.

@ibnesayeed
Copy link

ibnesayeed commented Mar 19, 2018

I might have misunderstood it, because I thought @ikreymer was using "ambiguity" to refer to the "soft content negotiation" nature of Prefer-based content negotiation. I thought he was suggesting that there are hardly any good use cases where the client might be OK to receive anything other than precisely what it "preferably" asked for. Which means server either fulfills the request with everything client asked for or returns a response that says, "sorry, we can't do that". If this is the case then Prefer header is not the right thing for the purpose.

In my opinion, client should be able to ask for a memento's "preferred" variation, which the server should try to fulfill as best as possible, but returns a memento even if some (or all) preferences were not applied. However, it is very important for the server to tell the client precisely and "unambiguously" about the response returned (using Preference-Applied, unless obvious, also, the client should not draw any conclusions from the URI). When the client receives the response, there is no guarantee that it is exactly what the client was looking for, but the client knows what it received, so it can decide what to do with the response.

Here are a few examples where an unfulfilled preferences might still be more useful than returning nothing:

  • Suppose, client requests for a thumbnail with dimensions 300x200, but the server has recently generated thumbnails of pages in 250x250 so it returns that from the cache instead and tells the client what it is returning. While Client would have "preferred" to have the size it was asking for, the one returned might still be useful.
  • Suppose there is a tool that compares mementos from different archives. It might ask for the original unmodified responses with no banners included and prefers to have original and request headers included for maximum amount of metadata to establish comparisons. However, one or more archives might not be able to honor all preferences, but the client can still report best-effort differences. For instance, if the URLs were rewritten, a byte-by-byte comparison may not be useful, but visual HTML-diff can still render meaningful. Suppose one of those archives also included the banner, the comparison tool can acknowledge users that some differences might be there due to added banner, as long as server told the client that it included the banner (which was not preferred).

I would not vote for redefining a URL query parameter-based API that breaks existing systems. There are situations where negotiating header-based preference is not possible, for example, clicking a link to open a memento in a browser or including a memento inside of an iFrame using src. However, in these cases we do not want fine-grained variations as the requirements are almost fixed (and these are often local decisions), so existing _ modifier-based non-standard technique would work fine.

@anjackson
Copy link

We seem to have strayed a long way from the original use cases, and I'm not convinced the complexity is justified.

Firstly, I agree with @hvdsomp in that the whole point of this was to add the ability to negotiate between rewritten and raw within the Memento protocol. I don't want to move this use case to a separate API.

However, I disagree with some of the analysis from @ibnesayeed - specifically, the distinction between 'atomic' and 'composite' is rather artificial and I don't think it will age well. If we look at the examples (and ignore that banner-none is implied by payload-original :-) )...

raw same as header-original payload-original banner-none

The latter 'atomic' variant appears to be preferred, but requires the client to enumerate the ways in which the content should be original. Is there really only the header and the payload? What about the protocol version? Even if this is a exaustive set (which I accept it probably is in this case), there is nothing more fundamental about having to list the conceptual components of a HTTP response. It's just longer.

The rewritten case is worse:

rewritten same as header-original rewrite-location rewrite-hyperlinks rewrite-requisites payload-jsfixed banner-isolated

I agree with @ikreymer in that rewritten is not the same across implementations, and so it won't work in the 'atomic' form where we have to list all the ways we'd like the content to be rewritten. It also does not cope well if rewrite tactics appear over time. To me the semantics of rewritten are clear -- do your best to make this work in a normal browser. Forcing a decomposition is unhelpful and creates the impression of hundreds of variants that no implementation supports.

I do not think it is helpful to consider raw and rewritten as formats. I don't think they match the semantics of Content-Type at all well. e.g. a screenshot of an archival resource could be handled like that, but the screenshot taken of the original host during the crawl process could not (and is better handled as a memento-variant).

So I'd like to be able to ask for raw and rewritten and know which I get back. If the implementation chooses to give more specific details, that's fine. I would like the standard to be really clear about what raw means, but allow precisely what rewritten means to be left to the implementation.

@ibnesayeed
Copy link

@anjackson would you prefer the GC style (e.g., Prefer: memento-rewritten) or NC (e.g., Prefer: memento-variant="rewritten")?

Whatever mechanism we eventually agree on, I would like a way to request raw mementos, but their location headers rewritten.

@anjackson
Copy link

No strong opinion. I've tried looking at the examples and the currently registered preferences.

I think I'm repeating someone else's suggestion, but perhaps NC is easier to standardise as it's just a single top-level preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants