Converting Media Formats

Issue

Many web archives contain data formats that cannot be rendered by current software environments used for access. This is mostly true for media formats that have fallen out of use, such as Flash video, Apple QuickTime video, MIDI audio, and more.

This document describes an approach for dealing with legacy media and executes it for the case of Flash video as an extension to the Webrecorder tools warcit and pywb.

Layering during access

A web archive replay system is correctly situated to establish how legacy media should be presented to the user. If access happens via an emulated software environment that can handle the legacy media—for instance Oldweb.Today—, just the original data from the web archive can be presented to the user.

In cases where the material is accessed with a later software environment in which the formats in question have been declared obsolete, the replay system can pick migrated versions of the media data and have the accessing browser display it in the right place.

Previous Work

1. WARC specification

The WARC 1.0 specification includes a “conversion” record, which

[…] shall contain an alternative version of another record’s content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc)

(See WARC 1.0 specification section 6.8)

This defines a replacement of resources per URL, such as that a Flash video having a version available based on free/open codecs that can be replayed in today’s browsers. This is not enough to provide a general replacement mechanism:

The replay system needs a way to determine for what kind of “conversion” is available. Querying a WARC’s index for every resource served in case a conversion is available would be wasteful; limiting this only to certain mime types or URLs would limit the general applicability of the system; instead a separate index for conversion records would need to be integrated with general web archiving indexing.
The replay system needs to express preferences for what type of conversion should be displayed by the browser used for access.
Legacy media usually is not embedded into a page via a simple HTML tag. Instead, complex client-side processes are common, such as opaque and proprietary plug-ins like Flash, QuickTime, sometimes dependent on JavaScript for checking plugin availability. Just serving a converted version of legacy media data won’t lead to the desired result. To solve this issue, the replay system requires information about what parts of the HTML document embedding legacy media needs to be rewritten: HTML pages, defined by their URLs and time stamps, are embedding media data, also defined by URL and time stamp, on a certain place in the DOM tree, defined by XPATH or a CSS selector.

2. youtube-dl

Youtube-dl is a popular tool and Python library for downloading media and metadata from various streaming media services, such as YouTube, Vimeo, etc. It is not designed for web archiving purposes but is general enough to allow for integration with different frameworks. The metadata records it generates contain some information defined in (1.3) above, in particular:

Media id that identifies media on the hosting services. When media are embedded in a target page via the popular <iframe> technique, this identifier can be used to construct the URL of the iframe document that the replay system then could replace.
In the case of video, media dimensions and duration are available, providing the replay system with parameters to construct HTML for embedding the media.
Encoding information (video and audio codecs used), providing the replay system with possible choices for the client.

(A list of all possible metadata fields youtube-dl specifies is listed in the project's source code.)

3. youtube-dl metadata in WARC

A previous version of pywb introduced the option to replace embeds from popular media hosting sites with versions downloaded via youtube-dl and corresponding metadata in WARC; the metadata would be stored in JSON format as a WARC metadata record under a synthetic metadata:// URL . The metadata URL would be matching the page URL with the embedded video and be findable via a regular CDX lookup. Every page accessed would therefore also require a second lookup for any metadata records.

During replay, pywb would replace the original embed HTML code with media players fitting the video format downloaded by youtube-dl. HTML5 standard media players as provided natively by the browsers would be used for video files in HTML5 supported formats, the popular open source Flash video player FlowPlayer was used for FLV videos.

The same approach to youtube-dl based capture and storage in WARC was later adopted by wpull and Brozzler. Both are capture-only tools, and pywb was (and remains, to our knowledge!) the only tool that could replay 'external' videos captured through youtube-dl. For this reason, wpull and Brozzler implemented the same metadata format to be compatible with pywb. This enabled pywb to replay WARC files generated with these tools. Other replay systems like OpenWayback didn’t support replay of youtube-dl based captures and in some cases would fall back to making captured videos accessible separately from the pages in which they had been embedded.

4. Capturing and replaying HLS/MPEG-DASH videos with Webrecorder

The goal of Webrecorder/pywb has always been to provide highest fidelity browser-based capture possible. When video content was served via Flash, it was necessary to use youtube-dl to download the videos as rewriting Flash players for replay was not possible.

When online video (and audio) switched to HTML5 standards, it became possible to rewrite and capture <video> and <audio> directly. With additional improvements to the client-side rewriting system, Webrecorder as now able to capture custom HTML5-based video players along with other custom Javascript content.

However, when HTML5 video became commonplace, several adaptive streaming formats were also introduced, namely HLS and MPEG-DASH. Both formats specify a way to split up a single video stream into chunks, and would allow for small time spans of a video to be available at different levels of quality. The browser chooses the next chunk to be loaded based on network performance, controlled by JavaScript that is looking up the available options in a standardized manifest. With these adaptive streaming techniques, the media sources connected to a regular HTML5 <video> tag would actually be modified on the fly, with the loaded chunks being pieced together via complex JavaScript.

Choosing different chunks of video dynamically presented a problem for web archiving, where reproducibility is paramount. Fortunately, two key workaround were possible:

First, many large streaming sites, including Youtube, Vimeo, Soundcloud, etc… provide a parameter to disable HLS/DASH streaming, and serve only a single video stream at once using their native player. It was determined that client-side rewriting during capture can simply toggle this option on, resulting in a single video stream. The outcome of this approach can be considered the most desirable but required custom work per site and hence is only available on a few known sites.

For sites using HLS/DASH, a different rewriting approach was taken. During capture, the HLS/DASH manifest is rewritten such that only one stream is advertized to the video player JavaScript, usually the highest resolution but below a maximum threshold (to avoid defaulting to ultra hd formats). The resolution chosen is added to the WARC record as metadata. During replay, using the metadata from the WARC record, the same single stream is chosen as was used during capture. When only one resolution is served to the browser, the same resolution stream URLs are used during capture and replay.

5. British Library custom framework

The British Library was using a custom framework for replacing legacy Flash video embeds with embeds of transcoded versions of these videos for access. Since the pages in question only had one single video embedded each, the metadata required just needed to match the page’s URL and archiving timestamp with a transcoded media file.

While this system is effective, it is too specialized on a single use-case, executed as a custom Javascript injected into OpenWayback replay, and stores all data outside of the usual formats and locations that would be integrated into web archiving workflows: in separate JSON and video files.

6. Rhizome Nastynets restoration

In 2018, Rhizome restored Nastynets, an blog run by artists in between 2006 and 2012, containing more that 30,000 posts, featuring hundreds of embeds of legacy video formats like QuickTime, FLV, SunAudio, and MIDI audio. A transcoding and rendering pipeline based on ffmpeg and timidity was created for linear media embeds, creating a variety of access and preservation formats (from FFV1 video and FLAC audio to webm VP9 video and opus audio). Before converting files available on local disk to WARC, the legacy HTML was rewritten to embed transcoded and rendered versions of media files.

While "hard rewriting" makes sense for this particular project (the available HTML files were far from usable in a web archive for numerous reasons), a generalized approach would need to add the media replacement as a function of the replay system. Converting single media files to a set of access and preservation formats, as commonly done in other fields of digital preservation, is increasing the likelihood of future access fixes on a web resource being possible.

(The Nastynets restoration was done in cooperation with digital media conservator Anne Krause.)

New integrated approach

For Webrecorder/pywb, the HTML5 capture/replay explained above has worked well in enabling browser based capture and replay of HTML5 video and audio.

However, for legacy media embeds or content captured using separate tools, outside of Webrecorder's symmetrical archiving system, a system for adding external replacements during replay is still necessary. In addition, to support legacy formats, a conversion process is also necessary to transcode media into HTML5-ready formats. An additional metadata lookup, similar to the youtube-dl metadata format would still be needed to store a list of all available embeds/transclusions from a given containing HTML page.

The British Library has commissioned Webrecorder to define and implement a working solution for their corpus, which mostly has to deal with Flash FLV video.

Requirements

Given the experience from previous approaches, a new process would need to fulfill the following requirements:

Support for multiple format conversions created through a conversion workflow and stored in WARC conversion records
Be fully integrated with a web archive storage and replay system.
Provide a safe path to upgrading media embeds should new transcoding be required in the future.
Provide an initial toolset to perform required actions.

Metadata

Youtube-dl metadata JSON format was initially a useful starting point for specifying multiple formats of embedded content and their mimetypes. However, the full youtube-dl format included additional data not needed for archiving, and did not include additional information, such as location of an object on the page, multiple objects and other metadata specific to conversions. The JSON format will likely further evolve away from the original youtube-dl inspired JSON.

Indexing and matching metadata

Since metadata for media replacement needs to be matched with pages that embeds the media, the approach of creating WARC records with a metadata:// URL matching the URL and timestamp of the page it should be applied to will be used.

The records will contain a JSON structure describing each media embed that should be replaced on the page, listed by the URL of the media file to be replaced. This allows for adding replacement information over time and overwriting outdated replacement information:

Classic CDX lookup procedures can be applied, using the standard WARC header fields Warc-Target-URI and Warc-Date, and matching the closest date.

Pointing to several transcoded formats

The metadata for each media embed that should be replaced can point to several data sources in different transcoded formats. On access, the replay system can generate the required HTML code to embed them as a fallback chain and inject that into the currently viewed page. This fallback chain also includes references to the original media and preservation formats. So even if access conditions would have changed such that the transcoded media files wouldn’t be playable anymore and the web archive wasn’t yet updated, users could still download the data in a preservation format and transcode it. The availability of that metadata will also allow automatic conversion processes to be designed so operations can be performed directly on WARC records.

Storage of metadata and transcoded media

Media replacement metadata will be stored in WARC metadata records under a metadata:// URL schema matching the page the replacement should become active on.

Transcoded media will be stored in WARC conversion records, with references to the original WARC records alternatives present, in adherence to the WARC standard.

Tooling

Based on the existing use-case at British Library, with web sites containing Flash video embeds existing as files on disk, an extension to the Webrecorder tool warcit makes the most sense. This includes a media conversion pipeline based on ffmpeg.

The existing video replacement mechanisms in pywb will be re-activated and updated to work with the new metadata that is not based on youtube-dl conventions. Direct support for Flash video via the FlowPlayer will be removed, using only HTML5 video.

Examples and Technical Documentation

Please see the warcit conversion and transclusion documentation for how to convert legacy content and create WARCs usable in pywb.

Pywb Deployment

The replay system is currently automatically supported in the ukwa-pywb setup of pywb. It will eventually be merged into the mainline pywb release. Media transclusions will be enabled and available on replay without requiring any additional configuration, other than adding the WARCs created by warcit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly