Synchronized Narration #83

llemeurfr · 2021-06-01T14:47:50Z

This is a first draft of a spec related to Synchronized Narrations, json equivalent of media overlays.

The use of textRef and audioRef is totally open to discussion. This is not how Thorium implements it.

mickael-menu · 2021-06-01T15:51:57Z

modules/sync-narration.md

+}
+```
+
+## Mime Type and file extension


"MIME type" is deprecated in favor of "media type"

[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
https://www.iana.org/assignments/media-types/media-types.xhtml

modules/sync-narration.md

mickael-menu · 2021-06-02T14:44:03Z

modules/sync-narration.md

+
+## Declaring Synchronized Narration documents in a Manifest
+
+Each Synchronized Narration document used in a publication <strong class="rfc">must</strong> be declared in the Readium Webpub Manifest as an `alternate` resource with the proper media type. 


I always considered alternate to be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.

But maybe I'm wrong on the semantics of alternate?

mickael-menu · 2021-06-02T15:09:07Z

modules/sync-narration.md

+  "textRef": "/text/chapter1.html",
+  "audioRef": "/audio/chapter1.mp3",


The term ref is not used anywhere else in RWPM, I think. How about href?

Although href is a special case of hypertextual-ref which maybe doesn't make sense here.

Also: the leading slash should be removed.

Personally I feel uneasy about hoisting "base" URLs into a higher level of the data serialisation hierarchy, for different media types (e.g. text and audio ... but what about others, and what about cases where not all resources of a given media type have the same base URL?). Also, this competes with the implicit base URL of the JSON resource itself, and its 'self' override (if any) ... and potentially its 'base' as well (see prior discussion liked below)

readium/architecture#109 (comment)

Also, this competes with the implicit base URL of the JSON resource itself, and its 'self' override (if any) ... and potentially its 'base'

When I mentioned a "base URL" during last call I was actually referring to a self link, so that relative hrefs would work like in a regular RWPM.

but what about others, and what about cases where not all resources of a given media type have the same base URL?)

If on different servers, then I guess they need to be absolute URLs.

mickael-menu · 2021-06-02T15:12:05Z

I feel that there's an opportunity to generalize the synchronization to any media type.

The spec mentions that we could extend it later by adding more media type-specific properties:

In case we decide to extend the structure to image and video, using image and video would be consistent with the latest work of the W3C CG.

But why not make it media-type agnostic instead? This could support all these use cases from the start:

Small illustrations or sign-language videos explaining words or utterances in a text.
Audio narration over a comic book.
Subtitles over video-based publications.

Even text-on-text synchronization could open interesting possibilities:

Synchronizing a publication and its translation, useful for:
- Displaying the two versions side-by-side, to practice learning a language.
- Displaying an accurate translation of a paragraph, when reading an ebook in a foreign language.
Synchronizing a publication and a commentary, for example to display the notes side by side or in a margin.
- I'm talking about "published author commentary" not user annotations. Think classical texts annotated with explanations for studies.

If we go down that road, "Synchronized Media" might be more accurate than "Narration".

I think for this we just need to:

rename text in source, master or primary
rename audio in secondary or something else
rename narration
add a way to specify the media type of the two resources to know how to interpret the fragments

I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.

danielweck · 2021-06-02T15:54:51Z

"Synchronized Rendition" might be more accurate than "Narration"

"Sync Media" is probably a better choice than "Sync Rendition". The former is already the chosen term for the W3C draft which succeeds "Sync Narration", the latter has been in use since EPUB3 "Multiple Renditions" which has different semantics.

See:

https://w3c.github.io/sync-media-pub/sync-media.html#media-objects

https://w3c.github.io/epub-specs/epub33/multi-rend/

HadrienGardeur · 2021-06-02T19:42:53Z

We've had a lot of discussions in the past about media overlays and we really need to dig back in there and read things before we merge any new document at this point.

The following documents are currently in our architecture repo:

There are also many issues that mention media overlays: https://github.com/readium/architecture/issues?q=is%3Aissue+media+overlay

But why not make it media-type agnostic instead? This could support all these use cases from the start

I've already made a similar proposal back in 2019, based on our RWPM Link Object, see: readium/architecture#88

I believe that could find a middle ground between this proposal (a Synchronized Media document is essentially based on our code model for RWPM) and something more specialized (similar to what we have in our architecture repo or this PR).

HadrienGardeur · 2021-06-02T19:51:07Z

I always considered alternate to be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.

But maybe I'm wrong on the semantics of alternate?

You're completely right that alternate and Synchronized Media/Media Overlay are very closely related with one another.

There's one major difference between the two of them:

alternate in readingOrder or resources is limited to resource-level alternates
Synchronized Media/Media Overlay operate at a fragment (or sub-resource) level

There are a few other places where we also work with fragments:

guided navigation in Divina: https://readium.org/webpub-manifest/profiles/divina.html#4-guided-navigation
table of contents: https://readium.org/webpub-manifest/#6-table-of-contents
pageList, loi, loa, lov and lot collections in EPUB: https://readium.org/webpub-manifest/profiles/epub.html#3-collection-roles

One could argue that for a Divina with guided navigation, Synchronized Media would not be useful as you could express the same type of information purely with alternate.

This is one of the reason why we need to move beyond the EPUB point of view on this and think about a more generic approach that applies to all fragments across all media.

typo correction Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>

llemeurfr · 2021-06-09T10:07:22Z

There are several items to settle on, that we can tackle in this order I guess:

do we suppress the proposed textRef and audioRef, which have drawbacks described in previous comments?
do we express a notion of "primary" (singular) and "secondary" (plural) resources, which open the path to text-to-text mapping? if yes how?
do we use simplified link objects, with href and type, instead of text+ 'audio properties, and children instead of sub narration?
in this case hat happens to structural semantics, i.e. the role property?
do we move from alternate to a more specific property in the resource referencing the syncnarr structure?
do we replace "sync narration" by "sync media"?

mickael-menu · 2021-07-12T14:30:02Z

This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20

m-abs · 2021-09-27T20:31:28Z

Hi,

This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20

I'm the author of that issue and this discussion #74 about

Maybe we could make Synchronized Narration for comics and magazines like this:

{
  "imageRef": "images/chapter1.jpeg",
  "audioRef": "audio/chapter1.mp3",
  "narration": [
    {
      "image": "#xywh=percent:5,5,15,15",
      "audio": "#t=0.0,1.2"
    },
    {
      "image": "#xywh=percent:20,20,25,25",
      "audio": "#t=1.2,3.4"
    },
    {
      "image": "#xywh=percent:5,45,30,30",
      "audio": "#t=3.4,5.6"
    }
  ]
}

Or should DiViNa's guided navigation be extended with an audio property?

Something like:

"guided": [
  {
    "href": "http://example.org/page1.jpeg",
    "audio": "http://example.org/page1.mp3#t=0,11",
    "title": "Page 1"
  },
  {
    "href": "http://example.org/page1.jpeg#xywh=0,0,300,200",
    "audio": "http://example.org/page1.mp3#t=11,25",
    "title": "Panel 1"
  },
  {
    "href": "http://example.org/page1.jpeg#xywh=300,200,310,200",
    "audio": "http://example.org/page1.mp3#t=25,102",
    "title": "Panel 2"
  }
]

I don't like the name audio but couldn't come up with something better.
I'm also a worried it is too verbose, generating larger than needed json-files.

mickael-menu · 2021-09-28T08:39:18Z

@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.

m-abs · 2021-09-28T09:38:08Z

@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.

Sorry, I must have missed it last night.

#83 (comment):

rename text in source, master or primary

rename audio in secondary or something else

Could there be a use-case there one would need more than just the two sources?
Maybe replace text and audio with links an array of Link objects or href strings?

It could be mixing two texts as suggested and background music or a comic book frame + a speech bubble text + background music or a way to implement #49

I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).

HadrienGardeur · 2021-09-28T10:35:35Z

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).

@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?

m-abs · 2021-09-28T11:23:19Z

@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?

A single document for the whole publication.

We made a dump of the structure of DAISY 2.02 audiobooks (both with and without text) to our private JSON format in a single file.
This JSON contains a resource map from the local path/uri to server path.

Some of these JSON files became very big and caused problems with our client app and for our users.

mickael-menu · 2021-09-28T12:01:47Z

Could there be a use-case there one would need more than just the two sources?
Maybe replace text and audio with links an array of Link objects or href strings?

I guess that would be fine, to have either a single link or a link array in secondary. As long as we have only a single primary/leading resource.

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).

By full HREFs I meant having the path of the resource relative to the self link, not necessarily a full URL. For example:

#xywh=percent:5,5,15,15 -> images/one.png#xywh=percent:5,5,15,15

alexwhb · 2022-02-22T22:46:01Z

Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.

alexwhb · 2022-02-22T23:00:26Z

Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired. It would also be useful in our use-case to have the ability to stream audio instead of playing downloaded media. Just some thoughts for potentially making the implementation more flex for these types of use cases.

One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU. The few options I considered where:

Just have function get triggered say every 300MS or so and check if the current word that's highlighted is still in the range of the media overlay timecode... if so do nothing if not find the next one in the range. This solution is okay but can feel a bit laggy at times especially if you are doing a word by word highlight. Obviously you can turn down the delay, but then you start eating more CPU
The second option I thought of was doing a PostDelay with a runnable, where the delay is always equal to the word duration, so you get called back when the current word should no longer be highlighted. This avoids using a lot of CPU since you are not adding any additional overhead that the message loop is not already incurring. The issue with this implementation is that there is a time drift that happens using this method because of intrinsic delay with message handling and executing functions. After using this for about 30 seconds or maybe a minute you definitely notice the synchronization getting off. So ideally you could maybe use this above method but have some sort of calculation to see how much internal timing error there is and adjust for that.

Any thoughts on the above notes?

mickael-menu · 2022-02-23T11:04:43Z

Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.

Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.

Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired.

As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.

One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU.

I'll start working on TTS next week and have more intel then, but:

As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?
I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling.
EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)

alexwhb · 2022-02-23T18:27:16Z

@mickael-menu Fantastic stuff. Thanks for the prompt response.

Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.

This makes a lot more sense. Thanks for the clarification.

As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.

This is very good to know. I was actually not aware of that

As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?

You are totally correct here. The SMIL spec just lets you define IDs that you'd like to highlight. In our case we have a proprietary parser that wraps every word in our ebooks with with a unique ID for each word and we then can reference those word id's in the SMIL file and link them up to the timecode where they appeared in their respective audio files. Obviously the word level is arbitrary and could be swiped out for a sentence or paragraph if desired or an arbitrary range.

I do wonder if there'd be a simple way to change up the overlay style from word to sentence to paragraph if our books contained the ID's for all three types and the timecode ranges. That'd be a feature I'd really like, but I've not spent enough time looking at the SMIL spec to see if that's currently supported.

I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling.
EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)

This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂

Really excited to see the TTS stuff.

mickael-menu · 2022-02-23T18:36:57Z

This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂

Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.

alexwhb · 2022-02-23T18:49:44Z

Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.

I do have the timestamps of each word. 🤩 . I'll send you an example section of one of our SMIL files as a reference in a bit. I'll dig one up.

Also just found the implementation on the JS side. Maybe we can glean some incite from it.

Also I know you are busy with other things so don't let me distract you. 😄 I'm kinda thinking out-loud here.

danielweck · 2023-02-27T14:30:57Z

Also see: readium/architecture#181

danielweck · 2023-03-20T14:42:39Z

Superseded by #95 ?

llemeurfr added 6 commits May 31, 2021 16:02

Addition of the Synchronized Narration Module

a56ba8a

Simplification of the structure.

2a1734e

Add a note about the interest of structural semantics

340b5cc

Add a note aoubt extension to image/video

0c304f2

update after study of Thorium implem.

0d3bb12

The use of textRef and audioRef is totally open to discussion. This is not how Thorium implements it.

Typo corrections + new subtitle for style

7e3f4f2

mickael-menu reviewed Jun 2, 2021

View reviewed changes

llemeurfr and others added 2 commits June 4, 2021 15:56

Update modules/sync-narration.md

56022db

typo correction Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>

Update modules/sync-narration.md

d988f9c

typo correction Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>

mickael-menu mentioned this pull request Oct 28, 2021

Divina: linkable rectangles in images #84

Open

mickael-menu mentioned this pull request Feb 20, 2022

3.0.0 readium/kotlin-toolkit#36

Closed

14 tasks

danielweck mentioned this pull request Feb 27, 2023

WIP: Guided Navigation readium/architecture#181

Merged

danielweck mentioned this pull request Mar 20, 2023

Added mediaOverlay to EPUB profile #95

Merged

HadrienGardeur closed this Mar 21, 2023

HadrienGardeur deleted the update/sync-narr branch June 26, 2024 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronized Narration #83

Synchronized Narration #83

llemeurfr commented Jun 1, 2021

mickael-menu Jun 1, 2021

mickael-menu Jun 2, 2021

mickael-menu Jun 2, 2021

mickael-menu Jun 2, 2021

danielweck Jun 2, 2021

danielweck Jun 2, 2021

mickael-menu Jun 9, 2021

mickael-menu commented Jun 2, 2021 •

edited

Loading

danielweck commented Jun 2, 2021

HadrienGardeur commented Jun 2, 2021

HadrienGardeur commented Jun 2, 2021

llemeurfr commented Jun 9, 2021

mickael-menu commented Jul 12, 2021

m-abs commented Sep 27, 2021

mickael-menu commented Sep 28, 2021

m-abs commented Sep 28, 2021 •

edited

Loading

HadrienGardeur commented Sep 28, 2021

m-abs commented Sep 28, 2021

mickael-menu commented Sep 28, 2021 •

edited

Loading

alexwhb commented Feb 22, 2022 •

edited

Loading

alexwhb commented Feb 22, 2022 •

edited

Loading

mickael-menu commented Feb 23, 2022 •

edited

Loading

alexwhb commented Feb 23, 2022 •

edited

Loading

mickael-menu commented Feb 23, 2022

alexwhb commented Feb 23, 2022 •

edited

Loading

danielweck commented Feb 27, 2023

danielweck commented Mar 20, 2023


		## Declaring Synchronized Narration documents in a Manifest

		Each Synchronized Narration document used in a publication <strong class="rfc">must</strong> be declared in the Readium Webpub Manifest as an `alternate` resource with the proper media type.

		"textRef": "/text/chapter1.html",
		"audioRef": "/audio/chapter1.mp3",

Synchronized Narration #83

Synchronized Narration #83

Conversation

llemeurfr commented Jun 1, 2021

mickael-menu Jun 1, 2021

Choose a reason for hiding this comment

mickael-menu Jun 2, 2021

Choose a reason for hiding this comment

mickael-menu Jun 2, 2021

Choose a reason for hiding this comment

mickael-menu Jun 2, 2021

Choose a reason for hiding this comment

danielweck Jun 2, 2021

Choose a reason for hiding this comment

danielweck Jun 2, 2021

Choose a reason for hiding this comment

mickael-menu Jun 9, 2021

Choose a reason for hiding this comment

mickael-menu commented Jun 2, 2021 • edited Loading

danielweck commented Jun 2, 2021

HadrienGardeur commented Jun 2, 2021

HadrienGardeur commented Jun 2, 2021

llemeurfr commented Jun 9, 2021

mickael-menu commented Jul 12, 2021

m-abs commented Sep 27, 2021

mickael-menu commented Sep 28, 2021

m-abs commented Sep 28, 2021 • edited Loading

HadrienGardeur commented Sep 28, 2021

m-abs commented Sep 28, 2021

mickael-menu commented Sep 28, 2021 • edited Loading

alexwhb commented Feb 22, 2022 • edited Loading

alexwhb commented Feb 22, 2022 • edited Loading

mickael-menu commented Feb 23, 2022 • edited Loading

alexwhb commented Feb 23, 2022 • edited Loading

mickael-menu commented Feb 23, 2022

alexwhb commented Feb 23, 2022 • edited Loading

danielweck commented Feb 27, 2023

danielweck commented Mar 20, 2023

mickael-menu commented Jun 2, 2021 •

edited

Loading

m-abs commented Sep 28, 2021 •

edited

Loading

mickael-menu commented Sep 28, 2021 •

edited

Loading

alexwhb commented Feb 22, 2022 •

edited

Loading

alexwhb commented Feb 22, 2022 •

edited

Loading

mickael-menu commented Feb 23, 2022 •

edited

Loading

alexwhb commented Feb 23, 2022 •

edited

Loading

alexwhb commented Feb 23, 2022 •

edited

Loading