Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronized Narration #83

Closed
wants to merge 8 commits into from
Closed

Synchronized Narration #83

wants to merge 8 commits into from

Conversation

llemeurfr
Copy link
Contributor

This is a first draft of a spec related to Synchronized Narrations, json equivalent of media overlays.

}
```

## Mime Type and file extension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MIME type" is deprecated in favor of "media type"

[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
https://www.iana.org/assignments/media-types/media-types.xhtml

modules/sync-narration.md Outdated Show resolved Hide resolved
modules/sync-narration.md Outdated Show resolved Hide resolved

## Declaring Synchronized Narration documents in a Manifest

Each Synchronized Narration document used in a publication <strong class="rfc">must</strong> be declared in the Readium Webpub Manifest as an `alternate` resource with the proper media type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always considered alternate to be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.

But maybe I'm wrong on the semantics of alternate?

Comment on lines +58 to +59
"textRef": "/text/chapter1.html",
"audioRef": "/audio/chapter1.mp3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term ref is not used anywhere else in RWPM, I think. How about href?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although href is a special case of hypertextual-ref which maybe doesn't make sense here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: the leading slash should be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I feel uneasy about hoisting "base" URLs into a higher level of the data serialisation hierarchy, for different media types (e.g. text and audio ... but what about others, and what about cases where not all resources of a given media type have the same base URL?). Also, this competes with the implicit base URL of the JSON resource itself, and its 'self' override (if any) ... and potentially its 'base' as well (see prior discussion liked below)

readium/architecture#109 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this competes with the implicit base URL of the JSON resource itself, and its 'self' override (if any) ... and potentially its 'base'

When I mentioned a "base URL" during last call I was actually referring to a self link, so that relative hrefs would work like in a regular RWPM.

but what about others, and what about cases where not all resources of a given media type have the same base URL?)

If on different servers, then I guess they need to be absolute URLs.

@mickael-menu
Copy link
Member

mickael-menu commented Jun 2, 2021

I feel that there's an opportunity to generalize the synchronization to any media type.

The spec mentions that we could extend it later by adding more media type-specific properties:

In case we decide to extend the structure to image and video, using image and video would be consistent with the latest work of the W3C CG.

But why not make it media-type agnostic instead? This could support all these use cases from the start:

  • Small illustrations or sign-language videos explaining words or utterances in a text.
  • Audio narration over a comic book.
  • Subtitles over video-based publications.

Even text-on-text synchronization could open interesting possibilities:

  • Synchronizing a publication and its translation, useful for:
    • Displaying the two versions side-by-side, to practice learning a language.
    • Displaying an accurate translation of a paragraph, when reading an ebook in a foreign language.
  • Synchronizing a publication and a commentary, for example to display the notes side by side or in a margin.
    • I'm talking about "published author commentary" not user annotations. Think classical texts annotated with explanations for studies.

If we go down that road, "Synchronized Media" might be more accurate than "Narration".

I think for this we just need to:

  • rename text in source, master or primary
  • rename audio in secondary or something else
  • rename narration
  • add a way to specify the media type of the two resources to know how to interpret the fragments

I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.

@danielweck
Copy link
Member

"Synchronized Rendition" might be more accurate than "Narration"

"Sync Media" is probably a better choice than "Sync Rendition". The former is already the chosen term for the W3C draft which succeeds "Sync Narration", the latter has been in use since EPUB3 "Multiple Renditions" which has different semantics.

See:

https://w3c.github.io/sync-media-pub/sync-media.html#media-objects

https://w3c.github.io/epub-specs/epub33/multi-rend/

@HadrienGardeur
Copy link
Collaborator

We've had a lot of discussions in the past about media overlays and we really need to dig back in there and read things before we merge any new document at this point.

The following documents are currently in our architecture repo:

There are also many issues that mention media overlays: https://github.com/readium/architecture/issues?q=is%3Aissue+media+overlay

But why not make it media-type agnostic instead? This could support all these use cases from the start

I've already made a similar proposal back in 2019, based on our RWPM Link Object, see: readium/architecture#88

I believe that could find a middle ground between this proposal (a Synchronized Media document is essentially based on our code model for RWPM) and something more specialized (similar to what we have in our architecture repo or this PR).

@HadrienGardeur
Copy link
Collaborator

I always considered alternate to be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.

But maybe I'm wrong on the semantics of alternate?

You're completely right that alternate and Synchronized Media/Media Overlay are very closely related with one another.

There's one major difference between the two of them:

  • alternate in readingOrder or resources is limited to resource-level alternates
  • Synchronized Media/Media Overlay operate at a fragment (or sub-resource) level

There are a few other places where we also work with fragments:

One could argue that for a Divina with guided navigation, Synchronized Media would not be useful as you could express the same type of information purely with alternate.

This is one of the reason why we need to move beyond the EPUB point of view on this and think about a more generic approach that applies to all fragments across all media.

llemeurfr and others added 2 commits June 4, 2021 15:56
typo correction

Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>
typo correction

Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>
@llemeurfr
Copy link
Contributor Author

There are several items to settle on, that we can tackle in this order I guess:

  • do we suppress the proposed textRef and audioRef, which have drawbacks described in previous comments?
  • do we express a notion of "primary" (singular) and "secondary" (plural) resources, which open the path to text-to-text mapping? if yes how?
  • do we use simplified link objects, with href and type, instead of text+ 'audio properties, and children instead of sub narration?
  • in this case hat happens to structural semantics, i.e. the role property?
  • do we move from alternate to a more specific property in the resource referencing the syncnarr structure?
  • do we replace "sync narration" by "sync media"?

@mickael-menu
Copy link
Member

This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20

@m-abs
Copy link

m-abs commented Sep 27, 2021

Hi,

This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20

I'm the author of that issue and this discussion #74 about

Maybe we could make Synchronized Narration for comics and magazines like this:

{
  "imageRef": "images/chapter1.jpeg",
  "audioRef": "audio/chapter1.mp3",
  "narration": [
    {
      "image": "#xywh=percent:5,5,15,15",
      "audio": "#t=0.0,1.2"
    },
    {
      "image": "#xywh=percent:20,20,25,25",
      "audio": "#t=1.2,3.4"
    },
    {
      "image": "#xywh=percent:5,45,30,30",
      "audio": "#t=3.4,5.6"
    }
  ]
}

Or should DiViNa's guided navigation be extended with an audio property?

Something like:

"guided": [
  {
    "href": "http://example.org/page1.jpeg",
    "audio": "http://example.org/page1.mp3#t=0,11",
    "title": "Page 1"
  },
  {
    "href": "http://example.org/page1.jpeg#xywh=0,0,300,200",
    "audio": "http://example.org/page1.mp3#t=11,25",
    "title": "Panel 1"
  },
  {
    "href": "http://example.org/page1.jpeg#xywh=300,200,310,200",
    "audio": "http://example.org/page1.mp3#t=25,102",
    "title": "Panel 2"
  }
]

I don't like the name audio but couldn't come up with something better.
I'm also a worried it is too verbose, generating larger than needed json-files.

@mickael-menu
Copy link
Member

@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.

@m-abs
Copy link

m-abs commented Sep 28, 2021

@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.

Sorry, I must have missed it last night.

#83 (comment):

  • rename text in source, master or primary
  • rename audio in secondary or something else

Could there be a use-case there one would need more than just the two sources?
Maybe replace text and audio with links an array of Link objects or href strings?

It could be mixing two texts as suggested and background music or a comic book frame + a speech bubble text + background music or a way to implement #49

I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).

@HadrienGardeur
Copy link
Collaborator

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).

@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?

@m-abs
Copy link

m-abs commented Sep 28, 2021

@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?

A single document for the whole publication.

We made a dump of the structure of DAISY 2.02 audiobooks (both with and without text) to our private JSON format in a single file.
This JSON contains a resource map from the local path/uri to server path.

Some of these JSON files became very big and caused problems with our client app and for our users.

@mickael-menu
Copy link
Member

mickael-menu commented Sep 28, 2021

Could there be a use-case there one would need more than just the two sources?
Maybe replace text and audio with links an array of Link objects or href strings?

I guess that would be fine, to have either a single link or a link array in secondary. As long as we have only a single primary/leading resource.

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).

By full HREFs I meant having the path of the resource relative to the self link, not necessarily a full URL. For example:

#xywh=percent:5,5,15,15 -> images/one.png#xywh=percent:5,5,15,15

@alexwhb
Copy link

alexwhb commented Feb 22, 2022

Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.

@alexwhb
Copy link

alexwhb commented Feb 22, 2022

Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired. It would also be useful in our use-case to have the ability to stream audio instead of playing downloaded media. Just some thoughts for potentially making the implementation more flex for these types of use cases.

One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU. The few options I considered where:

  1. Just have function get triggered say every 300MS or so and check if the current word that's highlighted is still in the range of the media overlay timecode... if so do nothing if not find the next one in the range. This solution is okay but can feel a bit laggy at times especially if you are doing a word by word highlight. Obviously you can turn down the delay, but then you start eating more CPU

  2. The second option I thought of was doing a PostDelay with a runnable, where the delay is always equal to the word duration, so you get called back when the current word should no longer be highlighted. This avoids using a lot of CPU since you are not adding any additional overhead that the message loop is not already incurring. The issue with this implementation is that there is a time drift that happens using this method because of intrinsic delay with message handling and executing functions. After using this for about 30 seconds or maybe a minute you definitely notice the synchronization getting off. So ideally you could maybe use this above method but have some sort of calculation to see how much internal timing error there is and adjust for that.

Any thoughts on the above notes?

@mickael-menu
Copy link
Member

mickael-menu commented Feb 23, 2022

Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.

Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.

Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired.

As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.

One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU.

I'll start working on TTS next week and have more intel then, but:

@alexwhb
Copy link

alexwhb commented Feb 23, 2022

@mickael-menu Fantastic stuff. Thanks for the prompt response.

Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.

This makes a lot more sense. Thanks for the clarification.

As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.

This is very good to know. I was actually not aware of that

As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?

You are totally correct here. The SMIL spec just lets you define IDs that you'd like to highlight. In our case we have a proprietary parser that wraps every word in our ebooks with with a unique ID for each word and we then can reference those word id's in the SMIL file and link them up to the timecode where they appeared in their respective audio files. Obviously the word level is arbitrary and could be swiped out for a sentence or paragraph if desired or an arbitrary range.

I do wonder if there'd be a simple way to change up the overlay style from word to sentence to paragraph if our books contained the ID's for all three types and the timecode ranges. That'd be a feature I'd really like, but I've not spent enough time looking at the SMIL spec to see if that's currently supported.

I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling.
EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)

This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂

Really excited to see the TTS stuff.  

@mickael-menu
Copy link
Member

This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂

Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.

@alexwhb
Copy link

alexwhb commented Feb 23, 2022

Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.

I do have the timestamps of each word. 🤩 . I'll send you an example section of one of our SMIL files as a reference in a bit. I'll dig one up.

Also just found the implementation on the JS side. Maybe we can glean some incite from it.

Also I know you are busy with other things so don't let me distract you. 😄 I'm kinda thinking out-loud here.

@danielweck
Copy link
Member

Also see: readium/architecture#181

@danielweck
Copy link
Member

Superseded by #95 ?

@HadrienGardeur HadrienGardeur deleted the update/sync-narr branch June 26, 2024 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants