-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronized Narration #83
Conversation
The use of textRef and audioRef is totally open to discussion. This is not how Thorium implements it.
} | ||
``` | ||
|
||
## Mime Type and file extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"MIME type" is deprecated in favor of "media type"
[RFC2046] specifies that Media Types (formerly known as MIME types) and Media
Subtypes will be assigned and listed by the IANA.
https://www.iana.org/assignments/media-types/media-types.xhtml
|
||
## Declaring Synchronized Narration documents in a Manifest | ||
|
||
Each Synchronized Narration document used in a publication <strong class="rfc">must</strong> be declared in the Readium Webpub Manifest as an `alternate` resource with the proper media type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always considered alternate
to be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.
But maybe I'm wrong on the semantics of alternate
?
"textRef": "/text/chapter1.html", | ||
"audioRef": "/audio/chapter1.mp3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term ref
is not used anywhere else in RWPM, I think. How about href
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although href
is a special case of hypertextual-ref
which maybe doesn't make sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: the leading slash should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I feel uneasy about hoisting "base" URLs into a higher level of the data serialisation hierarchy, for different media types (e.g. text and audio ... but what about others, and what about cases where not all resources of a given media type have the same base URL?). Also, this competes with the implicit base URL of the JSON resource itself, and its 'self' override (if any) ... and potentially its 'base' as well (see prior discussion liked below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this competes with the implicit base URL of the JSON resource itself, and its 'self' override (if any) ... and potentially its 'base'
When I mentioned a "base URL" during last call I was actually referring to a self
link, so that relative hrefs would work like in a regular RWPM.
but what about others, and what about cases where not all resources of a given media type have the same base URL?)
If on different servers, then I guess they need to be absolute URLs.
I feel that there's an opportunity to generalize the synchronization to any media type. The spec mentions that we could extend it later by adding more media type-specific properties:
But why not make it media-type agnostic instead? This could support all these use cases from the start:
Even text-on-text synchronization could open interesting possibilities:
If we go down that road, "Synchronized Media" might be more accurate than "Narration". I think for this we just need to:
I'm also in favor of having full hrefs in the |
"Sync Media" is probably a better choice than "Sync Rendition". The former is already the chosen term for the W3C draft which succeeds "Sync Narration", the latter has been in use since EPUB3 "Multiple Renditions" which has different semantics. See: https://w3c.github.io/sync-media-pub/sync-media.html#media-objects |
We've had a lot of discussions in the past about media overlays and we really need to dig back in there and read things before we merge any new document at this point. The following documents are currently in our architecture repo:
There are also many issues that mention media overlays: https://github.com/readium/architecture/issues?q=is%3Aissue+media+overlay
I've already made a similar proposal back in 2019, based on our RWPM Link Object, see: readium/architecture#88 I believe that could find a middle ground between this proposal (a Synchronized Media document is essentially based on our code model for RWPM) and something more specialized (similar to what we have in our architecture repo or this PR). |
You're completely right that There's one major difference between the two of them:
There are a few other places where we also work with fragments:
One could argue that for a Divina with guided navigation, Synchronized Media would not be useful as you could express the same type of information purely with This is one of the reason why we need to move beyond the EPUB point of view on this and think about a more generic approach that applies to all fragments across all media. |
typo correction Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>
typo correction Co-authored-by: Mickaël Menu <mickael.menu@gmail.com>
There are several items to settle on, that we can tackle in this order I guess:
|
This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20 |
Hi,
I'm the author of that issue and this discussion #74 about Maybe we could make {
"imageRef": "images/chapter1.jpeg",
"audioRef": "audio/chapter1.mp3",
"narration": [
{
"image": "#xywh=percent:5,5,15,15",
"audio": "#t=0.0,1.2"
},
{
"image": "#xywh=percent:20,20,25,25",
"audio": "#t=1.2,3.4"
},
{
"image": "#xywh=percent:5,45,30,30",
"audio": "#t=3.4,5.6"
}
]
} Or should DiViNa's Something like: "guided": [
{
"href": "http://example.org/page1.jpeg",
"audio": "http://example.org/page1.mp3#t=0,11",
"title": "Page 1"
},
{
"href": "http://example.org/page1.jpeg#xywh=0,0,300,200",
"audio": "http://example.org/page1.mp3#t=11,25",
"title": "Panel 1"
},
{
"href": "http://example.org/page1.jpeg#xywh=300,200,310,200",
"audio": "http://example.org/page1.mp3#t=25,102",
"title": "Panel 2"
}
] I don't like the name |
@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet. |
Sorry, I must have missed it last night.
Could there be a use-case there one would need more than just the two sources? It could be mixing two texts as suggested and background music or a comic book frame + a speech bubble text + background music or a way to implement #49
I'm worried this could result in very large |
@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource? |
A single document for the whole publication. We made a dump of the structure of Some of these JSON files became very big and caused problems with our client app and for our users. |
I guess that would be fine, to have either a single link or a link array in
By full HREFs I meant having the path of the resource relative to the
|
Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats. |
Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired. It would also be useful in our use-case to have the ability to stream audio instead of playing downloaded media. Just some thoughts for potentially making the implementation more flex for these types of use cases. One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU. The few options I considered where:
Any thoughts on the above notes? |
Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.
As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.
I'll start working on TTS next week and have more intel then, but:
|
@mickael-menu Fantastic stuff. Thanks for the prompt response.
This makes a lot more sense. Thanks for the clarification.
This is very good to know. I was actually not aware of that
You are totally correct here. The SMIL spec just lets you define IDs that you'd like to highlight. In our case we have a proprietary parser that wraps every word in our ebooks with with a unique ID for each word and we then can reference those word id's in the SMIL file and link them up to the timecode where they appeared in their respective audio files. Obviously the word level is arbitrary and could be swiped out for a sentence or paragraph if desired or an arbitrary range. I do wonder if there'd be a simple way to change up the overlay style from word to sentence to paragraph if our books contained the ID's for all three types and the timecode ranges. That'd be a feature I'd really like, but I've not spent enough time looking at the SMIL spec to see if that's currently supported.
This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂 Really excited to see the TTS stuff. |
Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration. |
I do have the timestamps of each word. 🤩 . I'll send you an example section of one of our SMIL files as a reference in a bit. I'll dig one up. Also just found the implementation on the JS side. Maybe we can glean some incite from it. Also I know you are busy with other things so don't let me distract you. 😄 I'm kinda thinking out-loud here. |
Also see: readium/architecture#181 |
Superseded by #95 ? |
This is a first draft of a spec related to Synchronized Narrations, json equivalent of media overlays.