-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans for Rhubarb Lip Sync 2 #95
Comments
@DanielSWolf Any plans for Unity 3D support? |
Once Rhubarb supports keyframe animation, I'd love to add plugins for various 3D packages, similar to the way it currently supports 2D tools. When that time comes, I'll be happy about any support. Given how little free time I have, however, it will probably take me several years to get there. |
For Unity 3D, having some abstract Timeline support with the right phoneme to play (and weights) would be more than enough for most devs to implement according to their own needs. We, for example, use Unity Spine SDK combined with skin composition for visemes and expressions. |
As a Java Developer I am very excited for a JVM compatible version. The Vosk stack might be a good substitution for Sphinx. I am using it in Java right now and am getting very good results. There is even a PR adding phoneme labels and timestamps to the data stream. alphacep/vosk-api#528 I have recently just written a wrapper around the published executables in Java. I am building real-time robotic interaction software. I am using Rhubarb for real time TTS -> audio -> Rhubarb -> Synced animation+audio . I cam here to just ask for a live update of any Viseme's detected as they are found. If i had live updates i could much more tightly synchronize the initiation so speech, and the execution of that speech. Moving forward, would it be possible to conciser if live updates is a feature worth adding? |
In case other Java developers get this far and are dissapointed that it seems that rubarb 2 will be in Rust. I was able to dublicate the functionality of the basic Rubarb viseme generation using Vosk. I have a small stand alone example for you! https://github.com/madhephaestus/TextToSpeechASDRTest.git I was able to use the partial results with the word timing to calculate the timing of the phonemems (after looking up the phonemes in a phoneme dictionary). I then down-mapped the phonemes to visemes and stored the visemes in a list with timestamps. The timestamped visemes process in a static 200ms, and then the audio can begin playing with the mouth movemets synchronized precisly with the phoneme start times precomputed ahead of time. This is compaired to Rubarb which takes as long to run as the audio file is long. This is a complete implementation for my uses, so if anyone else needs lip-syncing in java, have a look at that example. |
Sorry for the late reply. Rhubarb simply isn't designed for real-time applications; see e.g. #22. I'm glad you found a working solution though! |
just came across this, its really cool nice work, i think if version 2 is gonna be in rust, to make this into a realtime lib shouldn't be very hard, instead of having it write to a file, if it just returns the json, you can go a very long way with this. just wondering is there a eta for v2? |
Also looking forward to v2, this is a very useful tool already. If it can do streaming/chunking of some sort, that would be totally amazing! The detailed documentation is very much appreciated. |
Tuning in for v2 |
This issue is a collection of ideas and decisions regarding Rhubarb Lip Sync 2.
Full rewrite
Rhubarb 2 will be a full rewrite rather than a series of iterative improvements over version 1.x. This is necessary because it will use a completely different technology stack (see Programming languages and Build tool).
I'm currently working on a proof of concept to make sure that the basic ideas work out. Once that's done, I'll start working towards an MVP version of Rhubarb 2. This version will not contain all features discussed here, nor even all features currently found in version 1.x. My idea is to have versions 1.x and 2.x coexist for some time, during which I'll add new features to the 2.x versions, while only fixing major bugs in the 1.x versions. Once we've reached feature parity, I'll deprecate the 1.x versions.
Multiple languages
Rhubarb 1.x only supports English dialog. Support for additional languages has often been requested, but due to a number of technical limitations, adding it to Rhubarb 1.x would require a rewrite of most of its code.
The architecture of Rhubarb 2 will be language-agnostic from the start. This means that adding more languages should be possible at any time with minimal code changes.
Graphical user interface
In addition to the CLI, Rhubarb 2 will have a GUI with the following features:
This should satisfy the following use cases:
Exact dialog through forced alignment
Rhubarb 1.x allows the user to specify the dialog text of a recording. However, this text is merely used to guide the speech recognition step. Due to limitations in the speech recognition engine, Rhubarb often recognizes incorrect words even if the correct words were specified.
Rhubarb 2 will allow the user to specify exact dialog that is aligned with the recording without an additional recognition step. This should have the following advantages:
Mouth shapes
Rhubarb 1.x supports 6 basic mouth shapes and up to 3 extended mouth shapes, all of which are pre-defined. Rhubarb 2 will still rely on pre-defined mouth shapes and will use the same 6 basic mouth shapes. However, I'm planning to increase the number of supported extended mouth shapes. This will allow for smoother lip sync animation if desired.
Currently, mouth shapes are named using single letters. This is based on the tradition of hand-written exposure sheets, but may be unnecessarily cryptic in its digital form. I'm thinking about adopting a more intuitive naming scheme, similar to the visemes used by Amazon Polly. Desirable features for this naming scheme:
MBP
(Preston Blair / Papagayo),p
(Amazon Polly),m
, or similar.m-
or similar.m_to_e
.Eventual support for keyframe (3D) animation
Rhubarb 1.x only supports limited animation (also known as replacement animation), which holds each mouth shape until it is replaced by the next one. This approach is a good fit for most 2D animation, but is ill-suited for 3D animation or mesh-based 2D animation.
Rhubarb 2 will be designed in a way that allows keyframe-based export to be added at a later time. However, this feature has low priority and won't be included in the first versions.
CLI changes
Here is an incomplete list of probable changes to Rhubarb's CLI:
--extendedShapes
) to kebab case (--extended-shapes
). This format is much more common.Programming languages
Rhubarb 1.x is written in C++. While this language is both powerful and efficient, it has a number of severe shortcomings. Most notably, it requires a lot of boilerplate code and it's very easy to make unnoticed mistakes such as overriding the wrong special member functions or using the wrong mechanism for passing arguments.
After a lot of research, I've decided to use Kotlin as the main programming language, with some C++ code for performance-critical operations and third-party libraries. Below is a feature matrix I created for the three hottest contenders, Kotlin, Go, and Rust. Empty cells indicate that I didn't investigate an aspect for the given language.tl;dr: Kotlin has all the features I was looking for. Rust was a very strong contender, but I wanted a modern, React-style UI framework for the GUI, which excluded Rust. Go looked promising at the start, but revealed numerous weaknesses on closer inspection.Edit:
After additional research, I've decided to go with Rust as a programming language. The main argument against Rust was the lack of good GUI frameworks, which now exist. On the whole, Rust feels much more natural for the kind of program I'm writing:
¹ In 2020, most of the Rust team was laid off (see Wikipedia). Since then, a Rust Foundation has been founded, and all major IT companies have joined.
² Including Java packages.
³ Using CXX
⁴ Using JetBrain's brand new Compose for Desktop
⁵ The overview site Are we GUI yet is sadly outdated. There are, in fact, several viable options; I'm currently leaning towards egui.
⁶ Doesn't seem to be as robust as NPM.
Build tool
I've chosen Gradle as the build tool for Rhubarb. It fulfils the following requirements:
Speech processing
The text was updated successfully, but these errors were encountered: