Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I decided to abandon this framework for the time being #226

Closed
ErfolgreichCharismatisch opened this issue Sep 30, 2018 · 24 comments
Closed

Comments

@ErfolgreichCharismatisch
Copy link

ErfolgreichCharismatisch commented Sep 30, 2018

Reasons

  • I don't come close to synthesizing comprehensible wave files. And I don't know where to begin fixing it.
  • There is no manual for hparams.py for beginners
  • Hardly any support in this forum
  • Wiki is unfinished
  • Computation takes a long time and without clear information about whether the current configuration will work out, for instance in a preview, I cannot improve the output. Also here, no helpers on this board.
  • average loss getting lower? -> means nothing. loss getting lower? -> means nothing. Step 45000? -> means nothing. How do you even determine anything?

I am disappointed, honestly.

@tugstugi
Copy link

could you share your dataset? Maybe something wrong with your dataset?

@ErfolgreichCharismatisch
Copy link
Author

ErfolgreichCharismatisch commented Sep 30, 2018

I cannot share it, because it is not open source. What I did was

  1. to cut high quality audio at silences longer than X ms,
  2. to then have google speech recognition convert it to text snippets (Filename|transcribed text|copy of transcribed text)
  3. to then correct those text snippets manually.

It would help tremendously to have a good tutorial about how to create good input files for this to not be a source of error.

@tugstugi
Copy link

tugstugi commented Sep 30, 2018

How long is your dataset? I made myself 5 hours Mongolian dataset and trained successfully. Only thing I have to change was to lower fmin and update the vocabulary for Mongolian. I resampled also the audio files to 22050 to keep it compatible with LJSpeech.

@ErfolgreichCharismatisch
Copy link
Author

About 2.2 hours. Yet several people claimed to have had success with datasets even below 1 hour.

@tugstugi
Copy link

Could you share at least a few audio samples from your dataset?

@ErfolgreichCharismatisch
Copy link
Author

What are you aiming at in those files?

@tugstugi
Copy link

Maybe look for obvious errors? OK I give up :)

@ErfolgreichCharismatisch
Copy link
Author

Which errors are you talking about? Not only can I not share them for legal reasons, but it also wouldn't help anyone else. It would help tremendously to have a good tutorial for all kinds of languages and input sources about how to create good input files for this to not be a source of error.

@Rayhane-mamah
Copy link
Owner

Hi, first of, thanks @tugstugi for your assistance, much appreciated :)

@ErfolgreichCharismatisch I am sorry you feel that way. Let me just correct few misunderstandings here and there:

  • Nature of the repo: This is not a multi usage framework where you expect to bring any sort of data and expect it to work properly. This repo is only a word for word implementation of Tacotron-2 paper with a few extras of my own where I felt some change was helpful. The repo, as described in the README, only supports LJspeech or M-AILABS datasets, any external data requires cleaning and intervention on your end to make the repo consistent.
  • Hardware: To be able to run such Networks properly, or any other cutting edge deep learning Networks for that matter of fact, one needs the right hardware. I for example, train both this repo's models in less than 6 days. (2~4 days each)
  • Experience: Please note that this repo is not meant to be -like I said earlier- a black box style framework where you just call some command and expect it to handle your special case without mistake. While we try making usage easier and generalize the code as much as possible, we still also try to keep maximum consistency with the paper, because that's the main objective of this repo. With that said, a basic knowledge about the paper content is a must, knowledge of some related work is also helpful. All required papers are shared inside papers folder. Knowledge about signal processing basics is also helpful since parameters highly impact the overall results of your models and users should not fear making changes to the different pipeline parts as needed (preprocessing, model, etc.). Finally, this work is merely a basis that others can build and improve on.
  • Time: Please keep in mind that people can have time constraints and are not always available for assistance, that's why others who have faced the same problems usually intervene for help, which I highly appreciate. Please know that this project is merely one of my multiple hobby projects I work on, that I am also alternating between school and work, so time and focus are kind of hard to find in order to ensure 100% maintenance. With that said, this work is 100% capable of training efficient Tacotron-2 models that can compare to DeepMind's results when used properly. People also showed it works pretty well with other languages (Other than English I mean).
  • Improvements: This repo, despite hitting its original goals, is still far from perfect, and of course we plan on filling the gaps (finish the wiki, update README, etc.). Any help on that end is much appreciated, I generally accept any help that does not conflict with the path this work is following. And of course, opening duplicate issues that have already been answered is surely not the best way to help beginners understand what they're doing wrong..
    Thanks for providing you opinion however, it will most certainly help improve this project.

At the end of this long boring comment, I will simply give my quick notes that you may find helpful:

  • Metrics: You're right, loss values are absolute, so they are not normalized across every dataset/user, so the simple loss value is not much useful by itself (or is it?). For that reason, we provide the tensorboard logs that you can access by typing "Tensorboard --logdir='path/to/logs/dir'. A multitude of loss curves, embeddings (option not yet committed), data distributions, etc. are provided for training progress analysis. Like I said earlier, this requires some knowledge. Other more sophisticated control is done visually by reviewing all plots directories inside your logs folder. Most importantly, the alignment plots are a great visual indicator of whether your model is converging or not.

  • Hparams: In the hparams, usually, you don't have to change anything really, all those parameters, 90% of them are already optimized for you. The only key parameters to look out for, are the batch sizes, the number of outputs per step, and the audio parameters which depend on your data and which is really a matter of try and pick (that's why we provide a notebook file that you can use to test your targets quality before training).

  • Your model: By quickly reading other issues, I gather your model is not doing proper alignment (does not know how to read the input text properly) which is the one reason that makes your model output noise only at synthesis. Possible reasons are: batch_size smaller than 32, or the data is not big enough. If your batch_size is small, increase it to 32. Then increase outputs_per_step to 3 if you face OOM. It OOM is still present, decrase max_mel_frames until the problem is fixed. Note that increasing outputs_per_step and decreasing max_mel_frames will impact model quality negatively. That's why we recommend using proper hardware (at least 8GB GPU).

Thanks for trying our work, we hope to have some positive feedback from your end in the future!

@ErfolgreichCharismatisch
Copy link
Author

ErfolgreichCharismatisch commented Sep 30, 2018

Dear @Rayhane-mamah, put yourself into my or any other beginner's shoes. Which kind of tutorial would help you get started with your own data?
PS: Never use the phrase "I am sorry you feel that way".

@Thien223
Copy link

Thien223 commented Oct 1, 2018

@ErfolgreichCharismatisch
Dear friend,

I think this is a good one to start.

You are going so fast. Why did not you use his data, try to understand first, then you can use yours.

Like you, I'm a newbie. I started by getting his code, and make it runs. At the begining, It did not run, for some reasons (I've made nothing change, though). I searched for problems (they were my machine problems, such as my machine does not has GPU, some packagew were not installed right...).

When it run, I try to change something, a little at once, see the different, and understand what is the code parts used for.

Now I can run the project with even Korean language (and only 30 minutes of training data, trying to reduce more).

Be patient friend. you can do it. There is no problem with the code. (I have not checked out the updated version).

@Hayes515
Copy link

Hayes515 commented Oct 1, 2018

Hi,@Rayhane-mamah
I used LJ datasets ,and don't modify the value of tacotron_batch_size. When I train tacotron model, it is ok. but I continue to train WaveNet model, OOM is present. After I try to decrease wavenet_batch_size to 2 ,OOM error disappeared. I am not sure if it has other bad influences for changing wavenet_batch_size? Tacotron model will take one more day for finishing training ,I will start to train WaveNet model tomorrow afternoon.

@ErfolgreichCharismatisch
Copy link
Author

@Hayes515 Please create your own thread.

@ErfolgreichCharismatisch
Copy link
Author

ErfolgreichCharismatisch commented Oct 1, 2018

@tdplaza Great it works for you. How do you go about creating a new corpus, please be detailed.

@tacobeer
Copy link

tacobeer commented Oct 2, 2018

Testing this model can be challenging at first. I did that, too.

But... when it comes to Speech synthesis, this framework is a best place to learn about speech synthesis in github and i know no one would make this level of repository non-commercial.

@Thien223
Copy link

Thien223 commented Oct 2, 2018

After getting the code runs fine. I realized that to apply to my own corpus, I have to well prepare metadata.csv file. and write a module to preprocess the text.

The metadata.csv file has 2 infomation: wav file name, and the text.

see how the program get and process these info in build_from_path function.

My dataset has transcript with other format. So instead of change the transcript format due to metadata.csv file, I changed the function to read my transcript.txt, return exactly what the function wants (they are the text, wav path and index).

In case of text processing. I realize that the processing modul has 1 task: transform input text to sequence array (the inverse function sequence to text is not used for training, it used for loging information, you can bypass the inverse function). Before transforming, the english_cleaner converts number, special character, monetary... to text. So, I have to find a module, that helps me tranform number to text, deal with currency number, special character... and integrate it into another module that helps me transform Korean text to sequences.

When the code could not run fine. I have to find where is the problem. By using melspectrogram function to convert a wav to mel, and inv_mel_spectrogram to convert mel to wav .... I have applied directly them on the mel that was generated by preprocessing task (.npy files) to test if the preprocessing ran well.

Also I directly used these function to convert a wav file to audio array, and audio array to mel array, then I convert mel back to wav file to check if they are well functioning.

These are some simple tricks. There are a lot of things you can do to debug. You have to do it yourself. Such as your GPU has less memory, so you have to calculate how many batchs it can process at once.
check what is the shape of melspectrograms, multiply with the batch size, and number of sample in 1 batch. says:

[80,1200] x 32 x 48
where

80: (mel channel)
1200: (mel frames)
32: (number of batches)
48: (number of samples per batch).
Mel spectrogram has float32 or int32 data type, then you can calculate 1 number holds how much memory. And how much memory all the batches hold. 1 float32 occupies 32 bits. then above mel occupies 32 x 96000 bits ~ 0.382 Mbyte. Then to process above batches you need 589.824 MBytes. <~~ This is the way I would do if I was you.

Here, everything is easy, you can decrease mel channels, drop utterances has many mel frames, decrease batch size, or cut out the number of samples in 1 batch. Try changing them slowly, see how does it affect to mel quality. choose the best for your machine.

@ErfolgreichCharismatisch
Copy link
Author

ErfolgreichCharismatisch commented Oct 2, 2018

This is awesome @tdplaza. Do you want to add anything, @Rayhane-mamah?

@gsoul
Copy link

gsoul commented Oct 2, 2018

@ErfolgreichCharismatisch
With all due respect, it seems to me that you confuse enterprise-level support with hobby-level OS project which is free of charge. None of the participants here owe you anything. So I think it'd be highly appreciated if you could change your tone to more polite. And show your appreciation for all the hard work that was put into making this repository a reality.

Alternatively, you could pay Rayhane-mamah for his consultation, if he has the time, where he'll be able to answer any of your questions. I'm not aware of Rayhane-mamah's rates, but for enterprise-level support especially in ML I think 200-500usd/hr is not something extraordinary.

Please let us know, if you prefer the consultation, so that community wouldn't spend more time here, as you'll get all the needed answers in private consultation.

@m-toman
Copy link
Contributor

m-toman commented Oct 2, 2018

I agree with @gsoul - I've been really impressed by how many hours Rayhane-mamah put into this. There were Saturdays where my mail account was just flooded by the repo notifications where he's seemingly been answering issues for 5+ hours straight. That's unpaid time that could as well be spent with the family, earning money (or developing :), instead of answering).

Generally we are now in the luxurious situation that deep learning enthusiasts rush into the field and produce lots of open source material.
When I did my doctorate in speech synthesis, there was more or less HTS, Festival and MaryTTS to choose from - with a much steeper learning curve (it's crazy how many hundred thousand lines of C++ and Scheme code Tacotron replaces).
To dig deeper into TTS in general, I can recommend the page of Simon King (http://www.speech.zone/) or "Text-to-Speech Synthesis" by Paul Taylor.

@ErfolgreichCharismatisch
Copy link
Author

ErfolgreichCharismatisch commented Oct 2, 2018

@gsoul, with all due respect, it seems to me that you confuse asking for a tutorial with writing a documentation of a paid product. I don't owe you anything either. So why don't you just watch your tone and actually contribute something useful like @tdplaza? I cannot honestly show any appreciation for something that doesn't work for me. Also, where is your appreciation? You came here just to lecture me unsolicitly whereas you didn't do anything to deserve that position in the first place - nobody does.

We both know that Rayhane-mamah did not publish this only because he is such a great guy. He wants to put this on his resumeé. And it would work way better for him if he made entry simpler, more forks, more exposure, more job offers.

Also how would private consultation benefit anyone else?

@ErfolgreichCharismatisch
Copy link
Author

ErfolgreichCharismatisch commented Oct 2, 2018

@m-toman I am pretty sure this is a great framework, but if I cannot make it work and people don't puzzle together a tutorial, I couldn't care less about how much work he put in. And if you were honest, you would say the same.

Exactly because he used to support people so often with probably the same answers, there is an even bigger incentive to expand the wiki and point beginners to it instead of repeating himself.

Again, why don't you - being qualified as a Phd in the field - expand the wiki with how you made this framework work?

http://www.speech.zone is quite impressive, actually.

@m-toman
Copy link
Contributor

m-toman commented Oct 2, 2018

I'm not involved with this framework except I fixed a small bug. So I don't see why exactly my free time (to write the thing) should be worth less than yours (to figure things out)?
Considering that I could (and actually do) get money for doing work on speech synthesis instead... or just go and play with my daughter - who is more charming in asking for my time than you ;).

Generally I haven't worked much more with it than running the default LJ training (which more or less just worked as described).

@ErfolgreichCharismatisch
Copy link
Author

@m-toman It is not about balancing each other's effort and time invested. I wouldn't mind you only playing with your daughter and not showing up here again to pester those who actually care about this project enough to help beginners.

@Rayhane-mamah
Copy link
Owner

Alright this has been going for long enough.

Dear @ErfolgreichCharismatisch.

We happily accept all sorts of criticism as long as it's constructive and is delivered in a polite manner, I personally encourage such feedback as it helps me improve my work, and with it other works based on it. However, we do NOT tolerate any form of lack of respect, thing you have been doing on multiple occasions. Community is one of the most important aspects of open source, it would be a shame that a bad actor ruins this experience for the entire group. I am thus revoking you access from commenting or opening any further issues on this repository.

As stated earlier, your remarks will most certainly help improve this project and we will make sure to make our work's usage easier for others. While I believe your intentions are genuinely good, your execution seems to be the worst.. To make sure I am not being unfair (and because feedback is usually beneficial to all of us), here are the remarks on your attitude that lead me to take such drastic measures:

  • Disrespecting other users/contributors on multiple occasions despite being called out for it. Github is a platform for interactive development and improvement, not for heated conversations..
  • Unprofessional interactions with others.
  • Expecting from others to spend their valuable time doing work for you, despite being obvious that you are not making the proper effort on your end.
  • Taking a superior tone with other users (like yourself) when making your requests.
  • Not taking into consideration the fact that other users (again, like yourself) have life/work/studies to attend to and that some cannot always make their contributions to the OS community.

Please also keep in mind that most OS projects you will find out there are not 100% what you're looking for, and that you will need to make your own modifications (which translates to time..) to make them suit your needs. Please also do not expect continuous support from contributors as most of them are doing hobby projects on the side because they are passionate about what they do, and they want to share this passion with others.

With that said, if you do not like our work, you are still free to use other's works, no one is forcing you to use ours I believe? In fact, here are some awesome other contributions that you can use:

I apologize to anyone offended by this "issue" and thank you @tdplaza @m-toman @gsoul @Piligram for your assistance and contributions!

@ErfolgreichCharismatisch Please avoid having a negative attitude in others' repos as it is no fun for anyone.
Thanks for your understanding, sorry it had to come down to this.


Other than that,

@Hayes515 batch_size 2 with wavenet usually is not a big issue, it will just take longer to converge, but I believe I stabilized gradients as best as possible to allow the model to hit proper minimum. of course if you do face problems with it, please open an issue and we'll look into it. Thanks for reaching out! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants