Skip to content

A system that automatically generates an apt, emotionally pertinent, unique sequences of music for audiobooks based on the current narrative for the purpose of ameliorating user-experience while being accurate, cost-efficient, and time saving.

Notifications You must be signed in to change notification settings

danlobo1999/deep-audiobook-tuner

Repository files navigation

Deep Audiobook Tuner (DAT)

A system that generates an apt, emotionally pertinent, unique sequences of music for audiobooks based on the current narrative for the purpose of ameliorating user-experience while being accurate, cost-efficient, and time saving.

This repository is about the innerworkings of DAT. Check out the Flask application made for this project at https://github.com/jendcruz22/DeepAudiobookTunerApp

This project was made in collaboration with:

Table of Contents:

  1. About
  2. Folder Structure
  3. Installation
  4. Setup
  5. Datasets used
  6. Notebooks
  7. Results
  8. References

Audiobooks are being used on a regular basis by hundreds of users. However most audiobooks do have background music or in some cases very generic soundtracks. This system aims to develop unique and emotionally relevant soundtracks for audiobook recordings.

To extract the sentiments from the audiobook, we use a hybrid sentiment analysis approach consisting of text as well as audio sentiment analysis. The text sentiment model is a product of transfer learning on Google's BERT language model. Both the text as well as the audio model have been trained on four emotions: Anger, Happiness, Neutral and Sadness.

In order to perform text sentiment analysis, we require the transcripts of the audiobook. We are using IBM's Watson Speech to Text to transcribe the audiobooks.

The audio sentiment model is a fully connected Dense Neural Network with four hidden layers. It takes in audio features as its input which are extracted from the audiobooks using Librosa.

For music generation, we've implemented bearpelican's approach. They created a music generation model using transformers and built using the fastai library. We are using their MusicTransformer model which uses Transformer-XL to take a sequence of music notes and predict the next note. A huge Thank you! to bearpelican and do check out their project.

Given below is the workflow of our system:

deep-audiobook-tuner
├───assets
│   ├───audiobooks
│   ├───audio_sentiment_data_v1
│   ├───audio_sentiment_data_v2
│   │   ├───datasets
│   │   ├───data_features
│   │   ├───models
│   │   └───pickles
│   ├───music_generation_data
│   │   ├───datasets
|   |   |   └───vg-midi-annotated
│   │   ├───models
│   │   └───pickles
│   ├───temp
│   └───text_sentiment_data
│       ├───datasets
│       └───models
|
├───deepaudiobooktuner
│   ├───music_generation
│   │   └───music_transformer
│   ├───sentiment_analysis
│   └───utils
|
├───examples
|
├───images
|
├───notebooks
│   ├───demo
│   ├───music_generation
│   └───sentiment_analysis
│       ├───audio_segmentation
│       ├───audio_sentiment_analysis_v1
│       │   └───feature_ext_and_dataprep
│       ├───audio_sentiment_analysis_v2
│       │   └───feature_ext_and_dataprep
│       ├───audio_transcription
│       ├───text_sentiment_analysis
│       └───text_sentiment_analysis_v2
|
└───tests

Install the requirements for Tensorflow before you run the following commands.

Run pip install -r requirements.txt to install all the required libraries (python version = 3.7)

Or

Create a Conda environment: conda env create -f environment.yml
(This method requires Tensorflow 2.4 to be installed separately in the environment.
Run conda activate deepaudiobooktuner and pip install tensorflow==2.4.1)

Additional requirements:

  • Ffmpeg is available here.
  • The package midi2audio requires a sound font which can be downloaded here. The sound font should be place in deep-audiobook-tuner/assets/music_generation_data/soundfont/ (Refer the folder structure)

To run this project the following API key and models are required.

Transcription API key

The transcription process is done using a cloud service, specifically IBM's Watson Speech To Text. In order to use this service an API key is required. Create a free account and obtain your API key and URL. These values are to be saved in a file called .env as shown here api_key = 'your_api_key' url = 'your_url' . Keep this file in the root directory.

Music generation model

The music generation model trained by bearpelican is available at here. This model is to be placed in deep-audiobook-tuner/assets/music_generation_data/models/ (Refer the folder structure)

Text sentiment analysis model

A pre-trained text sentiment analysis model is available here. This model is to be placed in deep-audiobook-tuner/assets/text_sentiment_data/models/neubias_bert_model/ (Refer the folder structure)

  • Text Sentiment Analysis

    The DailyDialog, the ISEAR and the Emotion-Stimulus datasets were mixed together to create a dataset with 4 labels: Anger, Happiness, Neutrality and Sadness. We trained, validated and tested our model on these datasets and the accuracy obtained is discussed in the results section.

  • Audio Sentiment Analysis

    A combination of 3 datasets were used. The TESS dataset, the RAVDESS dataset and the SAVEE dataset. We trained our model on these datasets for the following emotions: Anger, Happiness, Neutral and Sadness. The model was then tested and validated. The accuracy obtained is discussed in the results section.

  • Music Generation

    We used a pre-trained model for music generation, but we required our model to generate music based on the emotion. For this, we built a tiny dataset of video-game piano music that was hand-labeled according to the emotions that our system was using. The music generating model uses this dataset as its input. The data set is located here deep-audiobook-tuner/assets/music_generation_data/datasets/vg-midi-annotated (Refer the folder structure)

Some examples of our system are available in the examples directory.

Given below are the accuracy metrics of our sentiment analysis models.

Text-Based-Sentiment-Analysis Audio-Based-Sentiment-Analysis

[1] Google's BERT model
[2] Ktrain wrapper for Keras
[3] Speech-Emotion-Analyzer
[4] Musicautobot

About

A system that automatically generates an apt, emotionally pertinent, unique sequences of music for audiobooks based on the current narrative for the purpose of ameliorating user-experience while being accurate, cost-efficient, and time saving.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published