The main objective of this project is to listen to Japanese music on Raspberry Pi 3 at home and while driving my car. The motivation behind this project is both Google Home and Amazon Alexa have very limited options when it comes to Japanese music subscription services outside Japan, but I wanted my kids to have exposure to Japanese music on a regular basis.
To start the smart speaker, either say your own wake word (e.g. "Alexa") or push the arcade button. Once you finish talking, Google Speech API will convert it to text, my custom gradient boosting model will predict intent (e.g. stream a certain radio station, search and stream music on YouTube, increase volume, skip to next song, etc), and execute commands.
I used Google Speech API for ASR/speech-to-text, and scikit-learn for gradient boosting model to capture intent. Open JTalk is used for text-to-speech. Actual music streaming pieces are dependent on other people's hard work (e.g. Radiko script, youtube-dl, etc).
Hardware
- Raspberry Pi 3
- Google AIY Voice Kit
- Micro USB charger (1.5A)
- Micro SD card
(I spent about $40 at Micro Center)
Software
- Raspbian
- Python >3.4
- Google AIY
- Google Cloud Platform subscription for Google Speech API
- Open JTalk
Set up Google AIY Voice Kit
- Follow this tutorial to assemble hardware and set up Google Speech API.
- Clone Google AIY repo on home directory.
git clone https://github.com/google/aiyprojects-raspbian.git
- Overwrite aiyprojects-raspbian/src with the content of raspbian_aiy_smart_speaker on this repo, which includes Japanese language support for Google Speech API, text-to-speech, and my smart speaker code.
- Enable service:
sudo mv my_cloudspeech.service /lib/systemd/system/
sudo systemctl enable my_cloudspeech.service
Configure Open JTalk
- Install Open JTalk:
sudo apt-get update
sudo apt-get install open-jtalk open-jtalk-mecab-naist-jdic hts-voice-nitech-jp-atr503-m001
- Download different voice:
wget https://sourceforge.net/projects/mmdagent/files/MMDAgent_Example/MMDAgent_Example-1.6/MMDAgent_Example-1.6.zip/download -O MMDAgent_Example-1.6.zip
unzip MMDAgent_Example-1.6.zip MMDAgent_Example-1.6/Voice/*
sudo cp -r MMDAgent_Example-1.6/Voice/mei/ /usr/share/hts-voice
Set up Radiko script and YouTube add-on
- Install dependencies for Radiko:
sudo apt-get install rtmpdump swftools libxml2-utils libav-tools
- Install mplayer for Radiko playback:
sudo apt-get install mplayer
- Install YouTube add-on:
sudo pip3 install mps-youtube youtube-dl
- Install vlc for YouTube playback:
sudo apt-get install vlc
- Set vlc as the default player for mps-youtube:
mpsyt set player vlc, set playerargs, exit
Deploy machine learning model
- Install dependencies for scikit-learn:
source env/bin/activate
sudo apt-get install liblapack-dev
sudo apt-get install build-essential python-dev python-setuptools python-numpy python-scipy libatlas-dev libatlas3gf-base
sudo pip3 install --user --install-option="--prefix=" -U scipy scikit-learn
sudo pip3 install pandas janome
- Run gbt.py to build a model (alternatively, use 32-bit machine for model training)
DONE!
For capturing intent, I used Gradient Boosting (scikit-learn), XGBoost, and LSTM (keras/tensorflow). While LSTM with word embedding (trained on Japanese Wikipedia) had slightly higher accuracy, the model size was too big to deploy to a Raspberry Pi 3. After trial and error, I ended up using the Gradient Boosting model due to simpler deployment.