A full-stack web app of data visualization of audio features and lyrical content, as well as tree-based machine learning recommendation algorithm, based on user's selection of artists and songs, using Genius and Spotify API-pulled packages.
This fall semester, I created a web app of data visualizations and tree-based machine learning algorithm in order to recommend users songs based on their selection of five songs. I used spotifyr package as well as geniusr package in order to retrieve various information about songs, let alone audio features and lyrics.
Creating a radar chart was relatively new to me, especially with eight different audio features of each song I wanted to showcase. So, I used a function radarchart2 that I found online and scaled each audio feature from 0 to 100 for a holistic standard. I also used geom_ridges() to create a different way to compare the audio features. The “lyrical content” section was created with bind_tf_idf() that calculates the frequency of important words, pulled by unnest_tokens(), across the song(s) selected. Ultimately, each time a user clicks the “Save selected song” button, these visualizations are updated.
- Modeling with training and testing datasets For training and testing datasets, I asked nine friends to give me a list of 10-15 songs that they like to listen. Then, using get_related_artists() function from those lists, I created dataset of “related artists” whose popularity and total followers are within the first and third quartiles, respectively. This allows to create reasonably sized data, from which I used get_artist_audio_features() function to pull audio features of every song from these artists. Given the large number of observations from this new dataset, I then randomly selected 15 songs using sample_n(), which my friends were asked to listen to and mark which ones they enjoyed and which ones they did not. Once they followed up, I compiled all the data, with the binary variable “like” that indicates whether they liked the song (1) or not (0). Because I want to incorporate both audio features and lyrics frequency to predict and recommend users new songs, I created a new variable ‘success rate’ calculated as follows, by each friend: (number of times a word appeared across the songs my friend A liked) / (total number of words across all songs that A listened to). I used randomForest() and varImpPlot() to identify six most important variables that determine the “like” variable: lyrics frequency, tempo, valence, acousticness, loudness, and liveness. I then used bagging, boosting, and regression trees to decide which of the three to ue for the modeling. Boosting, for which I used adaboost(), unfortunately rendered errors in multiple tries, so between the remaining two functions, I decided to use regression trees rpart() that gave a higher accuracy rate on average than bagging. However, lyrical content took way too long. Although lyrical content (or ‘succes rate’) was the most important variable in creating the algorithm model, it could not be accounted for every song, because geniusr() package does not cover as much as spotifyr() in terms of song lyrics availability. With that reason, and more importantly to reduce the running time of code, I decided to not account for the ‘sucess rate’ variable and only consider the other five variables, all of which are selected audio features. The resulting accuracy rate from regression trees typically ranged from around 20% to 60%.
- Implementation In creating the actual function, I designated a table of artist and their Spotify ID to be its input. From there, the function would retrieve data in the order that I did with my friends.
There are quite a few data visualization features that could not be part of the final app, as spotifyr package could only pull so much data. For example, I wanted users to be able to log in to their accounts, so that, using their streaming history with and without personalized playlists, the app could easily detect and identify top 5 songs in the past six months and the most recently played songs, through both of which the users could find their “mood”. I attempted to use JavaScript on Visual Studio Code, as recommended on the Spotify developers website. However, because it exclusively uses Web API that does not cover some features that the spotifyr() package that I want to use, I decided to not include the login feature Instead, users are asked to choose between 3 and 5 songs. Among the many perks of this feature is that they are not required to log in -- that is, non-Spotify users are able to freely use the web app as well, increasing its accessibility and their music interests. Another benefit is that they can learn something new about the songs they have not noticed before. For example, by audio features chart, they can find which song(s) that they choose has/have higher acousticness or danceability. Also, if the initial design was executed, the recommended songs might not fully reflect the users’ preference, because the sentiment/mood from top 5 songs could range more widely than that from the most recently played songs, so the tree model would cover too much variation of audio features. Here, however, because the users have the option to choose their songs, they have more freedom to choose songs they want to select in the moment, increasing the chance of accuracy in the recommendation system.
Geniusr pulls data from genius.com, a song lyrics platform that unfortunately does not cover many non-English songs. Given this lack of lyrical content availability, not every song that my friends chose and that I recommended them had lyrics, with which I could have analyzed frequency of words with bind_tf_idf().
The algorithm function I made runs many code chunks that require to pull data and save them as new datasets. And because of this process, rendering the final output from this function can often be lengthy. A similar case holds for lyrical content, as mentioned before -- this would lengthen the running time even more, because it scrapes through each row of the many rows of each of ten songs. So I had to try my best to keep the code as concise as possible. The running time ranges from 4 to 10 seconds per each action button click.
I have learned so much from building this app, especially with the design and feasibility. Focusing on both backend and frontend development, I was a developer and user, trying to find what features could work most efficiently. Alongside with the aforementioned features, I was excited to learn how to create a search bar that renders a list of songs according to whatever artist a user types in, and tables that are only reactive to the songs that the user chooses, because these two are the backbones of the remaining analysis and codes of the project.
Although the project does not include everything that I had hoped, it does the job perfectly fine, as it achieves my original objective to recommend users new songs based on their music history, which is reflected in their self-selection. From online research and exploration of Spotify Web API-based web apps and spotifyr-based data science projects, I came to learn that there is no existing web app with spotifyr and/or geniusr prior to my own. With this in mind, I hope this project serves as a starting point for web app development using spotifyr() package, focusing on either data visualization or machine learning, or both.
I would like to extend my sincere gratitude to my friends for having contributed to the cross-validation process and experimented with the app before its official launch, and Professor Alex Lyford at Middlebury College for his continued support and guidance. This project would not have been possible without any of you.
Some links I used for inspiration are: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/ https://stackoverflow.com/questions/49968975/shiny-pickerintput-choices-based-on-search-bar https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a#:~:text=Loudness%20values%20are%20averaged%20across,to%201.0%20the%20attribute%20value https://mastering-shiny.org/action-tidy.html