Skip to content

hannz88/Spotify_data_science

Repository files navigation

A Data Science Project Using Spotify

forthebadge forthebadge

Python Version Uses Json Uses Numpy Uses Pandas Machine Learning Uses Pytorch Uses Matplotlib Ideas Welcome

Introduction

This is data science side-project. Firstly, I gathered the information and track details of a few playlists from Spotify. The details are accessed using Spotipy which have wrapper functions for Spotify's RESTfuls API. The details are then used to analyze the difference between some of them. Prior to using Spotipy, remember to get the client id and client secret from Spotify Developer's Website.

After the initial analysis, I used machine learning techniques to train models for predicting if a user will like a song in a playlist using the dataset obtained via Spotify API. Click on Machine Learning section to find out more!

Table of content

Getting keys for Spotify

Back to top

As the codes use the Spotipy library, make sure you have Spotipy installed first. All methods in Spotipy requires authorization using Client ID and Client Secret which is available through Spotify's Developer site. All you have to do is create an account, create a new app and then obtain the Client ID and Client Secret.

screenshot of spotify developer site

Once you get both the Client ID and Client Secret, put them both in a json file. In my case, I placed them in authorization.json. This is how it should look like in the file:

{"client_id": "your_client_id",
"client_secret": "your_client_secret"}

Accessing the URI

Back to top

Spotify has URI(Unique Resource Identifier) for any track, album, playlist etc. For the purpose of analysing the playlist, you will need to get the URI for each playlist. The URI helps to communicate with Spotify API and also retrieving the information.

To get the URI:

  • Go to the three dots icon
  • Click on Share
  • Click Copy Spotify URI

Screenshot of getting the Spotify uri

Once you get the URI, put them into json file again with other keys and values for that playlist. I've placed them in a file called playlists_like_dislike.json, which looks like this:

[{"uri":"spotify:playlist:37i9dQZF1DX1T2fEo0ROQ2",
"like":true, "purpose":"meditation"},
 {"uri":"spotify:playlist:4wibn1cPPP9m7WPiv7KF5Z",
 "like":true, "purpose":"feels"},
{"uri":"spotify:playlist:317O0e8iWJLClLGDKtieRe",
"like":false, "purpose":"house"},
{"uri":"spotify:playlist:04cqQXOsOibWJhmHTMTIrG",
 "like":true, "purpose":"workout"}]

For each playlist, other than the uri, I've also included like for whether I like the playlist or not and purpose for what the occasion is for. Yes, I have a playlist for feels.

Gif of feels

Accessing the features and other attributes

Back to top

This part is bit lengthy. It's an explanation about the different features for each of the track. Click on (Skip) if you prefer to go to the next part. For each playlist, you could get the track details like the artists, ID's and the titles. Each track of have unique attributes called features. The examples of features for each track that are used for analysis later are loudness, livenes, energy, etc. For more information, check out Spotify.

On top of that, Spotify also has a a number of audio attributes for each track, namely bars, beats, sections, tatum and segments. I didn't analyse them but here's the link if you want more information.

Comparing playlists

Back to top

I analysed two playlists that I listen in different circumstances. One is a playlist for meditation and the other one is for when I am working out.

Radar chart comparing two playlists

From the radar chart, it's clear to see that there are differences in different features between the two playlists. Meditation playlist scored quite high in acousticness and instrumentalness while workout playlist scored higher in energy, danceability, and valence. Workout playlist is slightly higher in speechiness, tempo, lineliness and loudness.

I was surprised by the small differences in speechiness. Meditation tracks barely have any voice in it, if at all but workout tracks are all just songs. So, it turns out that high score in speechiness means that a track is composed mostly of spoken words, e.g. talk show, audio books. If both speech and music are present simultaneously, it'll have lower score than those of purely just spoken words.

Let's have a look at them in the form of bar chart.

Stacked bar chart comparing two playlists

The bar chart provides an alternative view at the differences between the meditation and workout playlist. Meditation differs a lot from Workout in Acousticness, Instrumentalness, Danceability, Energy and Valence. Maybe we should test the differences using statistical test? Before that, let's look at the another bar chart to highlight the differences.

Bar chart to show the difference

This barchart highlights the type of features that meditation and workout scored high in. Meditation scored highest in acousticness, instrumentalness and mode, whille workout tracks in everything else.

Further analysis

Back to top

I wanted to test the differences between the two playlists on a few variables. The following are my hypotheses:

Null hypothesis: There is no difference between the playlist in all variables
Alternative hypothesis: There is a difference between the playlist in all variables

Before selecting a statistical test for the hypotheses, it's common to check for assumptions to decide on which test to use. So, here I've decided to test homogeneity of variance, normality of variables and correlation between the variables.

Homogeneity of variance

Back to top

I used Levene's test for homogeneity of variance between variables of the two playlists. Results are:

Levene's stats for danceability is 2.01, p-value for is 0.15831
Levene's stats for energy is 31.25, p-value for is 0.0
Levene's stats for loudness_norm is 10.93, p-value for is 0.00113
Levene's stats for mode is 1.34, p-value for is 0.24831
Levene's stats for speechiness is 59.83, p-value for is 0.0
Levene's stats for acousticness is 0.09, p-value for is 0.76784
Levene's stats for instrumentalness is 15.91, p-value for is 9e-05
Levene's stats for liveness is 24.44, p-value for is 0.0
Levene's stats for valence is 84.96, p-value for is 0.0
Levene's stats for tempo_norm is 10.65, p-value for is 0.0013

From the results, it's clear that homogeneity of variance is violated.

Correlation between variables

Back to top

Correlation matrix of meditation variables

Correlation matrix of workout variables

From the results, we could see that the variables within each group are weakly correlated with one another. In other words, independence of variables could be assumed. Note that this is not the same as testing for correlation between "Energy" of meditation playlist and that of workout playlist. For that, we could asssume independence as the tracks are not related by any means.

Normality of variables

Back to top

Next, I used Shapiro-Wilk's test to test for the normality of the distribution. It turns out only loudness is normally distributed.

Meditation's danceability p-value is 0.0; Workout's danceability p-value is 0.030599
Meditation's energy p-value is 7e-06; Workout's energy p-value is 0.008363
Meditation's loudness_norm p-value is 0.161003; Workout's loudness_norm p-value is 0.078042
Meditation's mode p-value is 0.0; Workout's mode p-value is 0.0
Meditation's speechiness p-value is 0.00011; Workout's speechiness p-value is 0.0
Meditation's acousticness p-value is 0.0; Workout's acousticness p-value is 0.0
Meditation's instrumentalness p-value is 0.0; Workout's instrumentalness p-value is 0.0
Meditation's liveness p-value is 0.0; Workout's liveness p-value is 0.0
Meditation's valence p-value is 0.0; Workout's valence p-value is 6.8e-05
Meditation's tempo_norm p-value is 0.0; Workout's tempo_norm p-value is 0.0

So, I decided to have a look at the distribution of the variables.

The graphs show that the distribution for the data are far from normal. As per Shapiro-Wilk's test shown, loudness is probably the only variable that showed somewhat normal distribution. Mode has a bimodal distribution and is, therefore, not a continuous variable. Thus, mode should be dropped. The distributions of the variables are quite obviously skewed, however. Let's try to transform the data.

Data transformation

Back to top

When a dataset doesn't have normal disribution, it is common to tranform them so that you can use parametric tests on them. There are many ways to do this. For example, you could use log transformation, square root transformation, box cox transformation etc. Here, I tried MinMaxScaler from sklearn, log transformation, square and square root transformation (separately but I've overwritten them so they're no longer in the notebook), and box cox transformation. However, none has given any satisfactory results, as seen in the graphical distribution below. So, I decided to use non-parametric test instead.

Distribution for after log transformation

Hypothesis testing

Back to top

Given that the normality of the dataset prior to transformation does is not normally distributed and there's no homogeneity of variance, this means that a parametric tests which normally has the assumption of normally distributed data could not be used. Even after different transformation, the distribution still isn't satisfactory. So, I've decided not to transform the data and try non-parametric tests instead. Wilcoxon's signed-rank test is a non-parametric test. However, it is sensitive to non-symmetrical data and so, it's not used. After doing some research, I decided to use sign test.

Sign test, short for paired-samples sign test, is a non-parametric test used to detect the difference in medians between paired or matched observations. It is normally used to test same cohort of participants in two conditions or time. However, two different samples are considered as "matched-pair". Furthermore, because I'm going to conduct multiple test, there's a possibility of inflating Type I error. In order to reduce it, I decided to use Bonferroni's correction. The alpha level is set to original alpha level (0.05) divided by number of test. Here are more information on sign test. More information on the Bonferroni adjustment.

Sign test uses median. Therefore, the hypotheses are:

Null hypothesis: There is no difference in the median between the variables of the two playlists.
Alternative hypothesis: There is a difference in the median between the variables of the two playlists.

Results:

Difference in Danceability is significant (p=1.5777218104420236e-30).
Difference in Energy is significant (p=1.5777218104420236e-30).
Difference in Loudness_norm is significant (p=0.00239462935750601).
Difference in Speechiness is significant (p=1.911357536027706e-15).
Difference in Acousticness is significant (p=1.5777218104420236e-30).
Difference in Instrumentalness is significant (p=1.5934990285464443e-28).
Difference in Valence is significant (p=7.969072864542664e-27).

Conclusion

Back to top

This has been a fun project to do! A brief recap, I've selected a few playlist from Spotify. From these selections, I've picked Meditation and Workout playlist to analyze them. Spotify have different features for each track like energy, acousticness, etc. I analyze the differences between the two playlist using graphical means. Then, I tested some assumptions of the data in order to decide which statistical test to use in order to quantifiably test the differences. The datasets weren't normally distributed so I tried to transform the data using a few methods. None of them gave any satifying results. Therefore, I resorted to non-parametric test instead, specifically sign test. The test shows that the two playlist are different in terms of Danceability, Energy, Loudness, Speechiness, Acousticness, Instrumentalness and Valence. The data science-y part might be done for now but there's a few steps that I'm thinking of taking this project to. If you have any idea, let me know!

Machine Learning

So, to train ml models to predict the likeness of a song, I've aggregated track data from 8 different playlists including a mix of playlist that I like and dislike. Since the data is labelled, I'll be using supervised machine learning. Fistly, I used Logistic Regression (LR) and K-nearest Neighbour (KNN) from sklearn. As I'm interested in deep learning, I decided to employ Pytorch as a classifier to predict my taste in music, too.

Data Distribution

First, let's have a look at the shape of the distribution for the variables that are used.

Distribution for larger dataset

From the distribution of the different variables, it's fairly obvious that the songs I like have rather low scores in terms of danceability, energy, loudness and tempo. Songs that I'm into have rather high scores in instrumentalness and acousticness. I have a suspicion that this might caused by the meditation playlist. It looks like there are not much differences with the tracks that I dislike in terms of duration and sections. I'm hesitant to comment on the differences in tempo, bars and segments as there are quite a large amount of overlap. Normally, I would normalise the variables but we're not doing hypothesis testing atm so we'll leave them be, for now.

Sklearn

Sklearn, also known as Scikit-learn, is a machine learning library in Python. There are many tools in there. Here's an ultra-brief explanation of the algorithms I used:

Algorithms Description
Logistic Regression Predicts discrete values for a set of independent variables using logit function
KNN Assumes that similar objects exist in close proximity and computes distances between objects to assign them into groups

Other than the two algorithms, I also used classification_report from sklearn which is a pretty neat tool as it tells you the accuracy, precision, recall, f1-score, etc. Pretty neat, huh?

One more thing, it wouldn't be good practice to train and test your model on the same set of data. Imagine building a classifier for food type. Then, train it on sausage only. What will happen to the model? It'll only ever be able to tell you if it's sausage or not. Not useful right?

Gif of hotdog Identifier

So, to circumvent this pitfall, I used train_test_split from sklearn.model_selection.

from sklearn.model_selection import train_test_split
## x are the independent variables
## y are the dependent variables, ie what you're trying to predict
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.7, test_size=0.3, random_state=0)

Classification report results

Logistic regression:

Precision Recall f1-score support
0 0.91 0.99 0.95 104
1 0.98 0.81 0.89 53
accuracy 0.93 157
macro avg 0.94 0.90 0.92 157
weighted avg 0.93 0.93 0.93 157

KNN:

Precision Recall f1-score support
0 0.67 0.79 0.73 104
1 0.37 0.25 0.30 53
accuracy 0.61 157
macro avg 0.52 0.52 0.51 157
weighted avg 0.57 0.61 0.58 157

Accuracy is often used as a metric to judge a model. It's easy to see why: it's a measurement of how accurate your model is. However, the caveat is that it's not useful in imbalanced dataset. And it also depends on what you're testing. For example, if the model has high accuracy in testing Neg virus-infected people but what matters are the Pos virus-infected ones. People have used other measurments such as the precision, recall and f1-score. All of them are useful but it depends on several factors. For example, what are you testing? Is there an imbalanced dataset problem?

Imbalanced dataset is common. It happens when you have more datapoints for one class over the other. It's usually more of a problem when you have 90% of the datapoints from one class and only 10% of datapoints from the other. Even if your model managed to predict 90% of the majority class, it hasn't learnt anything really. In this particular dataset, the number of datapoints for both classes are not 50/50 as could be seen from the graph below, but it's not as extreme as 90/10 either. To err on the side of caution, I decided that f1-score is my priority, especially when it comes to predicting like (labelled "1") tracks. LR's f1 for class 1 is 0.89 which is considerably higher than that for KNN. Moreover, all other numbers for LR is higher than those from KNN. It's obvious that LR is the better model here. Note: there are other ways to handle imbalanced dataset.

Picture of different classes

Prediction

LR was selected to make the final prediction and the results is as below.

Precision Recall f1-score support
0 0.91 0.98 0.94 341
1 0.95 0.81 0.88 181
accuracy 0.92 522
macro avg 0.93 0.90 0.91 522
weighted avg 0.92 0.92 0.92 522

The overall scores look pretty neat. The accuracy was 0.92 and the overall scores look pretty good to me. Let's try Pytorch next time!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published