Predicting Song popularity in Spotify using KNN
If there’s one thing I can’t live without, it’s not my phone or my laptop — it’s music. I love music and getting lost in it. My inspiration for this project is finding out what features of song makes it popular. The main objective of this project is given quantifiable metrics from a song, namely acousticness, danceability,instrumentalness, and tempo, we wanted to predict its commercial success. Our system takes as input a subset of songs and their associated metadata, including genre, year, artist, rhythm/tempo, and instrumentation and predicts the popularity of the song. How we conclude a song to be a hit or a non hit is more of a classification problem. You can find the code I used in my Github repoRequirements
R==4.0.3
caret==6.0-86
gmodels==2.18.1
ggplot2==3.3.2
Data set
27K Rows, with 14 columns. You can download data from the link https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
Names | Description |
---|---|
acousticness | Numerical - Confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
danceability | Numerical- Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | Numerical - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. |
duration_ms | Numerical - Duration of the track in milliseconds. |
instrumentalness | Numerical - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | Numerical - Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
popularity | Numerical - The higher the value the more popular the song is.Ranges from 0 to 1 |
liveness | Numerical - The higher the value the more lively the song is.Ranges from 0 to 1 |
loudness | Numerical - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
speechiness | Numerical - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music.Values below 0.33 most likely represent music and other non-speech-like tracks. |
year | Numerical - Ranges from 1921 to 2020 |
mode | Categorical - (0 = Minor, 1 = Major) |
key | Categorical - All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on… |
artists | Categorical - List of artists mentioned |
genre | Categorical - Genre of the song |
Using k-nearest neighbours to predict the song popularity
- Step 1: Import the dataset from kaggle
- Step 2: Clean the dataset
- Clean the dataset to include only the subset of the features which will help in predicting popularity of song.
- Convert popularity (numeric data) to categorical value.
spotify<-subset(spotify,select = c(acousticness ,danceability
,energy,instrumentalness,liveness,
loudness,speechiness,tempo,valence,
popularity))
spotify$popularity[spotify$popularity>0.5] <- 'Y'
spotify$popularity[spotify$popularity==0.5] <- 'N/Y'
spotify$popularity[spotify$popularity<0.5] <- 'N'
spotify$popularity<-as.factor(spotify$popularity)
- Step 3: Data Splicing
- KNN algorithm is applied to the training data set and the results are verified on the test data set.
- I used 25% to test data and 75% to train the data.
- After obtaining training and testing data sets, then we will create a separate data frame from testing data set which has values to be compared with actual final values
- Predictor variables are acousticness ,danceability ,energy,instrumentalness,liveness, loudness,speechiness,tempo and valence,. Target variable is popularity.
Step 4:Data Pre-Processing With Caret
- We need to pre-process our data before we can use it for modeling.
- The caret package in R provides a number of useful data transforms.
- Training transforms can prepared and applied automatically during model evaluation.
- Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.
- Combining the scale and center transforms for preprocessing will standardize the data.
- The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.
- The center transform calculates the mean for an attribute and subtracts it from each value.
- Attributes after preprocessing will have a mean value of 0 and a standard deviation of 1.
Step 5:Model Training and Tuning
- To control parameters for train, trainControl function is used.
- “trainControl” allows estimation of parameter coefficients through resampling methods like cross validation, repeatedcv,boosting etc.
- The option “repeatedcv” method which we used in traincontrol method controls the number of repetitions for resampling used in repeated K-fold cross-validation.
Step 6:How to choose value for K to improve performance
- Time to fit a knn model using caret with preprocessed values.
- From the output of the knn model ,maximum accuracy(0.8842111) is achieved by k = 27.
- We can observe accuracy for different types of k.
k-Nearest Neighbors
20716 samples
9 predictor
3 classes: 'N', 'N/Y', 'Y'
Pre-processing: centered (9), scaled (9)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 18645, 18644, 18645, 18644, 18644, 18645, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.8858047 0.5350387
7 0.8844851 0.5232202
9 0.8842440 0.5169325
11 0.8854505 0.5149945
13 0.8852578 0.5103663
15 0.8865932 0.5122967
17 0.8860142 0.5060265
19 0.8862234 0.5042290
21 0.8861751 0.5010014
23 0.8868028 0.5021004
25 0.8866096 0.4995803
27 0.8866900 0.4975168
29 0.8864164 0.4961109
31 0.8858049 0.4922270
33 0.8859177 0.4922341
35 0.8850005 0.4872441
37 0.8851774 0.4868491
39 0.8851775 0.4858074
41 0.8851774 0.4835469
43 0.8846143 0.4797119
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 23
plot(knnFit)
- Step 7: Making predictions
- We build knn by using training & test data sets. After building the model, then we can check the accuracy of forecasting using confusion matrix.
- Step 8: Interpretation of the results and prediction accuracy achieved.
- The accuracy of our model on the testing set is 88%.
- We can visualise the model’s performance using a confusion matrix.
- We can also evaluvate the accuracy, precision and recall on the training and validation sets in confusion matrix to evaluate the performance of knn algorithm.
Confusion Matrix and Statistics
Reference
Prediction N N/Y Y
N 468 4 203
N/Y 0 0 0
Y 616 10 5604
Overall Statistics
Accuracy : 0.8794
95% CI : (0.8714, 0.887)
No Information Rate : 0.841
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4659
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: N Class: N/Y Class: Y
Sensitivity 0.43173 0.000000 0.9650
Specificity 0.96444 1.000000 0.4299
Pos Pred Value 0.69333 NaN 0.8995
Neg Pred Value 0.90112 0.997972 0.6993
Prevalence 0.15699 0.002028 0.8410
Detection Rate 0.06778 0.000000 0.8116
Detection Prevalence 0.09776 0.000000 0.9022
Balanced Accuracy 0.69809 0.500000 0.6975
mean(knnPredict == testing$popularity) 0.8793628
Overall insights obtained from the implemented project
- Overall accuracy of the model is 88%.It is safe to assume that knn models can be trained on the audio feature data to predict the popularity.
- Sensitivity for popular song is 0.52030 and for unpopular song is 0.9571.
- Specificity for popular song is 0.95619 and for unpopular song is 0.5200