Spotify Data Analysis Spotify Data Analysis | Nisha Selvarajan
Spotify Data Analysis Spotify Data Analysis | Nisha Selvarajan

Predicting Song popularity in Spotify using KNN

If there’s one thing I can’t live without, it’s not my phone or my laptop — it’s music. I love music and getting lost in it. My inspiration for this project is finding out what features of song makes it popular. The main objective of this project is given quantifiable metrics from a song, namely acousticness, danceability,instrumentalness, and tempo, we wanted to predict its commercial success. Our system takes as input a subset of songs and their associated metadata, including genre, year, artist, rhythm/tempo, and instrumentation and predicts the popularity of the song. How we conclude a song to be a hit or a non hit is more of a classification problem. You can find the code I used in my Github repo

Requirements

R==4.0.3
caret==6.0-86
gmodels==2.18.1
ggplot2==3.3.2

Data set

27K Rows, with 14 columns. You can download data from the link https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

Names Description
acousticness Numerical - Confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability Numerical- Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Numerical - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
duration_ms Numerical - Duration of the track in milliseconds.
instrumentalness Numerical - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence Numerical - Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
popularity Numerical - The higher the value the more popular the song is.Ranges from 0 to 1
liveness Numerical - The higher the value the more lively the song is.Ranges from 0 to 1
loudness Numerical - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
speechiness Numerical - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music.Values below 0.33 most likely represent music and other non-speech-like tracks.
year Numerical - Ranges from 1921 to 2020
mode Categorical - (0 = Minor, 1 = Major)
key Categorical - All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…
artists Categorical - List of artists
mentioned
genre Categorical - Genre of the song

Using k-nearest neighbours to predict the song popularity

  • Step 1: Import the dataset from kaggle
  • Step 2: Clean the dataset
    • Clean the dataset to include only the subset of the features which will help in predicting popularity of song.
    • Convert popularity (numeric data) to categorical value.
spotify<-subset(spotify,select = c(acousticness ,danceability
                                   ,energy,instrumentalness,liveness,
                                   loudness,speechiness,tempo,valence,
                                   popularity))

spotify$popularity[spotify$popularity>0.5]  <- 'Y'
spotify$popularity[spotify$popularity==0.5]  <- 'N/Y'
spotify$popularity[spotify$popularity<0.5]  <- 'N'
spotify$popularity<-as.factor(spotify$popularity)
  • Step 3: Data Splicing
    • KNN algorithm is applied to the training data set and the results are verified on the test data set.
    • I used 25% to test data and 75% to train the data.
    • After obtaining training and testing data sets, then we will create a separate data frame from testing data set which has values to be compared with actual final values
    • Predictor variables are acousticness ,danceability ,energy,instrumentalness,liveness, loudness,speechiness,tempo and valence,. Target variable is popularity.
  • Step 4:Data Pre-Processing With Caret

    • We need to pre-process our data before we can use it for modeling.
    • The caret package in R provides a number of useful data transforms.
    • Training transforms can prepared and applied automatically during model evaluation.
    • Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.
    • Combining the scale and center transforms for preprocessing will standardize the data.
    • The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.
    • The center transform calculates the mean for an attribute and subtracts it from each value.
    • Attributes after preprocessing will have a mean value of 0 and a standard deviation of 1.
  • Step 5:Model Training and Tuning

    • To control parameters for train, trainControl function is used.
    • “trainControl” allows estimation of parameter coefficients through resampling methods like cross validation, repeatedcv,boosting etc.
    • The option “repeatedcv” method which we used in traincontrol method controls the number of repetitions for resampling used in repeated K-fold cross-validation.
  • Step 6:How to choose value for K to improve performance

    • Time to fit a knn model using caret with preprocessed values.
    • From the output of the knn model ,maximum accuracy(0.8842111) is achieved by k = 27.
    • We can observe accuracy for different types of k.
 k-Nearest Neighbors
 20716 samples
     9 predictor
     3 classes: 'N', 'N/Y', 'Y'

 Pre-processing: centered (9), scaled (9)
 Resampling: Cross-Validated (10 fold, repeated 3 times)
 Summary of sample sizes: 18645, 18644, 18645, 18644, 18644, 18645, ...
 Resampling results across tuning parameters:

   k   Accuracy   Kappa
    5  0.8858047  0.5350387
    7  0.8844851  0.5232202
    9  0.8842440  0.5169325
   11  0.8854505  0.5149945
   13  0.8852578  0.5103663
   15  0.8865932  0.5122967
   17  0.8860142  0.5060265
   19  0.8862234  0.5042290
   21  0.8861751  0.5010014
   23  0.8868028  0.5021004
   25  0.8866096  0.4995803
   27  0.8866900  0.4975168
   29  0.8864164  0.4961109
   31  0.8858049  0.4922270
   33  0.8859177  0.4922341
   35  0.8850005  0.4872441
   37  0.8851774  0.4868491
   39  0.8851775  0.4858074
   41  0.8851774  0.4835469
   43  0.8846143  0.4797119

 Accuracy was used to select the optimal model using the largest value.
 The final value used for the model was k = 23
plot(knnFit)

  • Step 7: Making predictions
    • We build knn by using training & test data sets. After building the model, then we can check the accuracy of forecasting using confusion matrix.
  • Step 8: Interpretation of the results and prediction accuracy achieved.
    • The accuracy of our model on the testing set is 88%.
    • We can visualise the model’s performance using a confusion matrix.
    • We can also evaluvate the accuracy, precision and recall on the training and validation sets in confusion matrix to evaluate the performance of knn algorithm.
Confusion Matrix and Statistics

           Reference
 Prediction    N  N/Y    Y
        N    468    4  203
        N/Y    0    0    0
        Y    616   10 5604

 Overall Statistics

                Accuracy : 0.8794
                  95% CI : (0.8714, 0.887)
     No Information Rate : 0.841
     P-Value [Acc > NIR] : < 2.2e-16

                   Kappa : 0.4659

  Mcnemar's Test P-Value : < 2.2e-16

 Statistics by Class:

                      Class: N Class: N/Y Class: Y
 Sensitivity           0.43173   0.000000   0.9650
 Specificity           0.96444   1.000000   0.4299
 Pos Pred Value        0.69333        NaN   0.8995
 Neg Pred Value        0.90112   0.997972   0.6993
 Prevalence            0.15699   0.002028   0.8410
 Detection Rate        0.06778   0.000000   0.8116
 Detection Prevalence  0.09776   0.000000   0.9022
 Balanced Accuracy     0.69809   0.500000   0.6975
                    
mean(knnPredict == testing$popularity)  0.8793628

Overall insights obtained from the implemented project

  • Overall accuracy of the model is 88%.It is safe to assume that knn models can be trained on the audio feature data to predict the popularity.
  • Sensitivity for popular song is 0.52030 and for unpopular song is 0.9571.
  • Specificity for popular song is 0.95619 and for unpopular song is 0.5200