Music Recommendation System

In [251]:
df_final.to_csv('/content/drive/My Drive/dfFinal.csv',index=False)

Milestone 2

Now that we have explored the data, let's apply different algorithms to build recommendation systems.

Note: Use the shorter version of the data, i.e., the data after the cutoffs as used in Milestone 1.

Load the dataset

In [252]:
df_final =  pd.read_csv('/content/drive/MyDrive/dfFinal.csv')

Popularity-Based Recommendation Systems

Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.

In [47]:
# Calculating average play_count

average_count = df_final.groupby('song_id').mean()['play_count']# Hint: Use groupby function on the song_id column

# Calculating the frequency a song is played
play_freq = df_final.groupby('song_id').count()['play_count']     # Hint: Use groupby function on the song_id column
In [48]:
# Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count': average_count, 'play_freq': play_freq, })

# Let us see the first five records of the final_play dataset
final_play.head()
Out[48]:
avg_count play_freq
song_id
21 1.622642 265
22 1.492424 132
52 1.729216 421
62 1.728070 114
93 1.452174 115
In [49]:
# Set the figure size
plt.figure(figsize = (30, 10))

sns.barplot(x = final_play.index,
            y = 'play_freq',
            data = final_play,
            estimator = np.mean)

# Set the y label of the plot
plt.ylabel('Play Frequency') 
#set the x label
plt.xlabel('Songs')
#Set the title of the plot
plt.title("Showing Play Frequency of Songs in the Final Play Dataset")

# Show the plot
plt.show()

Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.

In [50]:
# Build the function to find top n songs
def top_n_songs(data, n, min_plays):
    
    # Finding songs with interactions greater than the minimum number of interactions
    recommendations = data[data['play_freq'] > min_plays]
    
    # Sorting values with respect to the average rating
    recommendations = recommendations.sort_values(by = 'avg_count', ascending = False)
    
    return recommendations.index[:n]
In [51]:
final_play.dtypes
Out[51]:
avg_count    float64
play_freq      int64
dtype: object
In [52]:
final_play.head()
Out[52]:
avg_count play_freq
song_id
21 1.622642 265
22 1.492424 132
52 1.729216 421
62 1.728070 114
93 1.452174 115
In [53]:
# Recommend top 10 songs using the function defined above
list(top_n_songs(final_play, 40, 500))
Out[53]:
[5531,
 2220,
 352,
 4448,
 1334,
 8092,
 8138,
 7416,
 8582,
 4152,
 605,
 4639,
 6175,
 703,
 1118,
 8612,
 6189,
 6293,
 2091,
 9931,
 5367]
In [54]:
song_ids = [5531, 2220, 352, 4448, 1334, 8092, 8138, 7416, 8582, 4152]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[54]:
['The Scientist by Coldplay',
 'Dog Days Are Over (Radio Edit) by Florence + The Machine',
 'Sehr kosmisch by Harmonia',
 'Fireflies by Charttraxx Karaoke',
 'Revelry by Kings Of Leon',
 'Drop The World by Lil Wayne / Eminem',
 'Use Somebody by Kings Of Leon',
 'OMG by Usher featuring will.i.am',
 'Secrets by OneRepublic',
 'Hey_ Soul Sister by Train']
In [55]:
# Recommend top 10 songs using the function defined above
# Experimenting with threshold of min plays of 1
list(top_n_songs(final_play, 10, 1))
Out[55]:
[7224, 8324, 6450, 9942, 5531, 5653, 8483, 2220, 657, 614]
In [56]:
song_ids = [7224, 8324, 6450, 9942, 5531, 5653, 8483, 2220, 657, 614]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[56]:
['Sehr kosmisch by Harmonia',
 'Luvstruck by Southside Spinners',
 'Secrets by OneRepublic',
 'Transparency by White Denim',
 'Greece 2000 by Three Drives',
 "You're The One by Dwight Yoakam",
 'Brave The Elements by Colossal',
 "Victoria (LP Version) by Old 97's",
 'The Big Gundown by The Prodigy',
 'Video Killed The Radio Star by The Buggles']

TOP n SONGS - OBSERVATIONS

We can see with a threshold of 500 - I tried to see if it could produce a top 40 songs but there are only 21 songs with a min_play higher than 500. So for a top 10 this is OK and truly representative of very popular songs. But we must be aware that if we wanted to produce a list of say 40 songs, we must use a lower threshold.

I also experimented with very low thresholds of minmum plays and even without it at all. In the case of no or a very low threshold (we can see from the plot that most songs have at least 100 plays) that the top n songs are then calculated on average play_count. The top song - Victoria by Old 97's (Yeah I've never heard of them either!) has an average count of 3.38 and a play frequency of only 107. But with a threshold of 500, the top song is 'Secrets' by One Republic with a lower average play_count of 2.31 but a much higher number of listeners at 618.

So if we are really looking for just the most popular songs, we should use a higher threshold and let the order of these be sorted by average play_count rather than letting the top songs be sorted by the average play count which are so closely grouped in the lower thresholds that it's not really doing a great sorting job.

We can use the top_n_songs and apply it to further filters such as countries or areas, also we can apply it to the catalogue of a particular artist to find their most popular songs or indeed even of a song title along with the songwriter (if we had this information) so that people could find the most popular versions of a song that has been covered many times.

User User Similarity-Based Collaborative Filtering

To build the user-user-similarity-based and subsequent models we will use the "surprise" library.

In [62]:
# Install the surprise package using pip. Uncomment and run the below code to do the same
!pip install surprise 
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 772.0/772.0 KB 13.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: joblib>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-surprise->surprise) (1.2.0)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from scikit-surprise->surprise) (1.21.6)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.8/dist-packages (from scikit-surprise->surprise) (1.7.3)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... done
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp38-cp38-linux_x86_64.whl size=3366457 sha256=2a931623d7799c6a0f3c7d35818a22e3749a525c670392e893e895988f4f728c
  Stored in directory: /root/.cache/pip/wheels/af/db/86/2c18183a80ba05da35bf0fb7417aac5cddbd93bcb1b92fd3ea
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1
In [63]:
# Import necessary libraries
# To compute the accuracy of models
from surprise import accuracy

# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# For implementing KFold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

Some useful functions

Below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.

Think About It: Which metric should be used for this problem to compare different models?

In [64]:
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    # Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x : x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[ : k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    # Mean of all the predicted precisions are calculated
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
    
    accuracy.rmse(predictions)

    # Command to print the overall precision
    print('Precision: ', precision)

    # Command to print the overall recall
    print('Recall: ', recall)
    
    # Formula to compute the F-1 score
    print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))

Think About It: In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing the threshold? What is the intuition behind using the threshold value of 1.5?

The threshold of 1.5 is to represent a user playing a song 1.5 times. This becomes the boundary that makes this regression problem become more like a clsasification problem. We are saying here that if a user plays the song 1.5 times or more, then they are considered to like the song was predicted - it would be considered a successful prediction.

A threshold lower that this would prove difficult as playing a song 1 time doesn't really let us know if the listener likes it or not. But playing a song more than once is a pretty good indication that they were interested in it and that the prediction had some value to them.

I experimented with changing the threshold to 2 then 3 then 1. 1 gave us precision and recall scores of almost 100% but this is not of much value to how well the model is working. 2 and 3 were much less, tiny scores because it's so rare that a user plays a song 3 times. It seems the best number is 1.5 for generally seeing how the model is performing.

In [65]:
# Instantiating Reader scale with expected rating scale 
reader = Reader(rating_scale=(0, 5)) #use rating scale (0, 5)

# Loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader) # Take only "user_id","song_id", and "play_count"

# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size= 0.4, random_state = 42) # Take test_size = 0.4

Think About It: How changing the test size would change the results and outputs?

In [136]:
# Build the default user-user-similarity model
sim_options = {'name': 'cosine',
               'user_based': True}

# KNN algorithm is used to find desired similar items
sim_user_user = KNNBasic(sim_options = sim_options, verbose = False, random_state = 1)

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k = 30
precision_recall_at_k(sim_user_user) # Use sim_user_user model
RMSE: 1.0878
Precision:  0.396
Recall:  0.692
F_1 score:  0.504

**Observations and Insights: We can see our RMSE is 1.09. Precision seems to be low at 39% meaning that only 39% of the songs receommended were of interest to the user. Recall is at 69% and this means of all the songs the listener would play more than once, that 69% of them are being recommended. F1 - which is the harmonic mean of Precision and Recall is 50%. We want to really focus on the Precision in this project since we want the listener to trust the recommendations and continue to find music that they enjoy. If the top songs being recommended to them are of no interest they will lost trust in the platform and potentially go to another.

We will see if we can get the precision higher and the RMSE scores lower, even a little bit by tuning the hyperparameters of the model.

In [138]:
# Predicting play_count for a sample user with a listened song
sim_user_user.predict(6958, 1671, r_ui = 2, verbose = True) # Use user id 6958 and song_id 1671
user: 6958       item: 1671       r_ui = 2.00   est = 1.80   {'actual_k': 40, 'was_impossible': False}
Out[138]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.8009387435128914, details={'actual_k': 40, 'was_impossible': False})
In [228]:
df_final.loc[df_final['song_id'] == 1671]
Out[228]:
user_id song_id play_count title release artist_name year text
215 6958 1671 2 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
9351 45386 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
13595 22749 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
15504 51415 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
22112 74334 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
... ... ... ... ... ... ... ... ...
1908629 67765 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
1909779 42427 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
1920487 37911 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
1964017 35114 1671 1 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service
1971708 10807 1671 2 Sleeping In (Album) Give Up Postal Service 2003 Sleeping In (Album) Give Up Postal Service

150 rows × 8 columns

In [139]:
# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict(6958, 3232, verbose = True) # Use user_id 6958 and song_id 3232
user: 6958       item: 3232       r_ui = None   est = 1.64   {'actual_k': 40, 'was_impossible': False}
Out[139]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.6386860897998294, details={'actual_k': 40, 'was_impossible': False})
In [229]:
df_final.loc[df_final['song_id'] == 3232]
Out[229]:
user_id song_id play_count title release artist_name year text
473 27018 3232 2 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
14230 69587 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
24925 2399 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
49597 39613 3232 2 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
75803 40245 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
... ... ... ... ... ... ... ... ...
1947290 22588 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
1957040 8986 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
1966257 21484 3232 2 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
1982595 30034 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...
1992742 68669 3232 1 Life In Technicolor ii Viva La Vida - Prospekt's March Edition Coldplay 2008 Life In Technicolor ii Viva La Vida - Prospekt...

142 rows × 8 columns

Observations and Insights: We can see that model estimated the user would listen to 'Sleeping In' by The Postal Service 1.8 times when in reality they did listen 2 times (worth noting that the play count is only given in integers so we will rarely get an exact match given that the prediction is given as a float number). This is pretty good.

In the prediction whether the same user would listen to Life In Technicolor by Coldplay it predicted they would listen 1.6 times. This seems reasonable just from knowledge of the songs myself. The songs are not too disimilar in genre and sound, though arguably someone who liked the Postal Service might be a little more niche in their taste than a big popular band like Coldplay. But still, the songs are close enough and the prediction seems reasonable purely from a musical point of view, as we don't know what the actual play count on this was as they have not heard the song.

Now, let's try to tune the model and see if we can improve the model performance.

In [234]:
# # Setting up parameter grid to tune the hyperparameters
# param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
#               'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
#                              'user_based': [True], "min_support": [2, 4]}
#               }

# #Performing 3-fold cross-validation to tune the hyperparameters
# gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3)

# #Fitting the data
# gs.fit(data) # Use entire data for GridSearch

# #Best RMSE score
# print(gs.best_score['rmse'])

# #Combination of parameters that gave the best RMSE score
# print(gs.best_params['rmse'])
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
1.0459372898453758
{'k': 30, 'min_k': 9, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 2}}
In [141]:
# Train the best model found in above gridsearch
# Build the default user-user-similarity model
sim_options = {'name': 'pearson_baseline',
               'user_based': True}

# KNN algorithm is used to find desired similar items
sim_user_user_opt = KNNBasic(sim_options = sim_options, k=30, min_k = 9, verbose = False, random_state = 1, min_support = 2)
#(sim_options=sim_options, k=20, min_k=6, verbose=False)

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user_opt.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k = 30
precision_recall_at_k(sim_user_user_opt) # Use sim_user_user model
RMSE: 1.0521
Precision:  0.413
Recall:  0.721
F_1 score:  0.525

Observations and Insights: We have gained a bit of Precision as it is now 42% up from 39%. Recall is also higher at 72% and our RMSE score is lower at 1.0521 down from 1.0878. This shows that the tuning of the hyperparamaters has increased the performance of the model.

In [142]:
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2
sim_user_user_opt.predict(6958, 1671, r_ui = 2, verbose = True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.96   {'actual_k': 24, 'was_impossible': False}
Out[142]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.962926073914969, details={'actual_k': 24, 'was_impossible': False})
In [143]:
# Predict the play count for a song that is not listened to by the user (with user_id 6958)
sim_user_user_opt.predict(6958, 3232, verbose = True)
user: 6958       item: 3232       r_ui = None   est = 1.45   {'actual_k': 10, 'was_impossible': False}
Out[143]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.4516261428486725, details={'actual_k': 10, 'was_impossible': False})

Observations and Insights: We can see the prediction of the user listening to 'Sleeping In' by The Postal Service is more accurate with this optimized model predicting the plays at 1.96 where the actual was 2. This is up from the baseline model which had it predicted at 1.8.

It also predicts the same user playing 'Life in Technicolor' by Coldplay 1.45 times down from 1.64 as it predicted in the baseline model. We don't know the real number of plays this listener would play of this song since they haven't heard it, but a lower number from the more accurate model seems to tie in with my thoughts from a SME (Subject Matter Expert) perspective.

We can conclude that this optimized model is better than the baseline one and seems quite accurate.

Think About It: Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?

In [144]:
# Use inner id 0
sim_user_user_opt.get_neighbors(0, k=5)
Out[144]:
[42, 1131, 17, 186, 249]

Below we will be implementing a function where the input parameters are:

  • data: A song dataset
  • user_id: A user-id against which we want the recommendations
  • top_n: The number of songs we want to recommend
  • algo: The algorithm we want to use for predicting the play_count
  • The output of the function is a set of top_n items recommended for the given user_id based on the given algorithm
In [266]:
def get_recommendations(data, user_id, top_n, algo):
    
    # Creating an empty list to store the recommended product ids
    recommendations = []
    
    # Creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot_table(index = 'user_id', columns = 'song_id', values = 'play_count')

    # Extracting those business ids which the user_id has not visited yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # Looping through each of the business ids which user_id has not interacted yet
    for song_id in non_interacted_products:
        
        # Predicting the ratings for those non visited restaurant ids by this user
        est = algo.predict(user_id, song_id).est
        
       
        
        # Appending the predicted ratings
        recommendations.append((song_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x : x[1], reverse = True)

    return recommendations[:top_n] # Returing top n highest predicted rating products for this user
In [267]:
# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations = get_recommendations(df_final, 6958, 5, sim_user_user_opt)
In [268]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['song_id', 'predicted_plays'])
Out[268]:
song_id predicted_plays
0 5531 2.553335
1 317 2.518269
2 4954 2.406776
3 8635 2.396606
4 5943 2.390723
In [274]:
song_ids = [5531, 317, 4954, 8635, 5943]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[274]:
['The Maestro by Beastie Boys',
 "You've Got The Love by Florence + The Machine",
 'Undo by Björk',
 'Secrets by OneRepublic',
 'Una Confusion by LU']

Observations and Insights: Here we have the top 5 songs for our user 6958 recommended by the user_user_sim_opt algorithm and ranking them in order of the predicted plays. The first 4 songs in this recommendation list look pretty good - similar style and sound all in the indie / electro / indiepop and indie dance genres but I must say that the 5th song does not seem to fit in as it is a spanish pop song very different in genre and general artistic style.

Correcting the play_counts and Ranking the above songs

In [270]:
def ranking_songs(recommendations, final_play):
  # Sort the songs based on play counts
  ranked_songs = final_play.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending = False)[['play_freq']].reset_index()

  # Merge with the recommended songs to get predicted play_count
  ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns = ['song_id', 'predicted_plays']), on = 'song_id', how = 'inner')
   #ranked_products = ranked_products.merge(pd.DataFrame(recommendations, columns = ['prod_id', 'predicted_ratings']), on = 'prod_id', how = 'inner')
  # Rank the songs based on corrected play_count
  ranked_songs['corrected_plays'] = ranked_songs['predicted_plays'] - 1 / np.sqrt(ranked_songs['play_freq'])

  # Sort the songs based on corrected play_counts
  ranked_songs = ranked_songs.sort_values('corrected_plays', ascending = False)
  
  return ranked_songs

Think About It: In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?

Here we are taking into account the number of users that have listened to a song as well as the number of times they have played it. A song with a high number of plays but by only a small number of users should not be considered to have a higher ranking than a song with an average to high play count but by many many different users. So we do this calculation to adjust the predicted plays by taking this into account.

Here we are subtracting the 1/sqrt(play_freq) so that the corrected plays will all come down a bit by different amounts. We could in this case also add the 1/sqrt(play_freq) so that the corrected plays will all be a bit higher as there is no upper limit of play count.

In [271]:
#Applying the ranking_songs function on the final_play data
ranking_songs(recommendations, final_play)
Out[271]:
song_id play_freq predicted_plays corrected_plays
0 5531 618 2.553335 2.513109
2 317 411 2.518269 2.468943
1 5943 423 2.390723 2.342101
3 4954 183 2.406776 2.332854
4 8635 155 2.396606 2.316284

Observations and Insights: Here the top 5 songs using the ranking_songs function has still produced the same top 5 songs but the order of them has changed slightly due to the new corrected play count predictions.

Item Item Similarity-based collaborative filtering recommendation systems

In [287]:
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance
sim_options = {'name': 'cosine',
               'user_based': False}
# Defining nearest neighbour algorithm
algo_knn_item = KNNBasic(sim_options=sim_options,verbose=False, random_state = 1)

# Train the algorithm on the train set
algo_knn_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k=10
precision_recall_at_k(algo_knn_item)
RMSE: 1.0394
Precision:  0.307
Recall:  0.562
F_1 score:  0.397

Observations and Insights: In this model that uses the cosine similarity nearest neighbours between songs, we can see our baseline model has an RMSE of 1.0394 which is lower (and better) than both the models in our user similarity model. However Precision is lower than both at 30% and so is Recall at 56% and the F1 at 39%. As we are quite interested in Precision we definitely want to improve this 30% though it is interesting to see that our error distance is lower on this model.

In [151]:
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user
algo_knn_item.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.36   {'actual_k': 20, 'was_impossible': False}
Out[151]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.3614157231762556, details={'actual_k': 20, 'was_impossible': False})
In [152]:
#Finding users who have not listened to song_id 1671

specific_song = 1671

# Create a Boolean mask for specific song_id
mask = df_final['song_id'] == specific_song

# Group by user_id and aggregate the mask using the `any` function
grouped = df.groupby('user_id')['song_id'].agg(listened_to_song=lambda x: any(x == song))

# Filter the grouped data to only show users who have not listened to the song
result = grouped[grouped['listened_to_song'] == False]

result
Out[152]:
listened_to_song
user_id
11 False
17 False
57 False
84 False
120 False
... ...
76300 False
76307 False
76331 False
76342 False
76347 False

3154 rows × 1 columns

In [285]:
# Predict the play count for a user that has not listened to the song (with song_id 1671)
algo_knn_item.predict(76300, 1671, verbose = True)
user: 76300      item: 1671       r_ui = None   est = 2.03   {'actual_k': 25, 'was_impossible': False}
Out[285]:
Prediction(uid=76300, iid=1671, r_ui=None, est=2.03143303412422, details={'actual_k': 25, 'was_impossible': False})

Observations and Insights: The prediction for our user that listened to 'Sleeping In' by the Postal Service 2 times, is only predicted to listen 1.34 times in this model. This is less accurate than the previous model which had the predictions at 1.80 and 1.96. It has also predicted that a user 76300 who has never heard this song would listen to it 2.03 times but we are not feeling too confident in this model.

In [154]:
# Apply grid search for enhancing model performance

# Setting up parameter grid to tune the hyperparameters
#param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              #'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
                             # 'user_based': [False], "min_support": [2, 4]}
              #}

# Performing 3-fold cross-validation to tune the hyperparameters
#gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3)

# Fitting the data
#gs.fit(data)

# Best RMSE score
#print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
#print(gs.best_params['rmse'])

Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameters here.

In [286]:
# Apply the best modle found in the grid search
sim_options = {'name': 'pearson_baseline',
               'user_based': False}
# Defining nearest neighbour algorithm
algo_knn_item_opt = KNNBasic(sim_options=sim_options, k = 30, min_k = 6, verbose=False, random_state = 1, min_support = 2)

# Train the algorithm on the train set
algo_knn_item_opt.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k=10
precision_recall_at_k(algo_knn_item_opt)
RMSE: 1.0328
Precision:  0.408
Recall:  0.665
F_1 score:  0.506

Observations and Insights: The tuning of the hyperparameters has helped improve this model a little bit. In the RMSE score however there is only a very slight improvement from 1.0394 to 1.0328. However Precision has improved from 30% to 41% and Recall has improved from 56% to 66.5%. It is still not performing as well as the User to User similarity however in terms of these scores.

In [156]:
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
algo_knn_item_opt.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.96   {'actual_k': 10, 'was_impossible': False}
Out[156]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.9634957386781853, details={'actual_k': 10, 'was_impossible': False})
In [288]:
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user
algo_knn_item_opt.predict(6958, 3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.28   {'actual_k': 10, 'was_impossible': False}
Out[288]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.2759946618244609, details={'actual_k': 10, 'was_impossible': False})

Observations and Insights: Wow the predicted plays on this user listening to 'Sleeping In' by the Postal Service has improved a lot. The baseline model had it predicting 1.36 plays but this tuned model predicted 1.96 plays which is very close to the actual number of plays - 2.

For this same user to listen to 'Life In Technicolor' by Coldplay it has a prediction of 1.28. We don't know how many times they actually would play it because they have never heard it.

In [158]:
# Find five most similar items to the item with inner id 0
algo_knn_item_opt.get_neighbors(0, k=5)
Out[158]:
[124, 523, 173, 205, 65]
In [159]:
# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
sim_item_recommendations = get_recommendations(df_final, 6958, 5, algo_knn_item_opt)
In [160]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(sim_item_recommendations, columns=['song_id', 'predicted_plays'])
Out[160]:
song_id predicted_plays
0 2342 2.653903
1 5101 2.386577
2 139 2.313727
3 7519 2.270864
4 8099 2.212702
In [161]:
# Applying the ranking_songs function
ranking_songs(sim_item_recommendations, final_play)
Out[161]:
song_id play_freq predicted_plays corrected_plays
4 2342 111 2.653903 2.558987
2 5101 130 2.386577 2.298871
3 139 119 2.313727 2.222057
1 7519 168 2.270864 2.193712
0 8099 275 2.212702 2.152399
In [289]:
song_ids = [2342, 5101, 139, 7519, 8099]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[289]:
['I Got Mine by The Black Keys',
 'Toxic by Britney Spears',
 'A Dustland Fairytale by The Killers',
 'White Sky by Vampire Weekend',
 'Alaska by Camera Obscura']

Observations and Insights: We have an entirely different set of song recommendatinos for this user with the song to song based similiarity algorithn comparered to the user to user similiarity. These songs seem to all fit in together again apart from one very pop track - Toxic by Britney Spears. Maybe our user does have a pop leaning too. The predicted plays of the top songs here roughly in the same sort of range as in the user to user similarity.

Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.

In [162]:
# Build baseline model using svd
svd = SVD(random_state=1)

# Training the algorithm on the train set
svd.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd)
RMSE: 1.0252
Precision:  0.41
Recall:  0.633
F_1 score:  0.498

Observations We have the lowest RMSE score of all our models so far. Precision is 41% which is the same as the user to user optimized whic is the best so far and Recall is 63% which is only better than the song-song baseline model. F1 is ~50% which again is only better than the song-song baseline model. So the RMSE and Precision of this model but its Recall lets it down.

In [290]:
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2
svd.predict(6958, 1671, r_ui = 2, verbose = True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.27   {'was_impossible': False}
Out[290]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.267473397214638, details={'was_impossible': False})
In [164]:
# Making a prediction for the user who has not listened to the song (song_id 3232)
#Finding users who have not listened to song_id 3232

specific_song = 3232

# Create a Boolean mask for specific song_id
mask = df_final['song_id'] == specific_song

# Group by user_id and aggregate the mask using the `any` function
grouped = df.groupby('user_id')['song_id'].agg(listened_to_song=lambda x: any(x == song))

# Filter the grouped data to only show users who have not listened to the song
result = grouped[grouped['listened_to_song'] == False]

result
Out[164]:
listened_to_song
user_id
11 False
17 False
57 False
84 False
120 False
... ...
76300 False
76307 False
76331 False
76342 False
76347 False

3154 rows × 1 columns

In [165]:
svd.predict(90, 3232, verbose = True)
user: 90         item: 3232       r_ui = None   est = 1.68   {'was_impossible': False}
Out[165]:
Prediction(uid=90, iid=3232, r_ui=None, est=1.683210478325189, details={'was_impossible': False})
In [291]:
#we also know that our test user 6958 has not listened to song 3232
svd.predict(6958, 3232, verbose = True)
user: 6958       item: 3232       r_ui = None   est = 1.56   {'was_impossible': False}
Out[291]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.5561675084403663, details={'was_impossible': False})

OBSERVATIONS Even though the model looks good from its scores, the individual prediction for the user 6958 to listen to 'Sleeping In' by The Postal Service 1671 is only 1.27 which is lower than all the other models and further away from the actual play count of 2. This svd model also predicts that this same user would listen to 'Life In Technicolor' by Coldplay ID 3232 1.56 times.

Improving matrix factorization based recommendation system by tuning its hyperparameters

In [166]:
# # Set the parameter space to tune
# param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
#               'reg_all': [0.2, 0.4, 0.6]}

# # Performe 3-fold grid-search cross-validation
# gs_= GridSearchCV(SVD, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)

# # Fitting data
# gs_.fit(data)
# # Best RMSE score
# print(gs_.best_score['rmse'])
# # Combination of parameters that gave the best RMSE score
# print(gs_.best_params['rmse'])
1.0120897909030624
{'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}

Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters here.

In [167]:
# Building the optimized SVD model using optimal hyperparameters
# Build the optimized SVD model using optimal hyperparameter search
svd_optimized = SVD(n_epochs=30, lr_all=0.01, reg_all=0.2)

# Train the algorithm on the train set
svd_optimized.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd_optimized)
RMSE: 1.0143
Precision:  0.412
Recall:  0.633
F_1 score:  0.499

Observations and Insights: RMSE has improved slightly and come down from 1.0252 to 1.0143. Precision is roughly the same though is 0.2% higher at 41.2%. Recall remains the same as the baseline model at 63.3% and F1 remains basically the same having only improved by 0.1% from the baseline model.

In [292]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671
svd_optimized.predict(6958, 1671, r_ui = 2, verbose = True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.35   {'was_impossible': False}
Out[292]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.3541160193632484, details={'was_impossible': False})
In [169]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
svd_optimized.predict(6958, 3232, verbose = True)
user: 6958       item: 3232       r_ui = None   est = 1.45   {'was_impossible': False}
Out[169]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.4478129164989462, details={'was_impossible': False})

Observations and Insights: The tuned svd model has improved slightly by predicting the play count for our test user on 'Sleeping In' by The Postal Service at 1.35 which is up from 1.27 that the baseline model. But it is still quite far away from the actual play count of 2 and other models have predicted much closer.

In [170]:
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm
svd_recommendations = get_recommendations(df_final, 6958, 5, svd_optimized)
In [171]:
pd.DataFrame(svd_recommendations, columns=['song_id', 'predicted_plays'])
Out[171]:
song_id predicted_plays
0 7224 2.331359
1 8324 2.102053
2 6450 2.075093
3 5653 2.019267
4 5531 1.976763
In [172]:
# Ranking songs based on above recommendations
ranking_songs(svd_recommendations, final_play)
Out[172]:
song_id play_freq predicted_plays corrected_plays
2 7224 107 2.331359 2.234685
4 8324 96 2.102053 1.999991
3 6450 102 2.075093 1.976078
0 5531 618 1.976763 1.936538
1 5653 108 2.019267 1.923042
In [293]:
song_ids = [7224, 8324, 6450, 5531, 5653]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[293]:
['Secrets by OneRepublic',
 'Transparency by White Denim',
 'Brave The Elements by Colossal',
 "Victoria (LP Version) by Old 97's",
 'The Big Gundown by The Prodigy']

Observations and Insights: The top 5 recommendations using the optimised matrix factorisation model (svd) is an interesting list where the songs again tend to fall into the same kind of genre. However they are again a completely new unique set of songs with the exception of 'Secrets' by One Republic which was also in the user user similarity recommendations.

Interestingly 'Victoria' by The Old 97s is in there which was in the Top 10 Songs based on popularity when the threshold for minumum plays was very low. However, this list of recommended songs from a SME point of view tend to fit quite well together and indeed actually fit quite well with all of the other top 5 recommendations that the other models have produced with the exception of those very poppy songs mentioned earlier.

Cluster Based Recommendation System

In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

In [298]:
# Make baseline clustering model
clust_baseline = CoClustering(random_state = 1)

# Training the algorithm on the train set
clust_baseline.fit(trainset)

# Let us compute precision@k and recall@k with k = 10
precision_recall_at_k(clust_baseline)
RMSE: 1.0487
Precision:  0.397
Recall:  0.582
F_1 score:  0.472
In [174]:
# Making prediction for user_id 6958 and song_id 1671
clust_baseline.predict(6958, 1671, r_ui = 2, verbose = True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.29   {'was_impossible': False}
Out[174]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.2941824757363074, details={'was_impossible': False})
In [175]:
# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user
clust_baseline.predict(6958, 3232, verbose = True)
user: 6958       item: 3232       r_ui = None   est = 1.48   {'was_impossible': False}
Out[175]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.4785259100797417, details={'was_impossible': False})

Improving clustering-based recommendation system by tuning its hyper-parameters

In [176]:
# # Set the parameter space to tune
# param_grid = {'n_cltr_u': [5, 6, 7, 8], 'n_cltr_i': [5, 6, 7, 8], 'n_epochs': [10, 20, 30]}

# # Performing 3-fold grid search cross-validation
# gs = GridSearchCV(CoClustering, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# # Fitting data
# gs.fit(data)
# # Best RMSE score
# print(gs.best_score['rmse'])
# # Combination of parameters that gave the best RMSE score
# print(gs.best_params['rmse'])
1.06093984730647
{'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}

Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters here.

In [297]:
# Train the tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u = 5,n_cltr_i = 5, n_epochs = 10, random_state = 1)

clust_tuned.fit(trainset)

precision_recall_at_k(clust_tuned)
RMSE: 1.0654
Precision:  0.394
Recall:  0.566
F_1 score:  0.465

Observations and Insights: The Clustering baseline model has an RMSE score of 1.0487 and yet the tuned model has a higher RMSE of 1.0654. The model seems to have decreased in performance on every score from tuning the hyperparamaters as the baseline Precision was 37.9% and tuned it's 37.4%. Recall was 0.582 and now it's 0.566 and F1 is down from 47.2% to 45.7%. So clearly tuning the paramaters here did not improve the model and the baseline model scores are from the best.

The prediction of our user to play 'Sleeping In' by The Postal Service was 1.29 which is far away from the 2 actual plays. The other models have outperformed this model. The tuned model predicted it even lower at 0.84 which is even less that one play! This model is proving itself very unreliable and I would not be confident using it.

In [294]:
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671
clust_tuned.predict(6958, 1671, r_ui = 2, verbose = True)
user: 6958       item: 1671       r_ui = 2.00   est = 0.84   {'was_impossible': False}
Out[294]:
Prediction(uid=6958, iid=1671, r_ui=2, est=0.8386051957794463, details={'was_impossible': False})
In [179]:
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
clust_tuned.predict(6958, 3232, verbose = True)
user: 6958       item: 3232       r_ui = None   est = 2.24   {'was_impossible': False}
Out[179]:
Prediction(uid=6958, iid=3232, r_ui=None, est=2.2357375431480007, details={'was_impossible': False})

Observations and Insights:_

Implementing the recommendation algorithm based on optimized CoClustering model

In [303]:
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = get_recommendations(df_final, 6958, 5, clust_tuned)

Correcting the play_count and Ranking the above songs

In [301]:
# Ranking songs based on the above recommendations
ranking_songs(clustering_recommendations, final_play)
Out[301]:
song_id play_freq predicted_plays corrected_plays
2 7224 107 3.094797 2.998124
4 8324 96 2.311498 2.209436
1 9942 150 2.215039 2.133390
0 5531 618 2.124563 2.084337
3 4831 97 2.123783 2.022248
In [299]:
# displaying the top 5 songs recommended by the tuned clustering algorithm

song_ids = [8324, 5531, 4831, 352, 8831]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[299]:
['Dog Days Are Over (Radio Edit) by Florence + The Machine',
 'Heaven Must Be Missing An Angel by Tavares',
 'Secrets by OneRepublic',
 "Bigger Isn't Better by The String Cheese Incident",
 'The Big Gundown by The Prodigy']
In [302]:
# displaying the top 5 songs recommended by the clustering baseline algorithm

song_ids = [7224, 8324, 9942, 5531, 4831]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Out[302]:
['Heaven Must Be Missing An Angel by Tavares',
 'Secrets by OneRepublic',
 'Greece 2000 by Three Drives',
 "Victoria (LP Version) by Old 97's",
 'The Big Gundown by The Prodigy']

Observations and Insights: The top 5 recommended songs using the optimized clustering algorithm is quite interesting! It contains a few of the songs found in other models' recommendations such as 'Secret' by One Republic and 'The Big Gundown' by The Prodigy and 'Dog Days Are Over' by Florece and the Machine. But then it has two other songs that are 1970s disco and funk songs - very different!

Given that the optimised clustering algorithm had worse scores than the baseline one I tried the recommendations with the baseline clustering algorithm too and things got weirder! It was a mixture of everything we've seen so far! Heaven Must be Missing an Angel - the disco funk one 'Secrets' - this has come up in a few of the models including the popularity one 'Victoria' which is the Old 97s song that comes up in the popularity model and the svd one. 'The Big Gundown' by the Prodigy which we have seen in some of the other models too. 'Greece 2000' which is a european pop rave kind of track and doesn't fit very well with even the other songs of a more dance nature.

I do not have much confidence in the clustering models for this project. They have given bad scores, bad predictions and quite strange recommendations.

Content Based Recommendation Systems

Think About It: So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?

In [182]:
df_small = df_final
In [183]:
df_small
Out[183]:
user_id song_id play_count title release artist_name year text
200 6958 447 1 Daisy And Prudence Distillation Erin McKeown 2000 Daisy And Prudence Distillation Erin McKeown
202 6958 512 1 The Ballad of Michael Valentine Sawdust The Killers 2004 The Ballad of Michael Valentine Sawdust The Ki...
203 6958 549 1 I Stand Corrected (Album) Vampire Weekend Vampire Weekend 2007 I Stand Corrected (Album) Vampire Weekend Vamp...
204 6958 703 1 They Might Follow You Tiny Vipers Tiny Vipers 2007 They Might Follow You Tiny Vipers Tiny Vipers
205 6958 719 1 Monkey Man You Know I'm No Good Amy Winehouse 2007 Monkey Man You Know I'm No Good Amy Winehouse
... ... ... ... ... ... ... ... ...
1999734 47786 9139 1 Half Of My Heart Battle Studies John Mayer 0 Half Of My Heart Battle Studies John Mayer
1999736 47786 9186 1 Bitter Sweet Symphony Bitter Sweet Symphony The Verve 1997 Bitter Sweet Symphony Bitter Sweet Symphony Th...
1999745 47786 9351 2 The Police And The Private Live It Out Metric 2005 The Police And The Private Live It Out Metric
1999755 47786 9543 1 Just Friends Back To Black Amy Winehouse 2006 Just Friends Back To Black Amy Winehouse
1999765 47786 9847 1 He Can Only Hold Her Back To Black Amy Winehouse 2006 He Can Only Hold Her Back To Black Amy Winehouse

117876 rows × 8 columns

In [184]:
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
df_small['text'] = df_small['title'] + ' ' + df_small['release'] + ' ' + df_small['artist_name']
In [185]:
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
df_small = df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]
In [186]:
#Let us drop the duplicate records from title column
df_small = df_small.drop_duplicates(subset = ['title'])
In [187]:
# Set the title column as the index
df_small = df_small.set_index('title')
In [188]:
# See the first 5 records of the df_small dataset
df_small.head()
Out[188]:
user_id song_id play_count text
title
Daisy And Prudence 6958 447 1 Daisy And Prudence Distillation Erin McKeown
The Ballad of Michael Valentine 6958 512 1 The Ballad of Michael Valentine Sawdust The Ki...
I Stand Corrected (Album) 6958 549 1 I Stand Corrected (Album) Vampire Weekend Vamp...
They Might Follow You 6958 703 1 They Might Follow You Tiny Vipers Tiny Vipers
Monkey Man 6958 719 1 Monkey Man You Know I'm No Good Amy Winehouse
In [189]:
# Create the series of indices from the data
indices = pd.Series(df_small.index)

indices[ : 5]
Out[189]:
0                 Daisy And Prudence
1    The Ballad of Michael Valentine
2          I Stand Corrected (Album)
3              They Might Follow You
4                         Monkey Man
Name: title, dtype: object
In [190]:
# Importing necessary packages to work with text data
import nltk

# Download punkt library
nltk.download("punkt")

# Download stopwords library
nltk.download("stopwords")

# Download wordnet 
nltk.download("wordnet")

# Import regular expression
import re

# Import word_tokenizer
from nltk import word_tokenize

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Import stopwords
from nltk.corpus import stopwords

# Import CountVectorizer and TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

We will create a function to pre-process the text data:

In [191]:
import nltk
nltk.download('omw-1.4')
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Out[191]:
True
In [192]:
# Function to tokenize the text
def tokenize(text):
    
    text = re.sub(r"[^a-zA-Z]"," ", text.lower())
    
    tokens = word_tokenize(text)
    
    words = [word for word in tokens if word not in stopwords.words("english")]  # Use stopwords of english
    
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems
In [193]:
# Create tfidf vectorizer 
tfidf = TfidfVectorizer(tokenizer = tokenize) 
In [194]:
# Fit_transfrom the above vectorizer on the text column and then convert the output into an array
song_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()
In [195]:
pd.DataFrame(song_tfidf)
Out[195]:
0 1 2 3 4 5 6 7 8 9 ... 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
556 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
557 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
558 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
559 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
560 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

561 rows × 1437 columns

In [196]:
type(song_tfidf)
Out[196]:
numpy.ndarray
In [197]:
# Compute the cosine similarity for the tfidf above output
similar_songs = cosine_similarity(song_tfidf, song_tfidf)
similar_songs
Out[197]:
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])
In [200]:
# Function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):
    
    recommended_songs = []
    
    # Getting the index of the song that matches the title
    idx = indices[indices == title].index[0]

    # Creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # Getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1 : 11].index)
    print(top_10_indexes)
    
    # Populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])
        
    return recommended_songs

Finally, let's create a function to find most similar songs to recommend for a given song.

Recommending 10 songs similar to Learn to Fly

In [202]:
# Make the recommendation for the song with title 'Learn To Fly'
recommendations('Learn To Fly', similar_songs)
[509, 234, 423, 345, 394, 370, 371, 372, 373, 375]
Out[202]:
['Everlong',
 'The Pretender',
 'Nothing Better (Album)',
 'From Left To Right',
 'Lifespan Of A Fly',
 'Under The Gun',
 'I Need A Dollar',
 'Feel The Love',
 'All The Pretty Faces',
 'Bones']
In [305]:
df_final.loc[df_final['title'] == 'From Left To Right']
Out[305]:
user_id song_id play_count title release artist_name year text
481 42302 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
2435 6901 4739 2 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
2887 12040 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
4521 10820 4739 3 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
4868 34039 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
... ... ... ... ... ... ... ... ...
112153 28099 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
113526 72251 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
114659 61350 4739 3 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
115310 6881 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip
116419 52592 4739 1 From Left To Right Corymb Boom Bip 2003 From Left To Right Corymb Boom Bip

112 rows × 8 columns

Observations and Insights: The content based recommendations has produced some interesting results. We don't have RMSE and Precision and Recall scores to go by with this model, but we can create a list of recommendations based on a song.

Here we have used Foo Fighters' 'Learn To Fly' as our song to create recommendations with. Bearing in mind it can only use the title, artist name and album name to go by for creating the similiarity matrix.

  • The first two songs are also very popular songs by Foo Fighters so these make sense.
  • From Left to Right is song by The Postal Service (they keep cropping up don't they?!) and although this band is much more obscure than Foo Fighters, and they are much more indie than rock, this song is one of their more popular songs.
  • 'Lifespan Of A Fly' is a very bizzare quirky female singer-songwriter song. I am struggling to see any connection between this song and 'Learn To Fly'.
  • Then there are 3 songs by The Killers in these last 5 songs. I can see a connection between liking Foo Fighters and liking The Killers. Especially in 2011 when the scene was still active and Foo Fighters at that time were encaptured in that scene (even though they have transversed through many scenes and always remained very much the same).
  • Feel The Love by Cut Copy is again in that similar indie scene of that time.

All in all I'd say this is a pretty decent top 10 recommendation though there are 5 songs which are all the same two artists.

In [206]:
# Build the default user-user-similarity model
print("sim_user_user:")
precision_recall_at_k(sim_user_user)
print("")
#user user optimized
print("sim_user_user_opt:")
precision_recall_at_k(sim_user_user_opt)
print("")
# Using the baseline similarity measure for item-item based collaborative filtering
print("algo_knn_item:")
precision_recall_at_k(algo_knn_item)
print("")

print("algo_knn_item_opt:")
precision_recall_at_k(algo_knn_item_opt)
print("")
# Baseline matrix factorization based recommendation system
print("svd:")
precision_recall_at_k(svd)
print("")
print("svd_optimized:")
precision_recall_at_k(svd_optimized)
print("")
# Make baseline clustering model
print("clust_baseline:")
precision_recall_at_k(clust_baseline)
print("")
print("clust_tuned:")
precision_recall_at_k(clust_tuned)
sim_user_user:
RMSE: 1.0878
Precision:  0.396
Recall:  0.692
F_1 score:  0.504

sim_user_user_opt:
RMSE: 1.0521
Precision:  0.413
Recall:  0.721
F_1 score:  0.525

algo_knn_item:
RMSE: 1.0394
Precision:  0.307
Recall:  0.562
F_1 score:  0.397

algo_knn_item_opt:
RMSE: 1.0328
Precision:  0.408
Recall:  0.665
F_1 score:  0.506

svd:
RMSE: 1.0252
Precision:  0.41
Recall:  0.633
F_1 score:  0.498

svd_optimized:
RMSE: 1.0143
Precision:  0.412
Recall:  0.633
F_1 score:  0.499

clust_baseline:
RMSE: 1.0487
Precision:  0.397
Recall:  0.582
F_1 score:  0.472

clust_tuned:
RMSE: 1.0733
Precision:  0.393
Recall:  0.545
F_1 score:  0.457

Conclusion and Recommendations:

  • Refined Insights - What are the most meaningful insights from the data relevant to the problem?

The most meaninful insights I have found are that some of the models make pretty good predictions and good recommendations. This is without lots of specific tuning and experimenting with thresholds and paramaters.

It is also clear that working with the low play counts ie. play counts of 5 or less, I'm not sure this is the best way to use the dataset. We might have been better to cap all the playcounts above 5 so that they could stil be used instead of being cut out.

  • Comparison of various techniques and their relative performance - How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

From this table we can see the best RMSE score is from the svd optimised model, but the best Precision, Recall and F1 scores are all from the user-user-optimized model. So the user-user optimized model seems to be the most robust. It is quite evident in music that people do tend to fall into having a 'taste' in music, so people that you find that have similar taste in music. We have all had favourite radio and club DJs that seem to always choose records that we like but that our friend doesn't because they have a different taste.

Using user similarity does seem like a sensible option from this point of view.

I was also quite impressed with the recommendations produced by the item contents model. It would be interesting to explore this further by having the lyrics to songs, we could find simliarity in lyrical themes of songs.

It would also be interesting to have the data of the songwriter. This could be great for finding cover versions of the same song but done by different artists in different styles.

Another idea would be to use the natural language packege look at online reviews of songs or albums for more content based recommendations.

  • Proposal for the final solution design - What model do you propose to be adopted? Why is this the best solution to adopt?

I would propose to use the user-user similarity model as the basis for recommendations. But I would also look at a hybrid model that also used the svd matrix estimation with some extra tuning. And also to look at using the content based recommendations.

For cold starts - when a user is new and we have no information about them or know what songs they like, then the most obvious thing to do seems to be to use the popularity based recommendations. As soon as they hit a play count of more than 1 for a song we can start using that to invoke other models as we gather more information about which songs they like and with which other users their taste seems to align.