df_final.to_csv('/content/drive/My Drive/dfFinal.csv',index=False)
Now that we have explored the data, let's apply different algorithms to build recommendation systems.
Note: Use the shorter version of the data, i.e., the data after the cutoffs as used in Milestone 1.
df_final = pd.read_csv('/content/drive/MyDrive/dfFinal.csv')
Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.
# Calculating average play_count
average_count = df_final.groupby('song_id').mean()['play_count']# Hint: Use groupby function on the song_id column
# Calculating the frequency a song is played
play_freq = df_final.groupby('song_id').count()['play_count'] # Hint: Use groupby function on the song_id column
# Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count': average_count, 'play_freq': play_freq, })
# Let us see the first five records of the final_play dataset
final_play.head()
# Set the figure size
plt.figure(figsize = (30, 10))
sns.barplot(x = final_play.index,
y = 'play_freq',
data = final_play,
estimator = np.mean)
# Set the y label of the plot
plt.ylabel('Play Frequency')
#set the x label
plt.xlabel('Songs')
#Set the title of the plot
plt.title("Showing Play Frequency of Songs in the Final Play Dataset")
# Show the plot
plt.show()
Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.
# Build the function to find top n songs
def top_n_songs(data, n, min_plays):
# Finding songs with interactions greater than the minimum number of interactions
recommendations = data[data['play_freq'] > min_plays]
# Sorting values with respect to the average rating
recommendations = recommendations.sort_values(by = 'avg_count', ascending = False)
return recommendations.index[:n]
final_play.dtypes
final_play.head()
# Recommend top 10 songs using the function defined above
list(top_n_songs(final_play, 40, 500))
song_ids = [5531, 2220, 352, 4448, 1334, 8092, 8138, 7416, 8582, 4152]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
# Recommend top 10 songs using the function defined above
# Experimenting with threshold of min plays of 1
list(top_n_songs(final_play, 10, 1))
song_ids = [7224, 8324, 6450, 9942, 5531, 5653, 8483, 2220, 657, 614]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
We can see with a threshold of 500 - I tried to see if it could produce a top 40 songs but there are only 21 songs with a min_play higher than 500. So for a top 10 this is OK and truly representative of very popular songs. But we must be aware that if we wanted to produce a list of say 40 songs, we must use a lower threshold.
I also experimented with very low thresholds of minmum plays and even without it at all. In the case of no or a very low threshold (we can see from the plot that most songs have at least 100 plays) that the top n songs are then calculated on average play_count. The top song - Victoria by Old 97's (Yeah I've never heard of them either!) has an average count of 3.38 and a play frequency of only 107. But with a threshold of 500, the top song is 'Secrets' by One Republic with a lower average play_count of 2.31 but a much higher number of listeners at 618.
So if we are really looking for just the most popular songs, we should use a higher threshold and let the order of these be sorted by average play_count rather than letting the top songs be sorted by the average play count which are so closely grouped in the lower thresholds that it's not really doing a great sorting job.
We can use the top_n_songs and apply it to further filters such as countries or areas, also we can apply it to the catalogue of a particular artist to find their most popular songs or indeed even of a song title along with the songwriter (if we had this information) so that people could find the most popular versions of a song that has been covered many times.
To build the user-user-similarity-based and subsequent models we will use the "surprise" library.
# Install the surprise package using pip. Uncomment and run the below code to do the same
!pip install surprise
# Import necessary libraries
# To compute the accuracy of models
from surprise import accuracy
# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader
# Class for loading datasets
from surprise.dataset import Dataset
# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV
# For splitting the data in train and test dataset
from surprise.model_selection import train_test_split
# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic
# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
# For implementing KFold cross-validation
from surprise.model_selection import KFold
# For implementing clustering-based recommendation system
from surprise import CoClustering
Below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.
Think About It: Which metric should be used for this problem to compare different models?
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
user_est_true = defaultdict(list)
# Making predictions on the test data
predictions=model.test(testset)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key = lambda x : x[0], reverse = True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[ : k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
# Mean of all the predicted precisions are calculated
precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)
# Mean of all the predicted recalls are calculated
recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
accuracy.rmse(predictions)
# Command to print the overall precision
print('Precision: ', precision)
# Command to print the overall recall
print('Recall: ', recall)
# Formula to compute the F-1 score
print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))
Think About It: In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing the threshold? What is the intuition behind using the threshold value of 1.5?
The threshold of 1.5 is to represent a user playing a song 1.5 times. This becomes the boundary that makes this regression problem become more like a clsasification problem. We are saying here that if a user plays the song 1.5 times or more, then they are considered to like the song was predicted - it would be considered a successful prediction.
A threshold lower that this would prove difficult as playing a song 1 time doesn't really let us know if the listener likes it or not. But playing a song more than once is a pretty good indication that they were interested in it and that the prediction had some value to them.
I experimented with changing the threshold to 2 then 3 then 1. 1 gave us precision and recall scores of almost 100% but this is not of much value to how well the model is working. 2 and 3 were much less, tiny scores because it's so rare that a user plays a song 3 times. It seems the best number is 1.5 for generally seeing how the model is performing.
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5)) #use rating scale (0, 5)
# Loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader) # Take only "user_id","song_id", and "play_count"
# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size= 0.4, random_state = 42) # Take test_size = 0.4
Think About It: How changing the test size would change the results and outputs?
# Build the default user-user-similarity model
sim_options = {'name': 'cosine',
'user_based': True}
# KNN algorithm is used to find desired similar items
sim_user_user = KNNBasic(sim_options = sim_options, verbose = False, random_state = 1)
# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k = 30
precision_recall_at_k(sim_user_user) # Use sim_user_user model
**Observations and Insights: We can see our RMSE is 1.09. Precision seems to be low at 39% meaning that only 39% of the songs receommended were of interest to the user. Recall is at 69% and this means of all the songs the listener would play more than once, that 69% of them are being recommended. F1 - which is the harmonic mean of Precision and Recall is 50%. We want to really focus on the Precision in this project since we want the listener to trust the recommendations and continue to find music that they enjoy. If the top songs being recommended to them are of no interest they will lost trust in the platform and potentially go to another.
We will see if we can get the precision higher and the RMSE scores lower, even a little bit by tuning the hyperparameters of the model.
# Predicting play_count for a sample user with a listened song
sim_user_user.predict(6958, 1671, r_ui = 2, verbose = True) # Use user id 6958 and song_id 1671
df_final.loc[df_final['song_id'] == 1671]
# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict(6958, 3232, verbose = True) # Use user_id 6958 and song_id 3232
df_final.loc[df_final['song_id'] == 3232]
Observations and Insights:
We can see that model estimated the user would listen to 'Sleeping In' by The Postal Service 1.8 times when in reality they did listen 2 times (worth noting that the play count is only given in integers so we will rarely get an exact match given that the prediction is given as a float number). This is pretty good.
In the prediction whether the same user would listen to Life In Technicolor by Coldplay it predicted they would listen 1.6 times. This seems reasonable just from knowledge of the songs myself. The songs are not too disimilar in genre and sound, though arguably someone who liked the Postal Service might be a little more niche in their taste than a big popular band like Coldplay. But still, the songs are close enough and the prediction seems reasonable purely from a musical point of view, as we don't know what the actual play count on this was as they have not heard the song.
Now, let's try to tune the model and see if we can improve the model performance.
# # Setting up parameter grid to tune the hyperparameters
# param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
# 'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
# 'user_based': [True], "min_support": [2, 4]}
# }
# #Performing 3-fold cross-validation to tune the hyperparameters
# gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3)
# #Fitting the data
# gs.fit(data) # Use entire data for GridSearch
# #Best RMSE score
# print(gs.best_score['rmse'])
# #Combination of parameters that gave the best RMSE score
# print(gs.best_params['rmse'])
# Train the best model found in above gridsearch
# Build the default user-user-similarity model
sim_options = {'name': 'pearson_baseline',
'user_based': True}
# KNN algorithm is used to find desired similar items
sim_user_user_opt = KNNBasic(sim_options = sim_options, k=30, min_k = 9, verbose = False, random_state = 1, min_support = 2)
#(sim_options=sim_options, k=20, min_k=6, verbose=False)
# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user_opt.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k = 30
precision_recall_at_k(sim_user_user_opt) # Use sim_user_user model
Observations and Insights: We have gained a bit of Precision as it is now 42% up from 39%. Recall is also higher at 72% and our RMSE score is lower at 1.0521 down from 1.0878. This shows that the tuning of the hyperparamaters has increased the performance of the model.
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2
sim_user_user_opt.predict(6958, 1671, r_ui = 2, verbose = True)
# Predict the play count for a song that is not listened to by the user (with user_id 6958)
sim_user_user_opt.predict(6958, 3232, verbose = True)
Observations and Insights: We can see the prediction of the user listening to 'Sleeping In' by The Postal Service is more accurate with this optimized model predicting the plays at 1.96 where the actual was 2. This is up from the baseline model which had it predicted at 1.8.
It also predicts the same user playing 'Life in Technicolor' by Coldplay 1.45 times down from 1.64 as it predicted in the baseline model. We don't know the real number of plays this listener would play of this song since they haven't heard it, but a lower number from the more accurate model seems to tie in with my thoughts from a SME (Subject Matter Expert) perspective.
We can conclude that this optimized model is better than the baseline one and seems quite accurate.
Think About It: Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?
# Use inner id 0
sim_user_user_opt.get_neighbors(0, k=5)
Below we will be implementing a function where the input parameters are:
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended product ids
recommendations = []
# Creating an user item interactions matrix
user_item_interactions_matrix = data.pivot_table(index = 'user_id', columns = 'song_id', values = 'play_count')
# Extracting those business ids which the user_id has not visited yet
non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the business ids which user_id has not interacted yet
for song_id in non_interacted_products:
# Predicting the ratings for those non visited restaurant ids by this user
est = algo.predict(user_id, song_id).est
# Appending the predicted ratings
recommendations.append((song_id, est))
# Sorting the predicted ratings in descending order
recommendations.sort(key = lambda x : x[1], reverse = True)
return recommendations[:top_n] # Returing top n highest predicted rating products for this user
# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations = get_recommendations(df_final, 6958, 5, sim_user_user_opt)
# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['song_id', 'predicted_plays'])
song_ids = [5531, 317, 4954, 8635, 5943]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Observations and Insights: Here we have the top 5 songs for our user 6958 recommended by the user_user_sim_opt algorithm and ranking them in order of the predicted plays. The first 4 songs in this recommendation list look pretty good - similar style and sound all in the indie / electro / indiepop and indie dance genres but I must say that the 5th song does not seem to fit in as it is a spanish pop song very different in genre and general artistic style.
def ranking_songs(recommendations, final_play):
# Sort the songs based on play counts
ranked_songs = final_play.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending = False)[['play_freq']].reset_index()
# Merge with the recommended songs to get predicted play_count
ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns = ['song_id', 'predicted_plays']), on = 'song_id', how = 'inner')
#ranked_products = ranked_products.merge(pd.DataFrame(recommendations, columns = ['prod_id', 'predicted_ratings']), on = 'prod_id', how = 'inner')
# Rank the songs based on corrected play_count
ranked_songs['corrected_plays'] = ranked_songs['predicted_plays'] - 1 / np.sqrt(ranked_songs['play_freq'])
# Sort the songs based on corrected play_counts
ranked_songs = ranked_songs.sort_values('corrected_plays', ascending = False)
return ranked_songs
Think About It: In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?
Here we are taking into account the number of users that have listened to a song as well as the number of times they have played it. A song with a high number of plays but by only a small number of users should not be considered to have a higher ranking than a song with an average to high play count but by many many different users. So we do this calculation to adjust the predicted plays by taking this into account.
Here we are subtracting the 1/sqrt(play_freq) so that the corrected plays will all come down a bit by different amounts. We could in this case also add the 1/sqrt(play_freq) so that the corrected plays will all be a bit higher as there is no upper limit of play count.
#Applying the ranking_songs function on the final_play data
ranking_songs(recommendations, final_play)
Observations and Insights: Here the top 5 songs using the ranking_songs function has still produced the same top 5 songs but the order of them has changed slightly due to the new corrected play count predictions.
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance
sim_options = {'name': 'cosine',
'user_based': False}
# Defining nearest neighbour algorithm
algo_knn_item = KNNBasic(sim_options=sim_options,verbose=False, random_state = 1)
# Train the algorithm on the train set
algo_knn_item.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k=10
precision_recall_at_k(algo_knn_item)
Observations and Insights: In this model that uses the cosine similarity nearest neighbours between songs, we can see our baseline model has an RMSE of 1.0394 which is lower (and better) than both the models in our user similarity model. However Precision is lower than both at 30% and so is Recall at 56% and the F1 at 39%. As we are quite interested in Precision we definitely want to improve this 30% though it is interesting to see that our error distance is lower on this model.
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user
algo_knn_item.predict(6958, 1671, r_ui=2, verbose=True)
#Finding users who have not listened to song_id 1671
specific_song = 1671
# Create a Boolean mask for specific song_id
mask = df_final['song_id'] == specific_song
# Group by user_id and aggregate the mask using the `any` function
grouped = df.groupby('user_id')['song_id'].agg(listened_to_song=lambda x: any(x == song))
# Filter the grouped data to only show users who have not listened to the song
result = grouped[grouped['listened_to_song'] == False]
result
# Predict the play count for a user that has not listened to the song (with song_id 1671)
algo_knn_item.predict(76300, 1671, verbose = True)
Observations and Insights: The prediction for our user that listened to 'Sleeping In' by the Postal Service 2 times, is only predicted to listen 1.34 times in this model. This is less accurate than the previous model which had the predictions at 1.80 and 1.96. It has also predicted that a user 76300 who has never heard this song would listen to it 2.03 times but we are not feeling too confident in this model.
# Apply grid search for enhancing model performance
# Setting up parameter grid to tune the hyperparameters
#param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
#'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
# 'user_based': [False], "min_support": [2, 4]}
#}
# Performing 3-fold cross-validation to tune the hyperparameters
#gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3)
# Fitting the data
#gs.fit(data)
# Best RMSE score
#print(gs.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
#print(gs.best_params['rmse'])
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameters here.
# Apply the best modle found in the grid search
sim_options = {'name': 'pearson_baseline',
'user_based': False}
# Defining nearest neighbour algorithm
algo_knn_item_opt = KNNBasic(sim_options=sim_options, k = 30, min_k = 6, verbose=False, random_state = 1, min_support = 2)
# Train the algorithm on the train set
algo_knn_item_opt.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k=10
precision_recall_at_k(algo_knn_item_opt)
Observations and Insights: The tuning of the hyperparameters has helped improve this model a little bit. In the RMSE score however there is only a very slight improvement from 1.0394 to 1.0328. However Precision has improved from 30% to 41% and Recall has improved from 56% to 66.5%. It is still not performing as well as the User to User similarity however in terms of these scores.
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
algo_knn_item_opt.predict(6958, 1671, r_ui=2, verbose=True)
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user
algo_knn_item_opt.predict(6958, 3232, verbose=True)
Observations and Insights: Wow the predicted plays on this user listening to 'Sleeping In' by the Postal Service has improved a lot. The baseline model had it predicting 1.36 plays but this tuned model predicted 1.96 plays which is very close to the actual number of plays - 2.
For this same user to listen to 'Life In Technicolor' by Coldplay it has a prediction of 1.28. We don't know how many times they actually would play it because they have never heard it.
# Find five most similar items to the item with inner id 0
algo_knn_item_opt.get_neighbors(0, k=5)
# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
sim_item_recommendations = get_recommendations(df_final, 6958, 5, algo_knn_item_opt)
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(sim_item_recommendations, columns=['song_id', 'predicted_plays'])
# Applying the ranking_songs function
ranking_songs(sim_item_recommendations, final_play)
song_ids = [2342, 5101, 139, 7519, 8099]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Observations and Insights: We have an entirely different set of song recommendatinos for this user with the song to song based similiarity algorithn comparered to the user to user similiarity. These songs seem to all fit in together again apart from one very pop track - Toxic by Britney Spears. Maybe our user does have a pop leaning too. The predicted plays of the top songs here roughly in the same sort of range as in the user to user similarity.
Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
# Build baseline model using svd
svd = SVD(random_state=1)
# Training the algorithm on the train set
svd.fit(trainset)
# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd)
Observations We have the lowest RMSE score of all our models so far. Precision is 41% which is the same as the user to user optimized whic is the best so far and Recall is 63% which is only better than the song-song baseline model. F1 is ~50% which again is only better than the song-song baseline model. So the RMSE and Precision of this model but its Recall lets it down.
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2
svd.predict(6958, 1671, r_ui = 2, verbose = True)
# Making a prediction for the user who has not listened to the song (song_id 3232)
#Finding users who have not listened to song_id 3232
specific_song = 3232
# Create a Boolean mask for specific song_id
mask = df_final['song_id'] == specific_song
# Group by user_id and aggregate the mask using the `any` function
grouped = df.groupby('user_id')['song_id'].agg(listened_to_song=lambda x: any(x == song))
# Filter the grouped data to only show users who have not listened to the song
result = grouped[grouped['listened_to_song'] == False]
result
svd.predict(90, 3232, verbose = True)
#we also know that our test user 6958 has not listened to song 3232
svd.predict(6958, 3232, verbose = True)
OBSERVATIONS Even though the model looks good from its scores, the individual prediction for the user 6958 to listen to 'Sleeping In' by The Postal Service 1671 is only 1.27 which is lower than all the other models and further away from the actual play count of 2. This svd model also predicts that this same user would listen to 'Life In Technicolor' by Coldplay ID 3232 1.56 times.
# # Set the parameter space to tune
# param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
# 'reg_all': [0.2, 0.4, 0.6]}
# # Performe 3-fold grid-search cross-validation
# gs_= GridSearchCV(SVD, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# # Fitting data
# gs_.fit(data)
# # Best RMSE score
# print(gs_.best_score['rmse'])
# # Combination of parameters that gave the best RMSE score
# print(gs_.best_params['rmse'])
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters here.
# Building the optimized SVD model using optimal hyperparameters
# Build the optimized SVD model using optimal hyperparameter search
svd_optimized = SVD(n_epochs=30, lr_all=0.01, reg_all=0.2)
# Train the algorithm on the train set
svd_optimized.fit(trainset)
# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd_optimized)
Observations and Insights: RMSE has improved slightly and come down from 1.0252 to 1.0143. Precision is roughly the same though is 0.2% higher at 41.2%. Recall remains the same as the baseline model at 63.3% and F1 remains basically the same having only improved by 0.1% from the baseline model.
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671
svd_optimized.predict(6958, 1671, r_ui = 2, verbose = True)
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
svd_optimized.predict(6958, 3232, verbose = True)
Observations and Insights: The tuned svd model has improved slightly by predicting the play count for our test user on 'Sleeping In' by The Postal Service at 1.35 which is up from 1.27 that the baseline model. But it is still quite far away from the actual play count of 2 and other models have predicted much closer.
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm
svd_recommendations = get_recommendations(df_final, 6958, 5, svd_optimized)
pd.DataFrame(svd_recommendations, columns=['song_id', 'predicted_plays'])
# Ranking songs based on above recommendations
ranking_songs(svd_recommendations, final_play)
song_ids = [7224, 8324, 6450, 5531, 5653]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Observations and Insights: The top 5 recommendations using the optimised matrix factorisation model (svd) is an interesting list where the songs again tend to fall into the same kind of genre. However they are again a completely new unique set of songs with the exception of 'Secrets' by One Republic which was also in the user user similarity recommendations.
Interestingly 'Victoria' by The Old 97s is in there which was in the Top 10 Songs based on popularity when the threshold for minumum plays was very low. However, this list of recommended songs from a SME point of view tend to fit quite well together and indeed actually fit quite well with all of the other top 5 recommendations that the other models have produced with the exception of those very poppy songs mentioned earlier.
In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.
# Make baseline clustering model
clust_baseline = CoClustering(random_state = 1)
# Training the algorithm on the train set
clust_baseline.fit(trainset)
# Let us compute precision@k and recall@k with k = 10
precision_recall_at_k(clust_baseline)
# Making prediction for user_id 6958 and song_id 1671
clust_baseline.predict(6958, 1671, r_ui = 2, verbose = True)
# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user
clust_baseline.predict(6958, 3232, verbose = True)
# # Set the parameter space to tune
# param_grid = {'n_cltr_u': [5, 6, 7, 8], 'n_cltr_i': [5, 6, 7, 8], 'n_epochs': [10, 20, 30]}
# # Performing 3-fold grid search cross-validation
# gs = GridSearchCV(CoClustering, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)
# # Fitting data
# gs.fit(data)
# # Best RMSE score
# print(gs.best_score['rmse'])
# # Combination of parameters that gave the best RMSE score
# print(gs.best_params['rmse'])
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters here.
# Train the tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u = 5,n_cltr_i = 5, n_epochs = 10, random_state = 1)
clust_tuned.fit(trainset)
precision_recall_at_k(clust_tuned)
Observations and Insights: The Clustering baseline model has an RMSE score of 1.0487 and yet the tuned model has a higher RMSE of 1.0654. The model seems to have decreased in performance on every score from tuning the hyperparamaters as the baseline Precision was 37.9% and tuned it's 37.4%. Recall was 0.582 and now it's 0.566 and F1 is down from 47.2% to 45.7%. So clearly tuning the paramaters here did not improve the model and the baseline model scores are from the best.
The prediction of our user to play 'Sleeping In' by The Postal Service was 1.29 which is far away from the 2 actual plays. The other models have outperformed this model. The tuned model predicted it even lower at 0.84 which is even less that one play! This model is proving itself very unreliable and I would not be confident using it.
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671
clust_tuned.predict(6958, 1671, r_ui = 2, verbose = True)
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
clust_tuned.predict(6958, 3232, verbose = True)
Observations and Insights:_
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = get_recommendations(df_final, 6958, 5, clust_tuned)
# Ranking songs based on the above recommendations
ranking_songs(clustering_recommendations, final_play)
# displaying the top 5 songs recommended by the tuned clustering algorithm
song_ids = [8324, 5531, 4831, 352, 8831]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
# displaying the top 5 songs recommended by the clustering baseline algorithm
song_ids = [7224, 8324, 9942, 5531, 4831]
df_filtered = df_final.loc[df_final['song_id'].isin(song_ids)]
df_deduplicated = df_filtered.drop_duplicates(subset='title')
df_deduplicated['song_info'] = df_deduplicated.apply(lambda x: x['title'] + " by " + x['artist_name'], axis=1)
song_titles = df_deduplicated['song_info'].tolist()
song_titles
Observations and Insights: The top 5 recommended songs using the optimized clustering algorithm is quite interesting! It contains a few of the songs found in other models' recommendations such as 'Secret' by One Republic and 'The Big Gundown' by The Prodigy and 'Dog Days Are Over' by Florece and the Machine. But then it has two other songs that are 1970s disco and funk songs - very different!
Given that the optimised clustering algorithm had worse scores than the baseline one I tried the recommendations with the baseline clustering algorithm too and things got weirder! It was a mixture of everything we've seen so far! Heaven Must be Missing an Angel - the disco funk one 'Secrets' - this has come up in a few of the models including the popularity one 'Victoria' which is the Old 97s song that comes up in the popularity model and the svd one. 'The Big Gundown' by the Prodigy which we have seen in some of the other models too. 'Greece 2000' which is a european pop rave kind of track and doesn't fit very well with even the other songs of a more dance nature.
I do not have much confidence in the clustering models for this project. They have given bad scores, bad predictions and quite strange recommendations.
Think About It: So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?
df_small = df_final
df_small
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
df_small['text'] = df_small['title'] + ' ' + df_small['release'] + ' ' + df_small['artist_name']
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
df_small = df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]
#Let us drop the duplicate records from title column
df_small = df_small.drop_duplicates(subset = ['title'])
# Set the title column as the index
df_small = df_small.set_index('title')
# See the first 5 records of the df_small dataset
df_small.head()
# Create the series of indices from the data
indices = pd.Series(df_small.index)
indices[ : 5]
# Importing necessary packages to work with text data
import nltk
# Download punkt library
nltk.download("punkt")
# Download stopwords library
nltk.download("stopwords")
# Download wordnet
nltk.download("wordnet")
# Import regular expression
import re
# Import word_tokenizer
from nltk import word_tokenize
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
# Import stopwords
from nltk.corpus import stopwords
# Import CountVectorizer and TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
We will create a function to pre-process the text data:
import nltk
nltk.download('omw-1.4')
# Function to tokenize the text
def tokenize(text):
text = re.sub(r"[^a-zA-Z]"," ", text.lower())
tokens = word_tokenize(text)
words = [word for word in tokens if word not in stopwords.words("english")] # Use stopwords of english
text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]
return text_lems
# Create tfidf vectorizer
tfidf = TfidfVectorizer(tokenizer = tokenize)
# Fit_transfrom the above vectorizer on the text column and then convert the output into an array
song_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()
pd.DataFrame(song_tfidf)
type(song_tfidf)
# Compute the cosine similarity for the tfidf above output
similar_songs = cosine_similarity(song_tfidf, song_tfidf)
similar_songs
# Function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):
recommended_songs = []
# Getting the index of the song that matches the title
idx = indices[indices == title].index[0]
# Creating a Series with the similarity scores in descending order
score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)
# Getting the indexes of the 10 most similar songs
top_10_indexes = list(score_series.iloc[1 : 11].index)
print(top_10_indexes)
# Populating the list with the titles of the best 10 matching songs
for i in top_10_indexes:
recommended_songs.append(list(df_small.index)[i])
return recommended_songs
Finally, let's create a function to find most similar songs to recommend for a given song.
Recommending 10 songs similar to Learn to Fly
# Make the recommendation for the song with title 'Learn To Fly'
recommendations('Learn To Fly', similar_songs)
df_final.loc[df_final['title'] == 'From Left To Right']
Observations and Insights:
The content based recommendations has produced some interesting results. We don't have RMSE and Precision and Recall scores to go by with this model, but we can create a list of recommendations based on a song.
Here we have used Foo Fighters' 'Learn To Fly' as our song to create recommendations with. Bearing in mind it can only use the title, artist name and album name to go by for creating the similiarity matrix.
All in all I'd say this is a pretty decent top 10 recommendation though there are 5 songs which are all the same two artists.
# Build the default user-user-similarity model
print("sim_user_user:")
precision_recall_at_k(sim_user_user)
print("")
#user user optimized
print("sim_user_user_opt:")
precision_recall_at_k(sim_user_user_opt)
print("")
# Using the baseline similarity measure for item-item based collaborative filtering
print("algo_knn_item:")
precision_recall_at_k(algo_knn_item)
print("")
print("algo_knn_item_opt:")
precision_recall_at_k(algo_knn_item_opt)
print("")
# Baseline matrix factorization based recommendation system
print("svd:")
precision_recall_at_k(svd)
print("")
print("svd_optimized:")
precision_recall_at_k(svd_optimized)
print("")
# Make baseline clustering model
print("clust_baseline:")
precision_recall_at_k(clust_baseline)
print("")
print("clust_tuned:")
precision_recall_at_k(clust_tuned)
The most meaninful insights I have found are that some of the models make pretty good predictions and good recommendations. This is without lots of specific tuning and experimenting with thresholds and paramaters.
It is also clear that working with the low play counts ie. play counts of 5 or less, I'm not sure this is the best way to use the dataset. We might have been better to cap all the playcounts above 5 so that they could stil be used instead of being cut out.
From this table we can see the best RMSE score is from the svd optimised model, but the best Precision, Recall and F1 scores are all from the user-user-optimized model.
So the user-user optimized model seems to be the most robust. It is quite evident in music that people do tend to fall into having a 'taste' in music, so people that you find that have similar taste in music. We have all had favourite radio and club DJs that seem to always choose records that we like but that our friend doesn't because they have a different taste.
Using user similarity does seem like a sensible option from this point of view.
I was also quite impressed with the recommendations produced by the item contents model. It would be interesting to explore this further by having the lyrics to songs, we could find simliarity in lyrical themes of songs.
It would also be interesting to have the data of the songwriter. This could be great for finding cover versions of the same song but done by different artists in different styles.
Another idea would be to use the natural language packege look at online reviews of songs or albums for more content based recommendations.
I would propose to use the user-user similarity model as the basis for recommendations. But I would also look at a hybrid model that also used the svd matrix estimation with some extra tuning. And also to look at using the content based recommendations.
For cold starts - when a user is new and we have no information about them or know what songs they like, then the most obvious thing to do seems to be to use the popularity based recommendations. As soon as they hit a play count of more than 1 for a song we can start using that to invoke other models as we gather more information about which songs they like and with which other users their taste seems to align.