The context: Why is this problem important to solve?
Music is important. I know someone whom music has literally saved their life when they were in a very dark time and hearing 'Astral Weeks' by Van Morrison for the first time made them realise the beauty that can exist in the world and that maybe it was worth sticking around for.
Music is spiritual, psychological, emotional, inspiring, evocative and it can express what we are feeling when we are unable to. It is not a product to be consumed, but it is a necessity to our soul.
Finding the music that speaks to you is so important. And being an artist that can reach your own niche listener is so special and vital to be able to keep new music thriving. Gone are the days where your friend makes you a mix tape or you record the top 40 from the Sunday night chart on the radio!
And with so many of the smaller (and larger!) music venues closed down that once saw a thriving live music scene which was essential to discover new bands and artists in your local town or city, the streaming apps are the new way of finding your music and for the artists to find their fans.
The objectives: The intended goal is to explore this dataset and come up with ideas on the best ways to give song receommendations to a user.
The key questions: The key questions include: what should a list of recommended songs look like? Should those top ten be all based on the same idea? Or should there be a combination of ideas that make up a perfect list of recommended songs?
The problem formulation: We can use data science to help find some of these recommendations. Using large datasets of users listening to songs, we can find users who are simliar to other users. For example, a top song of User A will be a great recommendation for User B who shares very simliar music taste.
We can look for hidden variables that affect who likes to listen to what - things only the maths can find that we can't though we might be able to make sense of once the maths finds them! We can use data science to create all sorts of varied models that will hopefully give us a great and varied list of recommended songs that a user will not only be happy to listen to, but that might just change, or even save their life.
The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.
song_data
song_id - A unique id given to every song
title - Title of the song
Release - Name of the released album
Artist_name - Name of the artist
year - Year of release
count_data
user _id - A unique id given to the user
song_id - A unique id given to the song
play_count - Number of times the song was played
This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')
# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd
# Basic library for data visualization
import matplotlib.pyplot as plt
# Slightly advanced library for data visualization
import seaborn as sns
# To compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity
# A dictionary output that does not raise a key error
from collections import defaultdict
# A performance metrics in sklearn
from sklearn.metrics import mean_squared_error
# Importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/count_data.csv')
song_df = pd.read_csv('/content/drive/MyDrive/song_data.csv')
# See top 10 records of count_df data
count_df.head(10)
# See top 10 records of song_df data
song_df.head(10)
# See the info of the count_df
count_df.info()
# We will look at the shape of the count_df
count_df.shape
# Checking for missing values in the count_df
print(count_df.isna().sum())
# See the info of the song_df data
song_df.info()
# We will look at the shape of the song_df
song_df.shape
# Checking for missing values in song_df
print(song_df.isna().sum())
# Let's see how many unique values there are in each column of the count_df
count_df.nunique()
#Let's see how many unique values there are in each column of the song_df
song_df.nunique()
# Let's check to see how many duplicates there are in the song_df
song_df.duplicated().sum()
# Let's check to see how many duplicates there are in the count_df
count_df.duplicated().sum()
In the COUNT dataset we can see that the Unamed index and the Play Count columns are integer data types. The Song and User ID's are object types as they are encrypted - we should probably convert these to numerical ID's later to be able to better work with them.
There are no duplicates and no missing values in the Count dataset.
In the SONG dataset we can see that all of the columns are object types because they are names or encrypted user ID's apart from the Release column as this is an integer as these values are the year that the song was released.
There are 498 duplicates in the Song dataset and 20 missing values.
I would propose to drop the duplicates and the missing value rows from this dataset. And that we should convert the User and Song IDs to numerical values.
# Dropping the missing value rows from the dataset
song_df = song_df.dropna()
# Checking all columns now have 999980 entries after dropping the missing values
song_df.info()
# Checking there are no longer any missing values
song_df.isna().sum()
# Dropping the duplicates in the song_df
song_df.drop_duplicates(inplace=True)
# Checking the values are now all (999980 - 498) = 999,482 after dropping the duplicates
song_df.info()
Observations and Insights: So we now have the Song dataset without the 498 duplicates and without the 20 missing value rows, with each column all having 999482 entries.
The Song dataset gives us more information about the song - connecting the song_ID with the Title, Artist, Album (release) and Year of release of the song. Whereas the Count dataset shows us which users have played which songs and how many times. It would be nice to merge these two datasets into one so we can use the extra song information for our recommendation models later.
We should also look at encoding the User and Song IDs into numerical values so we can more easily work with them inside our models that we will create.
The Column 'Unamaned: 0' is not of any use to us so we will drop it.
# Left merge the count_df and song_df data on "song_id". Drop duplicates from song_df data simultaneously
merged_df = pd.merge(count_df, song_df.drop_duplicates(['song_id']), on="song_id", how="left")
# Drop the column 'Unnamed: 0'
merged_df = merged_df.drop(columns =['Unnamed: 0'])
# Checking the shape of the merged_df
merged_df.shape
# Let's see what the merged dataframe looks like by displaying the first 10 rows
merged_df.head(10)
Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?
# Apply label encoding for "user_id" and "song_id"
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Fit transform the user_id column
le.fit(merged_df['user_id'])
merged_df['user_id'] = le.transform(merged_df['user_id'])
# Fit transform the song_id column
le.fit(merged_df['song_id'])
merged_df['song_id'] = le.transform(merged_df['song_id'])
# Let's check our new user_id and song_id labels by loading up the first 10 rows
merged_df.head(10)
Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it contains users who have listened to a good count of songs and vice versa?
# Get the column containing the users
users = merged_df.user_id
# Create a dictionary from users to their number of songs
ratings_count = dict()
for user in users:
# If we already have the user, just add 1 to their rating count
if user in ratings_count:
ratings_count[user] += 1
# Otherwise, set their rating count to 1
else:
ratings_count[user] = 1
# We want our users to have listened at least 90 songs
RATINGS_CUTOFF = 90
# Create a list of users who need to be removed
remove_users = []
for user, num_ratings in ratings_count.items():
if num_ratings < RATINGS_CUTOFF:
remove_users.append(user)
df = merged_df.loc[ ~ merged_df.user_id.isin(remove_users)]
# Get the column containing the songs
songs = df.song_id
# Create a dictionary from songs to their number of users
ratings_count = dict()
for song in songs:
# If we already have the song, just add 1 to their rating count
if song in ratings_count:
ratings_count[song] += 1
# Otherwise, set their rating count to 1
else:
ratings_count[song] = 1
# We want our song to be listened by atleast 120 users to be considred
RATINGS_CUTOFF = 120
remove_songs = []
for song, num_ratings in ratings_count.items():
if num_ratings < RATINGS_CUTOFF:
remove_songs.append(song)
df_final = df.loc[ ~ df.song_id.isin(remove_songs)]
# Drop records with play_count more than(>) 5
df_final = df_final[df_final['play_count'] <= 5]
# Check the shape of the data
df_final.shape
# Let's check that this final_df looks right by loading up the first 10 rows
df_final.head(10)
Total number of unique user id
# Display total number of unique user_id
df_final['user_id'].nunique()
Total number of unique song id
# Display total number of unique song_id
df_final['song_id'].nunique()
# Display total number of unique song titles
df_final['title'].nunique()
Total number of unique artists
# Display total number of unique artists
df_final['artist_name'].nunique()
Observations and Insights: We now have a much smaller and slicker final dataset to work with.
Our user and song IDs are nice, shorter numeric values. We only have songs that have been listened to by at least 120 users and users that have listened to at least 90 songs.
We have also dropped incidents of a song being played by a user more than 5 times. This seems counter-intuitive for building our recommendation models but it seems that there is a very low incidence of users who play songs more than 5 times and so it would make the user to song matrix very big but sparse if we included these rarer events.
We can see in the end we have
This tells us that there are two songs that share the same title. This is not many, but we should be careful to use the song_id in the models and not just the title in case we happen to use one that is shared by another song.
We can see also that there must be artists with multiple songs as there are more than double the number of songs as artists.
# Let's see what the distribution of play counts looks like
plt.figure(figsize = (12, 4))
sns.countplot(x="play_count", data=df_final)
plt.tick_params(labelsize = 10)
plt.title("Distribution of Play Count ", fontsize = 10)
plt.xlabel("Plays", fontsize = 10)
plt.ylabel("Occurence of No. of Plays", fontsize = 10)
plt.show()
Observations It is interesting to see that by far a play count of 1 is the most frequent and that for each count higher, the frequency is lower. As playing a song once is not particularly a great indication of someone liking a song, we will have to hope we still have enough data of the higher play counts when we create our recommendation models later.
Most interacted songs
# Display the top 10 songs that have been listened to by the most users
most_played = df_final.groupby(['song_id','title']).size().sort_values(ascending = False)[:10]
most_played
Most interacted users
# Display the top 10 users who have listened to the most songs
#df_final['user_id'].value_counts().head(10)
top_user = df_final.groupby(['user_id']).size().sort_values(ascending = False)[:10]
top_user
# Display the top 10 artists that have had the most listeners
top_artist = df_final.groupby('artist_name').size().sort_values(ascending = False)[:10]
top_artist
df.shape
df_final.shape
Observations and Insights: We can now see some popularity lists taken from the final dataset we are using:
The top 10 songs shows us the songs who have had the most listeners (not taking into account how many times an individual listener has played the song).
The top 10 users shows us the users who have listened to most songs - not taking into account how many times they have played those individual songs.
The top 10 artists shows us the artists that have had the most listeners - not taking into account how many times a listener may have played an individual song by that artist. Interestingly these numbers are much higher than the song counts. If we look at Coldplay - the have had 5,317 listeners whereas Yellow, which is by Coldplay has only had 583 of these listeners. This is probably because Coldplay had released 4 studio albums and 2 live albums in the 10 years up to this data being collected and so there are a lot of Coldplay songs for users to listen to, they were still relevant in this period and it just happens to be that Yellow - from their earliest release is the most popular.
We can see that for all of the top 10 artists, this is the case that their plays far exceed any of the top songs, suggesting they have a lot of material and not brand new.
Songs played in a year
count_songs = df_final.groupby('year').count()['title']
count = pd.DataFrame(count_songs)
count.drop(count.index[0], inplace = True)
count.tail()
# Create the plot
# Set the figure size
plt.figure(figsize = (30, 10))
sns.barplot(x = count.index,
y = 'title',
data = count,
estimator = np.median)
# Set the y label of the plot
plt.ylabel('number of titles played')
# Show the plot
plt.show()
This is possibly because the songs that were released in 2010 had not had enough time to gain popularity yet. Songs that had been around for a year already were the most popular followed by the 2 years previous to that one. This might say something about the trend and fashion in music at that time. That the most popular music of a certain scene that was released in 2007, was still relevant and similar to the most popular music that was released in 2009. It is also possible that a new trend was just starting in 2010 but hadn't got going yet, or maybe instead that that old trend was now dying out and users were starting to diverify in what they were listnening to from other years.
Think About It: What other insights can be drawn using exploratory data analysis?
Potential techniques: What different techniques should be explored?
The objective here is to create a list of 10 songs to recommend to a user. These do not have to be songs they have never heard before, only that we are quite sure they will want to listen to them. I think the best approach would be to treat this whole problem like you would if you were tasked with making a mix-tape for someone. You want to include:
Overall solution design: What is the potential solution design?
To accomodate this diverse set of recommendations I would propose to include different models such as:
Measures of success: What are the key measures of success to compare different potential technqiues?
In regards to the models, we can train them with a training set so that we can measure how well they are performing and we can use RMSE scores and other scores to compare the performances of tuned models to their counterparts.
However there is something else that we must take into consideration in a higher view of this project.
What we must bear in mind is that music should not always be a popularity contest. It may be tempting to use a most-popular = best kind of model but if we do this we risk drowning out the smaller, independant artists and the brand new artists. We must not be seduced by a bigger / better approach. People do not measure their love of a song based on if it is popular - it's normally only that if it is popular they will have more chance of hearing it. But indeed it is something very special to find a song first or one that is lost to time or something obscure you get to be the one that introduces it to your friends.