Music Recommendation System

Milestone 1

Problem Definition

The context: Why is this problem important to solve?

Music is important. I know someone whom music has literally saved their life when they were in a very dark time and hearing 'Astral Weeks' by Van Morrison for the first time made them realise the beauty that can exist in the world and that maybe it was worth sticking around for.

Music is spiritual, psychological, emotional, inspiring, evocative and it can express what we are feeling when we are unable to. It is not a product to be consumed, but it is a necessity to our soul. Finding the music that speaks to you is so important. And being an artist that can reach your own niche listener is so special and vital to be able to keep new music thriving. Gone are the days where your friend makes you a mix tape or you record the top 40 from the Sunday night chart on the radio! And with so many of the smaller (and larger!) music venues closed down that once saw a thriving live music scene which was essential to discover new bands and artists in your local town or city, the streaming apps are the new way of finding your music and for the artists to find their fans.

The objectives: The intended goal is to explore this dataset and come up with ideas on the best ways to give song receommendations to a user.

The key questions: The key questions include: what should a list of recommended songs look like? Should those top ten be all based on the same idea? Or should there be a combination of ideas that make up a perfect list of recommended songs?

The problem formulation: We can use data science to help find some of these recommendations. Using large datasets of users listening to songs, we can find users who are simliar to other users. For example, a top song of User A will be a great recommendation for User B who shares very simliar music taste.
We can look for hidden variables that affect who likes to listen to what - things only the maths can find that we can't though we might be able to make sense of once the maths finds them! We can use data science to create all sorts of varied models that will hopefully give us a great and varied list of recommended songs that a user will not only be happy to listen to, but that might just change, or even save their life.

Data Dictionary

The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.

song_data

song_id - A unique id given to every song

title - Title of the song

Release - Name of the released album

Artist_name - Name of the artist

year - Year of release

count_data

user _id - A unique id given to the user

song_id - A unique id given to the song

play_count - Number of times the song was played

Data Source

http://millionsongdataset.com/

Important Notes

  • This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise.

  • In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

  • The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

  • All the outputs in the notebook are just for reference and can be different if you follow a different approach.

  • There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.

Importing Libraries and the Dataset

In [201]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [202]:
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd

# Basic library for data visualization
import matplotlib.pyplot as plt

# Slightly advanced library for data visualization
import seaborn as sns

# To compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# A dictionary output that does not raise a key error
from collections import defaultdict

# A performance metrics in sklearn
from sklearn.metrics import mean_squared_error

Load the dataset

In [203]:
# Importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/count_data.csv')
song_df = pd.read_csv('/content/drive/MyDrive/song_data.csv')

Understanding the data by viewing a few observations

In [204]:
# See top 10 records of count_df data
count_df.head(10)
Out[204]:
Unnamed: 0 user_id song_id play_count
0 0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
1 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
2 2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
3 3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1
4 4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1
5 5 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5
6 6 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODXRTY12AB0180F3B 1
7 7 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFGUAY12AB017B0A8 1
8 8 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFRQTD12A81C233C0 1
9 9 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOHQWYZ12A6D4FA701 1
In [205]:
# See top 10 records of song_df data
song_df.head(10)
Out[205]:
song_id title release artist_name year
0 SOQMMHC12AB0180CB8 Silent Night Monster Ballads X-Mas Faster Pussy cat 2003
1 SOVFVAK12A8C1350D9 Tanssi vaan Karkuteillä Karkkiautomaatti 1995
2 SOGTUKN12AB017F4F1 No One Could Ever Butter Hudson Mohawke 2006
3 SOBNYVR12A8C13558C Si Vos Querés De Culo Yerba Brava 2003
4 SOHSBXH12A8C13B0DF Tangle Of Aspens Rene Ablaze Presents Winter Sessions Der Mystic 0
5 SOZVAPQ12A8C13B63C Symphony No. 1 G minor "Sinfonie Serieuse"/All... Berwald: Symphonies Nos. 1/2/3/4 David Montgomery 0
6 SOQVRHI12A6D4FB2D7 We Have Got Love Strictly The Best Vol. 34 Sasha / Turbulence 0
7 SOEYRFT12AB018936C 2 Da Beat Ch'yall Da Bomb Kris Kross 1993
8 SOPMIYT12A6D4F851E Goodbye Danny Boy Joseph Locke 0
9 SOJCFMH12A8C13B0C2 Mama_ mama can't you see ? March to cadence with the US marines The Sun Harbor's Chorus-Documentary Recordings 0

Let us check the data types and and missing values of each column

In [206]:
# See the info of the count_df 

count_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   Unnamed: 0  int64 
 1   user_id     object
 2   song_id     object
 3   play_count  int64 
dtypes: int64(2), object(2)
memory usage: 61.0+ MB
In [207]:
# We will look at the shape of the count_df

count_df.shape
Out[207]:
(2000000, 4)
In [208]:
# Checking for missing values in the count_df

print(count_df.isna().sum())
Unnamed: 0    0
user_id       0
song_id       0
play_count    0
dtype: int64
In [209]:
# See the info of the song_df data

song_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   song_id      1000000 non-null  object
 1   title        999985 non-null   object
 2   release      999995 non-null   object
 3   artist_name  1000000 non-null  object
 4   year         1000000 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 38.1+ MB
In [210]:
# We will look at the shape of the song_df

song_df.shape
Out[210]:
(1000000, 5)
In [211]:
# Checking for missing values in song_df

print(song_df.isna().sum())
song_id         0
title          15
release         5
artist_name     0
year            0
dtype: int64
In [212]:
# Let's see how many unique values there are in each column of the count_df

count_df.nunique()
Out[212]:
Unnamed: 0    2000000
user_id         76353
song_id         10000
play_count        295
dtype: int64
In [213]:
#Let's see how many unique values there are in each column of the song_df

song_df.nunique()
Out[213]:
song_id        999056
title          702428
release        149288
artist_name     72665
year               90
dtype: int64
In [214]:
# Let's check to see how many duplicates there are in the song_df

song_df.duplicated().sum()
Out[214]:
498
In [215]:
# Let's check to see how many duplicates there are in the count_df

count_df.duplicated().sum()
Out[215]:
0

In the COUNT dataset we can see that the Unamed index and the Play Count columns are integer data types. The Song and User ID's are object types as they are encrypted - we should probably convert these to numerical ID's later to be able to better work with them.

There are no duplicates and no missing values in the Count dataset.

In the SONG dataset we can see that all of the columns are object types because they are names or encrypted user ID's apart from the Release column as this is an integer as these values are the year that the song was released.

There are 498 duplicates in the Song dataset and 20 missing values.

I would propose to drop the duplicates and the missing value rows from this dataset. And that we should convert the User and Song IDs to numerical values.

In [216]:
# Dropping the missing value rows from the dataset

song_df = song_df.dropna()
In [217]:
# Checking all columns now have 999980 entries after dropping the missing values

song_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999980 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   song_id      999980 non-null  object
 1   title        999980 non-null  object
 2   release      999980 non-null  object
 3   artist_name  999980 non-null  object
 4   year         999980 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 45.8+ MB
In [218]:
# Checking there are no longer any missing values

song_df.isna().sum()
Out[218]:
song_id        0
title          0
release        0
artist_name    0
year           0
dtype: int64
In [219]:
# Dropping the duplicates in the song_df

song_df.drop_duplicates(inplace=True)
In [220]:
# Checking the values are now all (999980 - 498) = 999,482 after dropping the duplicates

song_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999482 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   song_id      999482 non-null  object
 1   title        999482 non-null  object
 2   release      999482 non-null  object
 3   artist_name  999482 non-null  object
 4   year         999482 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 45.8+ MB

Observations and Insights: So we now have the Song dataset without the 498 duplicates and without the 20 missing value rows, with each column all having 999482 entries.

The Song dataset gives us more information about the song - connecting the song_ID with the Title, Artist, Album (release) and Year of release of the song. Whereas the Count dataset shows us which users have played which songs and how many times. It would be nice to merge these two datasets into one so we can use the extra song information for our recommendation models later.

We should also look at encoding the User and Song IDs into numerical values so we can more easily work with them inside our models that we will create.

The Column 'Unamaned: 0' is not of any use to us so we will drop it.

In [221]:
# Left merge the count_df and song_df data on "song_id". Drop duplicates from song_df data simultaneously
merged_df = pd.merge(count_df, song_df.drop_duplicates(['song_id']), on="song_id", how="left")

# Drop the column 'Unnamed: 0'
merged_df = merged_df.drop(columns =['Unnamed: 0'])
In [222]:
# Checking the shape of the merged_df

merged_df.shape
Out[222]:
(2000000, 7)
In [223]:
# Let's see what the merged dataframe looks like by displaying the first 10 rows

merged_df.head(10)
Out[223]:
user_id song_id play_count title release artist_name year
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1 The Cove Thicker Than Water Jack Johnson 0
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2 Entre Dos Aguas Flamenco Para Niños Paco De Lucia 1976
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1 Stronger Graduation Kanye West 2007
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1 Constellations In Between Dreams Jack Johnson 2005
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1 Learn To Fly There Is Nothing Left To Lose Foo Fighters 1999
5 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5 Apuesta Por El Rock 'N' Roll Antología Audiovisual Héroes del Silencio 2007
6 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODXRTY12AB0180F3B 1 Paper Gangsta The Fame Monster Lady GaGa 2008
7 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFGUAY12AB017B0A8 1 Stacked Actors There Is Nothing Left To Lose Foo Fighters 1999
8 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFRQTD12A81C233C0 1 Sehr kosmisch Musik von Harmonia Harmonia 0
9 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOHQWYZ12A6D4FA701 1 Heaven's gonna burn your eyes Hôtel Costes 7 by Stéphane Pompougnac Thievery Corporation feat. Emiliana Torrini 2002

Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?

In [224]:
# Apply label encoding for "user_id" and "song_id"
# Label Encoding
from sklearn.preprocessing import LabelEncoder  
le = LabelEncoder()

# Fit transform the user_id column

le.fit(merged_df['user_id'])
merged_df['user_id'] = le.transform(merged_df['user_id'])

# Fit transform the song_id column

le.fit(merged_df['song_id'])
merged_df['song_id'] = le.transform(merged_df['song_id'])
In [225]:
# Let's check our new user_id and song_id labels by loading up the first 10 rows

merged_df.head(10)
Out[225]:
user_id song_id play_count title release artist_name year
0 54961 153 1 The Cove Thicker Than Water Jack Johnson 0
1 54961 413 2 Entre Dos Aguas Flamenco Para Niños Paco De Lucia 1976
2 54961 736 1 Stronger Graduation Kanye West 2007
3 54961 750 1 Constellations In Between Dreams Jack Johnson 2005
4 54961 1188 1 Learn To Fly There Is Nothing Left To Lose Foo Fighters 1999
5 54961 1239 5 Apuesta Por El Rock 'N' Roll Antología Audiovisual Héroes del Silencio 2007
6 54961 1536 1 Paper Gangsta The Fame Monster Lady GaGa 2008
7 54961 2056 1 Stacked Actors There Is Nothing Left To Lose Foo Fighters 1999
8 54961 2220 1 Sehr kosmisch Musik von Harmonia Harmonia 0
9 54961 3046 1 Heaven's gonna burn your eyes Hôtel Costes 7 by Stéphane Pompougnac Thievery Corporation feat. Emiliana Torrini 2002

Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it contains users who have listened to a good count of songs and vice versa?

In [226]:
# Get the column containing the users
users = merged_df.user_id

# Create a dictionary from users to their number of songs
ratings_count = dict()

for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    
In [227]:
# We want our users to have listened at least 90 songs
RATINGS_CUTOFF = 90

# Create a list of users who need to be removed
remove_users = []

for user, num_ratings in ratings_count.items():
    
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df = merged_df.loc[ ~ merged_df.user_id.isin(remove_users)]
In [228]:
# Get the column containing the songs
songs = df.song_id

# Create a dictionary from songs to their number of users
ratings_count = dict()

for song in songs:
    # If we already have the song, just add 1 to their rating count
    if song in ratings_count:
        ratings_count[song] += 1
    
    # Otherwise, set their rating count to 1
    else:
        ratings_count[song] = 1    
In [229]:
# We want our song to be listened by atleast 120 users to be considred
RATINGS_CUTOFF = 120

remove_songs = []

for song, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_songs.append(song)

df_final = df.loc[ ~ df.song_id.isin(remove_songs)]
In [230]:
# Drop records with play_count more than(>) 5 
df_final = df_final[df_final['play_count'] <= 5]
In [231]:
# Check the shape of the data
df_final.shape
Out[231]:
(117876, 7)
In [232]:
# Let's check that this final_df looks right by loading up the first 10 rows

df_final.head(10)
Out[232]:
user_id song_id play_count title release artist_name year
200 6958 447 1 Daisy And Prudence Distillation Erin McKeown 2000
202 6958 512 1 The Ballad of Michael Valentine Sawdust The Killers 2004
203 6958 549 1 I Stand Corrected (Album) Vampire Weekend Vampire Weekend 2007
204 6958 703 1 They Might Follow You Tiny Vipers Tiny Vipers 2007
205 6958 719 1 Monkey Man You Know I'm No Good Amy Winehouse 2007
206 6958 892 1 Bleeding Hearts Hell Train Soltero 0
209 6958 1050 5 Wet Blanket Old World Underground_ Where Are You Now? Metric 2003
213 6958 1480 1 Fast As I Can Monday Morning Cold Erin McKeown 2000
215 6958 1671 2 Sleeping In (Album) Give Up Postal Service 2003
216 6958 1752 1 Gimme Sympathy Gimme Sympathy Metric 2009

Exploratory Data Analysis

Let's check the total number of unique users, songs, artists in the data

Total number of unique user id

In [233]:
# Display total number of unique user_id
df_final['user_id'].nunique()
Out[233]:
3155

Total number of unique song id

In [234]:
# Display total number of unique song_id
df_final['song_id'].nunique()
Out[234]:
563
In [246]:
# Display total number of unique song titles
df_final['title'].nunique()
Out[246]:
561

Total number of unique artists

In [235]:
# Display total number of unique artists
df_final['artist_name'].nunique()
Out[235]:
232

Observations and Insights: We now have a much smaller and slicker final dataset to work with.

Our user and song IDs are nice, shorter numeric values. We only have songs that have been listened to by at least 120 users and users that have listened to at least 90 songs.

We have also dropped incidents of a song being played by a user more than 5 times. This seems counter-intuitive for building our recommendation models but it seems that there is a very low incidence of users who play songs more than 5 times and so it would make the user to song matrix very big but sparse if we included these rarer events.

We can see in the end we have

  • 3,155 unique users
  • 563 unique song IDs
  • 561 unique song titles
  • 232 unique artists

This tells us that there are two songs that share the same title. This is not many, but we should be careful to use the song_id in the models and not just the title in case we happen to use one that is shared by another song.

We can see also that there must be artists with multiple songs as there are more than double the number of songs as artists.

In [236]:
# Let's see what the distribution of play counts looks like

plt.figure(figsize = (12, 4))
sns.countplot(x="play_count", data=df_final)

plt.tick_params(labelsize = 10)
plt.title("Distribution of Play Count ", fontsize = 10)
plt.xlabel("Plays", fontsize = 10)
plt.ylabel("Occurence of No. of Plays", fontsize = 10)
plt.show()

Observations It is interesting to see that by far a play count of 1 is the most frequent and that for each count higher, the frequency is lower. As playing a song once is not particularly a great indication of someone liking a song, we will have to hope we still have enough data of the higher play counts when we create our recommendation models later.

Let's find out about the most interacted songs and interacted users

Most interacted songs

In [237]:
# Display the top 10 songs that have been listened to by the most users

most_played = df_final.groupby(['song_id','title']).size().sort_values(ascending = False)[:10]
most_played
Out[237]:
song_id  title                         
8582     Use Somebody                      751
352      Dog Days Are Over (Radio Edit)    748
2220     Sehr kosmisch                     713
1118     Clocks                            662
4152     The Scientist                     652
5531     Secrets                           618
4448     Fireflies                         609
6189     Creep (Explicit)                  606
6293     Yellow                            583
1334     Hey_ Soul Sister                  570
dtype: int64

Most interacted users

In [253]:
# Display the top 10 users who have listened to the most songs

#df_final['user_id'].value_counts().head(10)
top_user = df_final.groupby(['user_id']).size().sort_values(ascending = False)[:10]
top_user
Out[253]:
user_id
61472    243
15733    227
37049    202
9570     184
23337    177
10763    176
9097     175
26616    175
43041    174
65994    171
dtype: int64
In [254]:
# Display the top 10 artists that have had the most listeners

top_artist = df_final.groupby('artist_name').size().sort_values(ascending = False)[:10]
top_artist
                  
Out[254]:
artist_name
Coldplay                  5317
The Killers               4128
Florence + The Machine    2896
Kings Of Leon             2864
the bird and the bee      2387
LCD Soundsystem           2168
Vampire Weekend           2145
Justin Bieber             2130
Octopus Project           1825
Soltero                   1691
dtype: int64
In [241]:
df.shape
Out[241]:
(438390, 7)
In [242]:
df_final.shape
Out[242]:
(117876, 7)

Observations and Insights: We can now see some popularity lists taken from the final dataset we are using:

  • The top 10 songs shows us the songs who have had the most listeners (not taking into account how many times an individual listener has played the song).

  • The top 10 users shows us the users who have listened to most songs - not taking into account how many times they have played those individual songs.

  • The top 10 artists shows us the artists that have had the most listeners - not taking into account how many times a listener may have played an individual song by that artist. Interestingly these numbers are much higher than the song counts. If we look at Coldplay - the have had 5,317 listeners whereas Yellow, which is by Coldplay has only had 583 of these listeners. This is probably because Coldplay had released 4 studio albums and 2 live albums in the 10 years up to this data being collected and so there are a lot of Coldplay songs for users to listen to, they were still relevant in this period and it just happens to be that Yellow - from their earliest release is the most popular.

We can see that for all of the top 10 artists, this is the case that their plays far exceed any of the top songs, suggesting they have a lot of material and not brand new.

Songs played in a year

In [243]:
count_songs = df_final.groupby('year').count()['title']

count = pd.DataFrame(count_songs)

count.drop(count.index[0], inplace = True)

count.tail()
Out[243]:
title
year
2006 7592
2007 13750
2008 14031
2009 16351
2010 4087
In [244]:
# Create the plot

# Set the figure size
plt.figure(figsize = (30, 10))

sns.barplot(x = count.index,
            y = 'title',
            data = count,
            estimator = np.median)

# Set the y label of the plot
plt.ylabel('number of titles played') 

# Show the plot
plt.show()

Observations and Insights: # Here we can see the popularity of songs by the year they were released. This data was collected in January 2011 and it is interesting to see that there is a large dip in plays from songs from the latest year 2010 compared to the years preceeding it.

This is possibly because the songs that were released in 2010 had not had enough time to gain popularity yet. Songs that had been around for a year already were the most popular followed by the 2 years previous to that one. This might say something about the trend and fashion in music at that time. That the most popular music of a certain scene that was released in 2007, was still relevant and similar to the most popular music that was released in 2009. It is also possible that a new trend was just starting in 2010 but hadn't got going yet, or maybe instead that that old trend was now dying out and users were starting to diverify in what they were listnening to from other years.

Think About It: What other insights can be drawn using exploratory data analysis?

Proposed approach

Potential techniques: What different techniques should be explored?

The objective here is to create a list of 10 songs to recommend to a user. These do not have to be songs they have never heard before, only that we are quite sure they will want to listen to them. I think the best approach would be to treat this whole problem like you would if you were tasked with making a mix-tape for someone. You want to include:

  • songs they know and love
  • songs they might not know but should because everybody loves them
  • songs from artists you know they love but that they might not know this particular song
  • an old classic that might complement their modern taste or vice versa
  • something from a brand new artist you think they will love so that they can feel ahead of - or just on the curve

Overall solution design: What is the potential solution design?

To accomodate this diverse set of recommendations I would propose to include different models such as:

  • user to user similarity - to find what other users who have similar taste to them are listening to
  • graph network - to find out which songs seem to be the key to other songs that users like
  • song to song similiarity - this could be as simple as using songs from the same artist and / or album as it's almost certain that if you like 3 songs from the same artist, you are quite likely to like another one of their songs.
  • matrix estimation to find latent variables - we may well be able to explore and find some variable that we have not thought of to find out relationships of certain users to certain songs

Measures of success: What are the key measures of success to compare different potential technqiues?

In regards to the models, we can train them with a training set so that we can measure how well they are performing and we can use RMSE scores and other scores to compare the performances of tuned models to their counterparts.
However there is something else that we must take into consideration in a higher view of this project.

What we must bear in mind is that music should not always be a popularity contest. It may be tempting to use a most-popular = best kind of model but if we do this we risk drowning out the smaller, independant artists and the brand new artists. We must not be seduced by a bigger / better approach. People do not measure their love of a song based on if it is popular - it's normally only that if it is popular they will have more chance of hearing it. But indeed it is something very special to find a song first or one that is lost to time or something obscure you get to be the one that introduces it to your friends.