Content-Based Recommender System Python

We have build a basic recommendation system in the earlier blog. For your reference the link is been provided below. Please go through it as we will be working with the same TMDB dataset. https://machinemantra.in/2020/10/26/movie-recommendation-system-using-python/

Need for Content Based Filtering

The suggestions and the recommendations made by the recommender engine helps to narrow down the search as per our own preferences.

  • The need for content-based filtering arises as we become more selective with our choices and preferences. 
  • When we do not wish to waste time searching for similar content.
  • When we have a smart way to deal with the issue.
  • We can rely on the suggestions 

What are Content-Based Recommender Systems

They make recommendations based on the descriptive attributes of items.To put it; Content = Description.

In content-based methods, the ratings and buying behavior of users are combined with the content information available in the items. E.g. Jane likes Terminator movie; based on the similar genre keywords other science fiction movies, such as Alien and Predator will be recommended.

What becomes the training data?

  • The item descriptions, which are labeled with ratings, are used as training data. To create a user-specific classification or regression modeling problem.
  • The class (or dependent) variable corresponds to the specified ratings or buying behavior.
  • These training documents are used to create a classification or regression model, which is specific to the user
  • This user-specific model is used to predict whether the corresponding individual will like an item for which her rating or buying behavior is unknown

Advantages and Disadvantages of Content-Based Recommender System

Advantages :

  • In making recommendations for new items, when sufficient rating data are not available for that item.
  • Even if there is no history of ratings for a particular item; still recommendations can be made.

Disadvantages :

  • It becomes very specific to the user need. The community view is ignored here. They are not effective in making recommendation to the new user.
  • It is usually important to have a large number of ratings available for the target user in order to make robust predictions without overfitting.

Let Us Get Started

To build a content-based recommender system let us compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score. The plot description is given in the overview feature of our dataset.

Word to Vector Conversion is needed for the column overview.

vector is an object that has both a magnitude and a direction. Giving each word a direction and magnitude provides a way to visualize the text in space and draw meaningful insights from the data plotted.

We will Compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview. The tfidf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

TF: The number of times a word appears in a document, divided by the total number of words in that document; Term Frequency, which measures how frequently a term occurs in a document.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. IDF, measures how important a term is.

While computing TF, all terms are considered equally important. Few terms like: and, the, is, of,that; might appear many a times but are of little significance. Hence, we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Applying TF-IDF in our example; we obtain a matrix , where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each row represents a movie, as before.This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

We have 20,978 different words used to describe the 4803 movies in our dataset.

Similarity Score

With the matrix obtained, we can now compute a similarity score. There are several ways for this; such as the Euclidean, the Pearson and the cosine similarity score. It is often a good idea to experiment with different metrics.

We will use the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate. Mathematically, it is defined as follows:

Since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn linear_kernel() instead of cosine_similarities() since it is faster.

We are going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. i.e., we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

lamda signifies an anonymous function. In this case, this function takes the single argument x and returns x[1] (i.e.the item at index 1 in x).

We can now define our recommendation function. These are the following steps we’ll follow :-

  • Get the index of the movie given its title.
  • Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.
  • Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
  • Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
  • Return the titles corresponding to the indices of the top elements.

Hence, we are done with our recommender system based on plot description or based on the overview of the movie. But still further improvements can be made by making Recommendations based on cast, crew, genre.

Sources:

http://www.tfidf.com/

https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system

https://en.wikipedia.org/wiki/Cosine_similarity

Leave a comment