A picture is worth a thousand floats

On using image and word embeddings for improving movie recommendations

9 min readOct 22, 2023

In my last post, I detailed how I engineered an end-to-end movie recommender system. The recommendations were generated using a collaborative topic model (a hybrid model that combines matrix factorization and topic modeling), and I found that it provided superior recommendations compared to matrix factorization alone. Additionally, the model successfully alleviates the item cold-start problem thanks to the inclusion of a movie’s content (movie scripts in this case) in the model.

In this post, I aim to improve the recommender system by using embeddings of movie content. Embeddings are a powerful tool in a data scientist’s toolkit, providing a framework for transforming text, images, video, and audio into a dense vector representation. The transformation from the item to the embedding vector can be learned in several different ways, but the primary objective in all cases is to produce a latent space where two vectors will share similar meanings if they are ‘close together’. You can use several metrics for measuring how close two vectors are in the Latent space, but cosine similarity is a popular choice. Thanks to ChatGPT and other major AI breakthroughs, embeddings are all the rage these days and there are many excellent resources on how embedding models are learned and applied in practice (see e.g. Hugging Face, OpenAI, LinkedIn, Etsy). Therefore, in this post, I will primarily focus on using a pre-trained embedding model to represent a movie’s content.

Specifically, I will discuss the following topics in this post:

How to create image embeddings of movie posters using pre-trained models available through the FiftyOne open-source toolkit
How to create word embeddings of movie descriptions and movie scripts using pre-trained models available through Hugging Face
How to visualize the embedding space by projecting the vectors to a lower dimensional (2D) plane via uMAP (or tSNE)
How to use cosine-similarity to find the top-k similar movies in each of the embedding spaces and compare the results to the topic model we built previously
Finally, I will conclude with a discussion of the key findings in this work and discuss the next steps in leveraging embeddings in recommenders

I will show the core code that I used to generate the embeddings and visualizations, but the complete code can be found on GitHub.

Getting started: Downloading pre-trained embedding models, loading in the dataset, and forming a FiftyOne-specific dataset

To get started, we will install the packages and models required to transform the raw movie poster and text data into embeddings. We will use a pre-trained word embedding model from sentence-transformers which is available on the Hugging Face hub and a pre-trained image embedding model available from the FiftyOne open-source toolkit. We will also use the FiftyOne app to project the embeddings into a 2D space and then visualize/explore this 2D space.

# import the usual suspects
import numpy as np
import pandas as pd

# load FiftyOne packages for image embeddings and embedding visualization
import fiftyone.brain as fob
import fiftyone as fo
import fiftyone.zoo as foz

# load sentence-transformer packages for word embeddings 
# and nearest-nighbor search
from sentence_transformers import SentenceTransformer, util

In this post, I will only use one embedding model to construct the text embeddings (all-MiniLM-L6-v2) and one embedding model to construct the image embeddings (resnet50-imagenet-torch). Both of these models were trained on massive datasets and have shown good out-of-box performance on a range of general tasks. In a future post, I will assess multiple pre-trained models as well as potential gains in performance by fine-tuning the models to our specific dataset.

# download pretrained models
model_wrd = SentenceTransformer('all-MiniLM-L6-v2')
model_img = foz.load_zoo_model("resnet50-imagenet-torch")

Speaking of our dataset, the one I used for constructing the word and image embeddings is shown below. It contains entries for 32,173 movies, where each entry contains relevant movie metadata (title, year, IMDb rating), the full movie script, a short text description of the movie, and a poster for the movie. The data was synthesized from numerous sources, including TMDB.org, IMDb.com, and springfieldspringfield.co.uk. Please refer to my previous post for more details on how I wrangled this dataset.

# print out a few entries from the embeddings dataset
df_embedding = pd.read_csv('df_embeddings.csv', index_col=[0])
df_embedding.head()

First five rows of the DataFrame that contains the text data (tmdb_description, script_text) and the poster filename (poster_img) for all movies in the database. Additional metadata for each movie is also provided.

The last bit of set-up is to build a FiftyOne-specific dataset which will be fed into FiftyOne’s app for visualizing the embedding space projections. The following code snippet is what I used to create this FiftyOne dataset, including the addition of several Fields for each sample that can be used to color-code and filter the data points.

data_path = "path_to_downloaded_tmdb_posters\\posters\\"
# initialize a FiftyOne dataset
fo_dataset = fo.Dataset()

# Add samples and sample fields to the dataset
for j in range(len(df_embedding)):
    file_str = data_path + df_embedding.iloc[j]['poster_img']
    sample = fo.Sample(filepath=file_str)
    fo_dataset.add_sample(sample)
    # add fields to each sample
    sample['genre'] = df_embedding.iloc[j]['genre'].split(',')[0]
    sample['movie_year'] = df_embedding.iloc[j]['movie_year']
    sample['movie_title'] = df_embedding.iloc[j]['movie_title']
    sample['average_rating'] = df_embedding.iloc[j]['average_rating']
    sample['num_votes'] = df_embedding.iloc[j]['num_votes']
    sample.save()

Building the word and image embeddings

By far the easiest part of this entire project has been constructing the word and image embeddings. This is thanks to the widespread availability of extremely easy-to-use and immensely powerful embedding models that are available to the ML/DS community. The following code block shows the one-liners needed to transform the raw script and poster images into their respective vector spaces. I also load in the Latent topics obtained from the Latent Dirichlet Allocation (LDA) model (i.e. the transformation of the movie scripts into the latent vector space learned by LDA). See my previous post for how this “embedding” model was learned.

# transform movie description to word embedding space (~10 minutes)
embeddings_tmdb_desc = 
  [model_wrd.encode(jscript) for jscript in df_embedding['tmdb_description']]

# transform movie script to word embedding space (~120 minutes)
embeddings_scripts = 
  [model_wrd.encode(jscript) for jscript in df_embedding['script_text']]

# transform movie posters to image embedding space (~200 minutes)
embeddings_posters = fo_dataset.compute_embeddings(model_img)

# load in pre-built LDA embedding vectors (i.e. topic distributions)
with open("..\\movie_recsys\\model_building_out\\Xtran.txt", "rb") as f:
    embeddings_LDA = pickle.load(f)
embeddings_LDA = embeddings_LDA[list(df_embedding['script_id']), :]

Now that we have projected the raw text and image data into their respective vector spaces, we can move on to visualizing a projection of the embeddings in these vector spaces. To accomplish this, I used the fob.compute_visualization() method to compute the projection of the embeddings from the high dimensional vector space onto a 2D plane. The function uses uMAP (by default) but tSNE can also be used. Finally, we launch the FiftyOne app by calling session = fo.launch_App(fo_dataset)

# embeddings of poster data
results = fob.compute_visualization(
    fo_dataset,
    embeddings=embeddings_poster,
    num_dims=2,
    brain_key="poster_embeddings",
    method="umap",
    verbose=True,
    seed=51,
)
... # repeat for other embedding spaces

session = fo.launch_app(fo_dataset)

Visualizations of the embedding spaces

First, we take a look at the 2D projection of the LDA embeddings of the movie script data colored by the primary movie genre (left image in the figure below). We can clearly see the emergence of two distinct clusters in this space. Upon inspection of these data points, I determined that the smaller cluster on the left are predominantly Bollywood films (films produced in India). There also appears to be some local clustering by genre which can potentially be enhanced by reducing the number of genre topics (i.e. by combining the ‘Music’ and ‘Musical’ genres into one genre called ‘Music’). Also interesting is if we color-code the data points by release year (right image in the figure below), then we see that older movies are strongly clustered in a few different places in this vector space.

Interactive visualization of LDA embedding space using uMAP to project the embeddings to a 2D plane. (left) data points are colored by movie genre. (right) The data points are colored by movie release year. The two distinct clusters represent Bollywood films (films released in India) and Hollywood films.

Next, I visualized the word embeddings formed by the Mini-LM model as well as the image embeddings formed by the Resnet model (left and right images in the below figure, respectively). In the word embedding plot, we see the presence of a few smaller clusters in addition to a large central cluster. Upon inspection of these smaller clusters, it appears that the embedding model has strongly grouped together some rather niche topics. For example, one cluster contains films purely about wine. Another cluster only contains animated DC superhero movies, mainly Batman. In contrast to the LDA vector space, The Bollywood cluster is less distinct.

In the image embedding plot, I show a subset of neighboring images in the vector space to illustrate the model’s capabilities. It seems as if the image model is too focused on the background of the posters, rather than the more substantive content. For example, in the lower left of the embedding space, all of the photos contain a brick wall. Similarly, in the lower right (resp. upper left) of the embedding space all of the images contain a solid yellow (resp. solid black) background.

Finding similar movies in the embedding space

In the final part of this post, we use cosine similarity to find the four nearest neighbors in each of the vector spaces for a given input movie. The four input movies we use as test cases are “Remember the Titans,” “Little Giants,” “Green Book,” and “Finding Nemo.” In each case, I show the movie title, year, genres, short description, and poster for the input movie, but I only show the movie title, year, and poster for the calculated similar movies.

Nearest neighbors via cosine similarity in each of the four vector spaces for the input movies “Remember the Titans” (left) and “Little Giants” (right). For the input movie, the movie title, year, genres, short description, and poster are shown. For the calculated similar movies, the movie title, year, and poster are shown.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Same as the previous figure but for the movies “Finding Nemo” (left) and “Green Book” (right).

There are a few general remarks we can draw from these experiments:

In general, all four models return a different list of similar movies for a given input movie.
The word embeddings of just the movie description produced the least diverse recommendations when compared to the word embeddings of the entire movie script and the embeddings obtained from the LDA model.
The image embedding model does a remarkable job of finding similar movie posters to the input poster, but I would not say that it produces relevant movie recommendations. With the exception of Finding Nemo, most of the similar movie posters have little content in common with the input movie. An example use case of the image embedding model could be in personalizing the movie poster to show a user when multiple posters are available.
All three text-based models provide reasonable movie recommendations based on the input movie, but those predicted by LDA model seem to provide the most robust and diverse recommendations. As pointed out last time, the LDA model is excellent at finding the primary genre (e.g. a sports film) and the secondary genre (e.g. sports-comedy or sports-drama).

Future Work:

The image and word embeddings by themselves and straight out-of-the-box do not provide a substantial lift in the quality of recommendations compared with the LDA model we built previously. However, note that the LDA model was specifically trained on the script dataset and, therefore, had access to most of the data. Moreover, the LDA model was trained on a bag of words matrix which was created after extensive preprocessing of the script text. There was no preprocessing applied to the script before creating the word embeddings with the Mini-LM model.

The power of these embeddings arises when they are used as input features to a deep neural network designed for ranking candidate items. A nice advantage to a deep learning architecture is that it is relatively straightforward to concatenate the item and user embeddings with additional dense features, such as user geographic location, age, etc. Implementing such a deep-learning model will be the subject of a future post.