# CS 498: Computational Advertising Homework 4 solution

\$30.00

Original Work ?
Category:

## Question 1 (2 points)

While simrank is suitable for search engines, how would you adapt it for display ads where
are there are no queries?

## Question 2 (2 points)

Consider the following methods to compute user and item biases in neighborhood models
for recommendation.
1
Figure 1: Question 4: figure illustrating the integration of the binary view of the rating
matrix, into matrix factorization
bi =
P
u∈R(i)
(rui − µ)
λ2 + |R(i)|
bu =
P
i∈R(u)
(rui − µ − bi)
λ3 + |R(u)|
where R(i) denotes the set of users who have rated on item i, and R(u) denotes the set
of items rated by user u. Explain the purpose of constants λ2 and λ3.

## Question 3 (2 points)

In the matrix factorization model for recommendation, how do you determine the number
of latent factors?

## Question 4 (3 points)

Consider figure 1 which illustrates factoring in the binary view of the rating matrix, into
matrix factorization. Why does the information from the (dense) binary view (which is
derived from the rating matrix) produce a 7% improvement in the results over direct matrix
factorization on the rating matrix?
2

## Question 5 (7 + 14 = 21)

The objective of this assignment is to design a movie recommender system, i.e., recommend
movies to users using item-item neighborhood models. The dataset includes ratings provided
by users to some movies, and movie metadata that includes title, overview, etc.

Let the set of users be represented by U = {u ∈ U} and set of movies by represented by
I = {i ∈ I}. First, estimate movie biases {bi
: i ∈ I} by averaging over users that rated the
movie, and then estimate user biases by averaging residuals over movies rated by the user.

bi =
P
u∈R(i)
(rui − µ)
|R(i)|
bu =
P
i∈R(u)
(rui − µ − bi)
|R(u)|

where µ is the global mean calculated across all the ratings available in the dataset. The
rating for user u and movie i is predicted as:
rˆui = bu + bi + µ +
P
j∈R(u)
sij × (ruj − buj )
P
j∈R(u)
sij
where sij denotes the similarity between movies i and j. Note that the neighborhood R(u)
is computed over all items rated by user u. In this assignment, we explore two approaches
to compute movie-movie similarities.

1. Pearson correlation: Estimate movie-movie similarity using empirical Pearson correlation coefficient on shared support of items i and j, defined as:
sij =
P
u∈U(i,j)
(rui − bui)(ruj − buj )
r P
u∈U(i,j)
(rui − bui)
2
· (ruj − buj )
2
where U(i, j) is the set of users who have rated movies i and j, and bui = bu + bi + µ

2. Content similarity:
Movie metadata is a collection of words, sij is the cosine similarity between the two
movie documents i and j. In order to calculate cosine similarity, we first convert the
documents to a vectors using TF-IDF.
3

Term Frequency also known as TF, measures the number of times a term (word)
occurs in a document, normalized by document length.
T F(t, i) = Number of times term t appears in document of movie i
Total number of terms in document of movie i

Inverse Document Frequency also known as IDF, measures how important a term
is to a particular movie.
IDF(t) = loge
 Total number of movies in dataset
Number of movies containing term t


Now let the set of all unique words across all movie documents be V = {w1, . . . , w|V |}.
We define the vector for movie i as d
i = (d
i
1
, . . . , di
|V |
) where d
i
k = T F(wk, i)×IDF(wk)

The similarity between movies i and j, is now computed as:
sij = cosine-sim(d
i
, dj
)
In this MP, given an input (u, i) , you need to output the estimated rating value ˆrui.
Round ˆrui to 1 decimal point.

Input format: The input contains the (user,movie) ratings, movie metadata and the
(user,movie) pairs for which you need to estimate the ratings.

The first line of the input contains two space separated integers R M. R is the number
of lines of user-movie ratings and M is the number of movies. Next R lines contain the
ratings. Each line contains three space-separated values (user-id, movie-id, rating).

Next M
lines contain the metadata. The first entry in each line is the movie id, followed by words
describing movie metadata. The last five lines contain two space-separated integers (target
user-id, target movie-id) for which you need to estimate the ratings. You should submit your
code on compass and the code outputs in the pdf submitted to gradescope.
4