Latent Factor Models
The basic assumption is that significant portion of the rows and columns of data matrix are highly correlated. Highly correlated data can be well explained by a low number of columns, so low-rank matrix is useful for the matrix estimation. It reduces the dimensionality by rotating the axis system, so that pairwise correlations between dimensions are removed.
Low-Rank approach
✏️R has a rank k<<min{m,n}
U is an m×k matrix, and Vis a n×k matrix.
Each column of Uis viewed as one of the k basis vectors.
jthrow of V contains the corresponding coefficients to combine basis vectors.
✏️R has a rank larger than k
The residual matrix is (R−UVT) and the error is ∣∣R−UVT∣∣2. In this situation ∣∣⋅∣∣ is the Frobenius norm(It calculates the sum of the squares for every entries).

In this case the genre becomes the latent vector(concept). In this chapter, we want to find the elements in U and V matrix by solving an optimization problem.
Geometric Intuition

Let's assume the ratings of these three movies are highly positively correlated. In this case, just one latent factor is enough to explain the trend of a data. When we know just one rating score, we can also know the other ratings of other movies by finding the intersection point between a plane and a vector.
The thing is that we focus on finding a set of relevant latent vector. The averaged squared distance of the data points from the hyperplane defined by these latent vectors must be as small as possible.
Basic Matrix Factorization
The ith row uiˉ of U is a user factor, which has the affinity with user i towards k concepts.
Each row viˉ of V is an item factor, which has the affinity with ith item towards k concepts.
In the example above, rij has the meaning as follows.
The sum of (Affinity of user i to history)×(Affinity of item j to history) and (Affinity of user i to romance) ×(Affinity of item jto romance)
When U,Vhave negative values, it becomes hard to interpret the meaning of a latent vector. To make the model interpretable, NMF(Non-negative Matrix Factorization) is introduced in the following post.
Unconstrained Matrix Factorization
Even SVD(Singular Vector Decomposition) has a constraint that each vector is orthogonal. In this chapter, we handle with an unconstraint problem.
Objective
Because of missing entries, only a subset of entries are known. ∑s=1kuis⋅vjs means the similarity between user i and item j for every latent variables. To clarify the objective function, we need to consider only a set known as follows.
Solution: GD


q denotes the index of latent variables.
∇J=[∂uiq∂J∣∂vjq∂J],VAR=[U∣V]=VAR−α⋅∇J
As a matrix form, U⇐U+αEV;V⇐V+αETU
Solution: SGD

Different from the update method above, SGD can update entries in matrices by only using a part of a set S.
As a vector form, ui⇐ui+αeijvj,vj⇐vj+αeijui
For one update, a single observed entry (i,j) is needed.
In practice, SGD can achieve faster convergence than GD, but GD can have more smoother convergence. α is usually determined as 0.005. To avoid local optimum problem, it also adopts bold driver algorithm(Selection for α for each iteration.) Initialization is also issue, and one can handle this issue with SVD-based heuristic dealt with later section.
Regularization
The basic idea of regularization is to discourage very large entries in Uand Vby adding a regularization term, 2λ(∣∣U∣∣2+∣∣V∣∣2) into the optimization problem


In this case the regularization term is L2norm.
✏️Gradient Descent

U⇐U(1−α⋅λ)+αEV,V⇐V(1−α⋅λ)+αETU
In which unobserved entries of E are set to 0. (1−α⋅λ)plays a role in shirking the parameters in each step.
✏️Stochastic gradient descent - vectorized local updates

ui⇐ui+α(eijvj−λui),vj⇐vj+α(eijui−λvj)
We can also update our parameter using a vector form in stochastic gradient descent.
✏️Stochastic gradient descent - vectorized global updates
The regularization term is divided by the corresponding observed entries.
The book said the local update use uiqandvjq several times, while the global update use these variable just once. It leads to difference between two methods.(Trade-offs between quality and efficiency)
Solution: ComponentWise-SGD
It's also possible to train the latent components incrementally. The approach repeatedly cycles through all the observed entries in S while performing theses updates for q=1 until convergence is reached.

The result means that overall rank-k factorization can be expressed as the sum of k rank-1 factorization.
UqVqTis an outer product of UqandVq.
The main difference between this approach and the approach above is that it cycles through in q in the outer loop.
This approach leads to faster and more stable convergence.
The earlier latent components become the dominant ones, like they do in SVD.(Even if this approach doesn't ensure the orthogonality between vectors, but it can make these vectors orthogonal using projected gradient descent.)
Solution: Alternating Least Squares and Coordinate Descent
The stochastic gradient method is sensitive both to the initialization and the way in which the step sizes are chosen. To compensate for these shortcomings, ALS can be applied.
Algorithm for ALS
Keeping U fixed. In the optimization problem, ∑i:(i,j)∈S(rij−∑s=1kuisvjs)2, ui1,...,uikare treated as constant, whereas vj1,...,vjkare treated as variables. It becomes a least-square regression problem in v.
Keeping V fixed. vj1,...,vjk are treated as constant, where as ui1,...,uikare treated as variables.
These two steps are iterated to convergence. The least squares problem for each user is independent, so that this step can be parallelized easily.
A weighted version ALS can be well-suited to implicit feedback settings where the matrix is assumed to be fully specified with many zero values. However, ALS is not quite as efficient as stochastic gradient descent in large-scale settings with explicit ratings. Other methods such as coordinate descent can address this problem.
Updates for coordinate descent
uiq⇐λ+∑j:(i,j)∈Svjq2∑j:(i,j)∈S(eij+uiqvjq)vjq
vjq⇐λ+∑:(i,j)∈Suiq2∑i:(i,j)∈S(eij+uiqvjq)uiq
Singular Value Decomposition
Consider the rating matrix is fully specified. One can approximate R by using truncated SVD of rank k<<min{m,n}
Qk contains the k largest eigenvectors of RRT
Pk contains the k largest eigenvectors of RTR. Pk represents the reduced row space, k eigenvectors imply the directions of item-item correlations among ratings.
QkΣk contains the transformed and reduced m×k representation of the original rating matrix
U=QkΣk,V=Pk

When R is incompletely specified, one can impute missing entries by using row-wise average.
How to execute
Let the centered matrix Rc, let missing entries of Rc be 0.
Decompose Rc into QkΣkPkT
U=QkΣk,V=Pk
r^ij=uˉi⋅vˉj+μi
Last updated