Latent Factor Models
Last updated
Was this helpful?
Last updated
Was this helpful?
The basic assumption is that significant portion of the rows and columns of data matrix are highly correlated. Highly correlated data can be well explained by a low number of columns, so low-rank matrix is useful for the matrix estimation. It reduces the dimensionality by rotating the axis system, so that pairwise correlations between dimensions are removed.
βοΈ has a rank
is an matrix, and is a matrix.
Each column of is viewed as one of the basis vectors.
row of contains the corresponding coefficients to combine basis vectors.
βοΈ has a rank larger than
The residual matrix is and the error is . In this situation is the Frobenius norm(It calculates the sum of the squares for every entries).
In this case the genre becomes the latent vector(concept). In this chapter, we want to find the elements in and matrix by solving an optimization problem.
Let's assume the ratings of these three movies are highly positively correlated. In this case, just one latent factor is enough to explain the trend of a data. When we know just one rating score, we can also know the other ratings of other movies by finding the intersection point between a plane and a vector.
The thing is that we focus on finding a set of relevant latent vector. The averaged squared distance of the data points from the hyperplane defined by these latent vectors must be as small as possible.
The th row of is a user factor, which has the affinity with user towards concepts.
Each row of is an item factor, which has the affinity with th item towards concepts.
In the example above, has the meaning as follows.
The sum of (Affinity of user to history)(Affinity of item to history) and (Affinity of user to romance) (Affinity of item to romance)
When have negative values, it becomes hard to interpret the meaning of a latent vector. To make the model interpretable, NMF(Non-negative Matrix Factorization) is introduced in the following post.
Even SVD(Singular Vector Decomposition) has a constraint that each vector is orthogonal. In this chapter, we handle with an unconstraint problem.
Objective
Because of missing entries, only a subset of entries are known. means the similarity between user and item for every latent variables. To clarify the objective function, we need to consider only a set known as follows.
Solution: GD
denotes the index of latent variables.
As a matrix form,
Solution: SGD
Different from the update method above, SGD can update entries in matrices by only using a part of a set .
As a vector form,
For one update, a single observed entry is needed.
In practice, SGD can achieve faster convergence than GD, but GD can have more smoother convergence. is usually determined as 0.005. To avoid local optimum problem, it also adopts bold driver algorithm(Selection for for each iteration.) Initialization is also issue, and one can handle this issue with SVD-based heuristic dealt with later section.
Regularization
The basic idea of regularization is to discourage very large entries in and by adding a regularization term, into the optimization problem
In this case the regularization term is norm.
βοΈGradient Descent
In which unobserved entries of are set to . plays a role in shirking the parameters in each step.
βοΈStochastic gradient descent - vectorized local updates
We can also update our parameter using a vector form in stochastic gradient descent.
βοΈStochastic gradient descent - vectorized global updates
The regularization term is divided by the corresponding observed entries.
The book said the local update use and several times, while the global update use these variable just once. It leads to difference between two methods.(Trade-offs between quality and efficiency)
Solution: ComponentWise-SGD
It's also possible to train the latent components incrementally. The approach repeatedly cycles through all the observed entries in while performing theses updates for until convergence is reached.
The result means that overall rank-k factorization can be expressed as the sum of k rank-1 factorization.
is an outer product of and.
The main difference between this approach and the approach above is that it cycles through in in the outer loop.
This approach leads to faster and more stable convergence.
The earlier latent components become the dominant ones, like they do in SVD.(Even if this approach doesn't ensure the orthogonality between vectors, but it can make these vectors orthogonal using projected gradient descent.)
Solution: Alternating Least Squares and Coordinate Descent
The stochastic gradient method is sensitive both to the initialization and the way in which the step sizes are chosen. To compensate for these shortcomings, ALS can be applied.
A weighted version ALS can be well-suited to implicit feedback settings where the matrix is assumed to be fully specified with many zero values. However, ALS is not quite as efficient as stochastic gradient descent in large-scale settings with explicit ratings. Other methods such as coordinate descent can address this problem.
Consider the rating matrix is fully specified. One can approximate R by using truncated SVD of rank
contains the largest eigenvectors of
contains the largest eigenvectors of . represents the reduced row space, eigenvectors imply the directions of item-item correlations among ratings.
contains the transformed and reduced representation of the original rating matrix
When is incompletely specified, one can impute missing entries by using row-wise average.