A brief review of traditional graph embedding methods

In my point of view, the traditional graph embedding problem can be summarized as follows:

Construct a proximity graph G on data points, then embed nodes (data points) into some K-dimensional space, such that the reconstructed G’ is close to original G in some sense.

Before going into some details, let’s first justify why such an idea is useful. Given data points in some high dimensional feature space, it is pretty difficult to separate them directly (assuming you have some ML experience). However, we may observe that in many real data that we care about, although the feature dimension is high, most data actually lay on some low dimensional manifold. The following figure is an example, although data are spreading over 2-d space, but they are mainly organized in two 1-d non-linear subspaces (circle-shaped manifolds). Building the proximity graphs over G can help us find such low-dimensional non-linear manifolds. For example, the two clustered found by KMeans in the figure are pretty messed-up (which means original 2-d space representation of data doesn’t preserve the manifold directly), but if we use 1NN of each data points, we may find that data points in the inner circle and data in the outer circle are completely disconnected, i.e. they have zero similarity, thus we have a better chance to separate them. Now that we have constructed a proximity graph which captures the intrinsic manifolds in data, so we may use them as features for all kinds of tasks, such as clustering, right? No. Because this Screenshot 2015-06-24 20.02.11graph G is of the size of data points, which is usually large and also sparse, directly use neighbors of a node as its features may not be a good idea. A practical idea is to embed nodes (data points) into some lower-dimensional space, such that the proximity in original graph G is preserved, which will be measured by the “closeness” of original proximity graph G and reconstructed graph G’.

 

Given that we have justified why the graph embedding task is meaningful, there are several essential design choices that one has to make before we can implement it, which lead to different concrete methods. Those questions are:

1. How to construct the proximity graph G?

If we simply construct it using original feature space, such as Euclidean distance of data points, we end up with MDS (Multi-Dimensional Scaling), which is not desirable since this cannot directly preserve manifolds. We can use KNN plus shortest path as in Isomap, which may be easily hindered by noisy data points  (short circuited issue). Another good choice is Gaussian kernel, in which the similarity diverges exponentially, this practice is adopted by many methods, including Laplacian Eigenmaps, t-SNE, etc.. — I am just too lazy to add any references…

What if we are given a graph (adjacency matrix) instead of data points in high dimensional space? There is something called “structure preserving embedding”, which aims to learn a kernel that best “align” with given graph, and treat it as proximity graph.

Anyway, there are some more considerations in this part which I will not discuss here.

2. How to “reconstruct” G?

Once we have the proximity graph G, we now turn to embed nodes in the graph that preserve the graph structure. We achieve this by “reconstruct” proximity graph G directly or indirectly using embedding vectors. In MDS, Isomap, Laplacian Eigenmap, and so on, we “reconstruct” by simply adopting Euclidean distance between two vectors in embedding space. In MDS and Isomap, the distance is forced to match the weight between a pair of nodes, but in Laplacian Eigenmap, the distances are forced to align with weights using inner product. A less often seen in the community of traditional graph embedding, but popular in many representation learning methods choice is to use inner product of two vectors.

3. How to measure the “closeness” between G and G’?

Now that we have a way to “reconstruct” the original graph G, we need to measure the quality of such reconstruction such that it can provide a  mean for learning those embedding vectors. This can be seen as the choice of loss function. Well, L2 loss are commonly seen, such as in MDS, Isomap. As for Laplacian Eigenmap, the alignment loss by inner product is used. Other loss such as KL-divergence, hinge-loss, log-loss, etc., can also be adopted in appropriate contexts.

Other extensions

As I stated in the beginning of the post, the idea is to construct a proximity graph and embedding data points that preserve structures of the proximity graph. However, we may also learn meaningful low dimensional representations directly from data in original feature space. Auto-encoder is an example. Incorporating supervision is also a difficult but meaningful extension. And even with the “data->graph->embedding” idea, we still have other choices of constructing the graph, embedding nodes, and defining the loss based on real problems.

Notes on Neural Networks Part I – Backpropagation

As far as I know, the story behind deep learning is like this: original there are neural networks with only one hidden layer, which is not powerful enough; to increase its power, one may add more hidden layers, however, with the standard gradient-base learning algorithm, i.e. backpropagation, the gradients vanish in hidden layers fastly when the hidden layers are more than one, and also due to non-convexity of the objective, deep neural networks suffer deep local optimal and overfitting, which makes neural networks stuck in its downturn for a while. Neural networks seem to be borned with the concept of “feature/representation learning”, but it was purely act in a supervised fashion, until 2006, Hinton et al. proposed unsupervised feature/representation learning for addressing the critical issues in training a deep neural nets, which called unsupervised pre-training, aiming to initialize the deep neural networks with much better weights than randomly assigned one. After pre-training, the followed supervised learning (called fine-tuning) will work much better compared with original supervised learning. There are mainly two ways to do the pre-training, one is using Restricted Boltzmann Machines, the other is to use stacked Autoencoder (however, some recent studies suggest that using rectifier units, deep neural nets can be well trained even without unsupervised pre-training).

Okay, back to the basics, let me scribe the classic Backpropagation learning algorithm for neural networks in here:

Notations:

a_i^k : scalar output of node i at k-th layer; a^k is an n^k \times 1 matrix; note a^1 at 1st layer, i.e. input layer, is the input vector x_j of j-th instance, and a^K at output layer is output prediction t.

\delta_i^k : “Pseudo-error” for node i at k-th level, used for calculating gradients; \delta^k is an n^k \times 1 matrix.

W_{ij}^k : weight of link from node i at k-th layer to node j at k+1-th layer; W^k is an n^k \times n^{k+1} matrix.

b_j^k : bias from k-th layer to node j at k+1-th layer; b^k is an n^{k+1} \times 1matrix.

h(x) : element-wise activation function for x, usually it would be sigmoid, tanh, or rectifier.

\eta : learning rate.

x \circ y : element-wise product of two matrices.

Backpropagation with stochastic gradient descent on squared error loss function and sigmoid activation function h(x):


Initialization

For each training instances repeat following until convergence or maximum number of iteration reached:

% forward propagation

For layer bigger than 1:

(1)   \begin{equation*} a^k = h( (W^{k-1})^T a^{k-1} + b^{k-1}) \end{equation*}

% Back propagation of “pseudo errors”

For top layer, say the K-th layer,

(2)   \begin{equation*} \delta^K = a^K \circ (\mathbf{1} - a^K) \circ (a^K - t) \end{equation*}

For each k-th hidden layer,

(3)   \begin{equation*} \delta^k = a^k \circ (\mathbf{1} - a^k) \circ ( (W^k)^T \delta^{k+1}) \end{equation*}

% Calculate and apply gradients

For k-th layer (exclude the output layer),

(4)   \begin{equation*} \nabla_{W^k} = a^k (\delta^{k+1})^T \end{equation*}

(5)   \begin{equation*} \nabla_{b^k} = \delta^{k+1} \end{equation*}

(6)   \begin{equation*} W_{new} = W_{old} - \eta \nabla_{W} \end{equation*}

(7)   \begin{equation*} b_{new} = b_{old} - \eta \nabla_{b} \end{equation*}


For general loss functions and activation functions, we only need to replace equations 2 and 3 with:

(8)   \begin{equation*} \delta^K = h'(a^K) \circ \frac{\partial E}{\partial h(a^K)} \end{equation*}

And,

(9)   \begin{equation*} \delta^k = h'(a^k) \circ ( (W^k)^T \delta^{k+1}) \end{equation*}

Where h'(a^k) is a gradient of each dimension of function h to its corresponding dimension in the input vector a^k. \frac{\partial E}{\partial h(a^K)} is the gradient of error/loss function to predicted output function h(a^K), “element-wisely”.

简单说说推荐模型

(警惕!本文乃学习贴,是笔记,而不是常年做推荐的人的经验之谈,请不要轻信本文中的断言。)

什么是推荐?就是评估用户对于物品的喜欢程度。最简单的就是给定一个稀疏的rating矩阵,代表用户对于物品的打分,由于矩阵稀疏,推荐的直接目标就是把矩阵的missing value给出来。在这种最简单的setting下面,可以有很多扩展,比如给出用户的属性、物品的属性、时间标签、外部环境、用户的社交关系以及其他行为等。因此,推荐算法/系统的好坏,不仅要从速度、准确度角度去评价,还有一些扩展性,是否能够融合不同的属性、适应新的用户、物品等。

推荐的方法一般分为基于内容的过滤和协同过滤。基于内容过滤出发点是事物本身,比如用户睡眠不好,那就给他推荐个失眠药,很直接好懂。但独立用的时候行之不太有效,因为物品一多、需求一复杂,这种直接的对应是很难找到的。协同过滤是基于用户行为的,比如用户A与用户B看电影的口味很相似,A爱的电影B可能也爱看。就现在而言,协同过滤更加有效、流行,并且内容过滤很多时候都已经融入协同过滤中。

协同过滤(CF)又分为Memory-based的和Model-based,前者是直接利用用户行为历史数据,又分为user-based与item-based两种;后者是对用户行为进行模型假设,此类中比较常见的就是矩阵分解模型(Matrix Factorization)、概率矩阵分解(PMF)。下面分别简单说说。

Memory-Based Model

User-based Recommendation

为了计算用户对于物品的打分,这种方法的基本思路是,先计算用户之间的相关度(Cosine距离,Pearson-r关联度;可以用SVD降维),作为每个用户的权,对物品的打分进行加权平均,相关性越大的用户起到的影响作用也越大。数学公式是:

    \[ R_{u,i} = \bar{r_i} + \frac{\sum_{v \in U}{sim(u,v)(R_{v,i}-\bar{r_i})}}{\sum_{v \in U}{|sim(u,v)|}} \]

其中,bar{r_i}是物品i的平均得分,U是用户全集。

Item-based Recommendation

同样为了计算用户对于物品的打分,其思路是,先计算物品之间的相关度,作为物品的权重,而用户对于物品的打分则由用户曾经打分过的物品综合给出,同样是越相关的物品的影响作用越大。公式是:

    \[ R_{u,i} = \frac{\sum_{j \in R(u)}{sim(i,j)R_{u,j}}}{\sum_{j \in R(u)} |sim(i,j)|} \]

其中R(u)是用户u有rating过的产品集合。

Item在推荐系统中相对“稳定“,变化并不那么明显。我们可以离线预先计算好Item直接的相似度,因此基于Item的推荐比基于User-based的推荐(需要计算用户之间的相似度)计算成本要小,文章《Item-Based Collaborative Filtering Recommendation Algorithms》(Badrul Sarwar等)也指出基于Item的方法在实时效果上要优于基于User的方法。

Latent Factor Model — Matrix Factorization

潜在因子模型的本质是认为每个user、每个item都有一些潜在的因子(或者认为是topic),比如user U_1 = (0.5, 0.1, 0.2),movie I_1 = (0.5, 0.2, 0.3),其中三个维度分别代表:对于动作类型电影、喜剧类电影、历史题材电影的响应值,可以看出,用户U_1的对于动作类电影的响应很高(说明他爱看动作类电影),而电影I_1在动作类电影的相应很高(说明这部影片含有很多动作元素)。用向量点乘(U_1^TI_1)得到的值就代表了用户U_1对于电影I_1的打分预期。这个线性点乘的过程很好的吻合了直觉,而潜在因子的思想也很好的反映了人类真实决策过程,所以,这个模型是目前做推荐最流行也是比较准确的模型。关于用matrix factorization来做latent factor model最简单的入门可以看这篇文章(详解了basic svd的思想、建模、优化目标与优化算法)。

融合user/item feature与social relation

如果只用rating矩阵做推荐建模,未免有点不够“尽兴”,还有一堆信息怎么不用呢?来看看:用户有属性,性别、年龄、标签等,物品有属性,类别、标签等,用户有社会关系:关注、被关注,还有些implicit的feedback信息,比如用户看过的但没有rate的电影、用户的点击历史等。如何利用这些信息来建立一个统一的matrix factorization latent factor model呢?上交的tianqi chen formalized了一种feature-based matrix factorization framework(虽然之前有类似的东西,但是似乎没有这么general的formalization),他对于用户rating的建模是:

    \[ R_{u,i} = \mu+\left(\sum_{g \in G} \gamma_{g} b_g + \sum_{m \in M} \alpha_m b^u_m + \sum_{n \in N} \beta_n b^i_n\right) +\left(\sum_{m \in M} \alpha_m \mathbf{p}_m\right)^T\left( \sum_{n \in N} \beta_n \mathbf{q}_n\right) \]

具体的公式解释参考SVDfeature,或看这篇paper: Feature-Based Matrix Factorization。其思路也蛮straight forward,我的理解是:非latent部分:feature本身对于rating分数是会有一定作用的,比如用户对于所有物品的打分的平均分,物品收到的平均打分,这就是上述公式的前两项;latent部分,首先假设有N维的latent factor,被用户与产品共享,(所以\mathbf{p}, \mathbf{q}的维度是一致的),所有的feature都可以在latent factor上进行分解,比如,不同的用户本身在latent factor上的“响应值”分布是不同的,不同的物品也是这样,而对于其他feature,比如标签或类别:“动作片”,也将其在latent factor空间进行分解(学出来的latent factor空间的“响应”应该在“动作片”相关的一个或几个factor上特别大)。如此以来,一对user-item pair对应的feature在latent空间分解后相乘(上述公式的第三项)就代表了latent factor各处的rating的预期。对于social relation,可以直接认为是一种feature,用上述公式就能融入。但是还有些更加sophisticated的方法(请自行用recommendation+social relation谷歌),但是,本质上也没有太大的差别。

基于matrix factorization的latent factor model基本都是一层latent factor,而且基本上都能通过修改上面的对R_{u,i}建模的公式来表达,写出来的公式长得也差不多。下面提到的基于Neural Netowrk的latent factor model的表达就不太一样了。但是我似乎也没看到有多层latent factor的模型。

优化目标

一般都是用用L2优化目标:

    \[ Loss = \sum_{u,i}(r_{u,i} - \hat{r}_{u,i})^2 + regularization \]

这是一个凸优化问题,直接求导然后用梯度下降或拟牛顿就能优化。

References

关于Latent Facotr Model的文章很多,各种各样的变形都有。这里列举其中的几篇文章,值得阅读:

  • Y. Koren, R.M. Bell, C. Volinsky. Matrix Factorization Techniques for Recommender Systems.IEEE Computer
  • Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model.KDD 2008

另外,Netflix和KDD CUP2011,KDD CUP2012获奖者的Technical Report都是值得阅读的。

Latent Factor Model — Neural Network

主要的代表作就是用RBM做的两层神经网络(一层latent factor),具体可以看看这篇paper:Restricted Boltzmann Machines for Collaborative Filtering,也比较straight forward。

有意思的是,作者在paper最后写道将尝试用deep network来做推荐,但是这篇文章已经发了好几年了,也没有什么deep network的后继,看来效果并不是特别理想。但是,直觉上来说,latent factor也是有结构的,比如动作类电影中有成龙的动作片、有李小龙的动作片,虽然都是动作片,这种hierarchical structure和其他类型是structure似乎用deep neural network可以很好的捕获,这方面的空白确实让我感到惊讶。

计算上优化的tips

  • rating矩阵一般比较稀疏并且比较大,对于空间和时间的消耗可能会很大(10,000 x 10,000的矩阵循环赋值一遍就要接近在我的笔记本上就要1分钟,2.4主频的机器;而对于内存的消耗,如果是64位系统下用int,一个int占8字节,那就是800M左右)。因此,一个优化就是,用稀疏矩阵来保存。
  • 对于大矩阵,不要轻易用点乘、遍历等操作,复杂度太高。特别是在train和infer的过程中,不要做MN-dependant(MN是用户数和产品数),要尽量做成instances-dependant(依赖于训练测试的样本数,也就是rating的个数)。具体在代码中,不要对rating矩阵进行for循环,而是对instance进行for循环。
  • 矢量化编程,减少for循环。不妨看看
  • 如果训练数据很大,算梯度的时候可以用SGD,分成多个batch,每个batch训练完就可以更新参数,作为一个迭代,而不要等所有训练样本都累加后再算梯度。
  • 另外,也能用L-BFGS等拟牛顿法来代替Gradient Descent,下降速度会更快。
  • 关于语言选择。由于方便,我选择了python,后来发现不优化的python确实不适合做这种计算密集型的工作,就单纯的for循环,不做优化的python的速度就会比C++慢几十倍,在矩阵运算等一些地方,可能慢上百倍。这是什么概念呢?用python快速写好程序,跑一个结果要等两三个小时,但用C/C++写,可能写的比较久,但跑一个结果只要十几分钟不到。如果要调参数、要试验不同的model,这种差距更是可怕的。

Software

  • SVDfeature。极其快,不仅由于用C++来写,而且作者做过精心优化。作者用该工具在多个竞赛中拿过好成绩。
  • PREA。JAVA写的,实现了很多算法。但同样比C/C++慢。提供了一个benchmark。代码结构简单,用来学习各种算法不错。
  • MyMediaLite。C++写的,也实现了不少算法。提供了一个不同数据集下面的benchmark

(值得注意的是,本文在评价推荐算法时候大都是基于RMSE值最小化的目标,而在现实中,可能是rank-based,或者有其他更多指标,因此,并不能简单认为那些RMSE值跑的好看的算法就是实际中更实用的算法,具体的问题还需要具体的评价指标和算法优化)