In my point of view, the traditional graph embedding problem can be summarized as follows:
Construct a proximity graph G on data points, then embed nodes (data points) into some K-dimensional space, such that the reconstructed G’ is close to original G in some sense.
Before going into some details, let’s first justify why such an idea is useful. Given data points in some high dimensional feature space, it is pretty difficult to separate them directly (assuming you have some ML experience). However, we may observe that in many real data that we care about, although the feature dimension is high, most data actually lay on some low dimensional manifold. The following figure is an example, although data are spreading over 2-d space, but they are mainly organized in two 1-d non-linear subspaces (circle-shaped manifolds). Building the proximity graphs over G can help us find such low-dimensional non-linear manifolds. For example, the two clustered found by KMeans in the figure are pretty messed-up (which means original 2-d space representation of data doesn’t preserve the manifold directly), but if we use 1NN of each data points, we may find that data points in the inner circle and data in the outer circle are completely disconnected, i.e. they have zero similarity, thus we have a better chance to separate them. Now that we have constructed a proximity graph which captures the intrinsic manifolds in data, so we may use them as features for all kinds of tasks, such as clustering, right? No. Because this graph G is of the size of data points, which is usually large and also sparse, directly use neighbors of a node as its features may not be a good idea. A practical idea is to embed nodes (data points) into some lower-dimensional space, such that the proximity in original graph G is preserved, which will be measured by the “closeness” of original proximity graph G and reconstructed graph G’.
Given that we have justified why the graph embedding task is meaningful, there are several essential design choices that one has to make before we can implement it, which lead to different concrete methods. Those questions are:
1. How to construct the proximity graph G?
If we simply construct it using original feature space, such as Euclidean distance of data points, we end up with MDS (Multi-Dimensional Scaling), which is not desirable since this cannot directly preserve manifolds. We can use KNN plus shortest path as in Isomap, which may be easily hindered by noisy data points (short circuited issue). Another good choice is Gaussian kernel, in which the similarity diverges exponentially, this practice is adopted by many methods, including Laplacian Eigenmaps, t-SNE, etc.. — I am just too lazy to add any references…
What if we are given a graph (adjacency matrix) instead of data points in high dimensional space? There is something called “structure preserving embedding”, which aims to learn a kernel that best “align” with given graph, and treat it as proximity graph.
Anyway, there are some more considerations in this part which I will not discuss here.
2. How to “reconstruct” G?
Once we have the proximity graph G, we now turn to embed nodes in the graph that preserve the graph structure. We achieve this by “reconstruct” proximity graph G directly or indirectly using embedding vectors. In MDS, Isomap, Laplacian Eigenmap, and so on, we “reconstruct” by simply adopting Euclidean distance between two vectors in embedding space. In MDS and Isomap, the distance is forced to match the weight between a pair of nodes, but in Laplacian Eigenmap, the distances are forced to align with weights using inner product. A less often seen in the community of traditional graph embedding, but popular in many representation learning methods choice is to use inner product of two vectors.
3. How to measure the “closeness” between G and G’?
Now that we have a way to “reconstruct” the original graph G, we need to measure the quality of such reconstruction such that it can provide a mean for learning those embedding vectors. This can be seen as the choice of loss function. Well, L2 loss are commonly seen, such as in MDS, Isomap. As for Laplacian Eigenmap, the alignment loss by inner product is used. Other loss such as KL-divergence, hinge-loss, log-loss, etc., can also be adopted in appropriate contexts.
As I stated in the beginning of the post, the idea is to construct a proximity graph and embedding data points that preserve structures of the proximity graph. However, we may also learn meaningful low dimensional representations directly from data in original feature space. Auto-encoder is an example. Incorporating supervision is also a difficult but meaningful extension. And even with the “data->graph->embedding” idea, we still have other choices of constructing the graph, embedding nodes, and defining the loss based on real problems.