My attempt at building the gears of a graph convolutional network (GCN) from the ground up.
Through the process, Jonathan Hui's blog post and Inneke Mayachita's explainer were the gold-standards from which I learned.
This is not meant to be a zero-to-one primer. This is moreso my notes on GCNs, condensed into an explainer for my own comprehension (Feynman Technique).
Just like the analogous convolutional networks, there are a few core classes and operations.
Graph networks require a graph structure as input. Adjacency matrices are convenient ways to represent graph information. Cells with a '1' represent connections between nodes, '0' means there's no connection. In the image, Node 0 shares an edge with Node 1, 2, and 3.
Edge indices are also convenient representations of matrices. Edge indices are a list of node-node connections, e.g. [[1,0], [2,0], [2,3]] is a graph where Node 0 shares an edge with Node 1 and 2, and Node 2 and 3 also share an edge.
Edge indices are often more efficient than a full adjacency matrix (which grows at a complexity of N^2, where N is the number of nodes).
However, in this case I chose to use the matrix representation because there were only 34 nodes in the graph, and it simplified much of the calculations.
Just as NLP involves tokens with word embeddings, each node in a graph network has its own feature vector. (So does each edge, but we'll ignore edges for now).
The feature matrix is simply an matrix of every node feature vector, and as a result it is an N x feature_dim matrix, where N is the number of nodes and feature_dim is the number of features of your embedding vector.
This carries information about each node. More on this in a bit.
Naively, you might could multiply the adjacency matrix (A) and feature matrix (X) to somehow combine information from the graph connections and the information present in each node.
What happens when you perform the matmul? Well...
AX turns out to sum all the neighboring feature vectors for each node! e.g. AX[0] = 1 + 2 + 3. This is pretty good if our goal is to aggregate information from the graph. But there's a problem. If we go back to the adjacency matrix, you'll notice that all the diagonals are 0s.
For each node, AX doesn't take into account the features of node itself. We can resolve this with self-loops, essentially making it so that each node connect to itself.
In the adjacency matrix, we can simply add 1s along the diagonal by performing a simple sum: A' = A + I. The resulting matrix and product is:
You can see that now AX takes into account information of each node itself when performing the sum.
There's another more subtle problem with the current implementation. If you look at the product of AX, they aren't normalized. What does this mean? Well, if we keep applying the same operation to the graph, we'll encounter vanishing/exploding gradients. Similar to regular neural networks, we'll need to normalize to prevent numerical instabilities.
We can normalize our GCN by finding the Degree Matrix (D) and multiplying the inverse of D with our adjacency matrix (A).
Comparing against the same mat mul as before, you can see that now the product is normalized.
Without diving into too much of the math (you can do that here), Kipf and Welling showed that the ideal normalization actually takes on a symmetric form.
Now the output is:
That's all for normalization.
Putting everything together, performing the single layer of convolution across the entire graph takes the form of:
Where D is the degree matrix, A is the adjacency matrix with self-loops, H^l is feature matrix at layer l (H^0 is just X, the initial feature matrix), and W^l is the weight matrix of the neural network layer that processes the result of DADX.
I trained a model on Zachary's Karate Club, a classic graph of 34 people involved in a karate club. The model was evaluated on a binary classification task, determining whether a club member would join Mr. Hi's club (red) or the Administrator's (blue). Node 0 represents Mr. Hi, Node 34 represents the Administrator, and the connections of a given club member to other people is what determines their loyalty.
The loss was evaluated via cross entropy between the ground truth labels and probabilities for the binary classification.
Before training:
After training:
There's a clear clustering based on each node's learned embeddings as the model trained, which shows that the model effectively learned at this binary classification task.
That's all for now. Thanks for reading if you made it this far! If you want to learn more, I highly recommend this Graph Neural Network Primer in addition to both posts I mentioned above (Jonathan, Inneke)