This is code for computing the matching index much faster. This is of particular relevance for people using generative network models.
The matching index, as typically used, is a measure that quantifies the similarity in two nodes' connectivity profiles and this is understood to be a normalised measure of the overlap in two nodes' neighbourhoods* (also note that everything discussed will apply to only undirected, unweighted networks). The matching index
where
* See the section below discussing how the matching index may not measure precisely what we usually think it does...
The matching index has often been found to give the best performance among topological generative network models (for example, see here here here here and here), meaning that it is the model people are most interested in running.
HOWEVER
It takes agesssssssssss to run (in my experience I could run every single other topological model in the time it takes to run the matching model).
This likely makes you sad 😢
But what if I told you there was a way to make it better...
To understand why it can be made faster, we need to talk more maths (sorry).
We can also calculate the matching index as
which is just the number of nodes
When written this way it is trivial* to see how the calculation could easily be done programmatically. If you look at the original code provided in the BCT, you'll notice it is actually calculating it as per Equation 2 and not Equation 1. However, it is looping over all the nodes when performing its calculation. We can actually forgo any loops when calculating the matching index resulting in a considerable speed up in processing speed.
* I've always wanted to say this haha. The reason it is trivial is because we can take advantage of matrix operations to compute this value (as the number of neighbours can be easily computed by taking the square of the matrix, and degree can very easily be obtained by just taking the sum. Stack this vector
First, let's compare calculating the matching index on networks with different numbers of nodes:
You can see that the old way takes significantly longer when computing the index than the new way, and this time only increases as the network gets larger! So the benefits of using the new code get considerably better when a network has more nodes in it.
An important thing to note about generative network models is they iteratively add edges. As I alluded to before, the new code largely benefits because it can calculate everything in one hit instead of needing to loop over all the nodes. In the old generative model node, at each iteration the loop to calculate the matching index only runs over nodes that will be affected by the newly added edge i.e., the loop very likely doesn't need to be run over all nodes. So because of this, we won't see the same order of magnitude levels of improvement. But what improvement do we see? First, let's just calculate the matching index model using a network I used in my paper. I generated 100 different models with the old and new code and compared the time it takes to compute them (and also if they return a similar result):
A fourfold speed-up is pretty good! You can also see that the result (as determined by model fit AKA the energy function) is the same.
Ok, how does this change as a factor of the size of the network and the number of edges being requested?
To see how the speed of the codes changes under different node sizes and edge counts, we can exploit the fact the if you run a generative model for X edges, you will also have generated a network of 1 to X-1 edges as well, as each iteration is technically creating a new network using the seed of the previous iteration. If we record the time it takes to do each iteration we can see how the improvement varies:
Here we can clearly see that as more edges need to be made, the code slows down, but depending on the number of nodes it doesn't slow down at the same rate. The speed-up also appears to be maximal for a network of 200-250 nodes:
I thought this might be occurring as a factor of network density, so I went to two extremes. First I generated all 4950 edges for a network of size 100:
Then I generated all 124750 edges for a network of size 500:
The relative speed as compared to the old code seems to vary approximately with the desired density (it gets slower as a more dense network is requested) more so than the raw number of edges requested (but that still seems to have an effect). I am not completely sure as to why the improvement lessens over time but think might be something with having to index more and nodes on later iterations. If anyone has any ideas would be interested to know! But putting this curious coding quirk case study aside, the new version is faster, particularly for the network scale generative network models tend to be used at.
For the additive model, the speed-up is not quite as drastic. This is because each step involves an additional normalisation which slows things down.
Easy! Just run the script I wrote to demonstrate this in MATLAB
matchingSpeedTest.m
additiveSpeedTest.m
It takes well over 4 hours to run everything on an i7 6700k FYI
The matching index is commonly considered a normalised measure of the overlap of two nodes' neighbourhoods. Conceptually, we would understand this as meaning the number of shared neighbours divided by the total unique neighbours, as the original mathematical definition shown in Equation 1 suggests.
Consider the network below:
Node
We would technically be incorrect however, or rather we would have a different answer to what the code (both old and new) provides.
As mentioned above, the matching index is similarity in the connectivity profiles of two nodes. This means the matching index is actually calculated as the number of connections a pair of nodes have to the same neighbours, over the total number of connections those nodes have. So in the example above, nodes
I would say that this definition isn't exactly consistent with what we would expect from Equation 1 (in my opinion), but it is exactly how Equation 2 is done (and how it is done in both the new and old code by default). I would argue that this definition/conceptualisation (which I shall call the connectivity profiles definition) isn't the most intuitive. We can change Equation 2 to be more consistent with the intuitive conceptualisation (which I shall call the normalised overlapping neighbourhood definition) by doing the following
* It might make more sense to think of this in terms of a connectivity matrix. Each row/column corresponds to a node, and that forms a vector indicating which other nodes it is connected to. If you compare any two rows/pairs, where they both have a one indicates a shared neighbour. This measure is also very similar to the Jaccard index.
No. They will give different results as I showed above, but so long as the same calculation is being used throughout the analysis, it should be ok (and to clarify, if you have been using generative models to date you have almost certainly been implementing the connectivity profiles definition). The measures are almost perfectly correlated, but do show a clear monotonic relationship:
As you might expect, the two measures are mathematically related. The connectivity profiles definition i.e., Equation 2, can be multiplied by
The different definitions may affect how you discuss this measure though. The code I provided does the connectivity profiles definition by default, but does allow for the normalised overlapping neighbourhood definition to be done as well (note this is only done for the matching.m function, all the generative modelling functions at current can only use the connectivity profiles formulation).
I would like to incorporate these new ways of computing the matching index into my own code, is there an easy way to do this?
Good news! I have written code which allows you to do this! The inputs and outputs should be very similar to what the BCT/Betzel implementation used (and is in a similar format to the code I wrote for my paper). They aren't completely plug-and-play from what is provided in the BCT, but they should be easy to adapt.
I have it for the multiplicative and additive formulation of the generative network model. See the scripts matchingSpeedTest.m and additiveSpeedTest.m for examples of its use.
Thank you, I appreciate it.
This code is built off of Betzel 2016 and my own paper, you can (and probably should) also reference this GitHub.
You can email me at stuart.oldham@mcri.edu.au
Ha no. I leave it as an exercise for the reader to figure that one out.
Image for the social preview is by starline on Freepik