SimRank: Similarity Analysis Explanation and Python Implementation from Scratch | by Chonyy | Jan, 2021
Similarity Measurement of Websites
Measuring similarity is a problem needed in all kinds of fields. And SimRank is an intuitive and general approach in the similarity measure. It is applicable in any domain with object-to-object relationships, measuring the similarity of an object based on the relationship with other objects.
The key of SimRank is
Two objects are considered to be similar if they are referenced by similar objects.
We will briefly introduce the algorithm and walkthrough the Python implementation from scratch.
Feel free to check out the well-commented source code. It could really help to understand the whole algorithm.
The algorithm steps are listed below
- Initialize the SimRank of every pair of the nodes following
if(node1 == node2):
SimRank(node1, node2) = 1
SimRank(node1, node2) = 0
- For each iteration, update the SimRank of every pair of nodes in the graph
- If both nodes are the same, SimRank(a, b) = 1
- If one of the nodes has no in-neighbors, SimRank(a,b) = 0
- Else, the new SimRank follows the equation
- We calculate the new SimRank based on the SimRank from the previous iteration (Defined recursively but computed iteratively)
We initialize the SimRank in the Similarity class constructor. Please note that the initial values are stored in old_sim. We retain the new_sim to save the updated SimRank from the next iteration.
The initializing rule is just like what we mentioned above. SimRank = 1 if both are the same, else SimRank = 0.
SimRank one iteration
This is the SimRank main function. We iterate every pair of nodes in the graph and update the SimRank. After getting all the new SimRank values, we replace the old values with the values from the current iteration.
Calculate new SimRank
- If both nodes are the same, value = 1
- If one of the nodes has no in-neighbor, value = 0
- SimRank_sum = the sum of SimRank value of all in-neighbor pairs (SimRank value is from the previous iteration)
- Calculate the scale with decay factor
- Calculate the new SimRank with SimRank_sum and scale
Take the calculated new SimRank value and assign it to new_sim.
Replace the value from the previous iteration with the values from the current iteration.
Let’s test our implementation on the dataset in the repo. We set decay_factor = 0.9 in all the results.
The result follows the order of node value, which is
1, 2, 3, 4, 5, 6 on the row and
1, 2, 3, 4, 5, 6 on the column. From the matrix, we could see that the diagonal is always full of 1s. The SimRank of two same nodes is always 1.
Recall the SimRank equation explained above, having a common parent or not matters a lot in the calculation. From the graph, we could see that there’s no pair of nodes have a common parent. Thus, all the SimRank is 0.
Similarly, none of the nodes has a common parent. So the SimRank are all 0.
From the result matrix, there’s one interesting observation.
SimRank = SimRank
Let’s take a look at the pairs that SimRank != 0. (node1, node3), (node2, node4) all have a common parent, and that makes their SimRank not equals to zero.
Let’s look at pair (node6, node4), (node7, node4). The common node in these two pairs is node4. Node4 has two parents, node1 and node5.
- Node1 is the only in-neighbor of node7
- Node5 is the only in-neighbor of node6
As you can see, this kind of relationship makes them got the same SimRank = 0.695.
The result follows the node value order, which is
2076, 2564, 4785, 5016, 5793, 6338, 6395, 9484, 9994 in row and column respectively.
You might be wondering why (node4785, node5016) has a single and common parent, but the SimRank is not 1. This is why we need the decay factor, to separate the difference between extremely high similarity and totally the same.
Now we all knew that after enough iterations, SimRank will always converge to a specific value. Why don’t we plot it out to check how fast it’s converging?
Testing the convergence on graph_4.txt
There are two important observations here. The first one is that just like how we expected, the value started at 0 and gradually increased to a certain value and stop changing. The second one is that the curve is smoother than the plot we got from HITS or PageRank algorithm.
Please note that it may not always take only this few iterations to complete the calculation. For example, if we test this SimRank algorithm on graph_6 in the repo, which has 1228 nodes and 5220 edges, even 500 iteration is not enough for the SimRank to converge. And the computation takes forever long due to a large number of edges.
Number of Edges
We run 100 iterations with a different number of total edges in order to spot the relation between total edges and computation time. As you can see, the inference of edges number on the computation time is increasing faster than linear. So the calculation will be extremely slow if there are too many edges.
Please note that the reason it’s not a perfect curve is the way the edges link to each other will also affect the computation time a little.
Read More …