A couple weeks ago, Facebook launched a link prediction contest on Kaggle, with the goal of recommending missing edges in a social graph. I love investigating social networks, so I dug around a little, and since I did well enough to score one of the coveted prizes, I’ll share my approach here.
(For some background, the contest provided a training dataset of edges, a test set of nodes, and contestants were asked to predict missing outbound edges on the test set, using mean average precision as the evaluation metric.)
Exploration
What does the network look like? I wanted to play around with the data a bit first just to get a rough feel, so I made an app to interact with the network around each node.
Here’s a sample:
" alt="1 Untrimmed Network" />
(Go ahead, click on the picture to play with the app yourself. It’s pretty fun.)
The node in black is a selected node from the training set, and we perform a breadthfirst walk of the graph out to a maximum distance of 3 to uncover the local network. Nodes are sized according to their distance from the center, and colored according to a chosen metric (a personalized PageRank in this case; more on this later).
We can see that the central node is friends with three other users (in red), two of whom have fairly large, disjoint networks.
There are quite a few dangling nodes (nodes at distance 3 with only one connection to the rest of the local network), though, so let’s remove these to reveal the core structure:
" alt="1 Untrimmed Network" />
And here’s an embedded version you can manipulate inline:
Since the default view doesn’t encode the distinction between following and follower relationships, we can mouse over each node to see who it follows and who it’s followed by. Here, for example, is the following/follower network of one of the central node’s friends:
[/img]">[/img]" alt="1  Friend1" />
The moused over node is highlighted in black, its friends (users who both follow the node and are followed back in turn) are colored in purple, its followees are teal, and its followers in orange. We can also see that the node shares a friend with the central user (triadic closure, holla!).
Here’s another network, this time of the friend at the bottom:
[/img]">[/img]" alt="1  Friend2" />
Interestingly, while the first friend had several onlyfollowers (in orange), the second friend has none. (which suggests, perhaps, a nodelevel feature that measures how followhungry a user is…)
And here’s one more node, a little further out (maybe a celebrity, given it has nothing but followers?):
[/img]">[/img]" alt="1  Celebrity" />
The Quiet One
Let’s take a look at another graph, one whose local network is a little smaller:
[/img]">[/img]" alt="4 Network" />
A Social Butterfly
And one more, whose local network is a little larger:
[/img]">[/img]" alt="2 Network" />
[/img]">[/img]" alt="2 Network  Friend" />
Again, I encourage everyone to play around with the app here, and I’ll come back to the question of coloring each node later.
Distributions
Next, let’s take a more quantitative look at the graph.
Here’s the distribution of the number of followers of each node in the training set (cut off at 50 followers for a better fit – the maximum number of followers is 552), as well as the number of users each node is following (again, cut off at 50 – the maximum here is 1566)
[/img]">[/img]" alt="Training Followers" />
[/img]">[/img]" alt="Training Followees" />
Nothing terribly surprising, but that alone is good to verify. (For people tempted to mutter about power laws, I’ll hold you off with the bitter coldness of baby Gauss’s tears.)
Similarly, here are the same two graphs, but limited to the nodes in the test set alone:
[/img]">[/img]" alt="Test Followers" />
[/img]">[/img]" alt="Test Followees" />
Notice that there are relatively more test set users with 0 followees than in the full training set, and relatively fewer test set users with 0 followers. This information could be used to better simulate a validation set for model selection, though I didn’t end up doing this myself.
Preliminary Probes
Finally, let’s move on to the models themselves.
In order to quickly get up and running on a couple prediction algorithms, I started with some unsupervised approaches. For example, after building a new validation set* to test performance offline, I tried:
 Recommending users who follow you (but you don’t follow in return)
 Recommending users similar to you (when representing users as sets of their followers, and using cosine similarity and Jaccard similarity as the similarity metric)
 Recommending users based on a personalized PageRank score
 Recommending users that the people you follow also follow
And so on, combining the votes of these algorithms in a fairly adhoc way (e.g., by taking the majority vote or by ordering by the number of followers).
This worked quite well actually, but I’d been planning to move on to a more machine learned modelbased approach from the beginning, so I did that next.
*My validation set was formed by deleting random edges from the full training set. A slightly better approach, as mentioned above, might have been to more accurately simulate the distribution of the official test set, but I didn’t end up trying this out myself.
Candidate Selection
In order to run a machine learning algorithm to recommend edges (which would take two nodes, a source and a candidate destination, and generate a score measuring the likelihood that the source would follow the destination), it’s necessary to prune the set of candidates to run the algorithm on.
I used two approaches for this filtering step, both based on random walks on the graph.
Personalized PageRank
The first approach was to calculate a personalized PageRank around each source node.
Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).
That is, the random surfer in the personalized PageRank model works as follows:
 He starts at the source node $X$ that we want to calculate a personalized PageRank around.
 At step $i$: with probability $p$, the surfer moves to a neighboring node chosen uniformly at random; with probability $1p$, the surfer instead teleports back to the original source node $X$.
 The limiting probability that the surfer is at node $N$ is then the personalized PageRank score of node $N$ around $X$.
Here’s some Scala code that computes approximate personalized PageRank scores and takes the highestscoring nodes as the candidates to feed into the machine learning model:
Personalized PageRank1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67  /** * Calculate a personalized PageRank around the given user, and return * a list of the nodes with the highest personalized PageRank scores. * * @return A list of (node, probability of landing at this node after * running a personalized PageRank for K iterations) pairs. */ def pageRank(user: Int): List[(Int, Double)] = { // This map holds the probability of landing at each node, up to the // current iteration. val probs = Map[Int, Double]() probs(user) = 1 // We start at this user. val pageRankProbs = pageRankHelper(start, probs, NumPagerankIterations) pageRankProbs.toList .sortBy { _._2 } .filter { case (node, score) => !getFollowings(user).contains(node) && node != user } .take(MaxNodesToKeep) } /** * Simulates running a personalized PageRank for one iteration. * * Parameters: * start  the start node to calculate the personalized PageRank around * probs  a map from nodes to the probability of being at that node at * the start of the current iteration * numIterations  the number of iterations remaining * alpha  with probability alpha, we follow a neighbor; with probability * 1  alpha, we teleport back to the start node * * @return A map of node > probability of landing at that node after the * specified number of iterations. */ def pageRankHelper(start: Int, probs: Map[Int, Double], numIterations: Int, alpha: Double = 0.5): Map[Int, Double] = { if (numIterations <= 0) { probs } else { // Holds the updated set of probabilities, after this iteration. val probsPropagated = Map[Int, Double]() // With probability 1  alpha, we teleport back to the start node. probsPropagated(start) = 1  alpha // Propagate the previous probabilities... probs.foreach { case (node, prob) => val forwards = getFollowings(node) val backwards = getFollowers(node) // With probability alpha, we move to a follower... // And each node distributes its current probability equally to // its neighbors. val probToPropagate = alpha * prob / (forwards.size + backwards.size) (forwards.toList ++ backwards.toList).foreach { neighbor => if (!probsPropagated.contains(neighbor)) { probsPropagated(neighbor) = 0 } probsPropagated(neighbor) += probToPropagate } } pageRankHelper(start, probsPropagated, numIterations  1, alpha) } }

Propagation Score
Another approach I used, based on a proposal by another contestant on the Kaggle forums, works as follows:
 Start at a specified user node and give it some score.
 In the first iteration, this user propagates its score equally to its neighbors.
 In the second iteration, each user duplicates and keeps half of its score S. It then propagates S equally to its neighbors.
 In subsequent iterations, the process is repeated, except that neighbors reached via a backwards link don’t duplicate and keep half of their score. (The idea is that we want the score to reach followees and not followers.)
Here’s some Scala code to calculate these propagation scores:
Propagation Score1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61  /** * Calculate propagation scores around the current user. * * In the first propagation round, we * *  Give the starting node N an initial score S. *  Propagate the score equally to each of N's neighbors (followers * and followings). *  Each firstlevel neighbor then duplicates and keeps half of its score * and then propagates the original again to its neighbors. * * In further rounds, neighbors then repeat the process, except that neighbors * traveled to via a backwards/follower link don't keep half of their score. * * @return a sorted list of (node, propagation score) pairs. */ def propagate(user: Int): List[(Int, Double)] = { val scores = Map[Int, Double]() // We propagate the score equally to all neighbors. val scoreToPropagate = 1.0 / (getFollowings(user).size + getFollowers(user).size) (getFollowings(user).toList ++ getFollowers(user).toList).foreach { x => // Propagate the score... continuePropagation(scores, x, scoreToPropagate, 1) // ...and make sure it keeps half of it for itself. scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2 } scores.toList.sortBy { _._2 } .filter { nodeAndScore => val node = nodeAndScore._1 !getFollowings(user).contains(node) && node != user } .take(MaxNodesToKeep) } /** * In further rounds, neighbors repeat the process above, except that neighbors * traveled to via a backwards/follower link don't keep half of their score. */ def continuePropagation(scores: Map[Int, Double], user: Int, score: Double, currIteration: Int): Unit = { if (currIteration < NumIterations && score > 0) { val scoreToPropagate = score / (getFollowings(user).size + getFollowers(user).size) getFollowings(user).foreach { x => // Propagate the score... continuePropagation(scores, x, scoreToPropagate, currIteration + 1) // ...and make sure it keeps half of it for itself. scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2 } getFollowers(user).foreach { x => // Propagate the score... continuePropagation(scores, x, scoreToPropagate, currIteration + 1) // ...but backward links (except for the starting node's immediate // neighbors) don't keep any score for themselves. } } }

I played around with tweaking some parameters in both approaches (e.g., weighting followers and followees differently), but the natural defaults (as used in the code above) ended up performing the best.
Features
After pruning the set of candidate destination nodes to a more feasible level, I fed pairs of (source, destination) nodes into a machine learning model. From each pair, I extracted around 30 features in total.
As mentioned above, one feature that worked quite well on its own was whether the destination node already follows the source.
I also used a wide set of similaritybased features, for example, the Jaccard similarity between the source and destination when both are represented as sets of their followers, when both are represented as sets of their followees, or when one is represented as a set of followers while the other is represented as a set of followees.
Similarity Metrics1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49  abstract class SimilarityMetric[T] { def apply(set1: Set[T], set2: Set[T]): Double; } object JaccardSimilarity extends SimilarityMetric[Int] { /** * Returns the Jaccard similarity between two sets, 0 if both are empty. */ def apply(set1: Set[Int], set2: Set[Int]): Double = { val union = (set1.union(set2)).size if (union == 0) { 0 } else { (set1 & set2).size.toFloat / union } } } object CosineSimilarity extends SimilarityMetric[Int] { /** * Returns the cosine similarity between two sets, 0 if both are empty. */ def apply(set1: Set[Int], set2: Set[Int]): Double = { if (set1.size == 0 && set2.size == 0) { 0 } else { (set1 & set2).size.toFloat / (math.sqrt(set1.size * set2.size)) } } } // ************ // * FEATURES * // ************ /** * Returns the similarity between user1 and user2 when both are represented as * sets of followers. */ def similarityByFollowers(user1: Int, user2: Int) (implicit similarity: SimilarityMetric[Int]): Double = { similarity.apply(getFollowersWithout(user1, user2), getFollowersWithout(user2, user1)) } // etc.

Along the same lines, I also computed a similarity score between the destination node and the source node’s followees, and several variations thereof.
Extended Similarity Scores1 2 3 4 5 6 7 8 9 10 11  /** * Iterate over each of user1's followings, compute their similarity with * user2 when both are represented as sets of followers, and return the * sum of these similarities. */ def followerBasedSimilarityToFollowing(user1: Int, user2: Int) (implicit similarity: SimilarityMetric[Int]): Double = { getFollowingsWithout(user1, user2) .map { similarityByFollowers(_, user2)(similarity) } .sum }

Other features included the number of followers and followees of each node, the ratio of these, the personalized PageRank and propagation scores themselves, the number of followers in common, and triangle/closuretype features (e.g., whether the source node is friends with a node X who in turn is a friend of the destination node).
If I had had more time, I would probably have tried weighted and more regularized versions of some of these features as well (e.g., downweighting nodes with large numbers of followers when computing cosine similarity scores based on followees, or shrinking the scores of nodes we have little information about).
Feature Understanding
But what are these features actually doing? Let’s use the same app I built before to take a look.
Here’s the local network of node 317 (different from the node above), where each node is colored by its personalized PageRank (higher scores are in darker red):
[/img]">[/img]" alt="317  Personalized PageRank" />
If we look at the following vs. follower relationships of the central node (recall that purple is friends, teal is followings, orange is followers):
[/img]">[/img]" alt="317  Personalized PageRank" />
…we can see that, as expected (because edges that represented both following and follower were doubleweighted in my PageRank calculation), the darkest red nodes are those that are friends with the central node, while those in a followingonly or followeronly relationship have a lower score.
How does the propagation score compare to personalized PageRank? Here, I colored each node according to the log ratio of its propagation score and personalized PageRank:
[/img]">[/img]" alt="317  Log Ratio" />
Comparing this coloring with the local follow/follower network:
[/img]">[/img]" alt="317  Local Network of Node" />
…we can see that followed nodes (in teal) receive a higher propagation weight than friend nodes (in purple), while follower nodes (in orange) receive almost no propagation score at all.
Going back to node 1, let’s look at a different metric. Here, each node is colored according to its Jaccard similarity with the source, when nodes are represented by the set of their followers:
[/img]">[/img]" alt="1  Similarity by Followers" />
We can see that, while the PageRank and propagation metrics tended to favor nodes close to the central node, the Jaccard similarity feature helps us explore nodes that are further out.
However, if we look the highscoring nodes more closely, we see that they often have only a single connection to the rest of the network:
[/img]">[/img]" alt="1  Single Connection" />
In other words, their high Jaccard similarity is due to the fact that they don’t have many connections to begin with. This suggests that some regularization or shrinking is in order.
So here’s a regularized version of Jaccard similarity, where we downweight nodes with few connections:
[/img]">[/img]" alt="1  Regularized" />
We can see that the outlier nodes are much more muted this time around.
For a starker difference, compare the following two graphs of the Jaccard similarity metric around node 317 (the first graph is an unregularized version, the second is regularized):
[/img]">[/img]" alt="317  Unregularized" />
[/img]">[/img]" alt="317  Regularized" />
Notice, in particular, how the popular node in the top left and the popular nodes at the bottom have a much higher score when we regularize.
And again, there are other networks and features I haven’t mentioned here, so play around and discover them on the app itself.
Models
For the machine learning algorithms on top of my features, I experimented with two types of models: logistic regression (using both L1 and L2 regularization) and random forests. (If I had more time, I would probably have done some more parameter tuning and maybe tried gradient boosted trees as well.)
So what is a random forest? I wrote an old (layman’s) post on it here, but since nobody ever clicks on these links, let’s copy it over:
Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questionslike game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you *really really* loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you t