Author Topic: Edge Prediction in a Social Graph: My Solution to Facebook's User Recommendation Contest on Kaggle  (Read 94 times)

0 Members and 1 Guest are viewing this topic.

Offline Flavio58

Advertisement
Edge Prediction in a Social Graph: My Solution to Facebook's User Recommendation Contest on Kaggle

[html]

A couple weeks ago, Facebook launched a link prediction contest on Kaggle, with the goal of recommending missing edges in a social graph. I love investigating social networks, so I dug around a little, and since I did well enough to score one of the coveted prizes, I’ll share my approach here.



 

(For some background, the contest provided a training dataset of edges, a test set of nodes, and contestants were asked to predict missing outbound edges on the test set, using mean average precision as the evaluation metric.)



 

Exploration



 

What does the network look like? I wanted to play around with the data a bit first just to get a rough feel, so I made an app to interact with the network around each node.



 

Here’s a sample:



 

" alt="1 Untrimmed Network" />



 

(Go ahead, click on the picture to play with the app yourself. It’s pretty fun.)



 

The node in black is a selected node from the training set, and we perform a breadth-first walk of the graph out to a maximum distance of 3 to uncover the local network. Nodes are sized according to their distance from the center, and colored according to a chosen metric (a personalized PageRank in this case; more on this later).



 

We can see that the central node is friends with three other users (in red), two of whom have fairly large, disjoint networks.



 

There are quite a few dangling nodes (nodes at distance 3 with only one connection to the rest of the local network), though, so let’s remove these to reveal the core structure:



 

" alt="1 Untrimmed Network" />



 

And here’s an embedded version you can manipulate inline:



 


 

Since the default view doesn’t encode the distinction between following and follower relationships, we can mouse over each node to see who it follows and who it’s followed by. Here, for example, is the following/follower network of one of the central node’s friends:



 

[/img]">[/img]" alt="1 - Friend1" />



 

The moused over node is highlighted in black, its friends (users who both follow the node and are followed back in turn) are colored in purple, its followees are teal, and its followers in orange. We can also see that the node shares a friend with the central user (triadic closure, holla!).



 

Here’s another network, this time of the friend at the bottom:



 

[/img]">[/img]" alt="1 - Friend2" />



 

Interestingly, while the first friend had several only-followers (in orange), the second friend has none. (which suggests, perhaps, a node-level feature that measures how follow-hungry a user is…)



 

And here’s one more node, a little further out (maybe a celebrity, given it has nothing but followers?):



 

[/img]">[/img]" alt="1 - Celebrity" />



 

The Quiet One



 

Let’s take a look at another graph, one whose local network is a little smaller:



 

[/img]">[/img]" alt="4 Network" />



 

A Social Butterfly



 

And one more, whose local network is a little larger:



 

[/img]">[/img]" alt="2 Network" />



 

[/img]">[/img]" alt="2 Network - Friend" />



 

Again, I encourage everyone to play around with the app here, and I’ll come back to the question of coloring each node later.



 

Distributions



 

Next, let’s take a more quantitative look at the graph.



 

Here’s the distribution of the number of followers of each node in the training set (cut off at 50 followers for a better fit – the maximum number of followers is 552), as well as the number of users each node is following (again, cut off at 50 – the maximum here is 1566)



 

[/img]">[/img]" alt="Training Followers" />



 

[/img]">[/img]" alt="Training Followees" />



 

Nothing terribly surprising, but that alone is good to verify. (For people tempted to mutter about power laws, I’ll hold you off with the bitter coldness of baby Gauss’s tears.)



 

Similarly, here are the same two graphs, but limited to the nodes in the test set alone:



 

[/img]">[/img]" alt="Test Followers" />



 

[/img]">[/img]" alt="Test Followees" />



 

Notice that there are relatively more test set users with 0 followees than in the full training set, and relatively fewer test set users with 0 followers. This information could be used to better simulate a validation set for model selection, though I didn’t end up doing this myself.



 

Preliminary Probes



 

Finally, let’s move on to the models themselves.



 

In order to quickly get up and running on a couple prediction algorithms, I started with some unsupervised approaches. For example, after building a new validation set* to test performance offline, I tried:



 

     
  • Recommending users who follow you (but you don’t follow in return)

  •  
  • Recommending users similar to you (when representing users as sets of their followers, and using cosine similarity and Jaccard similarity as the similarity metric)

  •  
  • Recommending users based on a personalized PageRank score

  •  
  • Recommending users that the people you follow also follow

  •  



 

And so on, combining the votes of these algorithms in a fairly ad-hoc way (e.g., by taking the majority vote or by ordering by the number of followers).



 

This worked quite well actually, but I’d been planning to move on to a more machine learned model-based approach from the beginning, so I did that next.



 

*My validation set was formed by deleting random edges from the full training set. A slightly better approach, as mentioned above, might have been to more accurately simulate the distribution of the official test set, but I didn’t end up trying this out myself.



 

Candidate Selection



 

In order to run a machine learning algorithm to recommend edges (which would take two nodes, a source and a candidate destination, and generate a score measuring the likelihood that the source would follow the destination), it’s necessary to prune the set of candidates to run the algorithm on.



 

I used two approaches for this filtering step, both based on random walks on the graph.



 

Personalized PageRank



 

The first approach was to calculate a personalized PageRank around each source node.



 

Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).



 

That is, the random surfer in the personalized PageRank model works as follows:



 

     
  • He starts at the source node $X$ that we want to calculate a personalized PageRank around.

  •  
  • At step $i$: with probability $p$, the surfer moves to a neighboring node chosen uniformly at random; with probability $1-p$, the surfer instead teleports back to the original source node $X$.

  •  
  • The limiting probability that the surfer is at node $N$ is then the personalized PageRank score of node $N$ around $X$.

  •  



 

Here’s some Scala code that computes approximate personalized PageRank scores and takes the highest-scoring nodes as the candidates to feed into the machine learning model:



 
Personalized PageRank
1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
 
/**
 
* Calculate a personalized PageRank around the given user, and return
 
* a list of the nodes with the highest personalized PageRank scores.
 
*
 
* @return A list of (node, probability of landing at this node after
 
*         running a personalized PageRank for K iterations) pairs.
 
*/
 
def pageRank(user: Int): List[(Int, Double)] = {
 
  // This map holds the probability of landing at each node, up to the
 
  // current iteration.
 
  val probs = Map[Int, Double]()
 
  probs(user) = 1 // We start at this user.
 

 
  val pageRankProbs = pageRankHelper(start, probs, NumPagerankIterations)
 
  pageRankProbs.toList
 
               .sortBy { -_._2 }
 
               .filter { case (node, score) =>
 
                  !getFollowings(user).contains(node) && node != user
 
                }
 
                .take(MaxNodesToKeep)
 
}
 

 
/**
 
* Simulates running a personalized PageRank for one iteration.
 
*
 
* Parameters:
 
* start - the start node to calculate the personalized PageRank around
 
* probs - a map from nodes to the probability of being at that node at
 
*         the start of the current iteration
 
* numIterations - the number of iterations remaining
 
* alpha - with probability alpha, we follow a neighbor; with probability
 
*         1 - alpha, we teleport back to the start node
 
*
 
* @return A map of node -> probability of landing at that node after the
 
*         specified number of iterations.
 
*/
 
def pageRankHelper(start: Int, probs: Map[Int, Double], numIterations: Int,
 
                   alpha: Double = 0.5): Map[Int, Double] = {
 
  if (numIterations <= 0) {
 
    probs
 
  } else {
 
    // Holds the updated set of probabilities, after this iteration.
 
    val probsPropagated = Map[Int, Double]()
 

 
    // With probability 1 - alpha, we teleport back to the start node.
 
    probsPropagated(start) = 1 - alpha
 

 
    // Propagate the previous probabilities...
 
    probs.foreach { case (node, prob) =>
 
      val forwards = getFollowings(node)
 
      val backwards = getFollowers(node)
 

 
      // With probability alpha, we move to a follower...
 
      // And each node distributes its current probability equally to
 
      // its neighbors.
 
      val probToPropagate = alpha * prob / (forwards.size + backwards.size)
 
      (forwards.toList ++ backwards.toList).foreach { neighbor =>
 
        if (!probsPropagated.contains(neighbor)) {
 
          probsPropagated(neighbor) = 0
 
        }
 
        probsPropagated(neighbor) += probToPropagate
 
      }
 
    }
 

 
    pageRankHelper(start, probsPropagated, numIterations - 1, alpha)
 
  }
 
}
 



 

Propagation Score



 

Another approach I used, based on a proposal by another contestant on the Kaggle forums, works as follows:



 

     
  • Start at a specified user node and give it some score.

  •  
  • In the first iteration, this user propagates its score equally to its neighbors.

  •  
  • In the second iteration, each user duplicates and keeps half of its score S. It then propagates S equally to its neighbors.

  •  
  • In subsequent iterations, the process is repeated, except that neighbors reached via a backwards link don’t duplicate and keep half of their score. (The idea is that we want the score to reach followees and not followers.)

  •  



 

Here’s some Scala code to calculate these propagation scores:



 
Propagation Score
1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
 
/**
 
* Calculate propagation scores around the current user.
 
*
 
* In the first propagation round, we
 
*
 
* - Give the starting node N an initial score S.
 
* - Propagate the score equally to each of N's neighbors (followers
 
*   and followings).
 
* - Each first-level neighbor then duplicates and keeps half of its score
 
*   and then propagates the original again to its neighbors.
 
*
 
* In further rounds, neighbors then repeat the process, except that neighbors
 
* traveled to via a backwards/follower link don't keep half of their score.
 
*
 
* @return a sorted list of (node, propagation score) pairs.
 
*/
 
def propagate(user: Int): List[(Int, Double)] = {
 
  val scores = Map[Int, Double]()
 

 
  // We propagate the score equally to all neighbors.
 
  val scoreToPropagate = 1.0 / (getFollowings(user).size + getFollowers(user).size)
 

 
  (getFollowings(user).toList ++ getFollowers(user).toList).foreach { x =>
 
    // Propagate the score...
 
    continuePropagation(scores, x, scoreToPropagate, 1)
 
    // ...and make sure it keeps half of it for itself.
 
    scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2
 
  }
 

 
  scores.toList.sortBy { -_._2 }
 
               .filter { nodeAndScore =>
 
                 val node = nodeAndScore._1
 
                 !getFollowings(user).contains(node) && node != user
 
                }
 
                .take(MaxNodesToKeep)
 
}
 

 
/**
 
* In further rounds, neighbors repeat the process above, except that neighbors
 
* traveled to via a backwards/follower link don't keep half of their score.
 
*/
 
def continuePropagation(scores: Map[Int, Double], user: Int, score: Double,
 
                        currIteration: Int): Unit = {
 
  if (currIteration < NumIterations && score > 0) {
 
    val scoreToPropagate = score / (getFollowings(user).size + getFollowers(user).size)
 

 
    getFollowings(user).foreach { x =>
 
      // Propagate the score...       
 
      continuePropagation(scores, x, scoreToPropagate, currIteration + 1)
 
      // ...and make sure it keeps half of it for itself.       
 
      scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2
 
    }
 

 
    getFollowers(user).foreach { x =>
 
      // Propagate the score...
 
      continuePropagation(scores, x, scoreToPropagate, currIteration + 1)
 
      // ...but backward links (except for the starting node's immediate
 
      // neighbors) don't keep any score for themselves.
 
    }
 
  }
 
}
 



 

I played around with tweaking some parameters in both approaches (e.g., weighting followers and followees differently), but the natural defaults (as used in the code above) ended up performing the best.



 

Features



 

After pruning the set of candidate destination nodes to a more feasible level, I fed pairs of (source, destination) nodes into a machine learning model. From each pair, I extracted around 30 features in total.



 

As mentioned above, one feature that worked quite well on its own was whether the destination node already follows the source.



 

I also used a wide set of similarity-based features, for example, the Jaccard similarity between the source and destination when both are represented as sets of their followers, when both are represented as sets of their followees, or when one is represented as a set of followers while the other is represented as a set of followees.



 
Similarity Metrics
1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
 
abstract class SimilarityMetric[T] {
 
  def apply(set1: Set[T], set2: Set[T]): Double;
 
}
 

 
object JaccardSimilarity extends SimilarityMetric[Int] {
 
  /**
 
   * Returns the Jaccard similarity between two sets, 0 if both are empty.
 
   */
 
  def apply(set1: Set[Int], set2: Set[Int]): Double = {
 
    val union = (set1.union(set2)).size
 

 
    if (union == 0) {
 
      0
 
    } else {
 
      (set1 & set2).size.toFloat / union
 
    }
 
  }
 

 
}
 

 
object CosineSimilarity extends SimilarityMetric[Int] {
 
  /**
 
   * Returns the cosine similarity between two sets, 0 if both are empty.
 
   */
 
  def apply(set1: Set[Int], set2: Set[Int]): Double = {
 
    if (set1.size == 0 && set2.size == 0) {
 
      0
 
    } else {
 
      (set1 & set2).size.toFloat / (math.sqrt(set1.size * set2.size))
 
    }
 
  }
 

 
}
 

 
// ************
 
// * FEATURES *
 
// ************
 

 
/**
 
* Returns the similarity between user1 and user2 when both are represented as
 
* sets of followers.
 
*/
 
def similarityByFollowers(user1: Int, user2: Int)
 
                         (implicit similarity: SimilarityMetric[Int]): Double = {
 
  similarity.apply(getFollowersWithout(user1, user2),
 
                   getFollowersWithout(user2, user1))
 
}
 

 
// etc.
 



 

Along the same lines, I also computed a similarity score between the destination node and the source node’s followees, and several variations thereof.



 
Extended Similarity Scores
1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
 
/**
 
* Iterate over each of user1's followings, compute their similarity with
 
* user2 when both are represented as sets of followers, and return the
 
* sum of these similarities.
 
*/
 
def followerBasedSimilarityToFollowing(user1: Int, user2: Int)
 
  (implicit similarity: SimilarityMetric[Int]): Double = {
 
    getFollowingsWithout(user1, user2)
 
                        .map { similarityByFollowers(_, user2)(similarity) }
 
                        .sum
 
}
 



 

Other features included the number of followers and followees of each node, the ratio of these, the personalized PageRank and propagation scores themselves, the number of followers in common, and triangle/closure-type features (e.g., whether the source node is friends with a node X who in turn is a friend of the destination node).



 

If I had had more time, I would probably have tried weighted and more regularized versions of some of these features as well (e.g., downweighting nodes with large numbers of followers when computing cosine similarity scores based on followees, or shrinking the scores of nodes we have little information about).



 

Feature Understanding



 

But what are these features actually doing? Let’s use the same app I built before to take a look.



 

Here’s the local network of node 317 (different from the node above), where each node is colored by its personalized PageRank (higher scores are in darker red):



 

[/img]">[/img]" alt="317 - Personalized PageRank" />



 

If we look at the following vs. follower relationships of the central node (recall that purple is friends, teal is followings, orange is followers):



 

[/img]">[/img]" alt="317 - Personalized PageRank" />



 

…we can see that, as expected (because edges that represented both following and follower were double-weighted in my PageRank calculation), the darkest red nodes are those that are friends with the central node, while those in a following-only or follower-only relationship have a lower score.



 

How does the propagation score compare to personalized PageRank? Here, I colored each node according to the log ratio of its propagation score and personalized PageRank:



 

[/img]">[/img]" alt="317 - Log Ratio" />



 

Comparing this coloring with the local follow/follower network:



 

[/img]">[/img]" alt="317 - Local Network of Node" />



 

…we can see that followed nodes (in teal) receive a higher propagation weight than friend nodes (in purple), while follower nodes (in orange) receive almost no propagation score at all.



 

Going back to node 1, let’s look at a different metric. Here, each node is colored according to its Jaccard similarity with the source, when nodes are represented by the set of their followers:



 

[/img]">[/img]" alt="1 - Similarity by Followers" />



 

We can see that, while the PageRank and propagation metrics tended to favor nodes close to the central node, the Jaccard similarity feature helps us explore nodes that are further out.



 

However, if we look the high-scoring nodes more closely, we see that they often have only a single connection to the rest of the network:



 

[/img]">[/img]" alt="1 - Single Connection" />



 

In other words, their high Jaccard similarity is due to the fact that they don’t have many connections to begin with. This suggests that some regularization or shrinking is in order.



 

So here’s a regularized version of Jaccard similarity, where we downweight nodes with few connections:



 

[/img]">[/img]" alt="1 - Regularized" />



 

We can see that the outlier nodes are much more muted this time around.



 

For a starker difference, compare the following two graphs of the Jaccard similarity metric around node 317 (the first graph is an unregularized version, the second is regularized):



 

[/img]">[/img]" alt="317 - Unregularized" />



 

[/img]">[/img]" alt="317 - Regularized" />



 

Notice, in particular, how the popular node in the top left and the popular nodes at the bottom have a much higher score when we regularize.



 

And again, there are other networks and features I haven’t mentioned here, so play around and discover them on the app itself.



 

Models



 

For the machine learning algorithms on top of my features, I experimented with two types of models: logistic regression (using both L1 and L2 regularization) and random forests. (If I had more time, I would probably have done some more parameter tuning and maybe tried gradient boosted trees as well.)



 

So what is a random forest? I wrote an old (layman’s) post on it here, but since nobody ever clicks on these links, let’s copy it over:



 

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

    Thus, Willow is a decision tree for your movie preferences.

    But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

    Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you *really really* loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you t



Consulente in Informatica dal 1984

Software automazione, progettazione elettronica, computer vision, intelligenza artificiale, IoT, sicurezza informatica, tecnologie di sicurezza militare, SIGINT. 

Facebook:https://www.facebook.com/flaviobernardotti58
Twitter : https://www.twitter.com/Flavio58

Cell:  +39 366 3416556

f.bernardotti@deeplearningitalia.eu

#deeplearning #computervision #embeddedboard #iot #ai

 

Related Topics

  Subject / Started by Replies Last post
0 Replies
76 Views
Last post June 01, 2018, 07:01:34 PM
by Ruggero Respigo
0 Replies
65 Views
Last post June 08, 2018, 01:09:06 AM
by Flavio58
0 Replies
135 Views
Last post June 21, 2018, 02:08:42 PM
by Flavio58
0 Replies
83 Views
Last post July 07, 2018, 10:03:41 PM
by Flavio58
0 Replies
78 Views
Last post September 28, 2018, 10:06:10 PM
by Ruggero Respigo

Sitemap 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326