Skip to content

Implementing personalized PageRank for graph node scoring. #493

@maniospas

Description

@maniospas

Before filing

  • I searched the existing issues and did not find a duplicate.
  • This is a concrete proposal, not an open-ended design question.

Kind of addition

  • New algorithm
  • New data structure
  • New property map or utility
  • New I/O format reader or writer
  • Extension to an existing component
  • Performance improvement
  • Other (specify below)

Motivation

Following preliminary discussion in #492 , this proposal aims to bring the popular personalized node scoring/ranking algorithm to BGL. Personalized node scoring can be used to score nodes with respect to their structural proximity to a (weighted) set of seed nodes, so it can be used for overlapping community member scoring and ranking. It can also be used as a building block for more complex community detection schemes that satisfy theoretical properties like finding structural communities of low conductance given that the seed nodes lie closely together. Finally, node scores can be used for link prediction by starting from one node and finding structural close alternatives.

The proposal below presents some basic information for bringing these kinds of methods, and can be used as future basis for also implementing more spectral graph filters (not to be confused with full-blown spectral processing in that spectral filters act on eigenvalue-based properties by actually working only on the vertex domain for near-linear instead of quadratic time complexity).

Proposed addition

In linear algebra terms, the algorithm implements the iterative scheme:

$p = (1-dmp)Ap+dmp p_0$

where A is broadly a normalized graph adjacency matrix, dmp is the equivalent of the dampening factor (equal to 1-restart probability of equivalent random walk with restart schemes), and $p_0$ a starting personalization vector. As an API, this would mainly follow the existing page_rank interface:

template <typename Graph, typename RankMap, typename Done>
void personalized_page_rank(const Graph& g, RankMap rank_map, Done done,
               typename property_traits<RankMap>::value_type damping,
               typename graph_traits<Graph>::vertices_size_type n,
               Normalization normalization,
               Optimizer optimizer,
               RankMap1 rank_map1,
               RankMap2 rank_map2);

Simplifications analogous to page_rank's will be added, for example so that done uses some common default (see below for planned options for convergence tracking; it is not as simple a problem as it appears).

Important differences:

A. rank_map will hold both input personalization $p_0$ and the eventually outputted $p$, in alignment with page_rank's interface. This can change but I think it's elegant. Two more storage containers for node values are inserted, as we need to keep track the previous and next node values, as well s the personalization. All vertex value containers would be ReadWritePropertyMaps.

B. I am proposing that the default for done should not be a fixed number of iterations, as this creates a high computational overhead or may not be enough for robustness in scoring or rank order. Since I have worked on research determining robust stopping orders [2], I am proposing implementation of three different schemes, of which the first one would be the default, the second an alternative when correct ranking order does not matter as much as correct scoring, and the third is a practically unknown but a more theoretically rigorous version of the first:

  • Run for max(10, ceil(1/dmp)) iterations to account for most random walks of the equivalent random walk with restart scheme per [2]
  • Use mean(abs(p-p_previous))<tol as the stopping criterion with tol=10E-9 as the default (the popular Python library networkx does this, but its default tol=10E-6 is not sufficient for most graphs).
  • [Optional implementation - may consume a PR by itself] Use the adaptive criteria of [2] as a more advanced version of the previous bullet point, where a confidence score is provided. The exact criterion is a bit complicated to write as markdown (see the paper) but would require an interface akin to the above bullet point.

This means that Done would in general require as inputs const RankMap& rank_map1 and const RankMap& rank_map2 to compare, and contain early stopping parameter as well as a user-provided upper bound to the number of iterations that can be 100 by default. The last one is needed because, depending on what input are provided, the computations may never converge and it would be nice for applications to stop. I see that other components throw boost::throw_exception and we could also do it here.

C. Normalization means that each edge is weighted by either 1/out_degree(u) or, more common for results compatible with spectral graph theory, by 1/sqrt(out_degree(u)*in_degree(v)) (would collapse to 1/sqrt(out_degree(u)*out_degree(v)) for edge (u,v) of an IncidenceGraph). I have a sense that this may be more computationally intensive than needed and that there should be a mechanism to cache edge weights. So, for this, I have a question if there is some appropriate structure to use.

In order to not overcomplicate things, a first implementation would just run the computations eagerly, making normalization above would be an enum enum class Normalization { Markovian, Symmetric, Laplacian }, corresponding to divisin with outdegree, the geometric means of degree, and the last case while using (I-A) instead of A to obtain a high-pass instead of low-pass filter.

Extensibility considerations: Future versions could accept as argument a storage container for edge weights (like WeightMap for betweeness centrality) as well as means of precomputing and sharing those, but that would be too much potentially interface-risky work for my first PR given that I also need to practically familiarize myself with the codebase's abstractions. Such an extension would be backwards compatible with the above interface.

D. The above design has delegated an optimizer parameter, which will account for future schemas for speeding up convergence speed. There are many means of performing such optimizations, some of which have grown a bit dim in my mind and I would need to re-read material to implement them. However, the most important of those to be included as a first version, and aside from not making any "optimization", is to again normalize vectors after each iteration.

This renormalization has the advantage of performing numerically stable computations by keeping track of the L2 norm of the input/personalization, normalizing the latter, normalizing p after each iteration, and then multiplying the results with the tracked norm. This also speeds up convergence a lot (and even makes it possible if node scores would otherwise be numerically divergent), but has the issue of losing some nuance in the resulting rank magnitude and should therefore be an opt-in (or opt-out) scheme.

Since I am not sure what extensions would be made in the future, I am proposing the optimizer to be of enum class Optimizer { None, Renormalize } (if needed, it could be combined with Normalization as a bitfield-based approach but would prefer to keep type-based explicitness).

Relation to existing Boost.Graph components

This is similar to the page_rank algorithm.

Prior art

  • Reference paper or textbook (cite below)
  • Implementation in another library (link below)
  • None / original idea

References:

Introductory paper is [1], but the proposal aims to follow the implementation of [3], which different from the original paper by adding options for more recent esearch results (see above). The stopping criteria would be determined per [2].

  1. https://api.semanticscholar.org/CorpusID:1508503
  2. 10.1109/ASONAM49781.2020.9381435
  3. https://github.com/MKLab-ITI/pygrank/blob/master/pygrank/algorithms/filters/adhoc/pagerank/pagerank.py

Are you willing to contribute?

  • I'd like to submit the implementation as a pull request.
  • I can help with review or testing, but not the implementation.
  • I'm only proposing the idea.

Edit: typo fixing

Metadata

Metadata

Assignees

Labels

algorithmtype of issue related to algorithms
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions