Project Nayuki


Computing Wikipedia’s internal PageRanks

Java program run for Wikipedia PageRank

Introduction

Wikipedia is the world’s largest online encyclopedia, comprising millions of pages and links between pages. It also gives away all of this data in an easy-to-process format, via its periodic database dumps. What would happen if we took this set of data and ran the classic PageRank algorithm on every single page?

PageRank is a method for determining the importance of each document in a web of linked documents. Given a set of documents and the set of directed links between them, we can calculate the PageRank of each document using some simple arithmetic (or linear algebra).

I was motivated to calculate the PageRank of each page on Wikipedia for several reasons:

(While this idea of calculating Wikipedia PageRanks is not new and other people’s published work can be readily found on the web, the specific approach and details are interesting enough that I’d like to share how I did it.)

Results

Top pages

The PageRank of a document is the probability that a visitor will end up at that document after uniform random browsing. As such, the sum of all the PageRanks in the set of documents must equal 1. Because probabilities can get very small, the base-10 logarithm of the PageRank will be shown instead of the raw probability.

A list of the top 10 000 pages on English Wikipedia, along with the log10 PageRanks in descending order: wikipedia-top-pageranks.txt

A treemap of the (linear) probabilities of the top 1000 pages (these account for 7.3% of the total probability):

Treemap of Wikipedia's top 1000 pages

Sorting “What links here”

As mentioned in the introduction, my original motivation for computing PageRanks was to sort the “What links here” list of pages by PageRank. For the arbitrarily chosen page Telescope (as of 2014-02-08), here are the links in MediaWiki’s order (with various data cleanup) versus in descending PageRank order.

We can see that the original order starts off alphabetically, then becomes a mess of obscure people, places, and concepts with random clusters. By contrast, the PageRanked list shows familiar general concepts for the first hundred results, just as we would expect. This ordered list is useful for playing the game Six Degrees of Wikipedia from human memory, because we can learn to better predict where a page’s incoming links come from.

Computation statistics
  • Number of pages (filtered): 10 703 129
  • Number of links (filtered): 320 320 061
  • Time per PageRank iteration: 5.2 seconds
    (run on Intel Core i5-2450M 2.50 GHz, Windows 7 Pro SP1, Java 64-bit 1.7.0_25)
  • Iterations to convergence: Approximately 250
  • Total CPU time: About 30 minutes (mostly spent on parsing SQL data values, less spent on computing PageRank)

(All my results are computed from the 2014-01-02 dump.)

Program implementation

Language choice

Handling the Wikipedia page link data set would involve storing and manipulating tens or hundreds of millions of items in memory, so this consideration heavily influenced what programming language I would use to solve the problem:

After choosing Java and writing the program, I noticed some drawbacks of this approach. Storing the mapping of page titles (strings) to IDs (ints) was very memory-heavy, weighing at roughly 500 bytes per entry (for 10 million entries). During the data loading phase, there were many spikes of high CPU usage and frozen program execution because of garbage collection. Normally the Java GC fades into the background, but when dealing with millions of objects stored in gigabytes of memory, GC does start to show its ugly side. However after the SQL text has been parsed and the data has been preprocessed, the rest of the program executes smoothly. Although the high memory usage and GC pauses were undesirable, these were not sufficient reasons to prefer Python or C or C++ over Java for implementing this PageRank application.

Program architecture

Here is a high-level view of how my program works:

  1. Read the file enwiki-yyyymmdd-page.sql.gz, decompress gzip on the fly, parse the SQL text to extract the tuples of (string page title, int page ID), filter out entries that are not in namespace #0, and store them in memory in a Map<String,Integer>.

  2. Read the file enwiki-yyyymmdd-pagelinks.sql.gz, decompress gzip on the fly, parse the SQL text to extract the tuples, filter out tuples whose target is not a known page or not in namespace #0, represent each link as (int source page ID, int target page ID), and store them in memory in a long[].

  3. Sort the list of page links so that memory will be accessed in a more cache-friendly order, and repack them into a more compact format. First sort by link target ID, then group the entries together using a kind of run-length encoding.

    For example, if we have the list of page links [(5,0), (1,0), (3,2), (4,0)], then sorting it by target page ID (breaking ties by source ID) gives [(1,0), (4,0), (5,0), (3,2)]. Grouping the links by target ID into the format (target page ID, number of incoming links, source page IDs...) gives [(0,3,1,4,5), (2,1,3)] (i.e. page 0 has 3 incoming links {1,4,5}, and page 2 has 1 incoming link {3}).

  4. Do 1000 iterations of the PageRank computation. At each iteration, print a bunch of top-ranking pages. Also print the amount of change per iteration to let the user manually monitor the progress, to see if the numbers converged. The arithmetic formula for calculating PageRank is well-known and is explained better by Wikipedia and other sources, so this is what I won’t describe on this page.

  5. Write out all the PageRanks to a file (starting at page ID 0, then 1, 2, 3, etc.). This allows the data from this expensive computation to be loaded for later analysis.

Notes:

Room for improvement

This program meets my needs for the problem at hand, and it handles the data and most edge cases correctly. However, there’s still room for improvement when it comes to flexibility and time/space efficiency.

Source code and instructions

Java source code:

English Wikipedia data dumps: https://dumps.wikimedia.org/enwiki/
From a snapshot of your choice, download the two files enwiki-yyyymmdd-page.sql.gz (~1 GB) and enwiki-yyyymmdd-pagelinks.sql.gz (~5 GB). Update the file names at the top of wikipediapagerank.java and compile.

To run this program on the Wikipedia data set, you must use a 64-bit JVM and have about 12 GiB of free RAM on your computer. Put the Java class files and the .sql.gz files in the same directory. Run with the command java -mx12G wikipediapagerank . If the program execution runs out of memory, then increase the “12G” further.

When each .sql.gz file is loaded for the first time, the data is parsed and filtered, and then the program caches the data by writing a temporary .raw file. In subsequent runs, the cached version loads much faster than the original .sql.gz file, because the data (strings and int32) is loaded directly – without needing to decompress gzip, parse the SQL text (the slowest part), or filter out unneeded data. Note that wikipedia-pagerank-page-id-title.raw will be ~0.3 GB and wikipedia-pagerank-page-links.raw will be ~1.5 GB, so make sure you have enough storage space available for these temporary files.

More info