hunter paulson

how to build the fastest wikipedia race solver on the internet
blog art projects github linkedin

how to build the fastest wikipedia race solver on the internet

imagine you are reading the Wikipedia page for Matcha, and your trying to figure out how society got from this to Labubu… but you can only navigate using links on the current page. no search bar. no back button. no CTRL+F. just blue links.

Screenshot of the Matcha Wikipedia page Screenshot of the Labubu Wikipedia page

How long and how many clicks do you think it would take?

I wanted to answer both questions so I built a solver that can find the shortest path between any two english Wikipedia pages in milliseconds.

wikipedia is a graph

Every1 wikipedia page has links to other wikipedia pages and by clicking on one of those links we end up on that page.

Because of this Wikipedia can be thought of as a graph where each page is a node and each link is an edge2.

Dune (novel)
David Lynch
Frank Herbert
Star Wars
Zendaya
Denis Villeneuve
Ecology
Hugo Award
Science fiction

Eight of ’Dune (novel)’s 723 outgoing links

We traverse the graph by clicking on a link to another page. If we click on the right sequence of links we can get from one page to almost any other page. - footnote: wikipedia is not a fully connected graph so not every page is reachable from every other page.

Sushi
Matcha
Green tea
Starbucks
Matcha Latte
TikTok
Clairo
Performative Male
Generation Z
Labubu

You can get from Matcha to Labubu in 3 clicks

There is actually a semi-official game called Wikiracing where “players compete to navigate from one Wikipedia page to another (chosen manually or randomly) using only the internal links of the page”.

There are two variants of Wikiracing. One tries to reach the goal in the fewest number of clicks and the the other tries to get there in the shortest amount of time. So I challenged myself to build the fastest Wikipedia solver on the internet3 at both…

You can play with it here.

For the rest of this blog I am going to try my best to walk through how it works in a way show shows how you could have arrived at building it too.

brute force

In computer science class I was taught to always think about the brute force solution to your problem first. computers are pretty fast so this is usually a great way to get a feel for its complexity.

We havent even defined our problem yet so lets do that first:

Given a pair of pages (S, T). Get from the start page (S) to the target page (T) using only links on the current page at each step.

So imagine we start out on a random page that we know is part of a larger graph. Our goal is to find a specific page but we have no idea where it is on the graph. The only information we have is the pages that our current page links to, but unfortunately none of them our our target page. So somehow we have to use those links to find our target page.

What should we do?

idk, ¯\_(ツ)_/¯, lets try brute force… imagine we had infinite time and infinite browser tabs. what could we do?

… we could click every link on the start page.

And if none of them linked to the target page we could click every link4 on every one of those pages.

If none of them linked to the target page we could do it again…

and then again…

and again…

until we found the target page.

S
T
Begin knowing the start and target page.
S
T
1
1
1
1
1
1
1
1

Get all the pages that the start page links to.
S
T
2
2
2
2
2
2
2
2
2
2
2

Get all the pages that the start page can reach in TWO links.
S
T
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3

Get all the pages that the start page can reach in THREE links.
S
T
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4

Get all the pages that the start page can reach in FOUR links.
S
4
1
1
1
2
2
3
3
3

Drop every page that is not on a shortest path.
S
4
1
1
1
3
3
3
2
2

It takes at least 4 clicks to reach the target page.

We just performed what is called breadth-first search.

It was a lot of work, but some how we found the our way to the target page.

But notice, we didn’t just find any way from the start to the target, we actually all shortest paths between the two pages.

We should take a second to appreciate this. By expanding our web of pages one ring at a time we actually are guaranteed to find all shortest paths. If a shorter path was possible we would have found it one ring earlier.

As you can see we had to check a lot of pages that were not part of any shortest paths. This wasted computation is unavoidable since we have no way of knowing ex ante which pages will be on the shortest paths.

can we do better?

If our initial approach is already guarantted to find all shortest paths as soon as possible, how can we make it more efficient and therefore faster?

The issue with our approach is that the amount of work we have to do for each successive ring grows exponentially. Similar to how the area of a circle grows proportionally to the square of its radius, but much worse. The amount of work we have to do on the start page approximately the same as the amount of work we have do on every page that it links to in the first ring. Roughly speaking it takes more work to explore the next ring than it took to explore all previous rings combined.

So to minimize total work we need to minimize the number of rings we have to explore…

But aren’t we already exploring the minimum number of rings?

Yes… if we explore only from the start page.

Remember, initially we know about one other page: the target page.

But, careful, we cant naively apply this same process to the target page because the links on the target page link away from the target page. We need the the pages that link towards the target page.

These are called the target page’s ‘backlinks’. For example, if there is a link from page A to page B, then we can also think of it as a backlink from page B to page A.

So if we have a way to explore all the ‘backlinks’ into a page then we can run this process in reverse as well starting from the target page. I will explain how we build this later but for now lets see it in action.

the algorithm: bidirectional breadth-first search

I like to visualize this process kinda as two expanding solar systems centered on the start and target page that grow until they overlap. Every step of the algorithm we add an new further out orbit to one of the solar systems. As soon as the two outermost orbits intersect we know that we have found the shortest path(s).

Our previous approach was called breadth-first search, so it won’t surprise you that this new approach is called bidirectional breadth-first search.

S
T
Begin with the start and target page.
S
T

Get all the pages that link to the target page.
S
T

Get all the pages that the start page links to.
S
T

Get all the pages that can reach the target page in 2 links.
S
T

Get all the pages that the start page can reach in 2 links.
S
4
1
1
1
3
3
3
2
2

Find all the connections to the nodes where they intersect.
S
4
1
1
1
3
3
3
2
2

We have found the shortest paths from start to target!

so its just brute force in both directions?

Pretty much, yes. The additional complexity is that we now have to decide which direction to expand at each step.

Here are the steps to the algorithm:

  1. Initially we only know about the start and target page and we have no idea how far apart they are.

  2. We begin by checking either all the outgoing links for the start page or all the incoming links for the target page. In order to guarantee we find the shortest path we have no choice but to check every outgoing and incoming link for every page we encounter from here on out.

  3. Since we got all the incoming links to the target page in the previous step we now need to get all the outgoing links from the start page. Both of our solar systems have their first orbital ring. We call the outer ring of pages that are all exactly K links away from the center node the ‘frontier’.

  4. Now we need to decide which frontier to explore next. Our goal is to minimize the number of pages we have to check so we explore the frontier with the fewest pages5. Since our target frontier has 6 pages and our start frontier has 8 we explore the target frontier. Notice that this is exactly what we just did in step 3 when start frontier had one page.

  5. Repeat the previous step until our frontiers intersect. We iteratively continue this process until we encounter a page that is part of the other solar system.

  6. Once our frontiers intersect at at least one page we know that we have found the shortest path between the start and target page.

pretty cool right?

I first learned about this approach from Six Degrees of Wikipedia. They were an inspiration and baseline for this work. One of the reasons I built this was to measure if the six degrees of separation hypothesis is empirically true.

Unfortunately it turns out that, at least for Wikipedia, it is empirically false. The diameter of english Wikipedia is 416.

making it fast

Now that we intuitively know how the algorithm works we have to teach a computer how to do it since they are much much faster than me clicking links.

representing a graph on a computer

Computers operate on sequences of bits. So how do we turn circles with arrows between them into a continuous sequence of 1s and 0s?

Lets start by looking at this from the perspective of a single node on the graph and its links. We can then apply the same logic to any graph. Here is a toy graph of a Bittern and its neighbors.

Bittern
Avocet
Curlew
Finch
Egret
Heron
Ibis
Lark

Bittern has a link to Egret, Heron, Lark, and Ibis. While Eagle, Finch, and Raven each link to Bittern.

There are two ways to represent a graph like this on a computer.

1. an adjacency list

We can start by just listing the pages that each page links to.

page     links to
───────  ───────────────────────────
Avocet   Bittern
Bittern  Egret, Heron, Ibis, Lark
Curlew   Bittern
Finch    Bittern
Bittern graph as an adjacency list

Notice that links are directional so we only have a row for each page that has an outgoing link.

2. an adjacency matrix

You may have noticed that the above table kinda looks like a matrix. So lets convert it to a matrix where the rows are the source pages and the columns are the destination pages. We will write ‘link’ if there is a link from source to destination.

from \ to  Bittern  Egret   Heron   Ibis    Lark
─────────  ───────  ──────  ──────  ──────  ──────
Avocet     link     .       .       .       .
Bittern    .        link    link    link    link
Curlew     link     .       .       .       .
Finch      link     .       .       .       .

If we replace the ‘link’ with a 1 and the ‘.’ with a 0 we get the following matrix:

from \ to  Bittern  Egret   Heron   Ibis    Lark
─────────  ───────  ──────  ──────  ──────  ──────
Avocet     1        0       0       0       0
Bittern    0        1       1       1       1
Curlew     1        0       0       0       0
Finch      1        0       0       0       0
Bittern graph as an adjacency matrix of outgoing links

FUN FACT: The matrix of incoming links is just the transpose of the matrix of outgoing links.

turning wikipedia into a sequence of numbers

Computers love to store matricies. Since you can just append each row to the previous and store them as a continuous sequence without any padding.

Our matrix:

1  0  0  0  0
0  1  1  1  1
1  0  0  0  0
1  0  0  0  0
becomes
[0  1  1  1  1][1  0  0  0  0][1  0  0  0  0][1  0  0  0  0]

So can we just store wikipedia this way?

Not quite.

Every page in our graph is a row and column of our adjacency matrix. There are >7 Million wikipedia pages so our matrix would take 7M * 7M = 49 Trillion bits (5.125 Terrabytes) to store.

However you probably noticed that because every page does not link to every other page there is alot of empty space (zeros) in this matrix. Matrices like this, with lots of zeros, are called sparse matrices.

We can actually represent the same data with a lot less bytes by using a compressed sparse row (CSR).

Bittern
Egret
Heron
Ibis
Lark

Bittern has outgoing links to [Egret, Heron, Ibis, Lark]
out_offsets
page       ...   Avocet  Bittern Curlew   ...
        ┌───────┬───────┬───────┬───────┬───────┐
offset  │  ...  │   0   │   2   │   6   │  ...  │
        └───────┴───────┼───────┼───────┴───────┘
                        │       │
                        │       │
                        │       └───────────────────────┐
out_neighbors           │                               │
index    0       1      │2       3       4       5      │6       7
        ┌───────┬───────┼───────┬───────┬───────┬───────┼───────┬───────┐
title   │Bittern│ Duck  │ Egret │ Heron │ Ibis  │ Lark  │Bittern│ Wren  │
        └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
                        └─── Bittern links to [2..6) ───┘
How we store that Bittern links to [Egret, Heron, Ibis, Lark]

Each page’s ID tells us where it is in the sequence of out_offsets. Then we use that to find the pages it links to in the continuous sequence of out_neighbors.

This saves us a lot of space. The tradeoff is that we have to store both the out and in rows. Because we can no longer transpose our metrix to get the backlinks.

Bittern
Avocet
Curlew
Finch

[Avocet, Curlew, Finch] have a link to Bittern
in_offsets
page       ...   Avocet  Bittern Curlew    ...
        ┌───────┬───────┬───────┬───────┬───────┐
offset  │  ...  │   0   │   2   │   5   │  ...  │
        └───────┴───────┼───────┼───────┴───────┘
                        │       │
                        │       │
                        │       └───────────────┐
in_neighbors            │                       │
index    0       1      │2       3       4      │5       6
        ┌───────┬───────┼───────┬───────┬───────┼───────┬───────┐
title   │ Crane │ Crow  │Avocet │Curlew │ Finch │ Swan  │ Tern  │
        └───────┴───────┴───────┴───────┴───────┴───────┴───────┘
                        └[2..5) links to Bittern┘
How we store that [Avocet, Curlew, Finch] have a link to Bittern

We store all of these continuously in memory as one continuous string of bytes.

In total our entire graph takes up 4.1 Billion bytes or ~4.1 GB. Making it small enough to fit entirely into RAM on most modern machines. This extremely important because >15x faster to read from RAM than from an SSD.

preprocessing the graph

coming soon…