Wednesday, April 17, 2013

Graphing the Links between Articles on Wikipedia

One of my summer projects last year was to graph the links between Wikipedia articles. I was inspired by the Wiki Wars concept and wanted to see if I could programmatically beat the game. I used C# and the Mono .NET runtime to build a Wikipedia pathfinder.

Here are a couple of graphical representations of the results. They make great wallpapers!

My Current Wallpaper
A Zoomed Out Link Graph


Obtaining the Data

My first task was to download the Wikipedia dataset in its' entirety. I chose the XML format which compressed is approximately 10GB. The uncompressed file is nearly 40GB. I naively attempted to load the dataset using the .NET runtime XML parsers. Unfortunately they attempt to inflate the data before it can be queried. My server has 4GB of RAM and I quickly learned that this simply wouldn't work.

I decided to work with a subset of the data to ease the load on my server. This worked out well.

Assembling a Graph

The links on Wikipedia can be modeled using the abstract graph data type.

I have a type that consists of the page name and a list of all other pages it links to. To determine a path, I recursively search for the destination page. Once I find a path to the destination I record the hops necessary to get there. I try to find an even shorter path until I can no longer find a path that has not already been found.

Graphing the Results!

Once I realized that I could definitely beat this game with some simple programming, I quickly lost interest in the project. I decided to try graphing the output. Here is one of the first graphs that I created. This one originates at Dinosaur and performs 6 hops. There is lots of overlapping text, but it looks good!

An early prototype!
I was motivated to press on. I decided to have all links originate from the center of the viewport. I distribute outbound links evenly around the source article. As the number of outbound links increases, the distance between the source and destination article must increase to avoid overlapping text. Here is a screenshot of this idea in its' infancy.

Linux is the Source
In this screenshot you can see the idea coming together. Linux is at the center of the screen and outbound links are distributed in a circle around it. Unfortunately there are still some bugs at this point and things don't look quite right yet. I tweak the calculations involved in the locations of destination articles to yield a better result.

Starting to look Good
You can see the arrangement is starting to come together but there are still some rather large gaps between articles. It is really quite interesting to see what articles are linked to one another without the rest of the text "in the way". Apple Inc. was chosen as the test case for the following graph.

Apple Inc.
All of these graphs are generated as SVG files which render nicely using Inkscape providing that they aren't too large. Here is a screenshot of Inkscape showing the chaos involved in one of these images. You can see how a huge amount of outbound links actually exceed the bounds of the viewport.

Inkscape, most bugs are fixed at this point
Releasing the SVGs

If anyone wishes to see the original SVG files, I will be happy to share them. They are rather large so I won't bother hosting them until I have some serious interest.

Releasing the Code

I have decided not to release this code. There is nothing proprietary about what I have done and I have given you more than enough information to create your own implementation. This was a weekend experiment and as such is riddled with dependencies on my server and specific files that existed on my hard drive at the time. It is untested and is definitely not production code.

This became more of an art project than anything and the results are fantastic. If you decide to implement your own, I would love to see it!

No comments :

Post a Comment

Note: Only a member of this blog may post a comment.