Applying the Algorithm to Real Data
Let's use our Python implementation of the PageRank algorithm to some larger-scale data. We will use a dataset shared by J. Kleinberg at Cornell by crawling the web to find web pages containing the word California
. It is a text file in the following form:
Type Source Destination n 0 http://www.berkeley.edu/ n 1 http://www.caltech.edu/ … n 9663 http://www.cs.ucl.ac.uk/external/P.Dourish/hotlist.html e 0 449 e 0 450 … e 9663 7907
The first part contains 9,663 web pages that have the word California
, and the rest is an adjacency list for the graph representing the "internet" of these 9,663 web pages. For example, take the following line:
e 0 499
This means web page 0
has a link to web page 499
. In order to implement PageRank on this dataset, we need to create an adjacency matrix.
Let's use some Python code to read this data file into a pandas
DataFrame and display it:
# import the pandas library import...