Implementing the PageRank algorithm in Python
In this section, we will take the insights we learned about the PageRank algorithm in the previous sections to write an effective Python implementation of the algorithm.
As we saw previously, the idea of the PageRank algorithm is to do some calculations to update the PageRank vectors over and over until they reach a steady-state PageRank vector. But we just ran it 15 times, looked at the numbers, and stopped when the updates become so small as to be insignificant.
However, there are a few obstacles to implementing this on a real, large-scale problem:
- If the "internet" of web pages is large, such as with the real internet, we could not really look at millions or billions of PageRanks in the updates and find when they have stopped changing.
- We cannot know in advance how many iterations we need to run for the PageRanks to converge to a steady state.
- We manually defined the initial state of the PageRank vector...