Building our Wikipedia crawler - take two
Our code runs as expected, refactored and neatly packed into a module. However, there's one more thing I'd like us to refactor before moving on. I'm not especially fond of our extractlinks
function.
First of all, it naively iterates over all the HTML elements. For example, say that we also want to extract the title of the page—every time we want to process something that's not a link, we'll have to iterate over the whole document again. That's going to be resource-hungry and slow to run.
Secondly, we're reinventing the wheel. In Chapter 3, Setting Up the Wiki Game, we said that CSS selectors are the lingua franca of DOM parsing. We'd benefit massively from using the concise syntax of CSS selectors with the underlying optimizations provided by specialized libraries.
Fortunately, we don't need to look too far for this kind of functionality. Julia's Pkg
system provides access to Cascadia
, a native CSS selector library. And, the great thing about it is...