Building a deep web scanner
Sometimes you need to scan a website, but go one level deeper. For example, you want to build a web tree diagram of a website. This can be accomplished by looking for all <A>
tags and following the HREF
attributes to the next web page. Once you have acquired the child pages, you can then continue scanning in order to complete the tree.
How to do it...
- A core component of a deep web scanner is a basic
Hoover
class, as described previously. The basic procedure presented in this recipe is to scan the target website and hoover up all theHREF
attributes. For this purpose, we define aApplication\Web\Deep
class. We add a property that represents the DNS domain:namespace Application\Web; class Deep { protected $domain;
- Next, we define a method that will hoover the tags for each website represented in the scan list. In order to prevent the scanner from trawling the entire World Wide Web (WWW), we've limited the scan to the target domain. The reason why
yield from
has been added is because we need to yield the entire array produced byHoover::getTags()
. Theyield from
syntax allows us to treat the array as a sub-generator:public function scan($url, $tag) { $vac = new Hoover(); $scan = $vac->getAttribute($url, 'href', $this->getDomain($url)); $result = array(); foreach ($scan as $subSite) { yield from $vac->getTags($subSite, $tag); } return count($scan); }
Note
The use of
yield from
turns thescan()
method into a PHP 7 delegating generator. Normally, you would be inclined to store the results of the scan into an array. The problem, in this case, is that the amount of information retrieved could potentially be massive. Thus, it's better to immediately yield the results in order to conserve memory and to produce immediate results. Otherwise, there would be a lengthy wait, which would probably be followed by an out of memory error. - In order to keep within the same domain, we need a method that will return the domain from the URL. We use the convenient
parse_url()
function for this purpose:public function getDomain($url) { if (!$this->domain) { $this->domain = parse_url($url, PHP_URL_HOST); } return $this->domain; }
How it works...
First of all, go ahead and define the Application\Web\Deep
class defined previously, as well as the Application\Web\Hoover
class defined in the previous recipe.
Next, define a block of code from chap_01_deep_scan_website.php
that sets up autoloading (as described earlier in this chapter):
<?php // modify as needed define('DEFAULT_URL', unlikelysource.com'); define('DEFAULT_TAG', 'img'); require __DIR__ . '/../../Application/Autoload/Loader.php'; Application\Autoload\Loader::init(__DIR__ . '/../..');
Next, get an instance of our new class:
$deep = new Application\Web\Deep();
At this point, you can retrieve URL and tag information from URL parameters. The PHP 7 null coalesce
operator is useful for establishing fallback values:
$url = strip_tags($_GET['url'] ?? DEFAULT_URL); $tag = strip_tags($_GET['tag'] ?? DEFAULT_TAG);
Some simple HTML will display results:
foreach ($deep->scan($url, $tag) as $item) { $src = $item['attributes']['src'] ?? NULL; if ($src && (stripos($src, 'png') || stripos($src, 'jpg'))) { printf('<br><img src="%s"/>', $src); } }
See also
For more information on generators and yield from
, please see the article at http://php.net/manual/en/language.generators.syntax.php.