Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PHP 7 Programming Cookbook

You're reading from   PHP 7 Programming Cookbook Over 80 recipes that will take your PHP 7 web development skills to the next level!

Arrow left icon
Product type Paperback
Published in Aug 2016
Publisher Packt
ISBN-13 9781785883446
Length 610 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Doug Bierer Doug Bierer
Author Profile Icon Doug Bierer
Doug Bierer
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Building a Foundation FREE CHAPTER 2. Using PHP 7 High Performance Features 3. Working with PHP Functional Programming 4. Working with PHP Object-Oriented Programming 5. Interacting with a Database 6. Building Scalable Websites 7. Accessing Web Services 8. Working with Date/Time and International Aspects 9. Developing Middleware 10. Looking at Advanced Algorithms 11. Implementing Software Design Patterns 12. Improving Web Security 13. Best Practices, Testing, and Debugging A. Defining PSR-7 Classes Index

Building a deep web scanner

Sometimes you need to scan a website, but go one level deeper. For example, you want to build a web tree diagram of a website. This can be accomplished by looking for all <A> tags and following the HREF attributes to the next web page. Once you have acquired the child pages, you can then continue scanning in order to complete the tree.

How to do it...

  1. A core component of a deep web scanner is a basic Hoover class, as described previously. The basic procedure presented in this recipe is to scan the target website and hoover up all the HREF attributes. For this purpose, we define a Application\Web\Deep class. We add a property that represents the DNS domain:
    namespace Application\Web;
    class Deep
    {
        protected $domain;
  2. Next, we define a method that will hoover the tags for each website represented in the scan list. In order to prevent the scanner from trawling the entire World Wide Web (WWW), we've limited the scan to the target domain. The reason why yield from has been added is because we need to yield the entire array produced by Hoover::getTags(). The yield from syntax allows us to treat the array as a sub-generator:
    public function scan($url, $tag)
    {
        $vac    = new Hoover();
        $scan   = $vac->getAttribute($url, 'href', 
           $this->getDomain($url));
        $result = array();
        foreach ($scan as $subSite) {
            yield from $vac->getTags($subSite, $tag);
        }
        return count($scan);
    }

    Note

    The use of yield from turns the scan() method into a PHP 7 delegating generator. Normally, you would be inclined to store the results of the scan into an array. The problem, in this case, is that the amount of information retrieved could potentially be massive. Thus, it's better to immediately yield the results in order to conserve memory and to produce immediate results. Otherwise, there would be a lengthy wait, which would probably be followed by an out of memory error.

  3. In order to keep within the same domain, we need a method that will return the domain from the URL. We use the convenient parse_url() function for this purpose:
    public function getDomain($url)
    {
        if (!$this->domain) {
            $this->domain = parse_url($url, PHP_URL_HOST);
        }
        return $this->domain;
    }

How it works...

First of all, go ahead and define the Application\Web\Deep class defined previously, as well as the Application\Web\Hoover class defined in the previous recipe.

Next, define a block of code from chap_01_deep_scan_website.php that sets up autoloading (as described earlier in this chapter):

<?php
// modify as needed
define('DEFAULT_URL', unlikelysource.com');
define('DEFAULT_TAG', 'img');

require __DIR__ . '/../../Application/Autoload/Loader.php';
Application\Autoload\Loader::init(__DIR__ . '/../..');

Next, get an instance of our new class:

$deep = new Application\Web\Deep();

At this point, you can retrieve URL and tag information from URL parameters. The PHP 7 null coalesce operator is useful for establishing fallback values:

$url = strip_tags($_GET['url'] ?? DEFAULT_URL);
$tag = strip_tags($_GET['tag'] ?? DEFAULT_TAG);

Some simple HTML will display results:

foreach ($deep->scan($url, $tag) as $item) {
    $src = $item['attributes']['src'] ?? NULL;
    if ($src && (stripos($src, 'png') || stripos($src, 'jpg'))) {
        printf('<br><img src="%s"/>', $src);
    }
}

See also

For more information on generators and yield from, please see the article at http://php.net/manual/en/language.generators.syntax.php.

You have been reading a chapter from
PHP 7 Programming Cookbook
Published in: Aug 2016
Publisher: Packt
ISBN-13: 9781785883446
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image