Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
PHP 7 Programming Cookbook

You're reading from   PHP 7 Programming Cookbook Over 80 recipes that will take your PHP 7 web development skills to the next level!

Arrow left icon
Product type Paperback
Published in Aug 2016
Publisher Packt
ISBN-13 9781785883446
Length 610 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Doug Bierer Doug Bierer
Author Profile Icon Doug Bierer
Doug Bierer
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Building a Foundation FREE CHAPTER 2. Using PHP 7 High Performance Features 3. Working with PHP Functional Programming 4. Working with PHP Object-Oriented Programming 5. Interacting with a Database 6. Building Scalable Websites 7. Accessing Web Services 8. Working with Date/Time and International Aspects 9. Developing Middleware 10. Looking at Advanced Algorithms 11. Implementing Software Design Patterns 12. Improving Web Security 13. Best Practices, Testing, and Debugging A. Defining PSR-7 Classes Index

Hoovering a website

Very frequently, it is of interest to scan a website and extract information from specific tags. This basic mechanism can be used to trawl the web in search of useful bits of information. At other times you need to get a list of <IMG> tags and the SRC attribute, or <A> tags and the corresponding HREF attribute. The possibilities are endless.

How to do it...

  1. First of all, we need to grab the contents of the target website. At first glance it seems that we should make a cURL request, or simply use file_get_contents(). The problem with these approaches is that we will end up having to do a massive amount of string manipulation, most likely having to make inordinate use of the dreaded regular expression. In order to avoid all of this, we'll simply take advantage of an already existing PHP 7 class DOMDocument. So we create a DOMDocument instance, setting it to UTF-8. We don't care about whitespace, and use the handy loadHTMLFile() method to load the contents of the website into the object:
    public function getContent($url)
    {
        if (!$this->content) {
            if (stripos($url, 'http') !== 0) {
                $url = 'http://' . $url;
            }
            $this->content = new DOMDocument('1.0', 'utf-8');
            $this->content->preserveWhiteSpace = FALSE;
            // @ used to suppress warnings generated from // improperly configured web pages
            @$this->content->loadHTMLFile($url);
        }
        return $this->content;
    }

    Tip

    Note that we precede the call to the loadHTMLFile() method with an @. This is not done to obscure bad coding (!) as was often the case in PHP 5! Rather, the @ suppresses notices generated when the parser encounters poorly written HTML. Presumably, we could capture the notices and log them, possibly giving our Hoover class a diagnostic capability as well.

  2. Next, we need to extract the tags which are of interest. We use the getElementsByTagName() method for this purpose. If we wish to extract all tags, we can supply * as an argument:
    public function getTags($url, $tag)
    {
        $count    = 0;
        $result   = array();
        $elements = $this->getContent($url)
                         ->getElementsByTagName($tag);
        foreach ($elements as $node) {
            $result[$count]['value'] = trim(preg_replace('/\s+/', ' ', $node->nodeValue));
            if ($node->hasAttributes()) {
                foreach ($node->attributes as $name => $attr) 
                {
                    $result[$count]['attributes'][$name] = 
                        $attr->value;
                }
            }
            $count++;
        }
        return $result;
    }
  3. It might also be of interest to extract certain attributes rather than tags. Accordingly, we define another method for this purpose. In this case, we need to parse through all tags and use getAttribute(). You'll notice that there is a parameter for the DNS domain. We've added this in order to keep the scan within the same domain (if you're building a web tree, for example):
    public function getAttribute($url, $attr, $domain = NULL)
    {
        $result   = array();
        $elements = $this->getContent($url)
                         ->getElementsByTagName('*');
        foreach ($elements as $node) {
            if ($node->hasAttribute($attr)) {
                $value = $node->getAttribute($attr);
                if ($domain) {
                    if (stripos($value, $domain) !== FALSE) {
                        $result[] = trim($value);
                    }
                } else {
                    $result[] = trim($value);
                }
            }
        }
        return $result;
    }

How it works...

In order to use the new Hoover class, initialize the autoloader (described previously) and create an instance of the Hoover class. You can then run the Hoover::getTags() method to produce an array of tags from the URL you specify as an argument.

Here is a block of code from chap_01_vacuuming_website.php that uses the Hoover class to scan the O'Reilly website for <A> tags:

<?php
// modify as needed
define('DEFAULT_URL', 'http://oreilly.com/');
define('DEFAULT_TAG', 'a');

require __DIR__ . '/../Application/Autoload/Loader.php';
Application\Autoload\Loader::init(__DIR__ . '/..');

// get "vacuum" class
$vac = new Application\Web\Hoover();

// NOTE: the PHP 7 null coalesce operator is used
$url = strip_tags($_GET['url'] ?? DEFAULT_URL);
$tag = strip_tags($_GET['tag'] ?? DEFAULT_TAG);

echo 'Dump of Tags: ' . PHP_EOL;
var_dump($vac->getTags($url, $tag));

The output will look something like this:

How it works...

See also

For more information on DOM, see the PHP reference page at http://php.net/manual/en/class.domdocument.php.

You have been reading a chapter from
PHP 7 Programming Cookbook
Published in: Aug 2016
Publisher: Packt
ISBN-13: 9781785883446
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image