The type of technology used to build a websitewill affect how we crawl it. A useful tool to check the kind of technologies a website is built with is the module detectem, which requires Python 3.5+ and Docker. If you don't already have Docker installed, follow the instructions for your operating system at https://www.docker.com/products/overview. Once Docker is installed, you can run the following commands.
docker pull scrapinghub/splash
pip install detectem
This will pull the latest Docker image from ScrapingHub and install the package via pip. It is recommended to use a Python virtual environment (https://docs.python.org/3/library/venv.html) or a Conda environment (https://conda.io/docs/using/envs.html) and to check the project's ReadMe page (https://github.com/spectresearch/detectem) for any updates or changes.
Why use environments?
Imagine if your project was developed with an earlier version of a library such as
detectem, and then, in a later version,
detectem introduced some backwards-incompatible changes that break your project. However, different projects you are working on would like to use the newer version. If your project uses the system-installed
detectem, it is eventually going to break when libraries are updated to support other projects.
Ian Bicking's
virtualenv provides a clever hack to this problem by copying the system Python executable and its dependencies into a local directory to create an isolated Python environment. This allows a project to install specific versions of Python libraries locally and independently of the wider system. You can even utilize different versions of Python in different virtual environments. Further details are available in the documentation at
https://virtualenv.pypa.io. Conda environments offer similar functionality using the Anaconda Python path.
The detectem module uses a series of requests and responses to detect technologies used by the website, based on a series of extensible modules. It uses Splash (https://github.com/scrapinghub/splash), a scriptable browser developed by ScrapingHub (https://scrapinghub.com/). To run the module, simply use the det command:
$ det http://example.webscraping.com
[('jquery', '1.11.0')]
We can see the example website uses a common JavaScript library, so its content is likely embedded in the HTML and should be relatively straightforward to scrape.
Detectem is still fairly young and aims to eventually have Python parity to Wappalyzer (https://github.com/AliasIO/Wappalyzer), a Node.js-based project supporting parsing of many different backends as well as ad networks, JavaScript libraries, and server setups. You can also run Wappalyzer via Docker. To first download the Docker image, run:
$ docker pull wappalyzer/cli
Then, you can run the script from the Docker instance:
$ docker run wappalyzer/cli http://example.webscraping.com
The output is a bit hard to read, but if we copy and paste it into a JSON linter, we can see the many different libraries and technologies detected:
{'applications':
[{'categories': ['Javascript Frameworks'],
'confidence': '100',
'icon': 'Modernizr.png',
'name': 'Modernizr',
'version': ''},
{'categories': ['Web Servers'],
'confidence': '100',
'icon': 'Nginx.svg',
'name': 'Nginx',
'version': ''},
{'categories': ['Web Frameworks'],
'confidence': '100',
'icon': 'Twitter Bootstrap.png',
'name': 'Twitter Bootstrap',
'version': ''},
{'categories': ['Web Frameworks'],
'confidence': '100',
'icon': 'Web2py.png',
'name': 'Web2py',
'version': ''},
{'categories': ['Javascript Frameworks'],
'confidence': '100',
'icon': 'jQuery.svg',
'name': 'jQuery',
'version': ''},
{'categories': ['Javascript Frameworks'],
'confidence': '100',
'icon': 'jQuery UI.svg',
'name': 'jQuery UI',
'version': '1.10.3'},
{'categories': ['Programming Languages'],
'confidence': '100',
'icon': 'Python.png',
'name': 'Python',
'version': ''}],
'originalUrl': 'http://example.webscraping.com',
'url': 'http://example.webscraping.com'}
Here, we can see that Python and the web2py frameworks were detected with very high confidence. We can also see that the frontend CSS framework Twitter Bootstrap is used. Wappalyzer also detected Modernizer.js and the use of Nginx as the backend server. Because the site is only using JQuery and Modernizer, it is unlikely the entire page is loaded by JavaScript. If the website was instead built with AngularJS or React, then its content would likely be loaded dynamically. Or, if the website used ASP.NET, it would be necessary to use sessions and form submissions to crawl web pages. Working with these more difficult cases will be covered later in Chapter 5, Dynamic Content and Chapter 6, Interacting with Forms.