Spidering websites
Many tools provide the ability to map out websites, but often you are limited to style of output or the location in which the results are provided. This base plate for a spidering script allows you to map out websites in short order with the ability to alter them as you please.
Getting ready
In order for this script to work, you'll need the BeautifulSoup
library, which is installable from the apt
command with apt-get install python-bs4
or alternatively pip install beautifulsoup4
. It's as easy as that.
How to do it…
This is the script that we will be using:
import urllib2 from bs4 import BeautifulSoup import sys urls = [] urls2 = [] tarurl = sys.argv[1] url = urllib2.urlopen(tarurl).read() soup = BeautifulSoup(url) for line in soup.find_all('a'): newline = line.get('href') try: if newline[:4] == "http": if tarurl in newline: urls.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls.append(str(combline)) except: pass for uurl in urls: url = urllib2.urlopen(uurl).read() soup = BeautifulSoup(url) for line in soup.find_all('a'): newline = line.get('href') try: if newline[:4] == "http": if tarurl in newline: urls2.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls2.append(str(combline)) except: pass urls3 = set(urls2) for value in urls3: print value
How it works…
We first import the necessary libraries and create two empty lists called urls
and urls2
. These will allow us to run through the spidering process twice. Next, we set up input to be added as an addendum to the script to be run from the command line. It will be run like:
$ python spider.py http://www.packtpub.com
We then open the provided url
variable and pass it to the beautifulsoup
tool:
url = urllib2.urlopen(tarurl).read() soup = BeautifulSoup(url)
The beautifulsoup
tool splits the content into parts and allows us to only pull the parts that we want to:
for line in soup.find_all('a'): newline = line.get('href')
We then pull all of the content that is marked as a tag in HTML and grab the element within the tag specified as href
. This allows us to grab all the URLs listed in the page.
The next section handles relative and absolute links. If a link is relative, it starts with a slash to indicate that it is a page hosted locally to the web server. If a link is absolute, it contains the full address including the domain. What we do with the following code is ensure that we can, as external users, open all the links we find and list them as absolute links:
if newline[:4] == "http": if tarurl in newline: urls.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls.append(str(combline))
We then repeat the process once more with the urls
list that we identified from that page by iterating through each element in the original url
list:
for uurl in urls:
Other than a change in the referenced lists and variables, the code remains the same.
We combine the two lists and finally, for ease of output, we take the full list of the urls
list and turn it into a set. This removes duplicates from the list and allows us to output it neatly. We iterate through the values in the set and output them one by one.
There's more…
This tool can be tied in with any of the functionality shown earlier and later in this book. It can be tied to Getting Screenshots of a website with QtWeb Kit to allow you to take screenshots of every page. You can tie it to the email address finder in the Chapter 2, Enumeration, to gain email addresses from every page, or you can find another use for this simple technique to map web pages.
The script can be easily changed to add in levels of depth to go from the current level of 2 links deep to any value set by system argument. The output can be changed to add in URLs present on each page, or to turn it into a CSV to allow you to map vulnerabilities to pages for easy notation.