Downloading a web page as plain text
Web pages are HTML pages that contain a collection of HTML tags, along with other elements, such as JavaScript and CSS. Of these, the HTML tags define the content of a web page, which we can parse to look for a specific content, and this is something Bash scripting can help us with. When we download a web page, we receive an HTML file, and in order to view the formatted page, it should be viewed in a web browser.
In most of the circumstances, parsing a text document will be easier than parsing HTML data because we aren't required to strip off the HTML tags. Lynx is an interesting command-line web browser, which can get the web page as plaintext. Let us see how to do it.
How to do it...
Let's download the webpage view, in ASCII character representation, in a text file by using the -dump
flag with the lynx
command:
$ lynx URL -dump > webpage_as_text.txt
This command will also list all the hyperlinks (<a href="link">
) separately under a heading References...