Downloading raw text or a binary file is a good starting point, but the main language of the web is HTML.
HTML is a structured language, defining different parts of a document such as headers and paragraphs. HTML is also hierarchical, defining sub-elements. The ability to parse raw text into a structured document is basically to be able to extract information automatically from a web page. For example, some text can be relevant if enclosed in a particular class div or after a header h3 tag.