Encoding in Beautiful Soup
As already explained, every HTML/XML document will be written in a specific character set encoding, for example, UTF-8, and Latin-1. In an HTML page, this is represented using the meta
tag as shown in the following example:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Beautiful Soup uses the UnicodeDammit
library to automatically detect the encoding of the document. Beautiful Soup converts the content to Unicode while creating soup
objects from the document.
Note
Unicode is a character set, which is a list of characters with unique numbers. For example, in the Unicode character set, the number for the character B is 42. UTF-8 encoding is an algorithm that is used to convert these numbers into a binary representation.
In the previous example, Beautiful Soup converts the document to Unicode.
html_markup = """<p> The Spanish language is written using the Spanish alphabet, which is the Latin alphabet with one additional letter, eñe ⟨ñ⟩...