Chapter 6. Encoding Support in Beautiful Soup
All web pages will have an encoding associated with it. Modern websites have different encodings such as UTF-8, and Latin-1. Nowadays, UTF-8 is the encoding standard used in websites. So, while dealing with the scraping of such pages, it is important that the scraper should also be capable of understanding those encodings. Otherwise, the user will see certain characters in the web browser whereas the result you would get after using a scraper would be gibberish characters. For example, consider a sample web content from Wikipedia where we are able to see the Spanish character ñ.
If we run the same content through a scraper with no support for the previous encoding used by the website, we might end up with the following content:
The Spanish language is written using the Spanish alphabet, which is the Latin alphabet with one additional letter, e単e (単), for a total of 27 letters.
We see the Spanish character &...