Summary
In this chapter, we have looked into several different types of data formats and how to work with them. These formats include CSV, PDF, Excel, Plain Text, and HTML. HTML documents are the cornerstone of the World Wide Web and, given the amount of data that's contained in it, we can easily infer the importance of HTML as a data source.
We learned about bs4
(BeautifulSoup 4
), a Python library that gives us Pythonic ways to read and query HTML documents. We used bs4 to load an HTML document and explored several different ways to navigate the loaded document.
We also looked at how we can create a pandas
DataFrame from an HTML document (which contains a table). Although there are some built-in ways to do this job in pandas
, they fail as soon as the target table is encoded inside a complex hierarchy of elements. So, the knowledge we gathered in this topic to transform an HTML table into a pandas
DataFrame in a step-by-step manner is invaluable.
Finally, we looked at...