In this section, we will explore some tools and learn more about handling and managing the data that we have scraped or extracted from certain websites.
Data that's collected from websites using scraping scripts is known as raw data. This data might require some additional tasks to be performed on top of it before it can be processed further so that we can gain an insight on it. Therefore, raw data should be verified and processed (if required), which can be done by doing the following:
- Cleaning: As the name suggests, this step is used to remove unwanted pieces of information, such as space and whitespace characters, and unwanted portions of text. The following code shows some relevant steps that were used in examples in previous chapters, such as Chapter 9, Using Regex to Extract Data, and Chapter 3, Using LXML, XPath, and CSS Selectors. Functions...