Performing exploratory data analysis
After browsing the online dataset, we observe different files corresponding to various product categories. But before we focus on any particular group, we can explore the data found in the categories.txt.gz
file (https://snap.stanford.edu/data/amazon/categories.txt.gz). Looking at its extension, we deduce that it is a compressed archive to occupy less storage space. Furthermore, each product is specified by a unique identifier and can be part of multiple categories. Figure 4.1 shows two examples from the dataset:
Figure 4.1 – Sample product IDs along with their categories
Python offers the gzip
module to read data from a compressed file. So, first, we need to parse categories.txt.gz
and read both the ID and the categories for each product. In the following code snippet, we define a method that does exactly that: iterates over all lines in the file, checks whether it refers to a product ID or category, and stores the...