Text data
For the text data, we’ll use the same Hugging Face hub to obtain two kinds of data – unstructured text, as we did in Chapter 3, and structured data – programming language code:
# import Hugging Face Dataset from datasets import load_dataset # load the dataset with text classification labels dataset = load_dataset('imdb')
The preceding code fragment loads the dataset of movie reviews from the Internet Movie Database (IMDb). We can get an example of the data by using an interface that’s similar to what we used for images:
# show the first example dataset['train'][0]
We can visualize it using a similar one too:
# plot the distribution of the labels sns.histplot(dataset['train']['label'], bins=2)
The preceding code fragment creates the following diagram, showing that both positive and negative comments are perfectly balanced:
Figure 6.13 – Balanced classes in the...