Fine-tuning BERT for multi-class classification with custom datasets
In this section, we will fine-tune the Turkish BERT, namely BERTurk, to perform seven-class classification downstream tasks with a custom dataset. This dataset has been compiled from Turkish newspapers and consists of seven categories. We will start by getting the dataset. Alternatively, you can find it in this book’s GitHub repository or get it from https://www.kaggle.com/savasy/ttc4900.
First, run the following code to get data within a Python notebook:
!wget https://raw.githubusercontent.com/savasy/TurkishTextClassification/master/TTC4900.csv
Then, we load the data:
import pandas as pd data= pd.read_csv("TTC4900.csv") data=data.sample(frac=1.0, random_state=42)
Let’s organize the IDs and labels with id2label
and label2id
to make the model figure out which ID refers to which label. We will also pass the number of labels, NUM_LABELS
, to the model to specify the size of a thin...