Fine-tuning BERT for multi-class classification with custom datasets
In this section, we will fine-tune the Turkish BERT, namely BERTurk, to perform seven-class classification downstream tasks with a custom dataset. This dataset has been compiled from Turkish newspapers and consists of seven categories. We will start by getting the dataset. Alternatively, you can find it in this book's GitHub respository or get it from https://www.kaggle.com/savasy/ttc4900:
- First, run the following code to get data within a Python notebook:
!wget https://raw.githubusercontent.com/savasy/TurkishTextClassification/master/TTC4900.csv
- Start by loading the data:
import pandas as pd data= pd.read_csv("TTC4900.csv") data=data.sample(frac=1.0, random_state=42)
- Let's organize the IDs and labels with
id2label
andlabel2id
to make the model figure out which ID refers to which label. We will also pass the number of labels,NUM_LABELS
, to the model to specify the size of a thin...