In many classification problems, the target dataset is made up of categorical labels that cannot immediately be processed by every algorithm. An encoding is needed, and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10 categorical samples with 2 features each:
import numpy as np
X = np.random.uniform(0.0, 1.0, size=(10, 2))
Y = np.random.choice(('Male', 'Female'), size=(10))
print(X[0])
array([ 0.8236887 , 0.11975305])
print(Y[0])
'Male'
The first option is to use the LabelEncoder class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is, an index of an instance array called classes_:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
yt = le.fit_transform(Y)
print(yt)
[0 0 0 1 0 1 1 0 0 1]
le.classes_array...