Data manipulation using the datasets library
Datasets come with many dictionaries of subsets, where the split
parameter is used to decide which subset(s) or portion of the subset is to be loaded. If this is none
by default, it will return a dataset dictionary of all subsets (train
, test
, validation
, or any other combination). If the split
parameter is specified, it will return a single dataset rather than a dictionary. For the following example, we retrieve a train
split of the cola
dataset only:
cola_train = load_dataset('glue', 'cola', split ='train')
We can get a mixture of the train
and validation
subsets as follows:
cola_sel = load_dataset('glue', 'cola', split = 'train[:300]+validation[-30:]')
The split
expression means that the first 300 examples of train
and the last 30 examples of validation
are obtained as cola_sel
.
We can apply different combinations, as shown in the following...