Preprocessing the data
The first task is to read the file with the meta-information about the corpus. The metadata.csv
file includes one column with the audio filename and one with its transcription, separated by the |
symbol. The relevant code is included in the text-clustering.ipynb
notebook:
import pandas as pd # Read the data from the reduced csv file. data = pd.read_csv('./data/metadata.csv', usecols=range(2), names=['audiofile', 'transcription'], sep="|") data.head() >> audiofile transcription 0 LJ001-0001 Printing, in the only sense with which ... 1 LJ001-0002 in being comparatively modern. 2 LJ001-0003 For although the Chinese took impressio... 3 LJ001-0004 produced the block books, which were th... 4 LJ001-0005 the invention of movable metal letters ...
Unfortunately, the dataset lacks any information...