Normalizing and vectorizing data
For this section, pandas
and numpy
methods will be used. The first step is to load the contents of the processed files into one DataFrame
:
import glob
import pandas as pd
# could use `outfiles` param as well
files = glob.glob("./ner/*.tags")
data_pd = pd.concat([pd.read_csv(f, header=None,
names=["text", "label", "pos"])
for f in files], ignore_index = True)
This step may take a while given that it is processing 10,000 files. Once the content is loaded, we can check the structure of the DataFrame
:
data_pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62010 entries, 0 to 62009
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 62010 non-null object
1 label 62010 non-null object
2 pos 62010 non-null object
dtypes: object(3)
memory usage...