Cleaning the data for our knowledge graph
Knowledge graphs typically contain relationships that represent commonalities between related documents and are built up using the content within those documents text. For this reason, a large part of knowledge graph construction is cleaning and preparing that text for later graph creation.
Let’s begin by taking a look at the raw abstract data in 20k_abstracts.txt
. This data is displayed as in the following abstract style:
Aspergillus fumigatus
BACKGROUND IgE sensitization to Aspergillus fumigatus and a positive sputum fungal culture result are common in patients with refractory asthma.
BACKGROUND It is not clear whether these patients would benefit from antifungal treatment.
OBJECTIVE We are seeking to determine whether a 3-month course of voriconazole improved asthma-related outcomes in patients with asthma who are IgE sensitized to Aspergillus.
We can see that each abstract is given a reference number preceded by...