Preparation
Now that we have a general plan of action, before exploring our data, we must first invest in building the reusable tools for conducting the early mundane parts of the exploration pipeline that help us validate data; then as a second step investigate GDELT's content.
Introducing mask based data profiling
A simple but effective method for quickly exploring new types of data is to make use of mask based data profiling. A mask in this context is a transformation function for a string that generalizes a data item into a feature, that, as a collection of masks, will have a lower cardinality than the original values in the field of study.
When a column of data is summarized into mask frequency counts, a process commonly called data profiling, it can offer rapid insights into the common structures and content of the strings, and hence reveal how the raw data was encoded. Consider the following mask for exploring data:
- Translate uppercase letters to A
- Translate lowercase letters...