Normalizing data
Some datasets are nice to view but complicated for further processing. Take a look at the following extract of one of the matches files we were working with:
Date;Venue;Country;Matches;Country 07/09/12 15:00;Havana;Cuba;0:3;Honduras; 07/09/12 19:00;Kingston;Jamaica;2:1;USA; 07/09/12 19:30;San Salvador;El Salvador;2:2;Guyana; 07/09/12 19:45;Toronto;Canada;1:0;Panama; 07/09/12 20:00;Guatemala City;Guatemala;3:1;Antigua and Barbuda; 07/09/12 20:05;San Jose;Costa Rica;0:2;Mexico; ...
Imagine you want to answer the following questions:
- How many teams played?
- Which team converted most goals?
- Which team won all matches it played?
The dataset is not prepared to answer these questions, at least in an easy way. If you want to answer those questions in a simple way, first you will have to normalize the data, that is, convert it to a suitable format before proceeding. Let's work on it.