Normalizing data
Some datasets are nice to view but complicated for further processing. Take a look at the following extract of one of the matches files we were working with:
Date;Venue;Country;Matches;Country 07/09/12 15:00;Havana;Cuba;0:3;Honduras; 07/09/12 19:00;Kingston;Jamaica;2:1;USA; 07/09/12 19:30;San Salvador;El Salvador;2:2;Guyana; 07/09/12 19:45;Toronto;Canada;1:0;Panama; 07/09/12 20:00;Guatemala City;Guatemala;3:1;Antigua and Barbuda; 07/09/12 20:05;San Jose;Costa Rica;0:2;Mexico; ...
Imagine you want to answer the following questions:
How many teams played?
Which team converted most goals?
Which team won all matches it played?
The dataset is not prepared to answer these questions, at least in an easy way. If you want to answer those questions in a simple way, first you will have to normalize the data, that is, convert it to a suitable format before proceeding. Let's work on it.