Basics of data profiling
Data profiling assesses a set of data and provides information on the values, the length of strings, the level of completeness, and the distribution patterns of each column. For example, for both values and string lengths, the minimum, maximum, mean, and median are provided to help identify outliers.
Most of you will have some experience in data profiling – even if you have not heard the term before. The first task that many people perform when looking at an unfamiliar set of data is to open it in a spreadsheet tool and apply a filter (the autofilter feature in Microsoft Excel, for example) to all the columns. They will check all values in each column, looking to see whether the column contains a couple of values that all the rows are associated with, or whether there are many. People look to see whether the data is a number, a date, text, and so on. It’s quite common to look for the smallest and largest values. Even this basic action is an...