Data profiling with DataCleaner
Data profiling is an often overlooked process due to time or resource constraints on projects that, in reality, can save time and catch issues before they occur in your data integration code. For instance, finding data that doesn't match expected formats or fit within ranges, misspellings, improperly formatted dates, or discovering strings in an expected numerical field can all break a transformation.
DataCleaner is an open source data profiling tool that integrates with Kettle and can profile data while code is in the process of being developed. Additionally, DataCleaner jobs can be integrated into Kettle jobs and run as part of larger processes.
Profiling data shows the meta-information about the data being processed—from how many values fit into ranges to how many values fit a given format. This can help data integration developers write more optimized processes and determine if the quality of the source is capable of meeting the requirements of the project...