Adding checksums to verify datasets
While there are many ways to verify that your datasets are valid, a common practice is to create a checksum based on the data to determine if it is different from a reference data set. Checksums are a hash of the data provided to the algorithm generating it, making each one nearly unique to the data that built it.
Kettle provides a way to add a checksum to each record in your dataset through the Add a Checksum step.
For this recipe, we will be comparing data between the roller coaster database and a flat file that may have new roller coasters listed in it.
Getting ready
For this recipe, you will need the the files associated with this recipe, which can be downloaded from the book's site. More details about the files can be found in the recipe Comparing two streams and generating differences. There is a SQL file that will create the parks' database and a flat file we will be comparing the data to.
How to do it...
Perform the following steps:
Create a new transformation...