Implementing the solution
The first step of any implementation is always understanding the source data. This is because all our low-level transformation and cleansing will be dependent on the variety of the data. In the previous chapter, we used DataCleaner to profile the data. However, this time, we are dealing with big data and the cloud. DataCleaner may not be a very effective tool for profiling the data if its size runs into the terabytes. For our scenario, we will use an AWS cloud-based data profiling tool called AWS Glue DataBrew.
Profiling the source data
In this section, we will learn how to do data profiling and analysis to understand the incoming data (you can find the sample file for this on GitHub at https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/tree/main/Chapter05. Follow these steps:
- Create an S3 bucket called
scalabledataarch
using the AWS Management Console and upload the sample input data to the S3 bucket: