Data pipelines represent the flow of data from extraction to business reporting, which involves intermediate steps such as cleansing, transformation, and loading gold data to the reporting layer. The data pipelines may be either real-time, near mealtime, or batch. The storage system may be either distributed storage, like HDFS, S3 or distributed high throughput messaging queue, such as Kafka and Kinesis.
In this section, we will talk about how we can use cloud providers' tools all together to build data pipelines for customers. If you have already gone through the preceding section about the logical architecture in Hadoop, this section will be easy to understand. We have been continuously mentioning that every cloud provider available today has some equivalent tools available with respect to open source tools. Each cloud provider claims some rich set of features...