This figure shows where we have reached with our Data Lake after covering part 2 of this book:
Figure 01: Data Lake implemented so far in this book
HDFS |
Distributed File Storage |
MapReduce |
Batch Processing Engine |
YARN |
Resource Negotiator |
HBase |
Columnar and Key Value NoSQL database that runs on HDFS |
Hive |
Query engine that provides SQL like access to HDFS |
Impala |
Fast Query Engine for analytical queries on HDFS |
Sqoop |
Data Acquisition and Ingestion |
Flume |
Data Acquisition and Ingestion via streamed flume events |
Kafka |
Highly Scalable Distributed Messaging Engine |
Flink |
All purpose Real Time data processing and ingestion with Batch Support |
Spark |
All purpose Fast Batch Processing and ingestion with support for real time processing via micro-batches |
Elasticsearch |
Fast Distributed Indexing Engine built on Lucene, also... |