Hadoop's basic data flow
A basic data flow of the Hadoop system can be divided into four phases:
Capture Big Data : The sources can be extensive lists that are structured, semi-structured, and unstructured, some streaming, real-time data sources, sensors, devices, machine-captured data, and many other sources. For data capturing and storage, we have different data integrators such as, Flume, Sqoop, Storm, and so on in the Hadoop ecosystem, depending on the type of data.
Process and Structure: We will be cleansing, filtering, and transforming the data by using a MapReduce-based framework or some other frameworks which can perform distributed programming in the Hadoop ecosystem. The frameworks available currently are MapReduce, Hive, Pig, Spark and so on.
Distribute Results: The processed data can be used by the BI and analytics system or the big data analytics system for performing analysis or visualization.
Feedback and Retain: The data analyzed can be fed back to Hadoop and used for improvements...