In this chapter, we learned about the data that is categorized as unstructured data. We then jumped into the practical of importing data into Hadoop cluster by using Apache Flume. Apache Flume is designed on the principle of having the source and sink connected to each other using channel. We have already discussed the theory of Apache Flume in Chapter 3, Hadoop Ecosystem. Data comes into Flume with the help of a source, travels through channels, and dumps into sinks. In our example, we configured netcat to be an Apache Flume source. netcat sends data to Flume, which will be processed in memory. Memory is served as a channel to connect a source to a sink. Channel will then pass it on to sink. We configured Hadoop file system as an Apache Flume sink so that it will save the received data into HDFS as files.
We then developed a program that converts an image into text. We...