Data data everywhere...
In discussions concerning integration of Hadoop with other systems, it is easy to think of it as a one-to-one pattern. Data comes out of one system, gets processed in Hadoop, and then is passed onto a third.
Things may be like that on day one, but the reality is more often a series of collaborating components with data flows passing back and forth between them. How we build this complex network in a maintainable fashion is the focus of this chapter.
Types of data
For the sake of the discussion, we will categorize data into two broad categories:
Network traffic, where data is generated by a system and sent across a network connection
File data, where data is generated by a system and written to files on a filesystem somewhere
We don't assume these data categories are different in any way other than how the data is retrieved.
Getting network traffic into Hadoop
When we say network data, we mean things like information retrieved from a web server via an HTTP connection, database...