Building a distributed data pipeline
Building a distributed data pipeline is almost exactly the same as building a data pipeline to run on a single machine. NiFi will handle the logistics of passing and recombining the data. A basic data pipeline is shown in the following screenshot:
The preceding data pipeline uses the GenerateFlowFile
processor to create unique flowfiles. This is passed downstream to the AttributesToJSON
processor, which extracts the attributes and writes to the flowfile content. Lastly, the file is written to disk at /home/paulcrickard/output
.
Before running the data pipeline, you will need to make sure that you have the output directory for the PutFile
processor on each node. Earlier, I said that data pipelines are no different when distributed, but there are some things you must keep in mind, one being that PutFile
will write...