Designing big data processing pipelines
One of the critical mistakes many big data architectures make is handling multiple stages of the data pipeline with one tool. A fleet of servers managing the end-to-end data pipeline, from data storage and transformation to visualization, may be the most straightforward architecture, but it is also the most vulnerable to breakdowns in the pipeline. Such tightly coupled big data architecture typically does not provide the best possible balance of throughput and cost for your needs. When you are designing a data architecture, use FLAIR data principles as explained below:
- F: Findability. The ability to view which data assets are available, access metadata including ownership and data classification, and other mandatory attributes for data governance and compliance
- L: Lineage. The ability to find the data origin, trace data back, and understand and visualize data as it flows from data sources to consumption
- A: Accessibility...