Once you start working on problems and implementing Hadoop clusters, you'll have to deal with the issue of sizing. It's not just the sizing aspect of clusters that needs to be considered, but the SLAs associated with Hadoop runtime as well. A cluster can be categorized based on workloads as follows:
- Lightweight: This category is intended for low computation and fewer storage requirements, and is more useful for defined datasets with no growth
- Balanced: A balanced cluster can have storage and computation requirements that grow over time
- Storage-centric: This category is more focused towards storing data, and less towards computation; it is mostly used for archival purposes, as well as minimal processing
- Computational-centric: This cluster is intended for high computation which requires CPU or GPU-intensive work, such as analytics, prediction...