Summary
Building a production Hadoop cluster is a complex task with many steps involved. One of the often-overlooked steps in planning the cluster is outlining what kind of workload the future cluster will handle. As you have seen in this chapter, understanding what type of cluster you are building is important for proper sizing and choosing the right hardware configuration. Hadoop was originally designed for commodity hardware, but now it is being adopted by companies whose use cases are different from web giants like Yahoo! and Facebook. Such companies have different goals and resources and should plan their Hadoop cluster accordingly. It is not uncommon to see smaller clusters with more powerful nodes being built to save real estate in the data centers, as well as to keep power consumption under control.
Hadoop is constantly evolving with new features being added all the time and new important ecosystem projects emerging. Very often, these changes affect the core Hadoop components and new versions may not always be compatible with the old ones. There are several distributions of Hadoop that an end user can choose from, all providing a good level of integration between the components and even some additional features. It is often tempting to choose the latest and the most feature-rich version of Hadoop, but from a reliability perspective, it's better to go with the version that saw some production burn-in time and is stable enough. This will save you from unpleasant surprises. In the next chapter, we will dive into details about installing and configuring core Hadoop components. Roll up your sleeves and get ready to get your hands dirty!