Choosing OS for the Hadoop cluster
Choosing an operating system for your future Hadoop cluster is a relatively simple task. Hadoop core and its ecosystem components are all written in Java, with a few exceptions. While Java code itself is cross-platform, currently Hadoop only runs on Linux-like systems. The reason for this is that too many design decisions were made with Linux in mind, which made the code surrounding core Hadoop components such as start/stop
scripts and permissions model dependent on the Linux environment.
When it comes to Linux, Hadoop is pretty indifferent to specific implementations and runs well on different varieties of this OS: Red Hat, CentOS, Debian, Ubuntu, Suse, and Fedora. All these distributions don't have specific requirements for running Hadoop. In general, nothing prevents Hadoop from successfully working on any other POSIX-style OS, such as Solaris or BSD, if you make sure that all dependencies are resolved properly and all shell supporting scripts are working. Still, most of the production installations of Hadoop are running on Linux and this is the OS that we will be focusing on in our further discussions. Specifically, examples in this book will be focused on CentOS, since it is one of the popular choices for the production system, as well as its twin, Red Hat.
Apache Hadoop provides source binaries, as well as RPM and DEB packages for stable releases. Currently, this is a 1.0 branch. Building Hadoop from the source code, while still being an option, is not recommended for most of the users, since it requires experience in assembling big Java-based projects and careful dependencies resolution. Both Cloudera and Hortonworks distributions provide an easy way to setup a repository on your servers and install all required packages from there.
Tip
There is no strict requirement to run the same operating system across all Hadoop nodes, but common sense suggests, that the lesser the deviation in nodes configuration, the easier it is to administer and manage it.