Chapter 7, Keeping Things Running
Pop quiz – setting up a cluster
Q1 |
5. Though some general guidelines are possible and you may need to generalize whether your cluster will be running a variety of jobs, the best fit depends on the anticipated workload. |
Q2 |
4. Network storage comes in many flavors but in many cases you may find a large Hadoop cluster of hundreds of hosts reliant on a single (or usually a pair) of storage devices. This adds a new failure scenario to the cluster and one with a less uncommon likelihood than many others. Where storage technology does look to address failure mitigation it is usually through disk-level redundancy. These disk arrays can be highly performant but will usually have a penalty on either reads or writes. Giving Hadoop control of its own failure handling and allowing it full parallel access to the same number of disks is likely to give higher overall performance. |
Q3 |
3. Probably! We would suggest avoiding the first configuration as, though it has just enough raw storage and is far from underpowered, there is a good chance the setup will provide little room for growth. An increase in data volumes would immediately require new hosts and additional complexity in the MapReduce job could require additional processor power or memory. Configurations B and C both look good as they have surplus storage for growth and provide similar head-room for both processor and memory. B will have the higher disk I/O and C the better CPU performance. Since the primary job is involved in financial modelling and forecasting, we expect each task to be reasonably heavyweight in terms of CPU and memory needs. Configuration B may have higher I/O but if the processors are running at 100 percent utilization it is likely the extra disk throughput will not be used. So the hosts with greater processor power are likely the better fit. Configuration D is more than adequate for the task and we don’t choose it for that very reason; why buy more capacity than we know we need? |