Cluster tuning
In addition to the previous comments specific to a cluster run on EMR, there are some general thoughts to keep in mind when running workloads on any type of cluster. This will, of course, be more explicit when running outside of EMR as it often abstracts some of the details.
JVM considerations
You should be running the 64-bit version of a JVM and using the server mode. This can take longer to produce optimized code, but it also uses more aggressive strategies and will re-optimize code over time. This makes it a much better fit for long-running services, such as Hadoop processes.
Ensure that you allocate enough memory to the JVM to prevent overly-frequent Garbage Collection (GC) pauses. The concurrent mark-and-sweep collector is currently the most tested and recommended for Hadoop. The Garbage First (G1) collector has become the GC option of choice in numerous other workloads since its introduction with JDK7, so it's worth monitoring recommended best practice as it evolves. These...