Stinger initiative
Hive has remained very successful and capable since its earliest releases, particularly in its ability to provide SQL-like processing on enormous datasets. But other technologies did not stand still, and Hive acquired a reputation of being relatively slow, particularly in regard to lengthy startup times on large jobs and its inability to give quick responses to conceptually simple queries.
These perceived limitations were less due to Hive itself and more a consequence of how translation of SQL queries into the MapReduce model has much built-in inefficiency when compared to other ways of implementing a SQL query. Particularly in regard to very large datasets, MapReduce saw lots of I/O (and consequently time) spent writing out the results of one MapReduce job just to have them read by another. As discussed in Chapter 3, Processing – MapReduce and Beyond, this is a major driver in the design of Tez, which can schedule jobs on a Hadoop cluster as a graph of tasks that does...