Working with Hive
Hive is a data warehousing infrastructure based on Hadoop. Hive provides SQL-like capabilities to work with data on Hadoop. Hadoop, during its infancy was limited to MapReduce as a computer platform, which was a very engineer-centric programming paradigm. Engineers at Facebook in 2008 were writing fairly complex Map-Reduce jobs, but realised that it would not be scalable and it would be difficult to get the best value from the available talent. Having a team that could write Map-Reduce Jobs, and be called upon was considered a poor strategy and hence the team decided to bring SQL to Hadoop (Hive) due for two major reasons:
- An SQL-based declarative language while allowing engineers to plug their own scripts and programs when SQL did not suffice.
- Centralized metadata about all data (Hadoop based datasets) in the organization, to create a data-driven organization.
Spark supports reading and writing data stored in Apache Hive. You would need to configure Hive with Apache Spark...