Set operations in Spark
For those of you who are from the database world and have now ventured into the world of big data, you're probably looking at how you can possibly apply set operations on Spark datasets. You might have realized that an RDD can be a representation of any sort of data, but it does not necessarily represent a set based data. The typical set operations in a database world include the following operations, and we'll see how some of these apply to Spark. However, it is important to remember that while Spark offers some of the ways to mimic these operations, spark doesn't allow you to apply conditions to these operations, which is common in SQL operations:
- Distinct: Distinct operation provides you a non-duplicated set of data from the dataset
- Intersection: The intersection operations returns only those elements that are available in both datasets
- Union: A union operation returns the elements from both datasets
- Subtract: A subtract operation returns the elements from one dataset...