Map Reduce program to find distinct values
In this recipe, we are going to learn how to write a map reduce program to find distinct values from a given set of data.
Getting ready
To perform this recipe, you should have a running Hadoop cluster as well as an eclipse that is similar to an IDE.
How to do it
Sometimes, there may be a chance that the data you have contains some duplicate values. In SQL, we have something called a distinct function, which helps us get distinct values. In this recipe, we are going to take a look at how we can get distinct values using map reduce programs.
Let's consider a use case where we have some user data with us, which contains two columns: userId
and username
. Let's assume that the data we have contains duplicate records, and for our processing needs, we only need distinct records through user IDs. Here is some sample data that we have where columns are separated by '|':
1|Tanmay 2|Ramesh 3|Ram 1|Tanmay 2|Ramesh 6|Rahul 6|Rahul 4|Sneha 4|Sneha
The idea here is to...