Reading data with Apache Spark
In Spark you can read data from a lot of sources, but in general NoSQL datastores such as HBase, Accumulo, and Cassandra you have a limited query subset and you often need to scan all the data to read only the required data. Using Elasticsearch you can retrieve a subset of documents that match your Elasticsearch query.
Getting ready
To read an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
You also need a working installation of Apache Spark and the data indexed in the previous example.
How to do it...
For reading data in Elasticsearch via Apache Spark, we will perform the steps given as follows:
We need to start the Spark Shell:
./bin/spark-shell
We import the required classes:
import org.elasticsearch.spark._
Now we can create a RDD by reading data from Elasticsearch:
val rdd=sc.esRDD("spark/persons")
We can watch...