Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Apache Spark 2.x Cookbook

You're reading from   Apache Spark 2.x Cookbook Over 70 cloud-ready recipes for distributed Big Data processing and analytics

Arrow left icon
Product type Paperback
Published in May 2017
Publisher
ISBN-13 9781787127265
Length 294 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Rishi Yadav Rishi Yadav
Author Profile Icon Rishi Yadav
Rishi Yadav
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Getting Started with Apache Spark FREE CHAPTER 2. Developing Applications with Spark 3. Spark SQL 4. Working with External Data Sources 5. Spark Streaming 6. Getting Started with Machine Learning 7. Supervised Learning with MLlib — Regression 8. Supervised Learning with MLlib — Classification 9. Unsupervised Learning 10. Recommendations Using Collaborative Filtering 11. Graph Processing Using GraphX and GraphFrames 12. Optimizations and Performance Tuning

Leveraging Databricks Cloud

Databricks is the company behind Spark. It has a cloud platform that takes out all of the complexity of deploying Spark and provides you with a ready-to-go environment with notebooks for various languages. Databricks Cloud also has a community edition that provides one node instance with 6 GB of RAM for free. It is a great starting place for developers. The Spark cluster that is created also terminates after 2 hours of sitting idle. 

All the recipes in this book can be run on either the InfoObjects Sandbox or Databricks Cloud community edition. The entire data for the recipes in this book has also been ported to a public bucket called sparkcookbook on S3. Just put these recipes on the Databricks Cloud community edition, and they will work seamlessly. 

How to do it...

  1. Go to https://community.cloud.databricks.com:
  1. Click on Sign Up :
  1. Choose COMMUNITY EDITION (or full platform): 
  1.  Fill in the details and you'll be presented with a landing page, as follows:
  1. Click on Clusters, then Create Cluster (showing community edition below it):
  1. Enter the cluster name, for example, myfirstcluster, and choose Availability Zone (more about AZs in the next recipe). Then click on Create Cluster:
  1. Once the cluster is created, the blinking green signal will become solid green, as follows:
  1. Now go to Home and click on Notebook. Choose an appropriate notebook name, for example, config, and choose Scala as the language:
  1. Then set the AWS access parameters. There are two access parameters:
    • ACCESS_KEY: This is referred to as fs.s3n.awsAccessKeyId in SparkContext's Hadoop configuration.
    • SECRET_KEY: This is referred to as  fs.s3n.awsSecretAccessKey in SparkContext's Hadoop configuration.
  2. Set ACCESS_KEY in the config notebook:
        sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<replace  
with your key>")
  1. Set SECRET_KEY in the config notebook:
        sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","
<replace with your secret key>")
  1. Load a folder from the sparkcookbook bucket (all of the data for the recipes in this book are available in this bucket:
        val yelpdata = 
spark.read.textFile("s3a://sparkcookbook/yelpdata")
  1. The problem with the previous approach was that if you were to publish your notebook, your keys would be visible. To avoid the use of this approach, use Databricks File System (DBFS).
DBFS is Databricks Cloud's internal file system. It is a layer above S3, as you can guess. It mounts S3 buckets in a user's workspace as well as caches frequently accessed data on worker nodes. 
  1. Set the access key in the Scala notebook:
        val accessKey = "<your access key>"
  1. Set the secret key in the Scala notebook:
        val secretKey = "<your secret key>".replace("/", "%2F")
  1. Set the bucket name in the Scala notebook:
        val bucket = "sparkcookbook"
  1. Set the mount name in the Scala notebook:
        val mount = "cookbook"
  1. Mount the bucket:
        dbutils.fs.mount(s"s3a://$accessKey:$secretKey@$bucket", 
s"/mnt/$mount")
  1. Display the contents of the bucket:
        display(dbutils.fs.ls(s"/mnt/$mount"))
The rest of the recipes will assume that you would have set up AWS credentials.

How it works...

Let's look at the key concepts in Databricks Cloud.

Cluster

The concept of clusters is self-evident. A cluster contains a master node and one or more slave nodes. These nodes are EC2 nodes, which we are going to learn more about in the next recipe. 

Notebook

Notebook is the most powerful feature of Databricks Cloud. You can write your code in Scala/Python/R or a simple SQL notebook. These notebooks cover the whole 9 yards. You can use notebooks to write code like a programmer, use SQL like an analyst, or do visualization like a Business Intelligence (BI) expert. 

Table

Tables enable Spark to run SQL queries.

Library

Library is the section where you upload the libraries you would like to attach to your notebooks. The beauty is that you do not have to upload libraries manually; you can simply provide the Maven parameters and it would find the library for you and attach it.

You have been reading a chapter from
Apache Spark 2.x Cookbook
Published in: May 2017
Publisher:
ISBN-13: 9781787127265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image