Packt+ | Advance your knowledge in tech

You're reading from Learning PySpark Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

Product type Paperback

Published in Feb 2017

Publisher Packt

ISBN-13 9781786463708

Length 274 pages

Edition 1st Edition

Languages

Python

Tools

Apache Spark

Concepts

Data Processing

Authors (2):

Denny Lee

Tomasz Drabas

View More author details

Table of Contents (13) Chapters

Preface

1. Understanding Spark FREE CHAPTER

2. Resilient Distributed Datasets

3. DataFrames

4. Prepare Data for Modeling

5. Introducing MLlib

6. Introducing the ML Package

7. GraphFrames

8. TensorFrames

9. Polyglot Persistence with Blaze

10. Structured Streaming

11. Packaging Spark Applications

Index

Other features of PySpark ML in action

At the beginning of this chapter, we described most of the features of the PySpark ML library. In this section, we will provide examples of how to use some of the Transformers and Estimators.

Feature extraction

We have used quite a few models from this submodule of PySpark. In this section, we'll show you how to use the most useful ones (in our opinion).

NLP - related feature extractors

As described earlier, the NGram model takes a list of tokenized text and produces pairs (or n-grams) of words.

In this example, we will take an excerpt from PySpark's documentation and present how to clean up the text before passing it to the NGram model. Here's how our dataset looks like (abbreviated for brevity):

Tip

For the full view of how the following snippet looks like, please download the code from our GitHub repository: https://github.com/drabastomek/learningPySpark.

We copied these four paragraphs from the description of the DataFrame usage in Pipelines: http://spark...

The rest of the chapter is locked