Other features of PySpark ML in action
At the beginning of this chapter, we described most of the features of the PySpark ML library. In this section, we will provide examples of how to use some of the Transformers
and Estimators
.
Feature extraction
We have used quite a few models from this submodule of PySpark. In this section, we'll show you how to use the most useful ones (in our opinion).
NLP - related feature extractors
As described earlier, the NGram
model takes a list of tokenized text and produces pairs (or n-grams) of words.
In this example, we will take an excerpt from PySpark's documentation and present how to clean up the text before passing it to the NGram
model. Here's how our dataset looks like (abbreviated for brevity):
Tip
For the full view of how the following snippet looks like, please download the code from our GitHub repository: https://github.com/drabastomek/learningPySpark.
We copied these four paragraphs from the description of the DataFrame usage in Pipelines
: http://spark...