We have seen how feature transformers allow us to convert, modify, and standardize our documents using a preprocessing pipeline, resulting in the conversion of raw text into a collection of tokens. Feature extractors take these tokens and generate feature vectors from them that may then be used to train machine learning models. Two common examples of typical feature extractors that are used in NLP are the bag of words and term frequency–inverse document frequency (TF–IDF) algorithms.
Feature extractors
Bag of words
The bag of words approach simply counts the number of occurrences of each unique word in the raw or tokenized text. For example, given the text "Machine Learning with Apache Spark, Apache Spark...