Before ending this chapter, we can also create a custom transformer based on the Word2Vec embedding and use it in our classification pipeline instead of CountVectorizer. In order to be able to use our custom transformer in the pipeline, we need to make sure it has fit, transform, and fit_transform methods.
Here is our new transformer, whichwe will call WordEmbeddingVectorizer:
import spacy
class WordEmbeddingVectorizer:
def __init__(self, language_model='en_core_web_md'):
self.nlp = spacy.load(language_model)
def fit(self):
pass
def transform(self, x, y=None):
return pd.Series(x).apply(
lambda doc: self.nlp(doc).vector.tolist()
).values.tolist()
def fit_transform(self, x, y=None):
return self.transform(x)
The fit method here is impotent—it does not do anything since we are using a pre-trained model from spaCy. We can use the newly created transformer...