Chapter 3: Fundamentals of Natural Language Processing
Activity 3: Process a Corpus
Solution
Import the sklearn TfidfVectorizer and TruncatedSVD methods:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD
Load the corpus:
docs = [] ndocs = ["doc1", "doc2", "doc3"] for n in ndocs: Â Â Â Â aux = open("dataset/"+ n +".txt", "r", encoding="utf8") Â Â Â Â docs.append(aux.read())
With spaCy, let's add some new stop words, tokenize the corpus, and remove the stop words. The new corpus without these words will be stored in a new variable:
import spacy import en_core_web_sm from spacy.lang.en.stop_words import STOP_WORDS nlp = en_core_web_sm.load() nlp.vocab["\n\n"].is_stop = True nlp.vocab["\n"].is_stop = True nlp.vocab["the"].is_stop = True nlp.vocab["The"].is_stop = True newD = [] for d, i in zip(docs, range(len(docs))): Â Â Â Â doc = nlp(d) Â Â Â Â tokens = [token.text for token in doc if not token.is_stop and not token.is_punct] Â Â Â Â newD.append(' '.join(tokens))
Create the TF-IDF matrix. I'm going to add some parameters to improve the results:
vectorizer = TfidfVectorizer(use_idf=True, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ngram_range=(1,2), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â smooth_idf=True, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â max_df=0.5) X = vectorizer.fit_transform(newD)
Perform the LSA algorithm:
lsa = TruncatedSVD(n_components=100,algorithm='randomized',n_iter=10,random_state=0) lsa.fit_transform(X)
With pandas, we are shown a sorted DataFrame with the weights of the terms of each concept and the name of each feature:
import pandas as pd import numpy as np dic1 = {"Terms": terms, "Components": lsa.components_[0]} dic2 = {"Terms": terms, "Components": lsa.components_[1]} dic3 = {"Terms": terms, "Components": lsa.components_[2]} f1 = pd.DataFrame(dic1) f2 = pd.DataFrame(dic2) f3 = pd.DataFrame(dic3) f1.sort_values(by=['Components'], ascending=False) f2.sort_values(by=['Components'], ascending=False) f3.sort_values(by=['Components'], ascending=False)
The output is as follows:
Note
Do not worry if the keywords are not the same as yours, if the keywords represent a concept, it is a valid result.