word_tokenizer
Word tokenization is the process of splitting a large sample of text into words. This is a requirement in NLP tasks where each word needs to be captured and subjected to further analysis, such as classifying and analyzing them for a particular sentiment.
In Optimus, you just need to call word_tokenizer
in the cols
accessor, as in the following code:
print(df.cols.word_tokenize("text","tokens")["text"])
You will then obtain a list of words for every row, as shown here:
text (object) ['transformers', 'american', 'japanese', 'media', 'franchise', 'produced', 'american', 'toy', 'company', 'hasbro', 'japanese', 'toy', 'company', 'takara', 'tomy', 'follows', 'battles', 'sentient', 'living', 'autonomous', 'robots', 'often...