Building a bag-of-words model
When working with text documents that include large words, we need to switch them to several types of arithmetic depictions. We need to formulate them to be suitable for machine learning algorithms. These algorithms require arithmetical information so that they can examine the data and provide significant details. The bag-of-words procedure helps us to achieve this. Bag-of-words creates a text model that discovers vocabulary using all the words in the document. Later, it creates the models for every text by constructing a histogram of all the words in the text.
How to do it...
- Initialize a new Python file by importing the following file:
import numpy as np from nltk.corpus import brown from chunking import splitter
- Define the
main
function and read the input data fromBrown corpus
:
if __name__=='__main__': content = ' '.join(brown.words()[:10000])
- Split the text content into chunks:
num_of_words = 2000 num_chunks = [] count = 0 texts_chunk...