Dividing text using chunking
Chunking refers to dividing the input text into pieces, which are based on any random condition. This is different from tokenization in the sense that there are no constraints and the chunks do not need to be meaningful at all. This is used very frequently during text analysis. When you deal with really large text documents, you need to divide it into chunks for further analysis. In this recipe, we will divide the input text into a number of pieces, where each piece has a fixed number of words.
How to do it…
Create a new Python file, and import the following packages:
import numpy as np from nltk.corpus import brown
Let's define a function to split text into chunks. The first step is to divide the text based on spaces:
# Split a text into chunks def splitter(data, num_words): words = data.split(' ') output = []
Initialize a couple of required variables:
cur_count = 0 cur_words = []
Let's iterate through the words:
for word in words: cur_words...