We need to perform tokenization in order to build the Word2Vec models. The context of a sentence (document) is determined by the words in it. Word2Vec models require words rather than sentences (documents) to feed in, so we need to break the sentence into atomic units and create a token each time a white space is hit. DL4J has a tokenizer factory that is responsible for creating the tokenizer. The TokenizerFactory generates a tokenizer for the given string. In this recipe, we will tokenize the text data and train the Word2Vec model on top of them.
Tokenizing data and training the model
How to do it...
- Create a tokenizer factory and set the token preprocessor:
TokenizerFactory tokenFactory = new DefaultTokenizerFactory()...