Throughout this chapter, we saw how basic mathematical and information retrieval methods can be used to help identify how similar or dissimilar two text documents are. We also saw how we can extend these methods to any probabilistic distribution as well, such as topic models themselves – this can be particularly handy especially when we are working with more topics than we can analyze with the human eye. Summarization is also another useful tool we are now exposed to – since it works on the principle of which keywords provide the most information in a passage, we can use this knowledge of keywords to further aid us in building natural language processing pipelines.
We will now move on to more advanced topics involving neural networks and deep learning for textual data. These include methods such as Word2Vec and Doc2Vec, as well as shallow and deep neural...