Understanding the main concepts and tools used in NLP
When processing documents, the first analytical step is certainly to infer the document language. Most analytical engines that are used in NLP tasks are, in fact, trained on documents in a specific language and should only be used for such a language. Some attempts to build cross-language models (see, for instance, multi-lingual embeddings such as https://fasttext.cc/docs/en/aligned-vectors.html and https://github.com/google-research/bert/blob/master/multilingual.md) have recently gained increasing popularity, although they still represent a small portion of NLP models. Therefore, it is very common to first infer the language so that you can use the correct downstream analytical NLP pipeline.
You can use different methods to infer the language. One very simple yet effective approach relies on looking for the most common words of a language (the so-called stopwords
, such as the
, and
, be
, to
, of
, and so on) and building a score...