Natural Language Processing (NLP) is a set of machine learning techniques that allow working with text documents, considering their internal structure, and the distribution of words. In this chapter, we're going to discuss all common methods to collect texts, split them into atoms, and transform them into numerical vectors. In particular, we'll compare different methods to tokenize documents (separate each word), to filter them, to apply special transformations to avoid inflected or conjugated forms, and finally to build a common vocabulary. Using the vocabulary, it will be possible to apply different vectorization approaches, to build feature vectors that can easily be used for classification or clustering purposes. To show you how to implement the whole pipeline, at the end of the chapter, we're going to set up a simple...
United States
Great Britain
India
Germany
France
Canada
Russia
Spain
Brazil
Australia
Singapore
Hungary
Ukraine
Luxembourg
Estonia
Lithuania
South Korea
Turkey
Switzerland
Colombia
Taiwan
Chile
Norway
Ecuador
Indonesia
New Zealand
Cyprus
Denmark
Finland
Poland
Malta
Czechia
Austria
Sweden
Italy
Egypt
Belgium
Portugal
Slovenia
Ireland
Romania
Greece
Argentina
Netherlands
Bulgaria
Latvia
South Africa
Malaysia
Japan
Slovakia
Philippines
Mexico
Thailand