The project in this chapter was about finding hidden similarity underneath newsgroups data, be it semantic groups, be it themes, or word clouds. We started with what unsupervised learning does and the typical types of unsupervised learning algorithms. We then introduced unsupervised learning clustering and studied a popular clustering algorithm, k-means, in detail. We also talked about tf-idf as a more efficient feature extraction tool for text data. After that, we performed k-means clustering on the newsgroups data and obtained four meaningful clusters. After examining the key terms in each resulting cluster, we went straight to extracting representative terms among original documents using topic modeling techniques. Two powerful topic modeling approaches, NMF and LDA, were discussed and implemented. Finally, we had some fun interpreting the topics we obtained from both...
United States
Great Britain
India
Germany
France
Canada
Russia
Spain
Brazil
Australia
Singapore
Hungary
Ukraine
Luxembourg
Estonia
Lithuania
South Korea
Turkey
Switzerland
Colombia
Taiwan
Chile
Norway
Ecuador
Indonesia
New Zealand
Cyprus
Denmark
Finland
Poland
Malta
Czechia
Austria
Sweden
Italy
Egypt
Belgium
Portugal
Slovenia
Ireland
Romania
Greece
Argentina
Netherlands
Bulgaria
Latvia
South Africa
Malaysia
Japan
Slovakia
Philippines
Mexico
Thailand