The 20 newsgroup dataset is composed of text, taken from news articles as its name implies. It was originally collected by Ken Lang, and is now widely used for experiments in text applications of machine learning techniques, specifically natural language processing techniques.
Natural language processing (NLP) is a significant subfield of machine learning, which deals with the interactions between machine (computer) and human (natural) languages. Natural languages are not limited to speech and conversation. They can be in writing and sign languages as well. The data for NLP tasks can be in different forms, for example, text from social media posts, web pages, even medical prescription, audio from voice mail, commands to control systems, even a favorite music or movie. Nowadays, NLP has been broadly involved in our daily lives: we can not live without machine translation; weather forecast scripts are...