In this section, we briefly introduce the main techniques which lie behind the social media analysis process and bring intelligence to the data. We also present how to deal with reasonably big amount of data using our development environment. However, it is worth noting that the problem of scaling and dealing with massive data will be analyzed in Chapter 9, Social Data Analytics at Scale - Spark and Amazon Web Services.
Analyzing the data
Brief introduction to machine learning
The recent growth in the volume of data created by mobile devices and social networks has dramatically impacted the need for high performance computation and new methods of analysis. Historically, large quantities of data (big data) were analyzed by statistical approaches which were based on sampling and inductive reasoning to derive knowledge from data. A more recent development of artificial intelligence, and more specifically, machine learning, enabled not only the ability to deal with large volume of data, but it brought a tremendous value to businesses and consumers by extracting valuable insights and hidden patterns.
Machine learning is not new. In 1959, Arthur Samuel defined machine learning as:
Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that allow to This approach is similar to a person who increases his knowledge on a subject by reading more and more books on the subject. There are three main approaches in machine learning: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning assumes that we know what the outputs are of each data point. For example, we learn that a car that costs $80,000, which has an electric engine and acceleration of 0-100 km/h in 3 seconds, is called Tesla; another car, which costs $40,000, has a diesel engine, and acceleration of 0-100 km/h in 9.2 seconds, is called Toyota; and so on. Then, when we look for the name of a car which costs $35,000, has acceleration of 0-100 km/h in 9.8 seconds, and has a diesel engine, it is most probably Toyota and not Tesla.
Unsupervised learning is used when we do not know the outputs. In the case of cars, we only have technical specifications: acceleration, price, engine type. Then we cluster the data points into different groups (clusters) of similar cars. In our case, we will have the clusters with similar price and engine types. Then, we understand similarities and differences between the cars.
The third type of machine learning is reinforcement learning, which is used more in artificial intelligence applications. It consists of devising an algorithm that learns how to behave based on a system of rewards. This kind of learning is similar to the natural human learning process. It can be used in teaching an algorithm how to play chess. In the first step, we define the environmentāthe chess board and all possible moves. Then the algorithm starts by making random moves and earns positive or negative rewards. When a reward is positive, it means that the move was successful, and when it is negative, it means that it has to avoid such moves in the future. After thousands of games, it finishes by knowing all the best sequences of moves.
In real-life applications, many hybrid approaches are widely used, based on available data and the complexity of problems.
Techniques for social media analysis
Machine learning is a basic tool to add intelligence and extract valuable insights from social media data. There exist other widespread concepts that are used for social media analysis: Text Analytics, Natural Language Processing, and Graph Mining.
The first notion allows to retrieve non trivial information from textual data, such as brands or people names, relationships between words, extraction of phone numbers, URLs, hashtags, and so on. Natural Language Processing is more extensive and aims at finding the meaning of the text by analyzing text structure, semantics, and concepts among others.
Social networks can also be represented by graph structures. The last mining technique enables the structural analysis of such networks. These methods help in discovering relationships, paths, connections and clusters of people, brands, topics, and so on, in social networks.
The applications of all the techniques will be presented in following chapters.
Setting up data structure libraries
In our analysis, we will use some libraries that enable flexible data structures, such as pandas and sframe. The advantage of sframe over pandas is that it helps to deal with very big datasets which do not fit RAM memory. We will also use a pymongo library to pull collected data from MongoDB, as shown in the following code:
pip3 install pandas, sframe, pymongo
All necessary machine learning libraries will be presented in corresponding chapters.