Challenges for social media mining
Social media mining is currently in a stage of infancy, and its practitioners are learning and developing new approaches. Social media mining draws its roots from many fields, such as statistics, machine learning, information retrieval, pattern recognition, and bioinformatics. The parent fields themselves are not without their challenges. The sheer amount of data being generated daily is staggering, but current techniques allow for novel data mining solutions and scalable computational models with help from the fundamental concepts and theories and algorithms.
In social media theory, people are considered to be the basic building blocks of a world created on the grounds provided by the social media. The measurements of the interactions between these building blocks and other entities such as sites, networks, content, and so on leads to the discovery of human nature. The knowledge gained via these measurements constitutes the soul of the social worlds. Finding the insights from this data where social relationships play a critical role can be termed as the mining of social media data. This problem not only has to face the basic data mining challenges but also those that emerge because of the social-relationship aspect. We have listed down some of the important challenges here:
- Big Data: Should we use the taste of a friend of a friend of the person of interest, who has studied at one particular college and whose hometown was one particular city to recommend something to the person of the interest? In some applications, this might be overkill and in others this information could lead to a very small but differentiating performance increase. The content that can be used in social media data can be very deep. However, this can lead to a problem called over fitting, which is well known in the domain of machine learning. Using multiple sources of data can also complicate the overall performance in a similar fashion.
- Sufficiency: Should we restrict people to view only the person of interest's alma mater and his/her hometown to recommend something and not use the tastes of his/her friends? Common sense says this is not correct and we may be missing out on something. This is a problem commonly known as under fitting. This problem can also arise due to the fact that most social media networks restrict the amount of information that can be accessed in a certain time frame, so sometimes the data is not sufficient enough to generate patterns and/or generate recommendations.
- Noise removal error: Preprocessing steps are more or less always required in any application of data mining. These steps not only make the actual application run faster on the cleaned data, but they also improve overall accuracy. Due to all the clutter, which is present in most social data, a large amount of noise is always expected but effectively removing the noise from the data we have is a very tricky business. You can always end up missing some information while trying to remove this noise. Noise by its definition is a subjective quantity and can always be confused; hence, this step can end up introducing more error in pattern recognition.
- Evaluation dilemma: Because of the sheer size of social media data, it's not possible to obtain a properly annotated dataset to train a supervised machine-learning algorithm. Without the proper ground truth data, there is no way to judge the accuracy of any off-the-shell classification algorithms. Since there can't be any accuracy measures without the ground truth data, only a clustering (unsupervised machine learning) algorithm can be applied. But the problem is that such algorithms rely heavily on the domain expertise.