Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Learning Social Media Analytics with R
Learning Social Media Analytics with R

Learning Social Media Analytics with R: Transform data from social media platforms into actionable business insights

Arrow left icon
Profile Icon Karthik Ganapathy Profile Icon Sarkar Profile Icon Sharma Profile Icon Raghav Bali
Arrow right icon
S$74.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (4 Ratings)
Paperback May 2017 394 pages 1st Edition
eBook
S$12.99 S$59.99
Paperback
S$74.99
Subscription
Free Trial
Arrow left icon
Profile Icon Karthik Ganapathy Profile Icon Sarkar Profile Icon Sharma Profile Icon Raghav Bali
Arrow right icon
S$74.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (4 Ratings)
Paperback May 2017 394 pages 1st Edition
eBook
S$12.99 S$59.99
Paperback
S$74.99
Subscription
Free Trial
eBook
S$12.99 S$59.99
Paperback
S$74.99
Subscription
Free Trial

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Learning Social Media Analytics with R

Chapter 1. Getting Started with R and Social Media Analytics

The invention of computers, digital electronics, social media, and the Internet have truly ushered us from the industrial age into the information age. The Internet, and more specifically the invention of World Wide Web in the early 1990s, helped people to build an inter-connected universal platform where information can be stored, shared and consumed by anyone with an electronic device capable of connecting to the Web. This has led to the creation of vast amounts of information, ideas and opinions which people, brands, organizations and businesses want to share with everyone around the world. So, social media was born which provides interactive platforms to post content, share ideas, messages and opinions about everything under the sun.

This book will take you on a journey to understand various popular social media, analyzing rich data generated by these media and gaining valuable insights. We will focus on social media which cater to audiences in different forms, like micro-blogging, social networking, software collaboration, news and media sharing platforms. The main objective is to use standardized data access and retrieval techniques using social media application programming interfaces (APIs) to gather data from these websites and apply different data mining, statistical and machine learning, and natural language processing techniques on the data by leveraging the R programming language. This book will provide you with the tools, techniques, and approaches which would help you achieve the same. This introductory chapter will cover several important concepts which would help you get a jumpstart on social media analytics. They are mentioned as follows:

  • Social media – significance and pitfalls
  • Social media analytics – opportunities and challenges
  • Getting started with R
  • Data analytics
  • Machine learning
  • Text analytics

We will look at social media, the various forms of social media which exist today, and how it has impacted our society. This will help us understand the entire scope pertaining to social media analytics and the opportunity presented by it which would be valuable for consumers as well as businesses and brands. Concepts related to analytics, machine learning and text analytics coupled with hands on examples depicting the various features of the R programming language will help you get a grip on essential things which are necessary for the rest of this book. Without further delay, let's get started!

Understanding social media

The Internet and the information age have been responsible for revolutionizing the way we humans interact with each other in the 21st Century. Almost everyone uses some form of electronic communication, be it a laptop, tablet, smartphone or a personal computer. Social media is built upon the concept of platforms where people use computer-mediated communication (CMC) methods to communicate with others. This can range from instant messaging, emails, and chat rooms to social forums and social networking. To understand social media, you need to understand the origins of legacy or traditional media which gradually evolved into social media. Entities like the popular television, newspapers, radio, movies, books and magazines are various ways of sharing and consuming information, ideas and opinions. It's important to remember that social media has not replaced the older legacy based media; they co-exist peacefully together as we use and consume them both in our day-to-day lives.

Legacy media typically follow a one-way communication system. For instance, I can always read a magazine or watch a show on the television or get updated about the news from newspapers, but I cannot voice my opinions or share my ideas using the same media instantly. The communication mechanism in the various forms of social media is a two-way street, where audiences can share information and ideas and others can consume them and voice their own ideas, opinions and feedback on the same, and even share their own content based on what they see. Legacy based media, like radio or television, now use social media to provide a two-way communication mechanism to support their communications, but it's much more seamless in social media where anyone and everyone can share content, communicate with others, freely voice their ideas and opinions on a huge scale.

We can now formally define social media as interactive applications or platforms based on the principles of Web 2.0 and computer-mediated communication, which enable users to be publishers as well as consumers, to create and share ideas, opinions, information, emotions and expressions in various forms. While different and diverse forms of social media exist, they have several key features in common which are mentioned briefly as follows:

  • Web 2.0 Internet based applications or platforms
  • Content is created as well as consumed by users
  • Profiles give users have their own distinct and unique identity
  • Social networks help connect different users, similarly to communities

Indeed social media give users their own unique identity and the freedom to express themselves in their own user profiles. These profiles are maintained as accounts by social media companies. Features like what you see is what you get (WYSIWYG) editors, emoticons, photos and videos help users in creating and sharing rich content. Social networking capabilities enables users to add other users to their own friend or contact lists and create groups and forums where they can share and talk about like-minded interests. The following figure shows us some of the popular social media used today across the globe:

Understanding social media

I am sure you recognize several of these popular social media from their logos, which you must have seen on your own smartphone or on the web. Social media is used in various ways and media can be grouped into distinct buckets by the nature of its usage and its features. We mention several popular social media in the following points, some of which we will be analyzing in the future chapters:

  • Micro-blogging platforms, like Twitter and Tumblr
  • Blogging platforms, like WordPress, Blogger and Medium
  • Instant messaging application, like WhatsApp and Hangouts
  • Networking platforms, like Facebook and LinkedIn
  • Software collaboration platforms, like GitHub and StackOverflow
  • Audio collaboration platforms, like SoundCloud
  • Photo sharing platforms, like Instagram and Flickr
  • Video sharing platforms, like YouTube and Vimeo

This list is not an exhaustive list of social media because there are so many applications and platforms out there. We apologize in advance if we missed out mentioning your favorite social media! The list should clarify the different forms of communication and content sharing mechanisms that are available for users, and that they can leverage any of these social media to share content and connect with other users. We will now discuss some of the key advantages and significance which social media has to offer.

Advantages and significance

Social media has gained immense popularity and importance so that today almost everyone can't stay away from it. Not only is social media a medium for people to express their views, but also a very powerful tool which can be used by businesses to target new and existing customers and increase revenue. We will discuss some of the main advantages of social media as follows:

  • Cost savings: One of the main challenges for businesses is to reach out to their customers or clients through advertising on traditional and legacy based media, which can be expensive. However, social media allows businesses to have branded pages and to post sponsored content and advertisements for a fraction of the cost, thus helping them save costs in the process of increasing visibility.
  • Networking: With social media, you can build your social as well as professional network with people across the globe. This has opened up a myriad of possibilities where people from different continents and countries work together on cutting-edge innovations, share news, talk about their personal experiences, offer advice and share interesting opportunities, which can help develop personalities, careers, and skills.
  • Ease of use: It is quite easy to get started with social media. All you need is to create an account by registering your details in the application or website and within minutes you are ready to go! Besides this, it is quite easy to navigate through any social media website or application without any sophisticated technical skills. You just need an Internet connection and an electronic device, like a smartphone or a computer. Perhaps this could be the reason that a lot of parents and grandparents are now taking to social media to share their moments and connect with their long lost friends.
  • Global audience: With social media, you can also make your content reach out to a global audience across the world. The reason is quite simple: because social media applications are available openly on the web, users all across the world use it. Businesses that engage with customers in different parts of the world have a key advantage to push their promotions and new products and services.
  • Prompt feedback: Businesses and organizations can get prompt feedback on their new product launches and services being used directly from the users. There is much less calling up people asking them about their satisfaction levels. Tweets, posts, videos, comments and many more features exist to give instant feedback to organizations by posting generally, or conversing with them directly on their official social media channels.
  • Grievance redressal: One of the great advantages of social media is that users can now express any sort of grievances or inconveniences, like electricity, water supply or security issues. Most governments and organizations, including law enforcement have public social media channels which can be used for instant notification of grievances.
  • Entertainment: This is perhaps the most popular advantage used to the maximum by most users. Social media provides an unlimited source of entertainment where you can spend your time playing interactive games, watching videos, and participating in competitions with users across the world. Indeed the possibilities of entertainment from social media are endless.
  • Visibility: Anyone can leverage their social media profile to gain visibility in the world. Professional networking platforms like LinkedIn are an excellent way for people to get noticed by recruiters and also for companies to recruit great talent. Even small startups or individuals can develop inventions, build products, or announce discoveries and leverage social media to go viral and gain the necessary visibility which can propel them to the next level.

The significance and importance of social media is quite evident from the preceding points. In today's interconnected world, social media has almost become indispensable and although it might have a lot of disadvantages, including distractions, if we use it for the right reasons, it can indeed be a very important tool or medium to help us achieve great things.

Disadvantages and pitfalls

Even though we have been blowing the trumpet about social media and its significance, I'm sure you are already thinking about pitfalls and disadvantages, which are directly or indirectly caused from social media. We want to cover all aspects of social media including the good and the bad, so let's look at some negative aspects:

  • Privacy concerns: Perhaps one of the biggest concerns with regards to using social media is the lack of privacy. All our personal data, even though often guaranteed to be secure by the social media organizations that host it, have a risk of being illegally accessed. Further, many social media platforms have been accused, time and again, of selling or using users' personal data without their consent.
  • Security issues: Often users enter personal information on their social media profiles which can be used by hackers and other harmful entities to gain insights into their personal lives and use it for their own personal gain. You will have heard, several times in the past, that social media websites have been hacked and personal information from user accounts has been leaked. There are other issues also like users' bank accounts being compromised, and even theft and other harmful actions happening as a result of sensitive information obtained from social media.
  • Addiction: This is relevant to a large percentage of people using social media. Social media addiction is indeed real and a serious concern, especially among the millennials. There are so many forms of social media and you can really get engrossed in playing games, trying to keep up with what everyone is doing, or just sharing moments from your life every other minute. A lot of us tend to check social media websites every now and then, which can be a distraction, especially if you are trying to meet deadlines. There are even a few stories of people accessing social media whilst driving, with fatal results.
  • Negativity: Social media allows you to express yourself freely and this is often misused by people, terrorist, and other extremist groups to spread hateful propaganda and negativity. People often post sarcastic and negative reactions based on their opinions and feelings, which can lead to trolling and racism. Even though there are ways to report such behavior, it is often not enough because it is impossible to monitor a vast social network all the time.
  • Risks: There are several potential risks of leveraging social media for your personal use or business promotions and campaigns. One wrong post can potentially prove to be very costly. Besides this, there is the constant risk of hackers, fraud, security attacks and unwanted spam. Continuous usage of social media and addiction to it also poses a potential health risk. Organizations should have proper social media use policies to ensure that their employees do not end up being unproductive by wasting too much time on social media, and do not leak trade secrets or confidential information on social media.

We have discussed several pitfalls attached to using social media and some of them are very serious concerns. Proper social media usage guidelines and policies should be borne in mind by everyone because social media is like a magnifying glass: anything you post can be used against you or can potentially prove harmful later. Be it extremely sensitive personal information, or confidential information, like design plans for your next product launch, always think carefully before sharing anything with the rest of the world.

However, if you know what you are doing, social media can definitely be used as a proper tool for your personal as well as professional gain.

Social media analytics

We now have a detailed overview of social media, its significance, pitfalls, and various facets. We will now discuss social media analytics and the benefits it offers for data analysts, scientists and businesses in general looking to gather useful insights from social media. Social media analytics, also known as social media mining or social media intelligence, can be defined as the process of gathering data (usually unstructured) from social media platforms and analyzing the data using diverse analytical techniques to extract vital insights, which can be used to make data-driven business decisions. There are lots of opportunities and challenges involved in social media analytics, which we will be discussing in further detail in later sections. An important thing to remember is that the processes involved in social media analytics are usually domain-agnostic and you can apply them on data belonging to any organization or business in any domain.

The most important step in going forward with any social media analytics based workflow or process is to determine the business goals or objectives and the insights that we want to gather from our analyzes. These goals are usually in the form of key performance indicators (KPIs). For instance, the total number of followers, number of likes and shares can be KPIs to measure brand engagement with customers using social media. Sometimes data is not structured and the end objectives are not very concrete. Techniques like natural language processing and text analytics can be leveraged in such cases to extract insights from noisy unstructured text data like understanding the sentiment or mood of customers for a particular service or product and trying to understand the key trends and themes based on customer tweets or posts at any point in time.

A typical social media analytics workflow

We will be analyzing data from diverse social media applications and platforms throughout the course of this book. However, it is essential to have a good grasp of the essential concepts behind any typical analytics process or workflow. While we will be expanding more on data analytics and mining processes later, let us look at a typical social media analytics workflow in the following figure:

A typical social media analytics workflow

From the preceding diagram, we can broadly classify the main steps involved in the analytics workflow as follows:

  • Data access
  • Data processing and normalization
  • Data analysis
  • Insights

We will now briefly expand upon each of these four processes since we will be using them extensively in future chapters.

Data access

For access to social media data, you can usually do it using standard data retrieval methods in two ways.

The first technique is to use official APIs provided by the social media platform or organization itself.

The second technique is to use unofficial mechanisms, like web crawling and scraping. An important point to remember is that crawling and scraping social media websites and using that data for commercial purposes, like selling the data to other organizations, is usually against their terms of service. We will therefore not be using such methods in our book. Besides this, we will be following the necessary politeness policies while accessing social media data using their APIs, so that we do not overload them with too many requests. The data we'll obtain is the raw data which can be further processed and normalized as needed.

Data processing and normalization

The raw data obtained from data retrieval using social media APIs may not be structured and clean. In fact most of the data obtained from social media is noisy, unstructured and often contains unnecessary tokens such as Hyper Text Markup Language (HTML) tags and other metadata. Usually, data streams from social media APIs have JavaScript Object Notation (JSON) response objects, which consist of key value pairs just like the example shown in the following snippet:

{
"user": {
                    "profile_sidebar_fill_color": "DDFFCC",
                    "profile_sidebar_border_color": "BDDCAD",
                    "profile_background_tile": true,
                    "name": "J'onn J'onzz",
                    "profile_image_url": "http://test.com/img/prof.jpg",
                    "created_at": "Tue Apr 07 19:05:07 +0000 2009",
                    "location": "Ox City, UK",
                    "follow_request_sent": null,
                    "profile_link_color": "0084B4",
                    "is_translator": false,
                    "id_str": "2921138"
},
"followers_count": 2452,
"statuses_count": 7311,
"friends_count": 427
}

The preceding JSON object consists of a typical response from the Twitter API showing details of a user profile. Some APIs might return data in other formats, such as Extensible Markup Language (XML) or Comma Separated Values (CSV), and each format needs to be handled properly.

Often social media data contains unstructured textual data which needs additional text pre-processing and normalization before it can be fed into any standard data mining or machine learning algorithm. Text normalization is usually done using several techniques to clean and standardize the text. Some of them are:

  • Text tokenization
  • Removing special characters and symbols
  • Spelling corrections
  • Contraction expansions
  • Stemming
  • Lemmatization

More advanced processing can insert additional metadata to describe the text better, such as adding parts of speech (POS) tags, phrase tags, named entity tags, and so on.

Data analysis

This is the core of the whole workflow, where we apply various techniques to analyze the data: this could be the raw native data itself, or the processed and curated data. Usually the techniques used in analysis can be broadly classified into three areas:

  • Data mining or analytics
  • Machine learning
  • Natural language processing and text analytics

Data mining and machine learning have several overlapping concepts, including the fact that both use statistical techniques and try to find patterns from underlying data. Data mining is more about finding key patterns or insights from data; and machine learning is more about using mathematics, statistics, and even some of these data mining algorithms, to build models to predict or forecast outcomes. While both of these techniques need structured and numeric data to work with, more complex analyzes with unstructured textual data is usually handled in the separate realm of text analytics by leveraging natural language processing which enables us to use several tools, techniques and algorithms to analyze free-flowing unstructured text. We will be using techniques, from these three areas to analyze data from various social media platforms throughout this book. We will cover important concepts from data analytics and text analytics briefly towards the end of this chapter.

Insights

The end results from our workflow are the actual insights which act as facts or concrete data points to achieve the objective of the analysis. This can be anything from a business intelligence report to visualizations such as bar graphs, histograms, or even word or phrase clouds. Insights should be crisp, clear, and actionable so that it can be easy for businesses to take valuable decisions in time by leveraging them.

Opportunities

Based on the advantages of social media, we can derive plentiful opportunities which lie within the scope of social media analytics. You can save a lot of cost involved in targeted advertising and promotions by analyzing your social media traffic patterns. You can see how users engage with your brand or business using social media, for instance, when it is the perfect time to share something interesting, such as a new service, product, or even an interesting anecdote about your company. Based on traffic from different geographies, you can analyze and understand the preferences of users from different parts of the world. Users love it if you publish promotions in their local language, and businesses are already leveraging such capabilities from social media platforms such as Facebook to target users in specific countries based on localized content.

The social media analytics landscape is still young and emerging and has a lot of untapped potential.

Let us understand the potential of social media analytics better by taking a real-world example.

Consider you are running a profitable business with active engagement on various social media channels. How can you use the data generated from social media to know how you are doing and how your competitors are doing? Live data streams from Twitter could be continuously analyzed to get real-time mood, sentiment, emotion, and reactions of people to your products and services. You could even analyze the same for your rival competitors to see when they are launching their commodities and how users are reacting to them. With Facebook, you can do the same and even push localized promotions and advertisements to see if they help in generating better revenue. News portals would give you live feeds of trending news articles and insights into the current state of the economy and current events and help you decide if these are favorable times for a thriving business or should you be preparing for some hard times. Sentiment analysis, concept mining, topic models, clustering, and inference are just a few examples of using analytics on social media. The opportunities are huge—you just need to have a clear objective in mind so that you can use analytics effectively to solve that objective.

Challenges

Before we delve into the challenges associated with social media analytics let us look at the following interesting facts:

  • There are over 300 million active Twitter users
  • Facebook has over 1.8 billion active users
  • Facebook generates 600-700+ terabytes of data daily (and it could be more now)
  • Twitter generates 8-10+ terabytes of data daily
  • Facebook generates over 4 to 5 million posts per minute
  • Instagram generates over 2 million likes per minute

These statistics give you a rough idea about the massive scale of data being generated and consumed in these social media platforms. This leads to some challenges:

  • Big data: Due to the massive amount of data produced by social media platforms, it is sometimes difficult to analyze the complete dataset using traditional analytical methods since the complete data would never fit in memory. Other approaches and tools, such as Hadoop and Spark, need to be leveraged.
  • Accessibility issues: Social media platforms generate a lot of data but getting access to them directly is not always easy. There are rate limits for their official APIs and it's rare to be able to access and store complete datasets. Besides this, each platform has its own terms and conditions, which should be adhered to when accessing their data.
  • Unstructured and noisy data: Most of the data from social media APIs are unstructured, noisy, and have a lot of junk in them. Dealing with data cleaning and processing becomes really cumbersome and often analysts and data scientists end up spending 70% of their time and effort in trying to clean and curate the data for analysis.

These are perhaps the most prevalent challenges when analyzing social media data, amongst many other challenges, that you might face in your social media analytics journey. Let's now get acquainted with the R programming language, which will be useful to us when we are performing our analyzes.

Getting started with R

This section will help you get started with setting up your analysis and development environment and also acquaint you with the syntax, data structures, constructs, and other important concepts related to the R programming language. Feel free to skim through this section if you consider yourself to be a master of R! We will be mainly focusing our attention on the following topics:

  • Environment setup
  • Data types
  • Data structures
  • Functions
  • Controlling code flow
  • Advanced operations
  • Visualizing data
  • Next steps

We will be explaining each construct or concept with hands-on examples and code so that it is easier to understand and you can also learn by doing. Before we dive into further details, let us briefly get to know more about R. R is actually a scripting language but is used extensively for statistical modeling and analysis. The roots of R lie in the S language which was a statistical programming language developed by AT&T. R is a community-driven language and has grown by leaps and bounds over the years. It now has a vast arsenal of tools, frameworks, and packages for processing, analyzing, and visualizing any type of data. Because it's open source, the community posts constant improvements to the base R language, and it introduces extremely powerful R packages capable of performing complex analyzes and visualizations.

R and Python are perhaps the two most popular languages to be used for statistical analysis; and R is often preferred by statisticians, mathematicians, and data scientists because it has more capabilities related to statistical modeling, learning, and algorithms. R is maintained by the Comprehensive R Archival Network (CRAN) and includes all the latest and past versions, binaries, and source code for R, and its packages for different operating systems. Capabilities also exist to connect and interface R with other frameworks including big data frameworks such as Hadoop and Spark, computing platforms, and languages such as Python, Matlab, SPSS, and data interfaces to any possible source such as social media platforms, news portals, the Internet of Things based device data, web traffic data, and so on.

Environment setup

We will be discussing the necessary steps for setting up a proper analysis environment by installing the necessary dependencies around the R ecosystem and also the necessary code snippets, functions, and modules which we will be using across all the chapters. You can refer to any code snippet being used across any chapter from the code files which will be provided for each chapter along with this book. Besides that, you can also access our GitHub repository https://github.com/dipanjanS/learning-social-media-analytics-with-r for necessary code modules, snippets and functions which will be used in the book and adopt them for your own analyzes!

The R language is free and open-source as we mentioned earlier, and is available for all major operating systems. At the time of writing this book, the latest version of R is 3.3.1 (code named Bug in Your Hair) and is available for downloading at https://www.r-project.org/. This link includes detailed steps, but the direct download page can be accessed at https://cloud.r-project.org/ if you are interested. Download the necessary binary distribution based on your operating system of choice and run the executable setup following the necessary instructions for the Windows platform. If you are using Unix or any *nix like environment, you can install it directly from the terminal too if needed.

Once R is installed, you can fire up the R interpreter directly. This has a graphical user interface (GUI) containing an editor where you can write your code and then execute it. We recommend using an Integrated Development Environment (IDE) instead which eases development and helps maintain code in a more structured way. Besides this you can also use it for other capabilities like generating R markdown documents, R notebooks and Shiny Web Applications. We recommend using RStudio which provides a user-friendly interface for working with R. You can download and install it from https://www.rstudio.com/products/rstudio/download3/ which contains installers for various operating systems.

Once installed, you can start RStudio and use R directly from the IDE itself. It usually contains a code editor window at the top and the R interactive interpreter in the bottom. The interactive interpreter is often called a Read-Evaluate-Print-Loop (REPL). The interpreter asks for input, evaluates and instantly returns the output if any in the interpreter window itself. The interface usually shows the > symbol when waiting for any input and often shows the + symbol in the prompt when you enter code which spans multiple lines. Anything in R is usually a vector and outputs are usually returned with square brackets preceding it, like [1] indicating the output is a vector of size one. Comments are used to describe functions or sections of code. We can specify comments by using the # symbol followed by text. A sample execution in the R interpreter is shown in the following code for convenience:

> 10 + 5
[1] 15
> c(1,2,3,4)
[1] 1 2 3 4
> 5 == 5
[1] TRUE
> fruit = 'apple'
> if (fruit == 'apple'){
+     print('Apple')
+ }else{
+     print('Orange')
+ }
[1] "Apple"

You can see various operations being performed in the R interpreter in the preceding code snippet, including some conditional evaluations and basic arithmetic. We will now delve deeper into the various constructs of R.

Data types

There are several basic data types in R for handling different types of data and values:

  • numeric: The numeric data type is used to store real or decimal vectors and is identical to the double data type.
  • double: This data type can store and represent double precision vectors.
  • integer: This data type is used for representing 32-bit integer vectors.
  • character: This data type is used to represent character vectors, where each element can be a string of type character
  • logical: The reserved words TRUE and FALSE are logical constants in the R language and T and F are global variables. All these four are logical type vectors.
  • complex: This data type is used to store and represent complex numbers
  • factor: This type is used to represent nominal or categorical variables by storing the nominal values in a vector of integers ranging from (1…n) such that n is the number of distinct values for the variable. A vector of character strings for the actual variable values is then mapped to this vector of integers
  • Miscellaneous: There are several other types including NA to denote missing values in data, NaN which denotes not a number, and ordered is used for factoring ordinal variables

Common functions for each data type include as and is, which are used for converting data types (typecasting) and checking the data type respectively.

For example, as.numeric(…) would typecast the data or vector indicated by the ellipses into numeric type and is.numeric(…) would check if the data is of numeric type.

Let us look at a few more examples for the various data types in the following code snippet to understand them better:

# typecasting and checking data types
> n <- c(3.5, 0.0, 1.7, 0.0)
> typeof(n)
[1] "double"
> is.numeric(n)
[1] TRUE
> is.double(n)
[1] TRUE
> is.integer(n)
[1] FALSE
> as.integer(n)
[1] 3 0 1 0
> as.logical(n)
[1]  TRUE FALSE  TRUE FALSE

# complex numbers
> comp <- 3 + 4i
> typeof(comp)
[1] "complex"

# factoring nominal variables
> size <- c(rep('large', 5), rep('small', 5), rep('medium', 3))
> size
 [1] "large"  "large"  "large"  "large"  "large"  "small"  "small"  "small"  "small"  "small" 
[11] "medium" "medium" "medium"
> size <- factor(size)
> size
 [1] large  large  large  large  large  small  small  small  small  small  medium medium medium
Levels: large medium small
> summary(size)
 large medium  small 
     5      3      5

The preceding examples should make the concepts clearer. Notice that non-zero numeric values are logically TRUE always, and zero values are FALSE, as we can see from typecasting our numeric vector to logical. We will now dive into the various data structures in R.

Data structures

The base R system has several important core data structures which are extensively used in handling, processing, manipulating, and analyzing data. We will be talking about five important data structures, which can be classified according to the type of data which can be stored and its dimensionality. The classification is depicted in the following table:

Content type

Dimensionality

Data structure

Homogeneous

One-dimensional

Vector

Homogeneous

N-dimensional

Array

Homogeneous

Two-dimensional

Matrix

Heterogeneous

One-dimensional

List

Heterogeneous

N-dimensional

DataFrame

The content type in the table depicts whether the data stored in the structure belongs to the same data type (homogeneous) or can contain data of different data types, (heterogeneous). The dimensionality of the data structure is pretty straightforward and is self-explanatory. We will now examine each data structure in further detail.

Vectors

The vector is the most basic data structure in R and here vectors indicate atomic vectors. They can be used to represent any data in R including input and output data. Vectors are usually created using the c(…) function, which is short for combine. Vectors can also be created in other ways such as using the : operator or the seq(…) family of functions. Vectors are homogeneous, all elements always belong to a single data type, and the vector by itself is a one-dimensional structure. The following snippet shows some vector representations:

> 1:5
[1] 1 2 3 4 5
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
> seq_len(5)
[1] 1 2 3 4 5

You can also assign vectors to variables and perform different operations on them, including data manipulation, mathematical operations, transformations, and so on. We depict a few such examples in the following snippet:

# assigning two vectors to variables
> x <- 1:5
> y <- c(6,7,8,9,10)
> x
[1] 1 2 3 4 5
> y
[1]  6  7  8  9 10

# operating on vectors
> x + y
[1]  7  9 11 13 15
> sum(x)
[1] 15
> mean(x)
[1] 3
> x * y
[1]  6 14 24 36 50
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

# indexing and slicing
> y[2:4]
[1] 7 8 9
> y[c(2,3,4)]
[1] 7 8 9

# naming vector elements
> names(x) <- c("one", "two", "three", "four", "five")
> x
  one   two three  four  five 
    1     2     3     4     5

The preceding snippet should give you a good flavor of what we can do with vectors. Try playing around with vectors and transforming and manipulating data with them!

Arrays

From the table of data structures we mentioned earlier, arrays can store homogeneous data and are N-dimensional data structures unlike vectors. Matrices are a special case of arrays with two dimensions, but more on that later. Considering arrays, it is difficult to represent data higher than two dimensions on the screen, but R can still handle it in a special way. The following example creates a three-dimensional array of 2x2x3:

# create a three-dimensional array
three.dim.array <- array(
    1:12,    # input data
    dim = c(2, 2, 3),   # dimensions
    dimnames = list(    # names of dimensions
        c("row1", "row2"),
        c("col1", "col2"),
        c("first.set", "second.set", "third.set")
    )
)

# view the array
> three.dim.array
, , first.set

     col1 col2
row1    1    3
row2    2    4

, , second.set

     col1 col2
row1    5    7
row2    6    8

, , third.set

     col1 col2
row1    9   11
row2   10   12

From the preceding output, you can see that R filled the data in the column-first order in the three-dimensional array. We will now look at matrices in the following section.

Matrices

We've briefly mentioned that matrices are a special case of arrays with two dimensions. These two dimensions are represented by properties in rows and columns. Just like we used the array(…) function in the previous section to create an array, we will be using the matrix(…) function to create matrices.

The following snippet creates a 4x3 matrix:

# create a matrix
mat <- matrix(
    1:12,   # data
    nrow = 4,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)

# view the matrix
> mat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

Thus you can see from the preceding output that we have a 4x3 matrix with 4 rows and 3 columns and we filled in the data in row-wise fashion by using the by row parameter in the matrix(…) function.

The following snippet shows some mathematical operations with matrices which you should be familiar with:

# initialize matrices
m1 <- matrix(
    1:9,   # data
    nrow = 3,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)
m2 <- matrix(
    10:18,   # data
    nrow = 3,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)

# matrix addition
> m1 + m2
     [,1] [,2] [,3]
[1,]   11   13   15
[2,]   17   19   21
[3,]   23   25   27

# matrix transpose
> t(m1)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

# matrix product
> m1 %*% m2
     [,1] [,2] [,3]
[1,]   84   90   96
[2,]  201  216  231
[3,]  318  342  366

We encourage you to try out more complex operations using matrices. See if you can find the inverse of a matrix.

Lists

Lists are a special type of vector besides the atomic vector which we discussed earlier. The difference with atomic vectors is that lists are heterogeneous and can hold different types of data such that each element of a list can itself be a list, an atomic vector, an array, matrix, or even a function. The following snippet shows us how to create lists:

# create sample list
list.sample <- list(
    nums = seq.int(1,5),
    languages = c("R", "Python", "Julia", "Java"),
    sin.func = sin
)

# view the list
> list.sample
$nums
[1] 1 2 3 4 5

$languages
[1] "R"      "Python" "Julia"  "Java"  

$sin.func
function (x)  .Primitive("sin")

# accessing individual list elements
> list.sample$languages
[1] "R"      "Python" "Julia"  "Java"  
> list.sample$sin.func(1.5708)
[1] 1

You can see from the preceding snippet that lists can hold different types of elements and accessing them is really easy.

Let us look at a few more operations with lists in the following snippet:

# initializing two lists
l1 <- list(nums = 1:5)
l2 <- list(
    languages = c("R", "Python", "Julia"),
    months = c("Jan", "Feb", "Mar")
)

# check lists and their type
> l1
$nums
[1] 1 2 3 4 5

> typeof(l1)
[1] "list"
> l2
$languages
[1] "R"      "Python" "Julia" 

$months
[1] "Jan" "Feb" "Mar"

> typeof(l2)
[1] "list"

# concatenating lists
> l3 <- c(l1, l2)
> l3
$nums
[1] 1 2 3 4 5

$languages
[1] "R"      "Python" "Julia" 

$months
[1] "Jan" "Feb" "Mar"

# converting list back to a vector
> v1 <- unlist(l1)
> v1
nums1 nums2 nums3 nums4 nums5 
    1     2     3     4     5 
> typeof(v1)
[1] "integer"

Now that we know how lists work, we will be moving on to the last and perhaps most widely used data structure in data processing and analysis, the DataFrame.

DataFrames

The DataFrame is a special data structure which is used to handle heterogeneous data with N-dimensions. This structure is used to handle data tables or tabular data having several observations, samples or data points which are represented by rows, and attributes for each sample, which are represented by columns. Each column can be thought of as a dimension to the dataset or a vector. It is very popular since it can easily work with tabular data, notably spreadsheets.

The following snippet shows us how we can create DataFrames and examine their properties:

# create data frame
df <- data.frame(
  name =  c("Wade", "Steve", "Slade", "Bruce"),
  age = c(28, 85, 55, 45),
  job = c("IT", "HR", "HR", "CS")
)

# view the data frame
> df
   name age job
1  Wade  28  IT
2 Steve  85  HR
3 Slade  55  HR
4 Bruce  45  CS

# examine data frame properties
> class(df)
[1] "data.frame"
> str(df)
'data.frame':      4 obs. of  3 variables:
 $ name: Factor w/ 4 levels "Bruce","Slade",..: 4 3 2 1
 $ age : num  28 85 55 45
 $ job : Factor w/ 3 levels "CS","HR","IT": 3 2 2 1
> rownames(df)
[1] "1" "2" "3" "4"
> colnames(df)
[1] "name" "age" "job"
> dim(df)
[1] 4 3

You can see from the preceding snippet how DataFrames can represent tabular data where each attribute is a dimension or column. You can also perform multiple operations on DataFrames such as merging, concatenating, binding, sub-setting, and so on. We will depict some of these operations in the following snippet:

# initialize two data frames
emp.details <- data.frame(
    empid = c('e001', 'e002', 'e003', 'e004'),
    name = c("Wade", "Steve", "Slade", "Bruce"),
    age = c(28, 85, 55, 45)
)
job.details <- data.frame(
    empid = c('e001', 'e002', 'e003', 'e004'),
    job = c("IT", "HR", "HR", "CS")
)

# view data frames
> emp.details
  empid  name age
1  e001  Wade  28
2  e002 Steve  85
3  e003 Slade  55
4  e004 Bruce  45
> job.details
  empid job
1  e001  IT
2  e002  HR
3  e003  HR
4  e004  CS

# binding and merging data frames
> cbind(emp.details, job.details)
  empid  name age empid job
1  e001  Wade  28  e001  IT
2  e002 Steve  85  e002  HR
3  e003 Slade  55  e003  HR
4  e004 Bruce  45  e004  CS
> merge(emp.details, job.details, by='empid')
  empid  name age job
1  e001  Wade  28  IT
2  e002 Steve  85  HR
3  e003 Slade  55  HR
4  e004 Bruce  45  CS

# subsetting data frame
> subset(emp.details, age > 50)
  empid  name age
2  e002 Steve  85
3  e003 Slade  55

Now that we have a good grasp of data structures, we will look at concepts related to functions in R in the next section.

Functions

So far we have dealt with various variables and data types and structures for storing data. Functions are just another data type or object in R, albeit a special one which allows us to operate on data and perform actions on data. Functions are useful for modularizing code and separating concerns where needed by dedicating specific actions and operations to different functions and implementing the logic needed for any action inside the function. We will be talking about two types of functions in this section: the built-in functions and the user-defined functions.

Built-in functions

There are several functions which come with the base installation of R and its core packages. You can access these built-in functions directly using the function name and you can get more functions as you install newer packages. We depict operations using a few built-in functions in the following snippet:

> sqrt(7)
[1] 2.645751
> mean(1:5)
[1] 3
> sum(1:5)
[1] 15
> sqrt(1:5)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
> runif(5)
[1] 0.8880760 0.2925848 0.9240165 0.6535002 0.1891892
> rnorm(5)
[1]  1.90901035 -1.55611066 -0.40784306 -1.88185230  0.02035915

You can see from the previous examples that functions such as sqrt(…), mean(…), and sum(…) are built-in and pre-implemented. They can be used anytime in R without the need to define these functions explicitly or load other packages.

User-defined functions

While built-in functions are good, often you need to incorporate your own algorithms, logic, and processes for solving a problem. That is where you need to build you own functions. Typically, there are three main components in the function. They are mentioned as follows:

  • The environment(…) which contains the location map of the defined function and its variables
  • The formals(…) which depict the list of arguments which are used to call the function
  • The body(…) which is used to depict the code inside the function which contains the core logic of the function

Some depictions of user-defined functions are shown in the following code snippet:

# define the function
square <- function(data){
  return (data^2)
}

# inspect function components
> environment(square)
<environment: R_GlobalEnv>

> formals(square)
$data


> body(square)
{
    return(data^2)
}

# execute the function on data
> square(1:5)
[1]  1  4  9 16 25
> square(12)
[1] 144

We can see how user-defined functions can be defined using function(…) and we can also examine the various components of the function, as we discussed earlier, and also use them to operate on data.

Controlling code flow

When writing complete applications and scripts using R, the flow and execution of code is very important. The flow of code is based on statements, functions, variables, and data structures used in the code and it is all based on the algorithms, business logic, and other rules for solving the problem at hand. There are several constructs which can be used to control the flow of code and we will be discussing primarily the following two constructs:

  • Looping constructs
  • Conditional constructs

We will start with looking at various looping constructs for executing the same sections of code multiple times.

Looping constructs

Looping constructs basically involve using loops which are used to execute code blocks or sections repeatedly as needed. Usually the loop keeps executing the code block in its scope until some specific condition is met or some other conditional statements are used. There are three main types of loops in R:

  • for
  • while
  • repeat

We will explore all the three constructs with examples in the following code snippet:

# for loop
> for (i in 1:5) {
+     cat(paste(i," "))
+ }
1  2  3  4  5  

> sum <- 0
> for (i in 1:10){
+     sum <- sum + i
+ }

> sum
[1] 55

# while loop
> n <- 1
> while (n <= 5){
+     cat(paste(n, " "))
+     n <- n + 1
+ }
1  2  3  4  5  

# repeat loop
> i <- 1
> repeat{
+     cat(paste(i, " "))
+     if (i >= 5){
+         break  # break out of the infinite loop
+     }
+     i <- i + 1
+ }
1  2  3  4  5  

An important point to remember here is that, with larger amounts of data, vectorization-based constructs are more optimized than loops and we will cover some of them in the Advanced operations section later.

Conditional constructs

There are several conditional constructs which help us in executing and controlling the flow of code conditionally based on user-defined rules and conditions. This is very useful when we do not want to execute all possible code blocks in a script sequentially but we want to execute specific code blocks if and only if they meet or do not meet specific conditions.

There are mainly four constructs which are used frequently in R:

  • if or if…else
  • if…else if…else
  • ifelse(…)
  • switch(…)

The bottom two are functions compared to the other statements, which use the if, if…else, and if…else if…else syntax. We will look at them in the following code snippet with examples:

# using if
> num = 10
> if (num == 10){
+     cat('The number was 10')
+ }
The number was 10

# using if-else
> num = 5
> if (num == 10){
+     cat('The number was 10')
+ } else{
+     cat('The number was not 10')
+ }
The number was not 10

# using if-else if-else
> if (num == 10){
+     cat('The number was 10')
+ } else if (num == 5){
+     cat('The number was 5')
+ } else{
+     cat('No match found')
+ }
The number was 5

# using ifelse(...) function
> ifelse(num == 10, "Number was 10", "Number was not 10")
[1] "Number was not 10"

# using switch(...) function
> for (num in c("5","10","15")){
+     cat(
+         switch(num,
+             "5" = "five",
+             "7" = "seven",
+             "10" = "ten",
+             "No match found"
+     ), "\n")
+ }
five 
ten 
No match found

From the preceding snippet, we can see that switch(…) has a default option which can return a user-defined value when no match is found when evaluating the condition.

Advanced operations

We can perform several advanced vectorized operations in R, which is useful when dealing with large amounts of data and improves code performance with regards to time taken in executing code. Some advanced constructs in the apply family of functions will be covered in this section, as follows:

  • apply: Evaluates a function on the boundaries or margins of an array
  • lapply: Loops over a list and evaluates a function on each element
  • sapply: A more simplified version of the lapply(…) function
  • tapply: Evaluates a function over subsets of a vector
  • mapply: A multivariate version of the lapply(…) function

Let's look at how each of these functions work in further detail.

apply

As we mentioned earlier, the apply(…) function is used mainly to evaluate any defined function over the margins or boundaries of any array or matrix.

An important point to note here is that there are dedicated aggregation functions rowSums(…), rowMeans(…), colSums(…) and colMeans(…) which actually use apply internally but are more optimized and useful compared to other functions when operating on large arrays.

The following snippet depicts some aggregation functions being applied to a matrix:

# creating a 4x4 matrix
> mat <- matrix(1:16, nrow=4, ncol=4)

# view the matrix
> mat
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

# row sums
> apply(mat, 1, sum)
[1] 28 32 36 40
> rowSums(mat)
[1] 28 32 36 40

# row means
> apply(mat, 1, mean)
[1]  7  8  9 10
> rowMeans(mat)
[1]  7  8  9 10

# col sums
> apply(mat, 2, sum)
[1] 10 26 42 58
> colSums(mat)
[1] 10 26 42 58
 
# col means
> apply(mat, 2, mean)
[1]  2.5  6.5 10.5 14.5
> colMeans(mat)
[1]  2.5  6.5 10.5 14.5
 
# row quantiles
> apply(mat, 1, quantile, probs=c(0.25, 0.5, 0.75))
    [,1] [,2] [,3] [,4]
25%    4    5    6    7
50%    7    8    9   10
75%   10   11   12   13

You can see aggregations taking place without the need of extra looping constructs.

lapply

The lapply(…) function takes a list and a function as input parameters. Then it evaluates that function over each element of the list. If the input list is not a list, it is coerced to a list using the as.list(…) function before the final output is returned. All operations are vectorized and we will see an example in the following snippet:

# create and view a list of elements
> l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2))
> l
$nums
 [1]  1  2  3  4  5  6  7  8  9 10

$even
[1]  2  4  6  8 10

$odd
[1] 1 3 5 7 9

# use lapply on the list
> lapply(l, sum)
$nums
[1] 55

$even
[1] 30

$odd
[1] 25

sapply

The sapply(…) function is quite similar to the lapply(…) function, the only exception is that it will always try to simplify the final results of the computation. Suppose the final result is such that every element is of length 1, then sapply(…) would return a vector. If the length of every element in the result is greater than 1, then a matrix would be returned. If it is not able to simplify the results, then we end up getting the same result as lapply(…). The following example will make things clearer:

# create and view a sample list
> l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2))
> l
$nums
 [1]  1  2  3  4  5  6  7  8  9 10

$even
[1]  2  4  6  8 10

$odd
[1] 1 3 5 7 9

# observe differences between lapply and sapply
> lapply(l, mean)
$nums
[1] 5.5

$even
[1] 6

$odd
[1] 5
> typeof(lapply(l, mean))
[1] "list"

> sapply(l, mean)
nums even  odd 
 5.5  6.0  5.0
> typeof(sapply(l, mean))
[1] "double"

tapply

The tapply(…) function is used to evaluate a function over specific subsets of input vectors. These subsets can be defined by the users.

The following example depicts the same:

> data <- 1:30
> data
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
> groups <- gl(3, 10)
> groups
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(data, groups, sum)
  1   2   3 
 55 155 255 
> tapply(data, groups, sum, simplify = FALSE)
$'1'
[1] 55

$'2'
[1] 155

$'3'
[1] 255

mapply

The mapply(…) function is used to evaluate a function in parallel over sets of arguments. This is basically a multi-variate version of the lapply(…) function.

The following example shows how we can build a list of vectors easily with mapply(…) as compared to using the rep(…) function multiple times otherwise:

> list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

Visualizing data

One of the most important aspects of data analytics is to depict meaningful insights with crisp and concise visualizations. Data visualization is one of the most important aspects of exploratory data analysis, as well as an important medium to present results from any analyzes. There are three main popular plotting systems in R:

  • The base plotting system which comes with the basic installation of R
  • The lattice plotting system which produces better looking plots than the base plotting system
  • The ggplot2 package which is based on the grammar of graphics and produces beautiful, publication quality visualizations

The following snippet depicts visualizations using all three plotting systems on the popular iris dataset:

# load the data
> data(iris)

# base plotting system
> boxplot(Sepal.Length~Species,data=iris, 
+         xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This gives us a set of boxplots using the base plotting system as depicted in the following plot:

Visualizing data
# lattice plotting system
> bwplot(Sepal.Length~Species,data=iris, xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This snippet helps us in creating a set of boxplots for the various species using the lattice plotting system:

Visualizing data
# ggplot2 plotting system
> ggplot(data=iris, aes(x=Species, y=Sepal.Length)) + geom_boxplot(aes(fill=Species)) + 
+     ylab("Sepal Length") + ggtitle("Iris Boxplot") +
+     stat_summary(fun.y=mean, geom="point", shape=5, size=4) + theme_bw()

This code snippet gives us the following boxplots using the ggplot2 plotting system:

Visualizing data

You can see from the preceding visualizations how each plotting system works and compare the plots across each system. Feel free to experiment with each system and visualize your own data!

Next steps

This quick refresher in getting started with R should gear you up for the upcoming chapters and also make you more comfortable with the R eco-system, its syntax and features. It's important to remember that you can always get help with any aspect of R in a variety of ways. Besides that, packages in R are perhaps going to be the most important tool in your arsenal for analyzing data. We will briefly list some important commands for each of these two aspects which may come in handy in the future.

Getting help

R has thousands of packages, functions, constructs and structures! Hence it is impossible for anyone to keep track and remember of them all. Luckily, help in R is readily available and you can always get detailed information and documentation with regard to R, its utilities, features and packages using the following commands:

  • help(<any_R_object>) or ?<any_R_object>: This provides help on any R object including functions, packages, data types, and so on
  • example(<function_name>): This provides a quick example for the mentioned function
  • apropos(<any_string>): This lists all functions containing the any_string term

Managing packages

R, being open-source and a community-driven language, has tons of packages for helping you with your analyzes which are free to download, install and use.

In the R community, the term library is often used interchangeably with the term package.

R provides the following utilities to manage packages:

  • install.packages(…): This installs a package from Comprehensive R Archive Network (CRAN). CRAN helps maintain and distribute the various versions and documentation of R
  • libPaths(…): This adds this library path to R
  • installed.packages(lib.loc=): This lists installed packages
  • update.packages(lib.loc=): This updates a package
  • remove.packages(…): This removes a package
  • path.package(…): This is the package loaded for the session
  • library(…): This loads a package in a script to use its functions or utilities
  • library(help=): This lists the functions in a package

This brings us to the end of our section on getting started with R and we will now look at some aspects of analytics, machine learning and text analytics.

Data analytics

Data analytics is basically a structured process of using statistical modeling, machine learning, knowledge discovery, predictive modeling and data mining to discover and interpret meaningful patterns in the data, and to communicate them effectively as actionable insights which help in driving business decisions. Data analytics, data mining and machine learning are often said to be covering similar concepts and methodologies. While machine learning, being a branch or subset of artificial intelligence, is more focused on model building, evaluation and learning patterns, the end goal of all three processes is the same: to generate meaningful insights from the data. In the next section, we will briefly discuss the industry standard process followed in analytics which is rigorously followed by organizations.

Analytics workflow

Analyzing data is a science and an art. Any analytics process usually has a defined set of steps, which are generally executed in sequence. More than often, analytics being an iterative process, it leads to several of these steps being repeated many times over if necessary. There is an industry standard that is widely followed for data analysis, known as CRISP-DM, which stands for Cross Industry Standard Process for Data Mining. This is a standard data analysis and mining process workflow that describes how to break up any particular data analysis problem into six major stages.

The main stages in the CRISP-DM model are as follows:

  • Business understanding: This is the initial stage that focuses on the business context of the problem, objective or goal which we are trying to solve. Domain and business knowledge are essential here along with valuable insights from subject matter experts in the business for planning out the key objectives and end results that are intended from the data analysis workflow.
  • Data acquisition and understanding: This stage's main focus is to acquire data of interest and understand the meaning and semantics of the various data points and attributes that are present in the data. Some initial exploration of the data may also be done at this stage using various exploratory data analysis techniques.
  • Data preparation: This stage usually involves data munging, cleaning, and transformation. Extract-Transform-Load (ETL) processes of ten come in handy at this stage. Data quality issues are also dealt with in this stage. The final dataset is usually used for analysis and modeling.
  • Modeling and analysis: This stage mainly focuses on analyzing the data and building models using specific techniques from data mining and machine learning. Often, we need to apply further data transformations that are based on different modeling algorithms.
  • Evaluation: This is perhaps one of the most crucial stages. Building models is an iterative process. In this stage, we evaluate the results that are obtained from multiple iterations of various models and techniques, and then we select the best possible method or analysis, which gives us the insights that we need based on our business requirements. Often, this stage involves reiterating through the previous two steps to reach a final agreement based on the results.
  • Deployment: This is the final stage, where decision systems that are based on analysis are deployed so that end users can start consuming the results and utilizing them. This deployed system can be as complex as a real-time prediction system or as simple as an ad-hoc report.

The following figure shows the complete flow with the various stages in the CRISP-DM model:

Analytics workflow

The CRISP-DM model. Source: www.wikipedia.org

In principle, the CRISP-DM model is very clear and concise with straightforward requirements at each step. As a result, this workflow has been widely adopted across the industry and has become an industry standard.

Machine learning

Machine learning is one of the trending buzzwords in the technology domain today, along with big data. Although a lot of it is over-hyped, machine learning (ML) has proved to be really effective in solving tough problems and has been significant in propelling the rise of artificial intelligence (AI). Machine learning is basically an intersection of elements from the fields of computer science, statistics, and mathematics. This includes a combination of concepts from knowledge mining and discovery, artificial intelligence, pattern detection, optimization, and learning theory to develop algorithms and techniques which can learn from and make predictions on data without being explicitly programmed.

The learning here refers to the ability to make computers or machines intelligent, based on the data and algorithms which we provide to them, also known as training the model, so that they start detecting patterns and insights from new data. This can be better understood from the definition of machine learning by the well-known professor Tom Mitchell who said, "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Let's consider the task (T) of the system is to predict the sales of an organization for the next year. To perform such a task, it needs to rely upon historical sales information. We shall call this experience (E). Its performance (P) is measured on how well it predicts the sales in any given year. Thus, we can generalize that a system has successfully learned how to predict the sales (or task T) if it gets better at predicting it (or improves its performance P), utilizing the past information (or experience E).

Machine learning techniques

Machine-learning techniques are basically algorithms which can work on data and extract insights from it, which can include discovering, forecasting, or predicting trends and patterns. The idea is to build a model using a combination of data and algorithms which can then be used to work on new, previously unseen data and derive actionable insights.

Each and every technique depends on what type of data it can work on and the objective of the problem we are trying to solve. People often get tempted to learn a couple of algorithms and then try to apply them to every problem. An important point to remember here is that there is no universal machine-learning algorithm which fits all problems. The main inputs to machine-learning algorithms are features which are extracted from data using a process known as feature extraction, which is often coupled with another process called feature engineering (or building new features from existing features). Each feature can be described as an attribute of the dataset, such as your location, age, number of posts, shares and so on, if we were dealing with data related to social media user profiles. Machine-learning techniques can be classified into two major types namely supervised learning and unsupervised learning.

Supervised learning

Supervised learning techniques are a subset of the family of machine -learning algorithms which are mainly used in predictive modeling and forecasting. A predictive model is basically a model which can be constructed using a supervised learning algorithm on features or attributes from training data (available data used to train or build the model) such that we can predict using this model on newer, previously unseen data points. Supervised learning algorithms try to model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the dataset used during model training or building.

There are two main types of supervised learning techniques

  • Classification: These algorithms build predictive models from training data where the response variable to be predicted is categorical. These predictive models use the features learnt from training data on new, previously unseen data to predict their class or category labels. The output classes belong to discrete categories. Types of classification algorithms include decision trees, support vector machines, random forests and many more.
  • Regression: These algorithms are used to build predictive models on data such that the response variable to be predicted is numerical. The algorithm builds a model based on input features and output response values of the training data and this model is used to predict values for new data. The output values in this case are continuous numeric values and not discrete categories. Types of regression algorithms include linear regression, multiple regression, ridge regression and lasso regression, among many others.

Unsupervised learning

Unsupervised learning techniques are a subset of the family of machine-learning algorithms which are mainly used in pattern detection, dimension reduction and descriptive modeling. A descriptive model is basically a model constructed from an unsupervised machine learning algorithm and features from input data similar to the supervised learning process. However, the output response variables are not present in this case. These algorithms try to use techniques on the input data to mine for rules, detect patterns, and summarize and group the data points which help in deriving meaningful insights and describe the data better to the users. There is no specific concept of training or testing data here since we do not have any specific relationship mapping between input features and output response variables (which do not exist in this case). There are three main types of unsupervised learning techniques:

  • Clustering: The main objective of clustering algorithms is to cluster or group input data points into different classes or categories using features derived from the input data alone and no other external information. Unlike classification, the output response labels are not known beforehand in clustering. There are different approaches to build clustering models, such as by using centroid based approaches, hierarchical approaches, and many more. Some popular clustering algorithms include k-means, k-medoids, and hierarchical clustering.
  • Association rule mining: These algorithms are used to mine and extract rules and patterns of significance from data. These rules explain relationships between different variables and attributes, and also depict frequent item sets and patterns which occur in the data.
  • Dimensionality reduction: These algorithms help in the process of reducing the number of features or variables in the dataset by computing a set of principal representative variables. These algorithms are mainly used for feature selection.

Text analytics

Text analytics is also often called text mining. This is basically the process of extracting and deriving meaningful patterns from textual data which can in turn be translated into actionable knowledge and insights. Text analytics consist of a collection of machine learning, natural language processing, linguistic, and statistical methods that can be leveraged to analyze text data. Machine-learning algorithms are built to work on numeric data in general, so extra processing and feature extraction and engineering is needed for text analytics to make regular machine learning and statistical methods work on unstructured data.

Natural language processing, popularly known as NLP, aids in doing this. NLP is defined as a specialized field in computer science and engineering and artificial intelligence which has its roots and origins in computational linguistics. Concepts and techniques from NLP are extremely useful and help in building applications and systems that enable interaction between machines and humans with the aid of natural language which is indeed a daunting task. Some of the main applications of NLP are:

  • Question-answering systems
  • Speech recognition
  • Machine translation
  • Text categorization and classification
  • Text summarization

We will be using several concepts from these when we analyze unstructured textual data from social media in the upcoming chapters.

Summary

We've covered a lot in this chapter so I would like to commend your efforts for staying with us till the very end! We kicked off this chapter with a detailed look into social media, its scope, variants, significance and pitfalls. We also covered the basics of social media analytics, as well as the opportunities and challenges involved, to whet your appetite for social media analytics and to get us geared up for the journey we'll be taking throughout the course of this book. A complete refresher of the R programming language was also covered in detail, especially with regard to setting up a proper analytics environment, core structures, constructs and features of R. Finally, we took a quick glance at the basic concepts of data analytics, the industry standard process for analytics, and covered the core essentials of machine learning, text analytics and natural language processing.

We will be looking at analyzing data from various popular social media platforms in future chapters so get ready to do some serious analysis on social media!

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • A practical guide written to help leverage the power of the R eco-system to extract, process, analyze, visualize and model social media data
  • Learn about data access, retrieval, cleaning, and curation methods for data originating from various social media platforms.
  • Visualize and analyze data from social media platforms to understand and model complex relationships using various concepts and techniques such as Sentiment Analysis, Topic Modeling, Text Summarization, Recommendation Systems, Social Network Analysis, Classification, and Clustering.

Description

The Internet has truly become humongous, especially with the rise of various forms of social media in the last decade, which give users a platform to express themselves and also communicate and collaborate with each other. This book will help the reader to understand the current social media landscape and to learn how analytics can be leveraged to derive insights from it. This data can be analyzed to gain valuable insights into the behavior and engagement of users, organizations, businesses, and brands. It will help readers frame business problems and solve them using social data. The book will also cover several practical real-world use cases on social media using R and its advanced packages to utilize data science methodologies such as sentiment analysis, topic modeling, text summarization, recommendation systems, social network analysis, classification, and clustering. This will enable readers to learn different hands-on approaches to obtain data from diverse social media sources such as Twitter and Facebook. It will also show readers how to establish detailed workflows to process, visualize, and analyze data to transform social data into actionable insights.

Who is this book for?

It is targeted at IT professionals, Data Scientists, Analysts, Developers, Machine Learning Enthusiasts, social media marketers and anyone with a keen interest in data, analytics, and generating insights from social data. Some background experience in R would be helpful, but not necessary, since this book is written keeping in mind, that readers can have varying levels of expertise.

What you will learn

  • Learn how to tap into data from diverse social media platforms using the R ecosystem
  • Use social media data to formulate and solve real-world problems
  • Analyze user social networks and communities using concepts from graph theory and network analysis
  • Learn to detect opinion and sentiment, extract themes, topics, and trends from unstructured noisy text data from diverse social media channels
  • Understand the art of representing actionable insights with effective visualizations
  • Analyze data from major social media channels such as Twitter, Facebook, Flickr, Foursquare, Github, StackExchange, and so on
  • Learn to leverage popular R packages such as ggplot2, topicmodels, caret, e1071, tm, wordcloud, twittR, Rfacebook, dplyr, reshape2, and many more
Estimated delivery fee Deliver to Singapore

Standard delivery 10 - 13 business days

S$11.95

Premium delivery 5 - 8 business days

S$54.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : May 26, 2017
Length: 394 pages
Edition : 1st
Language : English
ISBN-13 : 9781787127524
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Singapore

Standard delivery 10 - 13 business days

S$11.95

Premium delivery 5 - 8 business days

S$54.95
(Includes tracking information)

Product Details

Publication date : May 26, 2017
Length: 394 pages
Edition : 1st
Language : English
ISBN-13 : 9781787127524
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total S$ 224.97
Mastering Machine Learning with R, Second Edition
S$74.99
Machine Learning with R Cookbook, Second Edition
S$74.99
Learning Social Media Analytics with R
S$74.99
Total S$ 224.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Getting Started with R and Social Media Analytics Chevron down icon Chevron up icon
2. Twitter – What's Happening with 140 Characters Chevron down icon Chevron up icon
3. Analyzing Social Networks and Brand Engagements with Facebook Chevron down icon Chevron up icon
4. Foursquare – Are You Checked in Yet? Chevron down icon Chevron up icon
5. Analyzing Software Collaboration Trends I – Social Coding with GitHub Chevron down icon Chevron up icon
6. Analyzing Software Collaboration Trends II - Answering Your Questions with StackExchange Chevron down icon Chevron up icon
7. Believe What You See – Flickr Data Analysis Chevron down icon Chevron up icon
8. News – The Collective Social Media! Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(4 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Amazon Customer Jun 11, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
One of the few books covering detailed use-cases and analytics on over 5-6 major social media platforms. I have read several books and blogs on social media mining but the current offerings had been lacking based on either content or problems showcased. This one definitely has a lot of examples and code which helps you get started with hands-on analysis yourself. Only con would be I wish there was an equivalent of this book in Python
Amazon Verified review Amazon
Swati Jul 28, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Well structured content with variety of examples and visualizations. Still going through the chapters. Best part: analysis for social networks is not available in other books/blogs eg stack exchange n foursquare
Amazon Verified review Amazon
Dhiraj Baurisetty Jul 19, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Really good book with a very deep insight into R.... Love to recommend it
Amazon Verified review Amazon
Amazon Customer Jun 11, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a good read which covers mining info from multiple social media websites and apps. Really liked the coverage from normal analysis and visualizations to advanced methods like text mining and recommendations. Overall a great book if you are into R and want to analyze data from social websites.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela