Reading Data from Various Data Formats
Data from digital systems is generated in various forms: browsing history on an e-commerce website, clickstream data, the purchase history of a customer, social media interactions, footfalls in a retail store, images from satellite and drones, and numerous other formats and types of data. We are living in an exciting time when technology is significantly changing lives, and enterprises are leveraging it to create their next data strategy to make better decisions.
It is not enough to be able to collect a huge amount of different types of data; we also need to leverage value out of it. A CCTV footage captured throughout a day will help the law and order teams of the government in improving the real-time surveillance of public places. The challenge remains in how we will process a large volume of heterogeneous data formats within a single system.
Transaction data in the Customer Relationship Management (CRM) application would mostly be tabular and feed in social media is mostly text, audio, video, and images.
We can categorize the data formats as structured—tabular data such as CSV and database tables; unstructured—textual data such as tweets, FB posts, and word documents; and semi-structured. Unlike textual, which is hard for machines to process and understand, semi-structured provides associated metadata, which makes it easy for computers to process it. It's popularly used with many web applications for data exchange, and JSON is an example of the semi-structured data format.
In this section, we will see how to load, process, and transform various data formats in R. Within the scope of this book, we will work with CSV, text, and JSON data.
CSV files are the most common type of data storage and exchange formats for structured data. R provides a method called read.csv() for reading data from a CSV file. It will read the data into a data.frame (more about it in the next section). There are many arguments that the method takes; the two required arguments are a path to the filename and sep, which specifies the character that separates the column values. The summary() method describes the six summary statistics, min, first quartile, median, mean, third quartile, and max.
In the following exercise, we'll read a CSV file and summarize its column.
Exercise 2: Reading a CSV File and Summarizing its Column
In this exercise, we will read the previously extracted CSV file and use the summary function to print the min, max, mean, median, 1st quartile, and 3rd quartile values of numeric variables and count the categories of the categorical variable.
Carry out these steps to read a CSV file and later summarize its columns:
First, use the read.csv method and load the bank-full.csv into a DataFrame:
Print the summary of the DataFrame:
The output is as follows:
JSON is the next most commonly used data format for sharing and storing data. It is unlike CSV files, which only deal with rows and columns of data where each has a definite number of columns. For example, in the e-commerce data of the customers, each row could be representing a customer with their information stored in separate columns. For a customer, if a column has no value, the field is stored as NULL.
JSON provides an added flexibility of having a varying number of fields for each customer. This type of flexibility relieves the developer from the burden of maintaining a schema as we have in traditional relational databases, wherein the same customer data might be spread across multiple tables to optimize for storage and querying time.
JSON is more of a key-value store type of storage, where all we care about is the keys (such as the name, age, and DOB) and their corresponding values. While this sounds flexible, proper care has to be taken, otherwise manageability might at times, go out of control. Fortunately, with the advent of big data technologies in recent days, many document stores (a subclass of the key-value store), popularly also known as NoSQL databases, are available for storing, retrieving, and processing data in such formats.
In the following exercise, the JSON file has data for cardamom (spices and condiments) cultivation district-wise in Tamil Nadu, India, for the year 2015-16. The keys include area (hectare), production (in quintals), and productivity (average yield per hectare).
The jsonlite package provides an implementation to read and convert a JSON file into DataFrame, which makes the analysis simpler. The fromJSON method reads a JSON file, and if the flatten argument in the fromJSON function is set to TRUE, it gives a DataFrame.
Exercise 3: Reading a JSON file and Storing the Data in DataFrame
In this exercise, we will read a JSON file and store the data in the DataFrame.
Perform the following steps to complete the exercise:
Download the data from https://data.gov.in/catalog/area-production-productivity-spices-condiments-district-wise-tamil-nadu-year-2015-16.
First, use the following command to install the packages required for the system of read the JSON file:
Next, read the JSON file using the fromJSON method, as illustrated here:
The second element in the list contains the DataFrame with crop production value. Retrieve it from json_data and store as a DataFrame named crop_production:
Next, use the following command to rename the columns:
Now, print the top six rows using the head() function:
The output is as follows:
Unstructured data is the language of the web. All the social media, blogs, web pages, and many other sources of information are textual and untidy to extract any meaningful information. An increasing amount of research work is coming out from the Natural Language Processing (NLP) field, wherein computers are becoming better in understanding not only the meaning of the word but also the context in which it's used in a sentence. The rise of computer chatbot, which responds to a human query, is the most sophisticated form of understanding textual information.
In R, we will use the tm text mining package to show how to read, process, and retrieve meaningful information from text data. We will use a small sample of the Amazon Food Review dataset in Kaggle (https://www.kaggle.com/snap/amazon-fine-food-reviews) for the exercise in this section.
In the tm package, collections of text documents are called Corpus. One implementation of Corpus in the tm package is VCorpus (volatile corpus). Volatile corpus is named after the fact that it's stored in-memory for fast processing. To check the metadata information of the VCorpus object, we can use the inspect() method. The following exercise uses the lapply method for looping through the first two reviews and casting the text as a character. You will learn more about the apply family of function in the The Apply Family of Functions section.
Exercise 4: Reading a CSV File with Text Column and Storing the Data in VCorpus
In this exercise, we will read a CSV file with the text column and store the data in VCorpus.
Perform the following steps to complete the exercise:
First, let's load the text mining package from the R into the system to read the text file:
Now, read the first top 10 reviews from the file:
To store the text column in VCorpus, use the following command:
To inspect the structure of first two reviews, execute the following command:
The output is as follows:
Using lapply, cast the first review as character and print:
We will revisit the review_corpus dataset in a later section to show how to convert the unstructured textual information to structured tabular data.
Apart from CSV, Text, and JSON, there are numerous other data formats depending upon the source of data and its usage. R has a rich collection of libraries that helps with many formats. R can import not only the standard formats (apart from the previous three) such as HTML tables and XML but also formats specific to an analytical tool such as SAS and SPSS. This democratization led to a significant migration of industry experts who were earlier working in the propriety tools (costly and often found with only the large corporations) to open source analytical programming languages such as R and Python.