Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Introduction to R for Business Intelligence

You're reading from   Introduction to R for Business Intelligence Profit optimization using data mining, data analysis, and Business Intelligence

Arrow left icon
Product type Paperback
Published in Aug 2016
Publisher Packt
ISBN-13 9781785280252
Length 228 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Jay Gendron Jay Gendron
Author Profile Icon Jay Gendron
Jay Gendron
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Extract, Transform, and Load FREE CHAPTER 2. Data Cleaning 3. Exploratory Data Analysis 4. Linear Regression for Business 5. Data Mining with Cluster Analysis 6. Time Series Analysis 7. Visualizing the Datas Story 8. Web Dashboards with Shiny A. References
B. Other Helpful R Functions C. R Packages Used in the Book
D. R Code for Supporting Market Segment Business Case Calculations

Extracting data from sources

Now we get to work. The aim of this book is to design, develop, and deliver a business intelligence product-a data product. In this section, you will explore two extraction methods to import data from different types of sources:

  • Importing CSV and other file formats
  • Importing data from relational databases

Depending on your background, you may be more or less familiar with both types of extraction methods. Here, you will learn or refresh your knowledge of both in order to have a more complete working understanding of ETL.

Importing CSV and other file formats

You can load the Bike Sharing data file into the R environment by using the read.csv() function. It is an easy and commonly used way to load CSV files:

bike <- read.csv("{filename}.csv") 

Calling this function will read the file as long as it is located in your R working directory. The working directory is the space R uses, much like a home directory. There are two commands that you can use to check and change your working directory:

  • getwd(): This will return the current working directory as a path in the R console
  • setwd(<path>): This is used in the console to change the working directory to <path> you pass in the function

If you try to read a file that is not in your working directory, you will see an error message. Some analysts manage data in a separate data directory, one level under their working directory. In this case, you can add a path in front of the filename. The following example shows how the data file is located one layer down in a data directory. Adding the ./data/ string to the front of the filename will allow R to access the data:

bike <- read.csv("./data/Ch1_bike_sharing_data.csv") 
str(bike) 

The str() function is not required to import the data, but it does provide a confirmation that the data was read into the R environment. It also provides you with a quick look at the structure of the dataset, its dimensions, and the names of the variables:

'data.frame':  17379 obs. of  12 variables:
 $ datetime  : Factor w/ 17379 levels "1/1/2011 0:00",..: 1 2 13  ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

The read.csv() function uses default parameters that are consistent with CSV files, such as using a comma (,) as a separator. The R environment has a more flexible read option for instances when you may have tab-delimited data or TXT file. This option is the read.table() function:

bike <- read.table("./data/Ch1_bike_sharing_data.csv", sep = ",", 
                   header = TRUE)

This function performs identically to the read.csv() function. In fact, read.csv() is built on read.table() and uses sep = "," and header = TRUE as default parameters.

Tip

Downloading the example code

Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look.

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Introduction-To-R-For-Business-Intelligence . We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/ . Check them out!

Importing data from relational databases

You can also use R to access the data stored in many relational databases through the Open Database Connectivity (ODBC) application programming interface. In R, you do this using the RODBC package. This package provides you with an advantage for legacy data. By establishing an R connection to relational databases at Bike Sharing, LLC, you can reuse existing SQL-type queries that may already exist in the BI group.

Imagine that the Bike Sharing dataset was not a CSV file, but it was instead stored in a relational database. You can access the data by loading the RODBC package and establishing a connection to the database using the following parameters: data source name, user ID, and password. In this notional example, the user Paul would like to access the data source ourDB using his password R4BI:

library(RODBC) 
connection <- odbcConnect(dsn = "ourDB", uid = "Paul", pwd = "R4BI") 

Note

As you follow along with this code, you will only be able to run the odbcConnect() function if you have configured a database source. Advanced users may try this using an Excel file. A tutorial for Windows-based environments is provided by Cyberfella LTD (2013) at http://www.cyberfella.co.uk/2012/06/11/windows7-odbc/ .

Having established a connection to your database, you can use R to write a SQL query, pass this query to the sqlQuery() function, and read the data into a bike data frame. You learned that there is a table in the database called marketing:

query <- "SELECT * FROM marketing" 
bike <- sqlQuery(connection, query) 
close(connection) 

The close() function closes the connection between R and the database. Doing this will free any memory consumed by the open connection. It is a good practice to follow.

Tip

R tip: One thing you can do when establishing connections is to add a log to record your ETL processes. In the event something goes wrong, the log will capture the issue and help you find the problem quickly. The log4r package is specifically written for this application. For more information, visit the repository at https://github.com/johnmyleswhite/log4r .

There are also non-relational databases in the work place. Non-relational databases store documents, key value data, and unstructured data. Unlike the relational database approach, they contain different data structures grouped into nodes, clusters, or other schemas. The R community has created various packages to connect with non-relational databases and distributive filing systems. You can connect to MongoDB using the rmongodb package and to Spark distributed computing using the SparkR package.

Tip

BI tip: By looking at the database approach used within an organization and conceptualizing the data, a business analyst can better understand business processes to improve the quality of analysis and meet organizational needs.

You have been reading a chapter from
Introduction to R for Business Intelligence
Published in: Aug 2016
Publisher: Packt
ISBN-13: 9781785280252
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image