Extracting data from sources
Now we get to work. The aim of this book is to design, develop, and deliver a business intelligence product-a data product. In this section, you will explore two extraction methods to import data from different types of sources:
- Importing CSV and other file formats
- Importing data from relational databases
Depending on your background, you may be more or less familiar with both types of extraction methods. Here, you will learn or refresh your knowledge of both in order to have a more complete working understanding of ETL.
Importing CSV and other file formats
You can load the Bike Sharing data file into the R environment by using the read.csv()
function. It is an easy and commonly used way to load CSV files:
bike <- read.csv("{filename}.csv")
Calling this function will read the file as long as it is located in your R working directory. The working directory is the space R uses, much like a home directory. There are two commands that you can use to check and change your working directory:
getwd()
: This will return the current working directory as a path in the R consolesetwd(<path>)
: This is used in the console to change the working directory to<path>
you pass in the function
If you try to read a file that is not in your working directory, you will see an error message. Some analysts manage data in a separate data directory, one level under their working directory. In this case, you can add a path in front of the filename. The following example shows how the data file is located one layer down in a data
directory. Adding the ./data/
string to the front of the filename will allow R to access the data:
bike <- read.csv("./data/Ch1_bike_sharing_data.csv")
str(bike)
The str()
function is not required to import the data, but it does provide a confirmation that the data was read into the R environment. It also provides you with a quick look at the structure of the dataset, its dimensions, and the names of the variables:
'data.frame': 17379 obs. of 12 variables: $ datetime : Factor w/ 17379 levels "1/1/2011 0:00",..: 1 2 13 ... $ season : int 1 1 1 1 1 1 1 1 1 1 ... $ holiday : int 0 0 0 0 0 0 0 0 0 0 ... $ workingday: int 0 0 0 0 0 0 0 0 0 0 ... $ weather : int 1 1 1 1 1 2 1 1 1 1 ... $ temp : num 9.84 9.02 9.02 9.84 9.84 ... $ atemp : num 14.4 13.6 13.6 14.4 14.4 ... $ humidity : int 81 80 80 75 75 75 80 86 75 76 ... $ windspeed : num 0 0 0 0 0 ... $ casual : int 3 8 5 3 0 0 2 1 1 8 ... $ registered: int 13 32 27 10 1 1 0 2 7 6 ... $ count : int 16 40 32 13 1 1 2 3 8 14 ...
The read.csv()
function uses default parameters that are consistent with CSV files, such as using a comma (,
) as a separator. The R environment has a more flexible read option for instances when you may have tab-delimited data or TXT file. This option is the read.table()
function:
bike <- read.table("./data/Ch1_bike_sharing_data.csv", sep = ",", header = TRUE)
This function performs identically to the read.csv()
function. In fact, read.csv()
is built on read.table()
and uses sep = ","
and header = TRUE
as default parameters.
Tip
Downloading the example code
Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look.
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Introduction-To-R-For-Business-Intelligence . We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/ . Check them out!
Importing data from relational databases
You can also use R to access the data stored in many relational databases through the Open Database Connectivity (ODBC) application programming interface. In R, you do this using the RODBC
package. This package provides you with an advantage for legacy data. By establishing an R connection to relational databases at Bike Sharing, LLC, you can reuse existing SQL-type queries that may already exist in the BI group.
Imagine that the Bike Sharing dataset was not a CSV file, but it was instead stored in a relational database. You can access the data by loading the RODBC
package and establishing a connection to the database using the following parameters: data source name, user ID, and password. In this notional example, the user Paul
would like to access the data source ourDB
using his password R4BI
:
library(RODBC)
connection <- odbcConnect(dsn = "ourDB", uid = "Paul", pwd = "R4BI")
Note
As you follow along with this code, you will only be able to run the odbcConnect()
function if you have configured a database source. Advanced users may try this using an Excel file. A tutorial for Windows-based environments is provided by Cyberfella LTD (2013) at
http://www.cyberfella.co.uk/2012/06/11/windows7-odbc/
.
Having established a connection to your database, you can use R to write a SQL query, pass this query to the sqlQuery()
function, and read the data into a bike
data frame. You learned that there is a table in the database called marketing
:
query <- "SELECT * FROM marketing" bike <- sqlQuery(connection, query) close(connection)
The close()
function closes the connection between R and the database. Doing this will free any memory consumed by the open connection. It is a good practice to follow.
Tip
R tip: One thing you can do when establishing connections is to add a log to record your ETL processes. In the event something goes wrong, the log will capture the issue and help you find the problem quickly. The log4r
package is specifically written for this application. For more information, visit the repository at
https://github.com/johnmyleswhite/log4r
.
There are also non-relational databases in the work place. Non-relational databases store documents, key value data, and unstructured data. Unlike the relational database approach, they contain different data structures grouped into nodes, clusters, or other schemas. The R community has created various packages to connect with non-relational databases and distributive filing systems. You can connect to MongoDB using the rmongodb
package and to Spark distributed computing using the SparkR
package.
Tip
BI tip: By looking at the database approach used within an organization and conceptualizing the data, a business analyst can better understand business processes to improve the quality of analysis and meet organizational needs.