Packt+ | Advance your knowledge in tech

You're reading from R Data Mining

Product type Book

Published in Nov 2017

Publisher Packt

ISBN-13 9781787124462

Pages 442 pages

Edition 1st Edition

Languages

Concepts

Data Mining

Table of Contents (22) Chapters

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Why to Choose R for Your Data Mining and Where to Start

2. A First Primer on Data Mining Analysing Your Bank Account Data

3. The Data Mining Process - CRISP-DM Methodology

4. Keeping the House Clean – The Data Mining Architecture

5. How to Address a Data Mining Problem – Data Cleaning and Validation

6. Looking into Your Data Eyes – Exploratory Data Analysis

7. Our First Guess – a Linear Regression

8. A Gentle Introduction to Model Performance Evaluation

9. Don't Give up – Power up Your Regression Including Multiple Variables

10. A Different Outlook to Problems with Classification Models

11. The Final Clash – Random Forests and Ensemble Learning

12. Looking for the Culprit – Text Data Mining with R

13. Sharing Your Stories with Your Stakeholders through R Markdown

14. Epilogue

15. Dealing with Dates, Relative Paths and Functions

R's points of strength

You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular?

If looking at the root causes of R's popularity, we definitely have to mention these three:

Open source inside
Plugin ready
Data visualization friendly

Open source inside

One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well.

These attributes fit well for almost every target user of a statistical analysis language:

Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes
Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true
Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses

Plugin ready

You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game.

There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to filesystem manipulation, statistical analysis, and data visualization.

While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages.

This is basically how the package development and sharing flow works:

The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper.
The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R-related documents and packages.

Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor):

install.packages("ggplot2")
library(ggplot2)

As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is.

More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community.

Data visualization friendly

as a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data.

Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields.

R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks:

This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it.

Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers.

To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field.