You're reading from R Data Mining Implement data mining techniques through practical use cases and real-world datasets

Product type Paperback

Published in Nov 2017

Publisher Packt

ISBN-13 9781787124462

Length 442 pages

Edition 1st Edition

Languages

Tools

ggplot

Concepts

Data Mining

Author (1):

Andrea Cirillo

View More author details

R's points of strength

You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular?

If looking at the root causes of R's popularity, we definitely have to mention these three:

Open source inside
Plugin ready
Data visualization friendly

Open source inside

One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well.

These attributes fit well for almost every target user of a statistical analysis language:

Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes
Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true
Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses

Plugin ready

You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game.

There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to filesystem manipulation, statistical analysis, and data visualization.

While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages.

This is basically how the package development and sharing flow works:

The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper.
The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R-related documents and packages.

Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor):

install.packages("ggplot2")
library(ggplot2)

As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is.

More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community.

Data visualization friendly

as a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data.

Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields.

R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks:

This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it.

Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers.

To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field.