You're reading from Hands-On Data Science with R Techniques to perform data manipulation and mining to build smart analytical models using R

Product type Paperback

Published in Nov 2018

Publisher Packt

ISBN-13 9781789139402

Length 420 pages

Edition 1st Edition

Languages

Tools

ggplot

Concepts

Data Science

Authors (4):

Nataraj Dasgupta

Vitor Bianchi Lanzetta

Doug Ortiz

Ricardo Anjoleto Farias

View More author details

Introduction to data science

The term, data science, as mentioned earlier, was first proposed in the 1960s and 1970s by Peter Naur. In the late 1990s, Jeff Wu, while at the University of Michigan, Ann Arbor, proposed the term in a formal paper titled Statistics = Data Science?. The paper, which Prof. Wu subsequently presented at the seventh series of P.C. Mahalonobis Lectures at the Indian Statistical Institute in 1998, raised some interesting questions about what an appropriate definition of statistics might be in light of the tasks that a statistician did beyond numerical calculations.

In the paper Prof. Wu highlighted the concept of Statistical Trilogy, consisting of data collection, data modeling and analysis, and problem solving. The following sections reflected upon the future directions in which Dr. Wu raised the prospects of neural network models to model complex, non-linear relationships, the use of cross validation to improve model performance, and data mining of large-scale data among others. [Source: https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf].

The paper, although written more than 20 years ago, is a reflection of the foresight that a few academicians such as Dr. Wu had at the time, which has been realized in full, almost verbatim as it was propositioned back then, both in thought and practical concepts. A copy of Dr. Wu's paper is available at https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf.

Key components of data science

The practice of data science requires the application of three distinct disciplines to uncover insights from data. These disciplines are as follows:

Computer science
Predictive analytics
Domain knowledge

The following diagram shows the core components of data science:

Computer science

During the course of performing data science, if large datasets are involved, the practitioner may spend a fair amount of time cleansing and curating the dataset. In fact, it is not uncommon for data scientists to spend the majority of their time preparing data for analysis. The generally accepted distribution of time for a data science project involves 80% spent in data management and the remaining 20% spent in the actual analysis of the data.

While this may seem or sound overly general, the growth of big data, that is, large-scale datasets, usually in the range of terabytes, has meant that it takes sufficient time and effort to extract data before the actual analysis takes place. Real-world data is seldom perfect. Issues with real-world data range from missing variables to incorrect entries and other deficiencies. The size of datasets also poses a formidable challenge.

Technologies such as Hadoop, Spark, and NoSQL databases have addressed the needs of the data science community for managing and curating terabytes, if not petabytes, of information. These tools are usually the first step in the overall data science process that precedes the application of algorithms on the datasets using languages such as R, Python and others.

Hence, as a first step, the data scientist generally should be capable of working with datasets using contemporary tools for large-scale data mining. For instance, if the data resides in a Hadoop cluster, the practitioner must be able and willing to perform the work necessary to retrieve and curate the data from the source systems.

Second, once the data has been retrieved and curated, the data scientist should be aware of the requirements of the algorithm from a computational perspective and determine if the system has the necessary resources to efficiently execute these algorithms. For instance, if the algorithms can be taken advantage of with multi-core computing facilities, the practitioner must use the appropriate packages and functions to leverage. This may mean the difference between getting results in an hour versus requiring an entire day.

Last, but not least, the creation of machine learning models will require programming in one or more languages. This in itself demands a level of knowledge and skill in applying algorithms and using appropriate data structures and other computer science concepts:

Predictive analytics (machine learning)

In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.

The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.

In general, creating a machine learning model involves a series of steps such as the sequence:

Cleanse and curate the dataset to extract the cohort on which the model will be built.
Analyze the data using descriptive statistics, for example, distributions and visualizations.
Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.
Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).
Select appropriate machine learning models and create the model using cross validation.
Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.
Perform predictions on the test dataset.
Deliver the final model.

The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.

Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.

The Comprehensive R Archive Network (CRAN) has a task view page titled CRAN Task View: Machine Learning & Statistical Learning (https://cran.r-project.org/web/views/MachineLearning.html) that summarizes some of the key packages in use today for machine learning using R.

Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.

It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.

Domain knowledge

More often than data scientists would like to admit, machine learning models produce results that are obvious and intuitive. For instance, we once conducted an elaborate analysis of physicians, prescribing behavior to find out the strongest predictor of how many prescriptions a physician would write in the next quarter. We used a broad set of input variables such as the physicians locations, their specialties, hospital affiliations, prescribing history, and other data. In the end, the best performing model produced a result that we all knew very well. The strongest predictor of how many prescriptions a physician would write in the next quarter was the number of prescriptions the physician had written in the previous quarter! To filter out the truly meaningful variables and build a more robust model, we eventually had to engage someone who had extensive experience of working in the pharma industry. Machine learning models work best when produced in a hybrid approach—one that combines domain expertise along with the sophistication of models developed.