Chapter 1, Getting Started with Predictive Analytics, begins with a little bit of history of how predictive analytics developed. We then discuss some different roles of predictive analytics practitioners, and describe the industries in which they work. Ways to organize predictive analytic projects on a PC is discussed next, the R language is introduced, and we end the chapter with a short example of a predictive model.
Chapter 2, The Modeling Process, discusses how the development of predictive models can be organized into a series of stages, each with different goals, such as exploration and problem definition, leading to the actual development of a predictive model. We discuss two important analytics methodologies, CRISP-DM and SEMMA. Code examples are sprinkled through the chapter to demonstrate some of the ideas central to the methodologies, so you will hopefully, never be bored...
Chapter 3, Inputting and Exploring Data, introduces various ways that you can bring your own input data into R. We also discuss various data preparation techniques using standard SQL functions as well as analogous methods using the R dplyr package. Have no data to input? No problem. We will show you how to generate your own human-like data using the R package wakefield.
Chapter 4, Introduction to Regression Algorithms, begins with a discussion of supervised versus unsupervised algorithms. The rest of the chapter concentrates on regression algorithms, which represent the supervised algorithm category. You will learn about interpreting regression output such as model coefficients and residual plots. There is even an interactive game that supplies an interact test to see if you can determine if a series of residuals are random or not.
Chapter 5, Introduction to Decision trees, Clustering, and SVM, concentrates on three other core predictive algorithms that have widespread use, and, along with regression, can be used to solve many, if not most, of your predictive analytics problems. The last algorithm discussed, Support Vector Machines (SVMs), are often used with high-dimensional data, such as unstructured text, so we will accompany this example with some text mining techniques using some customer complaint comments.
Chapter 6, Using Survival Analysis to Predict and Analyze Customer Churn, discusses a specific modeling technique known as survival analysis and follows a hypothetical customer marketing satisfaction and retention example. We will also delve more deeply into simulating customer choice using some sampling functions available in R.
Chapter 7, Using Market Basket Analysis as a Recommender Engine, introduces the concept of association rules and market basket analysis, and steps you through some techniques that can predict future purchases based upon various combinations of previous purchases from an online retail store. It also introduces some text analytics techniques coupled with some cluster analysis that places various customers into different segments. You will learn some additional data cleaning techniques, and learn how to generate some interesting association plots.
Chapter 8, Exploring Health Care Enrollment Data as a Time Series, introduces time series analytics. Healthcare enrollment data from the CMS website is first explored. Then we move on to defining some basic time series concepts such as simple and exponential moving averages. Finally, we work with the R forecast package which, as its name implies, helps you to perform some time series forecasting.
Chapter 9, Introduction to Spark Using R, introduces RSpark, which is an environment for accessing large Spark clusters using R. No local version of R needs to be installed. It also introduces Databricks, which is a cloud-based environment for running R (as well as Python, SQL, and other language), against Spark-based big data. This chapter also demonstrates techniques for transforming small datasets into larger Spark clusters using the Pima Indians Diabetes database as reference.
Chapter 10, Exploring Large Datasets Using Spark, shows how to perform some exploratory data analysis using a combination of RSpark and Spark SQL using the Pima Indians Diabetes data loaded into Spark. We will learn the basics of exploring Spark data using some Spark-specific commands that allow us to filter, group and summarize, and visualize our Spark data.
Chapter 11, Spark Machine Learning – Regression and Cluster Models, covers machine learning by first illustrating a logistic regression model that has been built using a Spark cluster. We will learn how to split Spark data into training and test data in Spark, run a logistic regression model, and then evaluate its performance.
Chapter 12, Spark Models - Rules-Based Learning, teaches you how to run decision tree models in Spark using the Stop and Frisk dataset. You will learn how to overcome some of the algorithmic limitations of the Spark MLlib environment by extracting some cluster samples to your local machine and then run some non-Spark algorithms that you are already familiar with. This chapter will also introduce you to a new rule-based algorithm, OneR, and will also demonstrate how you can mix different languages together in Spark, such as mixing R, SQL, and even Python code together in the same notebook using the %magic directive.