Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Learning Predictive Analytics with Python
Learning Predictive Analytics with Python

Learning Predictive Analytics with Python: Gain practical insights into predictive modelling by implementing Predictive Analytics algorithms on public datasets with Python

Arrow left icon
Profile Icon Kumar Profile Icon Gary Dougan
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.4 (11 Ratings)
Paperback Feb 2016 354 pages 1st Edition
eBook
€22.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Kumar Profile Icon Gary Dougan
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.4 (11 Ratings)
Paperback Feb 2016 354 pages 1st Edition
eBook
€22.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€22.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Learning Predictive Analytics with Python

Chapter 1. Getting Started with Predictive Modelling

Predictive modelling is an art; its a science of unearthing the story impregnated into silos of data. This chapter introduces the scope and application of predictive modelling and shows a glimpse of what could be achieved with it, by giving some real-life examples.

In this chapter, we will cover the following topics in detail:

  • Introducing predictive modelling
  • Applications and examples of predictive modelling
  • Installing and downloading Python and its packages
  • Working with different IDEs for Python

Introducing predictive modelling

Did you know that Facebook users around the world share 2,460,000 pieces of content every minute of the day? Did you know that 72-hours worth of new video content is uploaded on YouTube in the same time and, brace yourself, did you know that everyday around 2.5 exabytes (10^18) of data is created by us humans? To give you a perspective on how much data that is, you will need a million 1 TB (1000 GB) hard disk drives every day to store that much data. In a year, we will outgrow the US population and will be north of five times the UK population and this estimation is by assuming the fact that the rate of the data generation will remain the same, which in all likelihoods will not be the case.

The breakneck speed at which the social media and Internet of Things have grown is reflected in the huge silos of data humans generate. The data about where we live, where we come from, what we like, what we buy, how much money we spend, where we travel, and so on. Whenever we interact with a social media or Internet of Things website, we leave a trail, which these websites gleefully log as their data. Every time you buy a book at Amazon, receive a payment through PayPal, write a review on Yelp, post a photo on Instagram, do a check-in on Facebook, apart from making business for these websites, you are creating data for them.

Harvard Business Review (HBR) says "Data is the new oil" and that "Data Scientist is the sexiest job of the 21st century". So, why is the data so important and how can we realize the full potential of it? There are broadly two ways in which the data is used:

  • Retrospective analytics: This approach helps us analyze history and glean out insights from the data. It allows us to learn from mistakes and adopt best practices. These insights and learnings become the torchbearer for the purpose of devising better strategy. Not surprisingly, many experts have been claiming that data is the new middle manager.
  • Predictive analytics: This approach unleashes the might of data. In short, this approach allows us to predict the future. Data science algorithms take historical data and spit out a statistical model, which can predict who will buy, cheat, lie, or die in the future.

Let us evaluate the comparisons made with oil in detail:

  • Data is as abundant as oil used to be, once upon a time, but in contrast to oil, data is a non-depleting resource. In fact, one can argue that it is reusable, in the sense that, each dataset can be used in more than one way and also multiple number of times.
  • It doesn't take years to create data, as it takes for oil.
  • Oil in its crude form is worth nothing. It needs to be refined through a comprehensive process to make it usable. There are various grades of this process to suit various needs; it's the same with data. The data sitting in silos is worthless; it needs to be cleaned, manipulated, and modelled to make use of it. Just as we need refineries and people who can operate those refineries, we need tools that can handle data and people who can operate those tools. Some of the tools for the preceding tasks are Python, R, SAS, and so on, and the people who operate these tools are called data scientists.

A more detailed comparison of oil and data is provided in the following table:

Data

Oil

It's a non-depleting resource and also reusable.

It's a depleting resource and non-reusable.

Data collection requires some infrastructure or system in place. Once the system is in place, the data generation happens seamlessly.

Drilling oil requires a lot of infrastructure. Once the infrastructure is in place, one can keep drawing the oil until the stock dries up.

It needs to be cleaned and modelled.

It needs to be cleaned and processed.

The time taken to generate data varies from fractions of second to months and years.

It takes decades to generate.

The worth and marketability of different kinds of data is different.

The worth of crude oil is same everywhere. However, the price and marketability of different end products of refinement is different.

The time horizon for monetization of data is smaller after getting the data.

The time horizon for monetizing oil is longer than that for data.

Scope of predictive modelling

Predictive modelling is an ensemble of statistical algorithms coded in a statistical tool, which when applied on historical data, outputs a mathematical function (or equation). It can in-turn be used to predict outcomes based on some inputs (on which the model operates) from the future to drive a goal in business context or enable better decision making in general.

To understand what predictive modelling entails, let us focus on the phrases highlighted previously.

Ensemble of statistical algorithms

Statistics are important to understand data. It tells volumes about the data. How is the data distributed? Is it centered with little variance or does it varies widely? Are two of the variables dependent on or independent of each other? Statistics helps us answer these questions. This book will expect a basic understanding of basic statistical terms, such as mean, variance, co-variance, and correlation. Advanced terms, such as hypothesis testing, Chi-Square tests, p-values, and so on will be explained as and when required. Statistics are the cog in the wheel called model.

Algorithms, on the other hand, are the blueprints of a model. They are responsible for creating mathematical equations from the historical data. They analyze the data, quantify the relationship between the variables, and convert it into a mathematical equation. There is a variety of them: Linear Regression, Logistic Regression, Clustering, Decision Trees, Time-Series Modelling, Naïve Bayes Classifiers, Natural Language Processing, and so on. These models can be classified under two classes:

  • Supervised algorithms: These are the algorithms wherein the historical data has an output variable in addition to the input variables. The model makes use of the output variables from historical data, apart from the input variables. The examples of such algorithms include Linear Regression, Logistic Regression, Decision Trees, and so on.
  • Un-supervised algorithms: These algorithms work without an output variable in the historical data. The example of such algorithms includes clustering.

The selection of a particular algorithm for a model depends majorly on the kind of data available. The focus of this book would be to explain methods of handling various kinds of data and illustrating the implementation of some of these models.

Statistical tools

There are a many statistical tools available today, which are laced with inbuilt methods to run basic statistical chores. The arrival of open-source robust tools like R and Python has made them extremely popular, both in industry and academia alike. Apart from that, Python's packages are well documented; hence, debugging is easier.

Python has a number of libraries, especially for running the statistical, cleaning, and modelling chores. It has emerged as the first among equals when it comes to choosing the tool for the purpose of implementing preventive modelling. As the title suggests, Python will be the choice for this book, as well.

Historical data

Our machinery (model) is built and operated on this oil called data. In general, a model is built on the historical data and works on future data. Additionally, a predictive model can be used to fill missing values in historical data by interpolating the model over sparse historical data. In many cases, during modelling stages, future data is not available. Hence, it is a common practice to divide the historical data into training (to act as historical data) and testing (to act as future data) through sampling.

As discussed earlier, the data might or might not have an output variable. However, one thing that it promises to be is messy. It needs to undergo a lot of cleaning and manipulation before it can become of any use for a modelling process.

Mathematical function

Most of the data science algorithms have underlying mathematics behind them. In many of the algorithms, such as regression, a mathematical equation (of a certain type) is assumed and the parameters of the equations are derived by fitting the data to the equation.

For example, the goal of linear regression is to fit a linear model to a dataset and find the equation parameters of the following equation:

Mathematical function

The purpose of modelling is to find the best values for the coefficients. Once these values are known, the previous equation is good to predict the output. The equation above, which can also be thought of as a linear function of Xi's (or the input variables), is the linear regression model.

Another example is of logistic regression. There also we have a mathematical equation or a function of input variables, with some differences. The defining equation for logistic regression is as follows:

Mathematical function

Here, the goal is to estimate the values of a and b by fitting the data to this equation. Any supervised algorithm will have an equation or function similar to that of the model above. For unsupervised algorithms, an underlying mathematical function or criterion (which can be formulated as a function or equation) serves the purpose. The mathematical equation or function is the backbone of a model.

Business context

All the effort that goes into predictive analytics and all its worth, which accrues to data, is because it solves a business problem. A business problem can be anything and it will become more evident in the following examples:

  • Tricking the users of the product/service to buy more from you by increasing the click through rates of the online ads
  • Predicting the probable crime scenes in order to prevent them by aggregating an invincible lineup for a sports league
  • Predicting the failure rates and associated costs of machinery components
  • Managing the churn rate of the customers

The predictive analytics is being used in an array of industries to solve business problems. Some of these industries are, as follows:

  • Banking
  • Social media
  • Retail
  • Transport
  • Healthcare
  • Policing
  • Education
  • Travel and logistics
  • E-commerce
  • Human resource

By what quantum did the proposed solution make life better for the business, is all that matters. That is the reason; predictive analytics is becoming an indispensable practice for management consulting.

In short, predictive analytics sits at the sweet spot where statistics, algorithm, technology and business sense intersect. Think about it, a mathematician, a programmer, and a business person rolled in one.

Knowledge matrix for predictive modelling

As discussed earlier, predictive modelling is an interdisciplinary field sitting at the interface and requiring knowledge of four disciplines, such as Statistics, Algorithms, Tools, Techniques, and Business Sense. Each of these disciplines is equally indispensable to perform a successful task of predictive modelling.

These four disciplines of predictive modelling carry equal weights and can be better represented as a knowledge matrix; it is a symmetric 2 x 2 matrix containing four equal-sized squares, each representing a discipline.

Knowledge matrix for predictive modelling

Fig. 1.1: Knowledge matrix: four disciplines of predictive modelling

Task matrix for predictive modelling

The tasks involved in predictive modelling follows the Pareto principle. Around 80% of the effort in the modelling process goes towards data cleaning and wrangling, while only 20% of the time and effort goes into implementing the model and getting the prediction. However, the meaty part of the modelling that is rich with almost 80% of results and insights is undoubtedly the implementation of the model. This information can be better represented as a matrix, which can be called a task matrix that will look something similar to the following figure:

Task matrix for predictive modelling

Fig. 1.2: Task matrix: split of time spent on data cleaning and modelling and their final contribution to the model

Many of the data cleaning and exploration chores can be automated because they are alike most of the times, irrespective of the data. The part that needs a lot of human thinking is the implementation of a model, which is what makes the bulk of this book.

Applications and examples of predictive modelling

In the introductory section, data has been compared with oil. While oil has been the primary source of energy for the last couple of centuries and the legends of OPEC, Petrodollars, and Gulf Wars have set the context for the oil as a begrudged resource; the might of data needs to be demonstrated here to set the premise for the comparison. Let us glance through some examples of predictive analytics to marvel at the might of data.

LinkedIn's "People also viewed" feature

If you are a frequent LinkedIn user, you might be familiar with LinkedIn's "People also viewed" feature.

What it does?

Let's say you have searched for some person who works at a particular organization and LinkedIn throws up a list of search results. You click on one of them and you land up on their profile. In the middle-right section of the screen, you will find a panel titled "People Also Viewed"; it is essentially a list of people who either work at the same organization as the person whose profile you are currently viewing or the people who have the same designation and belong to same industry.

Isn't it cool? You might have searched for these people separately if not for this feature. This feature increases the efficacy of your search results and saves your time.

How is it done?

Are you wondering how LinkedIn does it? The rough blueprint is as follows:

  • LinkedIn leverages the search history data to do this. The model underneath this feature plunges into a treasure trove of search history data and looks at what people have searched next after finding the correct person they were searching for.
  • This event of searching for a particular second person after searching for a particular first person has some probability. This will be calculated using all the data for such searches. The profiles with the highest probability of being searched (based on the historical data) are shown in the "People Also Viewed" section.
  • This probability comes under the ambit of a broad set of rules called Association Rules. These are very widely used in Retail Analytics where we are interested to know what a group of products will sell together. In other words, what is the probability of buying a particular second product given that the consumer has already bought the first product?

Correct targeting of online ads

If you browse the Internet, which I am sure you must be doing frequently, you must have encountered online ads, both on the websites and smartphone apps. Just like the ads in the newspaper or TV, there is a publisher and an advertiser for online ads too. The publisher in this case is the website or the app where the ad will be shown while the advertiser is the company/organization that is posting that ad.

The ultimate goal of an online ad is to be clicked on. Each instance of an ad display is called an impression. The number of clicks per impression is called Click Through Rate and is the single most important metric that the advertisers are interested in. The problem statement is to determine the list of publishers where the advertiser should publish its ads so that the Click Through Rate is the maximum.

How is it done?

  • The historical data in this case will consist of information about people who visited a certain website/app and whether they clicked the published ad or not. Some or a combination of classification models, such as Decision Trees, and Support Vector Machines are used in such cases to determine whether a visitor will click on the ad or not, given the visitor's profile information.
  • One problem with standard classification algorithms in such cases is that the Click Through Rates are very small numbers, of the order of less than 1%. The resulting dataset that is used for classification has a very sparse positive outcome. The data needs to be downsampled to enrich the data with positive outcomes before modelling.

The logistical regression is one of the most standard classifiers for situations with binary outcomes. In banking, whether a person will default on his loan or not can be predicted using logistical regression given his credit history.

Santa Cruz predictive policing

Based on the historical data consisting of the area and time window of the occurrence of a crime, a model was developed to predict the place and time where the next crime might take place.

How is it done?

  • A decision tree model was created using the historical data. The prediction of the model will foretell whether a crime will occur in an area on a given date and time in the future.
  • The model is consistently recalibrated every day to include the crimes that happened during that day.

The good news is that the police are using such techniques to predict the crime scenes in advance so that they can prevent it from happening. The bad news is that certain terrorist organizations are using such techniques to target the locations that will cause the maximum damage with minimal efforts from their side. The good news again is that this strategic behavior of terrorists has been studied in detail and is being used to form counter-terrorist policies.

Determining the activity of a smartphone user using accelerometer data

The accelerometer in a smartphone measures the acceleration over a period of time as the user indulges in various activities. The acceleration is measured over the three axes, X, Y, and Z. This acceleration data can then be used to determine whether the user is sleeping, walking, running, jogging, and so on.

How is it done?

  • The acceleration data is clustered based on the acceleration values in the three directions. The values of the similar activities cluster together.
  • The clustering performs well in such cases if the columns contributing the maximum to the separation of activities are also included while calculating the distance matrix for clustering. Such columns can be found out using a technique called Singular Value Decomposition.

Sport and fantasy leagues

Moneyball, anyone? Yes, the movie. The movie where a statistician turns the fortunes of a poorly performing baseball team, Oak A, by developing an algorithm to select players who were cheap to buy but had a lot of latent potential to perform.

How was it done?

  • Bill James, using historical data, concluded that the older metrics used to rate a player, such as stolen balls, runs batted in, and batting average were not very useful indicators of a player's performance in a given match. He rather relied on metrics like on-base percentage and sluggish percentage to be a better predictor of a player's performance.
  • The chief statistician behind the algorithms, Bill James, compiled the data for performance of all the baseball league players and sorted them for these metrics. Surprisingly, the players who had high values for these statistics also came at cheaper prices.

This way, they gathered an unbeatable team that didn't have individual stars who came at hefty prices but as a team were an indomitable force. Since then, these algorithms and their variations have been used in a variety of real and fantasy leagues to select players. The variants of these algorithms are also being used by Venture Capitalists to optimize and automate their due diligence to select the prospective start-ups to fund.

Python and its packages – download and installation

There are various ways in which one can access and install Python and its packages. Here we will discuss a couple of them.

Anaconda

Anaconda is a popular Python distribution consisting of more than 195 popular Python packages. Installing Anaconda automatically installs many of the packages discussed in the preceding section, but they can be accessed only through an IDE called Spyder (more on this later in this chapter), which itself is installed on Anaconda installation. Anaconda also installs IPython Notebook and when you click on the IPython Notebook icon, it opens a browser tab and a Command Prompt.

Note

Anaconda can be downloaded and installed from the following web address: http://continuum.io/downloads

Download the suitable installer and double click on the .exe file and it will install Anaconda. Two of the features that you must check after the installation are:

  • IPython Notebook
  • Spyder IDE

Search for them in the "Start" icon's search, if it doesn't appear in the list of programs and files by default. We will be using IPython Notebook extensively and the codes in this book will work the best when run in IPython Notebook.

IPython Notebook can be opened by clicking on the icon. Alternatively, you can use the Command Prompt to open IPython Notebook. Just navigate to the directory where you have installed Anaconda and then write ipython notebook, as shown in the following screenshot:

Anaconda

Fig. 1.3: Opening IPython Notebook

Note

On the system used for this book, Anaconda was installed in the C:\Users\ashish directory. One can open a new Notebook in IPython by clicking on the New Notebook button on the dashboard, which opens up. In this book, we have used IPython Notebook extensively.

Standalone Python

You can download a Python version that is stable and is compatible to the OS on your system. The most stable version of Python is 2.7.0. So, installing this version is highly recommended. You can download it from https://www.python.org/ and install it.

There are some Python packages that you need to install on your machine before you start predictive analytics and modelling. This section consists of a demo of installation of one such library and a brief description of all such libraries.

Installing a Python package

There are several ways to install a Python package. The easiest and the most effective is the one using pip. As you might be aware, pip is a package management system that is used to install and manage software packages written in Python. To be able to use it to install other packages, pip needs to be installed first.

Installing pip

The following steps demonstrate how to install pip. Follow closely!

  1. Navigate to the webpage shown in the following screenshot. The URL address is https://pypi.python.org/pypi/pip:
    Installing pip

    Downloading pip from the Python's official website

  2. Download the pip-7.0.3.tar.gz file and unzip in the folder where Python is installed. If you have Python v2.7.0 installed, this folder should be C:\Python27:
    Installing pip

    Unzipping the .zar file for pip in the correct folder

  3. On unzipping the previously mentioned file, a folder called pip-7.0.3 is created. Opening that folder will take you to the screen similar to the one in the preceding screenshot.
  4. Open the CMD on your computer and change the current directory to the current directory in the preceding screenshot that is C:\Python27\pip-7.0.3 using the following command:
    cd C:\Python27\pip-7.0.3.
  5. The result of the preceding command is shown in the following screenshot:
    Installing pip

    Navigating to the directory where pip is installed

  6. Now, the current directory is set to the directory where setup file for pip (setup.py) resides. Write the following command to install pip:
    python setup.py install
  7. The result of the preceding command is shown in the following screenshot:
    Installing pip

    Installing pip using a command line

Once pip is installed, it is very easy to install all the required Python packages to get started.

Installing Python packages with pip

The following are the steps to install Python packages using pip, which we just installed in the preceding section:

  1. Change the current directory in the command prompt to the directory where the Python v2.7.0 is installed that is: C:\Python27.
  2. Write the following command to install the package:
    pip install package-name
  3. For example, to install pandas, you can proceed as follows:
    Installing Python packages with pip

    Installing a Python package using a command line and pip

  4. Finally, to confirm that the package has installed successfully, write the following command:
    python  -c "import pandas"
  5. The result of the preceding command is shown in the following screenshot:
    Installing Python packages with pip

    Checking whether the package has installed correctly or not

If this doesn't throw up an error, then the package has been installed successfully.

Python and its packages for predictive modelling

In this section, we will discuss some commonly used packages for predictive modelling.

pandas: The most important and versatile package that is used widely in data science domains is pandas and it is no wonder that you can see import pandas at the beginning of any data science code snippet, in this book, and anywhere in general. Among other things, the pandas package facilitates:

  • The reading of a dataset in a usable format (data frame in case of Python)
  • Calculating basic statistics
  • Running basic operations like sub-setting a dataset, merging/concatenating two datasets, handling missing data, and so on

The various methods in pandas will be explained in this book as and when we use them.

Note

To get an overview, navigate to the official page of pandas here: http://pandas.pydata.org/index.html

NumPy: NumPy, in many ways, is a MATLAB equivalent in the Python environment. It has powerful methods to do mathematical calculations and simulations. The following are some of its features:

  • A powerful and widely used a N-d array element
  • An ensemble of powerful mathematical functions used in linear algebra, Fourier transforms, and random number generation
  • A combination of random number generators and an N-d array elements is used to generate dummy datasets to demonstrate various procedures, a practice we will follow extensively, in this book

Note

To get an overview, navigate to official page of NumPy at http://www.NumPy.org/

matplotlib: matplotlib is a Python library that easily generates high-quality 2-D plots. Again, it is very similar to MATLAB.

  • It can be used to plot all kind of common plots, such as histograms, stacked and unstacked bar charts, scatterplots, heat diagrams, box plots, power spectra, error charts, and so on
  • It can be used to edit and manipulate all the plot properties such as title, axes properties, color, scale, and so on

Note

To get an overview, navigate to the official page of matplotlib at: http://matplotlib.org

IPython: IPython provides an environment for interactive computing.

It provides a browser-based notebook that is an IDE-cum-development environment to support codes, rich media, inline plots, and model summary. These notebooks and their content can be saved and used later to demonstrate the result as it is or to save the codes separately and execute them. It has emerged as a powerful tool for web based tutorials as the code and the results flow smoothly one after the other in this environment. At many places in this book, we will be using this environment.

Note

To get an overview, navigate to the official page of IPython here http://ipython.org/

Scikit-learn: scikit-learn is the mainstay of any predictive modelling in Python. It is a robust collection of all the data science algorithms and methods to implement them. Some of the features of scikit-learn are as follows:

  • It is built entirely on Python packages like pandas, NumPy, and matplotlib
  • It is very simple and efficient to use
  • It has methods to implement most of the predictive modelling techniques, such as linear regression, logistic regression, clustering, and Decision Trees
  • It gives a very concise method to predict the outcome based on the model and measure the accuracy of the outcomes

Note

To get an overview, navigate to the official page of scikit-learn here: http://scikit-learn.org/stable/index.html

Python packages, other than these, if used in this book, will be situation based and can be installed using the method described earlier in this section.

IDEs for Python

The IDE or the Integrated Development Environment is a software that provides the source-code editor cum debugger for the purpose of writing code. Using these software, one can write, test, and debug a code snippet before adding the snippet in the production version of the code.

IDLE: IDLE is the default Integrated Development Environment for Python that comes with the default implementation of Python. It comes with the following features:

  • Multi-window text-editor with auto-completion, smart-indent, syntax, and keyword highlighting
  • Python shell with syntax highlighting

IDLE is widely popular as an IDE for beginners; it is simple to use and works well for simple tasks. Some of the issues with IDLE are bad output reporting, absence of line numbering options, and so on. As a result, advanced practitioners move on to better IDEs.

IPython Notebook: IPython Notebook is a powerful computational environment where code, execution, results, and media can co-exist in one single document. There are two components of this computing environment:

  • IPython Notebook: Web applications containing code, executions, plots, and results are stored in different cells; they can be saved and edited as and when required
  • Notebook: It is a plain text document meant to record and distribute the result of a computational analysis

The IPython documents are stored with an extension .ipynb in the directory where it is installed on the computer.

Some of the features of IPython Notebook are as follows:

  • Inline figure rendering of the matplotlib plots that can be saved in multiple formats(JPEG, PNG).
  • Standard Python syntax in the notebook can be saved as a Python script.
  • The notebooks can be saved as HTML files and .ipynb files. These notebooks can be viewed in browsers and this has been developed as a popular tool for illustrated blogging in Python. A notebook in IPython looks as shown in the following screenshot:
    IDEs for Python

    An Ipython Notebook

Spyder: Spyder is a powerful scientific computing and development environment for Python. It has the following features:

  • Advanced editing, auto-completion, debugging, and interactive testing
  • Python kernel and code editor with line numbering in the same screen
  • Preinstalled scientific packages like NumPy, pandas, scikit-learn, matplotlib, and so on.
  • In some ways, Spyder is very similar to RStudio environment where text editing and interactive testing go hand in hand:
    IDEs for Python

    The interface of Spyder IDE

In this book, IPython Notebook and Spyder have been used extensively. IDLE has been used from time to time and some people use other environments, such as Pycharm. Readers of this book are free to use such editors if they are more comfortable with them. However, they should make sure that all the required packages are working fine in those environments.

Summary

The following are some of the takeaways from this chapter:

  • Social media and Internet of Things have resulted in an avalanche of data.
  • Data is powerful but not in its raw form. The data needs to be processed and modelled.
  • Organizations across the world and across the domains are using data to solve critical business problems. The knowledge of statistical algorithms, statisticals tool, business context, and handling of historical data is vital to solve these problems using predictive modelling.
  • Python is a robust tool to handle, process, and model data. It has an array of packages for predictive modelling and a suite of IDEs to choose from.

Let us enter the battlefield where Python is our weapon. We will start using it from the next chapter. In the next chapter, we will learn how to read data in various cases and do a basic processing.

Left arrow icon Right arrow icon

Key benefits

  • A step-by-step guide to predictive modeling including lots of tips, tricks, and best practices
  • Get to grips with the basics of Predictive Analytics with Python
  • Learn how to use the popular predictive modeling algorithms such as Linear Regression, Decision Trees, Logistic Regression, and Clustering

Description

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form - It needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Learning to predict who would win, lose, buy, lie, or die with Python is an indispensable skill set to have in this data age. This book is your guide to getting started with Predictive Analytics using Python. You will see how to process data and make predictive models from it. We balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and numpy. You’ll start by getting an understanding of the basics of predictive modeling, then you will see how to cleanse your data of impurities and get it ready it for predictive modeling. You will also learn more about the best predictive modeling algorithms such as Linear Regression, Decision Trees, and Logistic Regression. Finally, you will see the best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world.

Who is this book for?

If you wish to learn how to implement Predictive Analytics algorithms using Python libraries, then this is the book for you. If you are familiar with coding in Python (or some other programming/statistical/scripting language) but have never used or read about Predictive Analytics algorithms, this book will also help you. The book will be beneficial to and can be read by any Data Science enthusiasts. Some familiarity with Python will be useful to get the most out of this book, but it is certainly not a prerequisite.

What you will learn

  • Understand the statistical and mathematical concepts behind Predictive Analytics algorithms and implement Predictive Analytics algorithms using Python libraries
  • Analyze the result parameters arising from the implementation of Predictive Analytics algorithms
  • Write Python modules/functions from scratch to execute segments or the whole of these algorithms
  • Recognize and mitigate various contingencies and issues related to the implementation of Predictive Analytics algorithms
  • Get to know various methods of importing, cleaning, sub-setting, merging, joining, concatenating, exploring, grouping, and plotting data with pandas and numpy
  • Create dummy datasets and simple mathematical simulations using the Python numpy and pandas libraries
  • Understand the best practices while handling datasets in Python and creating predictive models out of them

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 15, 2016
Length: 354 pages
Edition : 1st
Language : English
ISBN-13 : 9781783983261
Category :
Languages :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Feb 15, 2016
Length: 354 pages
Edition : 1st
Language : English
ISBN-13 : 9781783983261
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 115.97
Designing Machine Learning Systems with Python
€36.99
Python Machine Learning
€36.99
Learning Predictive Analytics with Python
€41.99
Total 115.97 Stars icon

Table of Contents

11 Chapters
1. Getting Started with Predictive Modelling Chevron down icon Chevron up icon
2. Data Cleaning Chevron down icon Chevron up icon
3. Data Wrangling Chevron down icon Chevron up icon
4. Statistical Concepts for Predictive Modelling Chevron down icon Chevron up icon
5. Linear Regression with Python Chevron down icon Chevron up icon
6. Logistic Regression with Python Chevron down icon Chevron up icon
7. Clustering with Python Chevron down icon Chevron up icon
8. Trees and Random Forests with Python Chevron down icon Chevron up icon
9. Best Practices for Predictive Modelling Chevron down icon Chevron up icon
A. A List of Links Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.4
(11 Ratings)
5 star 36.4%
4 star 9.1%
3 star 27.3%
2 star 9.1%
1 star 18.2%
Filter icon Filter
Top Reviews

Filter reviews by




adnan baloch Mar 28, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
You don't have to be married to a physicist to appreciate the role of the team at CERN that confirmed the existence of the Higgs Boson. Who better to be a reviewer of this book than a member of that team? That fact itself should inspire confidence in the utility of this book. The author uses interesting analogies to explain the different aspects of predictive analytics and even goes so far as to present comparison tables, serving to drive home his points. The ease and power of the Python programming language is put to good use in explaining the process of data cleaning and wrangling. The better part of the first half of the book is dedicated to exploring the various aspects of these two critical processes with easy to follow examples and code. A whole chapter is devoted to laying out the statistical concepts that are integral to getting the most out of the remainder of the book. The latter part of the book details supervised and unsupervised predictive modelling algorithms, shows how to implement them in Python and furthermore, delves deep into the mathematics of these widely used algorithms so that readers become well equipped to tackle real world challenges of predictive analytics in ANY programming language of their choice. In my opinion, the author really succeeded in making the serious subject matter of this book sound cool and exciting.
Amazon Verified review Amazon
A. Zubarev Apr 18, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
In my view Learning Predictive Analytics with Python is one of the most successful publications on such a difficult to initially grasp subject as Machine Learning. Yes, despite the name of the book does not imply so, it is in fact a gentle submersion into the Machine Learning, a so highly praised Data Science topic. Luckily, learning it would be much easier with Learning Predictive Analytics with Python from such a talented author. It is the most exciting yet easy to follow, logical and at the same time entertaining material I ever read so far. Tasteful, relevant examples, based on free software and datasets anyone can obtain. And the book also has several gems, these are the coverage of the ID3 algorithm (based on my observation looks like totally omitted in the most modern literature, but undeservedly), building various regressions and testing your model. One small advice to the reader: get familiarized yourself with iPython, and perhaps read some theory on statistics, not really necessary, but if you are going to apply the newly acquired knowledge at work or study then it could be a great deal of steering you into the right direction.
Amazon Verified review Amazon
Julian Cook Mar 13, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you are familiar with Packt (the publisher), you will know that they tend to carpet bomb particular areas, with multiple overlapping titles. This makes it difficult to recommend just one title if anyone asks you, since different books have different strengths.The strength of this book is that the author really does explain how to use PANDAS (python data analysis library) and statistical analysis from the ground up. Most pandas users will be familiar with pd.read_csv, but he covered a lot of options that I had never really understood properly, because I chiefly learnt from examples that don't really give you the 'why' of things.You might say, why not read the original book by Wes McKinney? I would have to say that this is a much more interesting read and has better flow. The Wes McKinney book sometimes reads like documentation and you are not sure what to really focus on.The coverage of statistical learning is also good, for instance he does a nice explanation of logistic regression and the underlying methodology with just enough math to properly explain the distinction between linear regression and logistic regression.I think the book is thorough enough that you could actually use it as a coursebook for statistical learning w/python, which a high praise for a book with a fairly generic title.
Amazon Verified review Amazon
a reader Sep 26, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a good book. I do not understand why there are bad reviews for it. I would like to thank the author for the good job! Well done! Unfortunately, the author deleted the datasets the book uses from the Google drive.
Amazon Verified review Amazon
Jeremie Oct 04, 2017
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Book deserves three to four stars max. It is ok and interesting. It is introduces a lot of concepts but shame it doesn't go a little bit more into details especially in the end of the book when talking about clustering and regression. It is one thing to talk about clustering but there is nothing about what to do with it once it is done.there isnt much discussion about regression tree and random forest algorithms which deserve more such as for example what can one do to improve the algos if thstbdont work well or what other algos are available.perhaps simply the book needs to advise on further reading
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.