Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Data Science Projects with Python
Data Science Projects with Python

Data Science Projects with Python: A case study approach to successful data science projects using Python, pandas, and scikit-learn

eBook
€19.99 €22.99
Paperback
€28.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Data Science Projects with Python

Chapter 1. Data Exploration and Cleaning

Note

Learning Objectives

By the end of this chapter, you will be able to:

  • Perform basic operations in Python

  • Describe the business context of the case study data and its suitability for the task

  • Perform data cleaning operations

  • Examine statistical summaries and visualize the case study data

  • Implement one-hot encoding on categorical variables

Note

This chapter will get you started with basic operations in Python and shows you how to perform data-related operations such as data verification, data cleaning, datatype conversion, examining statistical summaries, and more.

Introduction


Most businesses possess a wealth of data on their operations and customers. Reporting on these data in the form of descriptive charts, graphs, and tables is a good way to understand the current state of the business. However, in order to provide quantitative guidance on future business strategies and operations, it is necessary to go a step further. This is where the practices of machine learning and predictive modeling become involved. In this book, we will show how to go from descriptive analyses to concrete guidance for future operations using predictive models.

To accomplish this goal, we'll introduce some of the most widely-used machine learning tools via Python and many of its packages. You will also get a sense of the practical skills necessary to execute successful projects: inquisitiveness when examining data and communication with the client. Time spent looking in detail at a dataset and critically examining whether it accurately meets its intended purpose is time well spent. You will learn several techniques for assessing data quality here.

In this chapter, after getting familiar with the basic tools for data exploration, we will discuss a few typical working scenarios for how you may receive data. Then, we will begin a thorough exploration of the case study dataset and help you learn how you can uncover possible issues, so that when you are ready for modeling, you may proceed with confidence.

Python and the Anaconda Package Management System


In this book, we will use the Python programming language. Python is a top language for data science and is one of the fastest growing programming languages. A commonly cited reason for Python's popularity is that it is easy to learn. If you have Python experience, that's great; however, if you have experience with other languages, such as C, Matlab, or R, you shouldn't have much trouble using Python. You should be familiar with the general constructs of computer programming to get the most out of this book. Examples of such constructs are for loops and if statements that guide the control flow of a program. No matter what language you have used, you are likely familiar with these constructs, which you will also find in Python.

A key feature of Python, that is different from some other languages, is that it is zero-indexed; in other words, the first element of an ordered collection has an index of 0. Python also supports negative indexing, where index-1 refers to the last element of an ordered collection and negative indices count backwards from the end. The slice operator, :, can be used to select multiple elements of an ordered collection from within a range, starting from the beginning, or going to the end of the collection.

Indexing and the Slice Operator

Here, we demonstrate how indexing and the slice operator work. To have something to index, we will create a list, which is a mutable ordered collection that can contain any type of data, including numerical and string types. "Mutable" just means the elements of the list can be changed after they are first assigned. To create the numbers for our list, which will be consecutive integers, we'll use the built-in range() Python function. The range() function technically creates an iterator that we'll convert to a list using the list() function, although you need not be concerned with that detail here. The following screenshot shows a basic list being printed on the console:

Figure 1.1: List creation and indexing

A few things to notice about Figure 1.1: the endpoint of an interval is open for both slice indexing and the range() function, while the starting point is closed. In other words, notice how when we specify the start and end of range(), endpoint 6 is not included in the result but starting point 1 is. Similarly, when indexing the list with the slice [:3], this includes all elements of the list with indices up to, but not including, 3.

We've referred to ordered collections, but Python also includes unordered collections. An important one of these is called a dictionary. A dictionary is an unordered collection of key:value pairs. Instead of looking up the values of a dictionary by integer indices, you look them up by keys, which could be numbers or strings. A dictionary can be created using curly braces {} and with the key:value pairs separated by commas. The following screenshot is an example of how we can create a dictionary with counts of fruit – examine the number of apples, then add a new type of fruit and its count:

Figure 1.2: An example dictionary

There are many other distinctive features of Python and we just want to give you a flavor here, without getting into too much detail. In fact, you will probably use packages such as pandas (pandas) and NumPy (numpy) for most of your data handling in Python. NumPy provides fast numerical computation on arrays and matrices, while pandas provides a wealth of data wrangling and exploration capabilities on tables of data called DataFrames. However, it's good to be familiar with some of the basics of Python—the language that sits at the foundation of all of this. For example, indexing works the same in NumPy and pandas as it does in Python.

One of the strengths of Python is that it is open source and has an active community of developers creating amazing tools. We will use several of these tools in this book. A potential pitfall of having open source packages from different contributors is the dependencies between various packages. For example, if you want to install pandas, it may rely on a certain version of NumPy, which you may or may not have installed. Package management systems make life easier in this respect. When you install a new package through the package management system, it will ensure that all the dependencies are met. If they aren't, you will be prompted to upgrade or install new packages as necessary.

For this book, we will use the Anaconda package management system, which you should already have installed. While we will only use Python here, it is also possible to run R with Anaconda.

Note

Environments

If you previously had Anaconda installed and were using it prior to this book, you may wish to create a new Python 3.x environment for the book. Environments are like separate installations of Python, where the set of packages you have installed can be different, as well as the version of Python. Environments are useful for developing projects that need to be deployed in different versions of Python. For more information, see https://conda.io/docs/user-guide/tasks/manage-environments.html.

Exercise 1: Examining Anaconda and Getting Familiar with Python

In this exercise, you will examine the packages in your Anaconda installation and practice some basic Python control flow and data structures, including a for loop, dict, and list. This will confirm that you have completed the installation steps in the preface and show you how Python syntax and data structures may be a little different from other programming languages you may be familiar with. Perform the following steps to complete the exercise:

Note

The code file for this exercise can be found here: http://bit.ly/2Oyag1h.

  1. Open up a Terminal, if you're using macOS or Linux, or a Command Prompt window in Windows. Type conda list at the command line. You should observe an output similar to the following:

    Figure 1.3: Selection of packages from conda list

    You can see all the packages installed in your environment. Look how many packages already come with the default Anaconda installation! These include all the packages we will need for the book. However, installing new packages and upgrading existing ones is easy and straightforward with Anaconda; this is one of the main advantages of a package management system.

  2. Type python in the Terminal to open a command-line Python interpreter. You should obtain an output similar to the following:

    Figure 1.4: Command-line Python

    You should also see some information about your version of Python, as well as the Python command prompt (>>>). When you type after this prompt, you are writing Python code.

    Note

    Although we will be using the Jupyter Notebook in this book, one of the aims of this exercise is to go through the basic steps of writing and running Python programs on the command prompt.

  3. Write a for loop at the command prompt to print values from 0 to 4 using the following code:

    for counter in range(5):
    ...    print(counter)
    ... 

    Once you hit Enter when you see (...) on the prompt, you should obtain this output:

    Figure 1.5: Output of a for loop at the command line

    Notice that in Python, the opening of the for loop is followed by a colon, and the body of the loop requires indentation. It's typical to use four spaces to indent a code block. Here, the for loop prints the values returned by the range() iterator, having repeatedly accessed them using the counter variable with the in keyword.

    Note

    For many more details on Python code conventions, refer to the following: https://www.python.org/dev/peps/pep-0008/.

    Now, we will return to our dictionary example. The first step here is to create the dictionary.

  4. Create a dictionary of fruits (apples, oranges, and bananas) using the following code:

    example_dict = {'apples':5, 'oranges':8, 'bananas':13}
  5. Convert the dictionary to a list using the list() function, as shown in the following snippet:

    dict_to_list = list(example_dict)
    dict_to_list

    Once you run the preceding code, you should obtain the following output:

    Figure 1.6: Dictionary keys converted to a list

    Notice that when this is done and we examine the contents, only the keys of the dictionary have been captured in the list. If we wanted the values, we would have had to specify that with the .values() method of the list. Also, notice that the list of dictionary keys happens to be in the same order that we wrote them in when creating the dictionary. This is not guaranteed, however, as dictionaries are unordered collection types.

    One convenient thing you can do with lists is to append other lists to them with the + operator. As an example, in the next step we will combine the existing list of fruit with a list that contains just one more type of fruit, overwriting the variable containing the original list.

  6. Use the + operator to combine the existing list of fruits with a new list containing only one fruit (pears):

    dict_to_list = dict_to_list + ['pears']
    dict_to_list

    Figure 1.7: Appending to a list

    What if we wanted to sort our list of fruit types?

    Python provides a built-in sorted() function that can be used for this; it will return a sorted version of the input. In our case, this means the list of fruit types will be sorted alphabetically.

  7. Sort the list of fruits in alphabetical order using the sorted() function, as shown in the following snippet:

    sorted(dict_to_list)

    Once you run the preceding code, you should see the following output:

    Figure 1.8: Sorting a list

That's enough Python for now. We will show you how to execute the code for this book, so your Python knowledge should improve along the way.

Note

As you learn more and inevitably want to try new things, you will want to consult the documentation: https://docs.python.org/3/.

Different Types of Data Science Problems


Much of your time as a data scientist is likely to be spent wrangling data: figuring out how to get it, getting it, examining it, making sure it's correct and complete, and joining it with other types of data. pandas will facilitate this process for you. However, if you aspire to be a machine learning data scientist, you will need to master the art and science of predictive modeling. This means using a mathematical model, or idealized mathematical formulation, to learn the relationships within the data, in the hope of making accurate and useful predictions when new data comes in.

For this purpose, data is typically organized in a tabular structure, with features and a response variable. For example, if you want to predict the price of a house based on some characteristics about it, such as area and number of bedrooms, these attributes would be considered the features and the price of the house would be the response variable. The response variable is sometimes called the target variable or dependent variable, while the features may also be called the independent variables.

If you have a dataset of 1,000 houses including the values of these features and the prices of the houses, you can say you have 1,000 samples of labeled data, where the labels are the known values of the response variable: the prices of different houses. Most commonly, the tabular data structure is organized so that different rows are different samples, while features and the response occupy different columns, along with other metadata such as sample IDs, as shown in Figure 1.9.

Figure 1.9: Labeled data (the house prices are the known target variable)

Regression Problem

Once you have trained a model to learn the relationship between the features and response using your labeled data, you can then use it to make predictions for houses where you don't know the price, based on the information contained in the features. The goal of predictive modeling in this case is to be able to make a prediction that is close to the true value of the house. Since we are predicting a numerical value on a continuous scale, this is called a regression problem.

Classification Problem

On the other hand, if we were trying to make a qualitative prediction about the house, to answer a yes or no question such as "will this house go on sale within the next five years?" or "will the owner default on the mortgage?", we would be solving what is known as a classification problem. Here, we would hope to answer the yes or no question correctly. The following figure is a schematic illustrating how model training works, and what the outcomes of regression or classification models might be:

Figure 1.10: Schematic of model training and prediction for regression and classification

Classification and regression tasks are called supervised learning, which is a class of problems that relies on labeled data. These problems can be thought of a needing "supervision" by the known values of the target variable. By contrast, there is also unsupervised learning, which relates to more open-ended questions of trying to find some sort of structure in a dataset that does not necessarily have labels. Taking a broader view, any kind of applied math problem, including fields as varied as optimization, statistical inference, and time series modeling, may potentially be considered an appropriate responsibility for a data scientist.

Loading the Case Study Data with Jupyter and pandas


Now it's time to take a first look at the data we will use in our case study. We won't do anything in this section other than ensure that we can load the data into a Jupyter Notebook correctly. Examining the data, and understanding the problem you will solve with it, will come later.

The data file is an Excel spreadsheet called default_of_credit_card_clients__courseware_version_1_13_19.xls. We recommend you first open the spreadsheet in Excel, or the spreadsheet program of your choice. Note the number of rows and columns, and look at some example values. This will help you know whether or not you have loaded it correctly in the Jupyter Notebook.

Note

The dataset can obtained from the following link: http://bit.ly/2HIk5t3. This is a modified version of the original dataset, which has been sourced from the UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

What is a Jupyter Notebook?

Jupyter Notebooks are interactive coding environments that allow for in-line text and graphics. They are great tools for data scientists to communicate and preserve their results, since both the methods (code) and the message (text and graphics) are integrated. You can think of the environment as a kind of webpage where you can write and execute code. Jupyter Notebooks can, in fact, be rendered as web pages and are done so on GitHub. Here is one of our example notebooks: http://bit.ly/2OvndJg. Look it over and get a sense of what you can do. An excerpt from this notebook is displayed here, showing code, graphics, and prose, known as markdown in this context:

Figure 1.11: Example of a Jupyter Notebook showing code, graphics, and markdown text

One of the first things to learn about Jupyter Notebooks is how to navigate around and make edits. There are two modes available to you. If you select a cell and press Enter, you are in edit mode and you can edit the text in that cell. If you press Esc, you are in command mode and you can navigate around the notebook.

When you are in command mode, there are many useful hotkeys you can use. The Up and Down arrows will help you select different cells and scroll through the notebook. If you press y on a selected cell in command mode, it changes it to a code cell, in which the text is interpreted as code. Pressing m changes it to a markdown cell. Shift + Enter evaluates the cell, rendering the markdown or executing the code, as the case may be.

Our first task in our first Jupyter Notebook will be to load the case study data. To do this, we will use a tool called pandas. It is probably not a stretch to say that pandas is the pre-eminent data-wrangling tool in Python.

A DataFrame is a foundational class in pandas. We'll talk more about what a class is later, but you can think of it as a template for a data structure, where a data structure is something like the lists or dictionaries we discussed earlier. However, a DataFrame is much richer in functionality than either of these. A DataFrame is similar to spreadsheets in many ways. There are rows, which are labeled by a row index, and columns, which are usually given column header-like labels that can be thought of as a column index. Index is, in fact, a data type in pandas used to store indices for a DataFrame, and columns have their own data type called Series.

You can do a lot of the same things with a DataFrame that you can do with Excel sheets, such as creating pivot tables and filtering rows. pandas also includes SQL-like functionality. You can join different DataFrames together, for example. Another advantage of DataFrames is that once your data is contained in one of them, you have the capabilities of a wealth of pandas functionality at your fingertips. The following figure is an example of a pandas DataFrame:

Figure 1.12: Example of a pandas DataFrame with an integer row index at the left and a column index of strings

The example in Figure 1.12 is in fact the data for the case study. As a first step with Jupyter and pandas, we will now see how to create a Jupyter Notebook and load data with pandas. There are several convenient functions you can use in pandas to explore your data, including .head() to see the first few rows of the DataFrame, .info() to see all columns with datatypes, .columns to return a list of column names as strings, and others we will learn about in the following exercises.

Exercise 2: Loading the Case Study Data in a Jupyter Notebook

Now that you've learned about Jupyter Notebooks, the environment in which we'll write code, and pandas, the data wrangling package, let's create our first Jupyter Notebook. We'll use pandas within this notebook to load the case study data and briefly examine it. Perform the following steps to complete the exercise:

Note

For Exercises 2–5 and Activity 1, the code and the resulting output have been loaded in a Jupyter Notebook that can be found at http://bit.ly/2W9cwPH. You can scroll to the appropriate section within the Jupyter Notebook to locate the exercise or activity of choice.

  1. Open a Terminal (macOS or Linux) or a Command Prompt window (Windows) and type jupyter notebook.

    You will be presented with the Jupyter interface in your web browser. If the browser does not open automatically, copy and paste the URL from the terminal in to your browser. In this interface, you can navigate around your directories starting from the directory you were in when you launched the notebook server.

  2. Navigate to a convenient location where you will store the materials for this book, and create a new Python 3 notebook from the New menu, as shown here:

    Figure 1.13: Jupyter home screen

  3. Make your very first cell a markdown cell by typing m while in command mode (press Esc to enter command mode), then type a number sign, #, at the beginning of the first line, followed by a space, for a heading. Make a title for your notebook here. On the next few lines, place a description.

    Here is a screenshot of an example, including other kinds of markdown such as bold, italics, and the way to write code-style text in a markdown cell:

    Figure 1.14: Unrendered markdown cell

    Note that it is good practice to make a title and brief description of your notebook, to identify its purpose to readers.

  4. Press Shift + Enter to render the markdown cell.

    This should also create a new cell, which will be a code cell. You can change it to a markdown cell, as you now know how to do, by pressing m, and back to a code cell by pressing y. You will know it's a code cell because of the In [ ]: next to it.

  5. Type import pandas as pd in the new cell, as shown in the following screenshot:

    Figure 1.15: Rendered markdown cell and code cell

    After you execute this cell, the pandas library will be loaded into your computing environment. It's common to import libraries with "as" to create a short alias for the library. Now, we are going to use pandas to load the data file. It's in Microsoft Excel format, so we can use pd.read_excel.

    Note

    For more information on all the possible options for pd.read_excel, refer to the following documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html.

  6. Import the dataset, which is in the Excel format, as a DataFrame using the pd.read_excel() method, as shown in the following snippet:

    df = pd.read_excel('../Data/default_of_credit_card_clients_courseware_version_1_21_19.xls')

    Note that you need to point the Excel reader to wherever the file is located. If it's in the same directory as your notebook, you could just enter the filename. The pd.read_excel method will load the Excel file into a DataFrame, which we've called df. The power of pandas is now available to us.

    Let's do some quick checks in the next few steps. First, do the numbers of rows and columns match what we know from looking at the file in Excel?

  7. Use the .shape method to review the number of rows and columns, as shown in the following snippet:

    df.shape

    Once you run the cell, you will obtain the following output:

    Figure 1.16: Checking the shape of a DataFrame

    This should match your observations from the spreadsheet. If it doesn't, you would then need to look into the various options of pd.read_excel to see if you needed to adjust something.

With this exercise, we have successfully loaded our dataset into the Jupyter Notebook. You can also have a look at the .info() and .head() methods, which will tell you information about all the columns, and show you the first few rows of the DataFrame, respectively. Now you're up and running with your data in pandas.

As a final note, while this may already be clear, observe that if you define a variable in one code cell, it is available to you in other code cells within the notebook. The code cells within a notebook share scope as long as the kernel is running, as shown in the following screenshot:

Figure 1.17: Variable in scope between cells

Getting Familiar with Data and Performing Data Cleaning

Now let's imagine we are taking our first look at this data. In your work as a data scientist, there are several possible scenarios in which you may receive such a dataset. These include the following:

  1. You created the SQL query that generated the data.

  2. A colleague wrote a SQL query for you, with your input.

  3. A colleague who knows about the data gave it to you, but without your input.

  4. You are given a dataset about which little is known.

In cases 1 and 2, your input was involved in generating/extracting the data. In these scenarios, you probably understood the business problem and then either found the data you needed with the help of a data engineer, or did your own research and designed the SQL query that generated the data. Often, especially as you gain more experience in your data science role, the first step will be to meet with the business partner to understand, and refine the mathematical definition of, the business problem. Then, you would play a key role in defining what is in the dataset.

Even if you have a relatively high level of familiarity with the data, doing data exploration and looking at summary statistics of different variables is still an important first step. This step will help you select good features, or give you ideas about how you can engineer new features. However, in the third and fourth cases, where your input was not involved or you have little knowledge about the data, data exploration is even more important.

Another important initial step in the data science process is examining the data dictionary. The data dictionary, as the term implies, is a document that explains what the data owner thinks should be in the data, such as definitions of the column labels. It is the data scientist's job to go through the data carefully to make sure that these impressions match the reality of what is in the data. In cases 1 and 2, you will probably need to create the data dictionary yourself, which should be considered essential project documentation. In cases 3 and 4, you should seek out the dictionary if at all possible.

The case study data we'll use in this book is basically similar to case 3 here.

The Business Problem

Our client is a credit card company. They have brought us a dataset that includes some demographics and recent financial data (the past six months) for a sample of 30,000 of their account holders. This data is at the credit account level; in other words, there is one row for each account (you should always clarify what the definition of a row is, in a dataset). Rows are labeled by whether in the next month after the six month historical data period, an account owner has defaulted, or in other words, failed to make the minimum payment.

Goal

Your goal is to develop a predictive model for whether an account will default next month, given demographics and historical data. Later in the book, we'll discuss the practical application of the model.

The data is already prepared, and a data dictionary is available. The dataset supplied with the book, default_of_credit_card_clients__courseware_version_1_21_19.xls, is a modified version of this dataset in the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Have a look at that web page, which includes the data dictionary.

Note

The original dataset has been obtained from UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. In this book, we have modified the dataset to suit the book objectives. The modified dataset can be found here: http://bit.ly/2HIk5t3.

Data Exploration Steps

Now that we've understood the business problem and have an idea of what is supposed to be in the data, we can compare these impressions to what we actually see in the data. Your job in data exploration is to not only look through the data both directly and using numerical and graphical summaries, but also to think critically about whether the data make sense and match what you have been told about them. These are helpful steps in data exploration:

  1. How many columns are there in the data?

    These may be features, response, or metadata.

  2. How many rows (samples)?

  3. What kind of features are there? Which are categorical and which are numerical?

    Categorical features have values in discrete classes such as "Yes," "No," or "maybe."Numerical features are typically on a continuous numerical scale, such as dollar amounts.

  4. What does the data look like in these features?

    To see this, you can examine the range of values in numeric features, or the frequency of different classes in categorical features, for example.

  5. Is there any missing data?

We have already answered questions 1 and 2 in the previous section; there are 30,000 rows and 25 columns. As we start to explore the rest of these questions in the following exercise, pandas will be our go-to tool. We begin by verifying basic data integrity in the next exercise.

Note

Note that compared to the website's description of the data dictionary, X6-X11 are called PAY_1-PAY_6 in our data. Similarly, X12-X17 are BILL_AMT1-BILL_AMT6, and X18-X23 are PAY_AMT1-PAY_AMT6.

Exercise 3: Verifying Basic Data Integrity

In this exercise, we will perform a basic check on whether our dataset contains what we expect and verify whether there are the correct number of samples.

The data are supposed to have observations for 30,000 credit accounts. While there are 30,000 rows, we should also check whether there are 30,000 unique account IDs. It's possible that, if the SQL query used to generate the data was run on an unfamiliar schema, values that are supposed to be unique are in fact not unique.

To examine this, we can check if the number of unique account IDs is the same as the number of rows. Perform the following steps to complete the exercise:

Note

The code and the resulting output graphics for this exercise have been loaded in a Jupyter Notebook that can be found here: http://bit.ly/2W9cwPH.

  1. Examine the column names by running the following command in the cell:

    df.columns

    The .columns method of the DataFrame is employed to examine all the column names. You will obtain the following output once you run the cell:

    Figure 1.18: Columns of the dataset

    As can be observed, all column names are listed in the output. The account ID column is referenced as ID. The remaining columns appear to be our features, with the last column being the response variable. Let's quickly review the dataset information that was given to us by the client:

    LIMIT_BAL: Amount of the credit provided (in New Taiwanese (NT) dollar) including individual consumer credit and the family (supplementary) credit.

    SEX: Gender (1 = male; 2 = female).

    Note

    We will not be using the gender data to decide credit-worthiness owing to ethical considerations.

    EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

    MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).

    AGE: Age (year).

    PAY_1Pay_6: A record of past payments. Past monthly payments, recorded from April to September, are stored in these columns.

    PAY_1 represents the repayment status in September; PAY_2 = repayment status in August; and so on up to PAY_6, which represents the repayment status in April.

    The measurement scale for the repayment status is as follows: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; and so on up to 8 = payment delay for eight months; 9 = payment delay for nine months and above.

    BILL_AMT1BILL_AMT6: Bill statement amount (in NT dollar).

    BILL_AMT1 represents the bill statement amount in September; BILL_AMT2 represents the bill statement amount in August; and so on up to BILL_AMT7, which represents the bill statement amount in April.

    PAY_AMT1PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 represents the amount paid in September; PAY_AMT2 represents the amount paid in August; and so on up to PAY_AMT6, which represents the amount paid in April.

    Let's now use the .head() method in the next step to observe the first few rows of data.

  2. Type the following command in the subsequent cell and run it using Shift + Enter:

    df.head()

    You will observe the following output:

    Figure 1.19: .head() of a DataFrame

    The ID column seems like it contains unique identifiers. Now, to verify if they are in fact unique throughout the whole dataset, we can count the number of unique values using the .nunique() method on the Series (aka column) ID. We first select the column using square brackets.

  3. Select the target column (ID) and count unique values using the following command:

    df['ID'].nunique()

    You will see in the following output that the number of unique entries is 29,687:

    Figure 1.20: Finding a data quality issue

  4. Run the following command to obtain the number of rows in the dataset:

    df.shape 

    As can be observed in the following output, the total number of rows in the dataset is 30,000:

    Figure 1.21: Dimensions of the dataset

    We see here that the number of unique IDs is less than the number of rows. This implies that the ID is not a unique identifier for the rows of the data, as we thought. So we know that there is some duplication of IDs. But how much? Is one ID duplicated many times? How many IDs are duplicated?

    We can use the .value_counts() method on the ID series to start to answer these questions. This is similar to a group by/count procedure in SQL. It will list the unique IDs and how often they occur. We will perform this operation in the next step and store the value counts in a variable id_counts.

  5. Store the value counts in a variable defined as id_counts and then display the stored values using the .head() method, as shown:

    id_counts = df['ID'].value_counts()
    id_counts.head()

    You will obtain the following output:

    Figure 1.22: Getting value counts of the account IDs

    Note that .head() returns the first five rows by default. You can specify the number of items to be displayed by passing the required number in the parentheses, ().

  6. Display the number of grouped duplicated entries by running another value count:

    id_counts.value_counts()

    You will obtain the following output:

    Figure 1.23: Getting value counts of the account IDs

    In the preceding output and from the initial value count, we can see that most IDs occur exactly once, as expected. However, 313 IDs occur twice. So, no ID occurs more than twice. Armed with this information, we are ready to begin taking a closer look at this data quality issue and fixing it. We will be creating Boolean masks to further clean the data.

Boolean Masks

To help clean the case study data, we introduce the concept of a logical mask, also known as a Boolean mask. A logical mask is a way to filter an array, or series, by some condition. For example, we can use the "is equal to" operator in Python, ==, to find all locations of an array that contain a certain value. Other comparisons, such as "greater than" (>), "less than" (<), "greater than or equal to" (>=), and "less than or equal to" (<=), can be used similarly. The output of such a comparison is an array or series of True/False values, also known as Boolean values. Each element of the output corresponds to an element of the input, is True if the condition is met, and is False otherwise. To illustrate how this works, we will use synthetic data. Synthetic data is data that is created to explore or illustrate a concept. First, we are going to import the NumPy package, which has many capabilities for generating random numbers, and give it the alias np:

import numpy as np

Now we use what's called a seed for the random number generator. If you set the seed, you will get the same results from the random number generator across runs. Otherwise this is not guaranteed. This can be a helpful option if you use random numbers in some way in your work and want to have consistent results every time you run a notebook:

np.random.seed(seed=24)

Next, we generate 100 random integers, chosen from between 1 and 5 (inclusive). For this we can use numpy.random.randint, with the appropriate arguments.

random_integers = np.random.randint(low=1,high=5,size=100)

Let's look at the first five elements of this array, with random_integers[:5]. The output should appear as follows:

Figure 1.24: Random integers

Suppose we wanted to know the locations of all elements of random_integers equal to 3. We could create a Boolean mask to do this.

is_equal_to_3 = random_integers == 3

From examining the first 5 elements, we know the first element is equal to 3, but none of the rest are. So in our Boolean mask, we expect True in the first position and False in the next 4 positions. Is this the case?

is_equal_to_3[:5]

The preceding code should give this output:

Figure 1.25: Boolean mask for the random integers

This is what we expected. This shows the creation of a Boolean mask. But what else can we do with them? Suppose we wanted to know how many elements were equal to 3. To know this, you can take the sum of a Boolean mask, which interprets True as 1 and False as 0:

sum(is_equal_to_3)

This should give us the following output:

Figure 1.26: Sum of the Boolean mask

This makes sense, as with a random, equally likely choice of 5 possible values, we would expect each value to appear about 20% of the time. In addition to seeing how many values in the array meet the Boolean condition, we can also use the Boolean mask to select the elements of the original array that meet that condition. Boolean masks can be used directly to index arrays, as shown here:

random_integers[is_equal_to_3]

This outputs the elements of random_integers meeting the Boolean condition we specified. In this case, the 22 elements equal to 3:

Figure 1.27: Using the Boolean mask to index an array

Now you know the basics of Boolean arrays, which are useful in many situations. In particular, you can use the .loc method of DataFrames to index the rows of the DataFrames by a Boolean mask, and the columns by label. Let's continue exploring the case study data with these skills.

Exercise 4: Continuing Verification of Data Integrity

In this exercise, with our knowledge of Boolean arrays, we will examine some of the duplicate IDs we discovered. In Exercise 3, we learned that no ID appears more than twice. We can use this learning to locate the duplicate IDs and examine them. Then we take action to remove rows of dubious quality from the dataset. Perform the following steps to complete the exercise:

Note

The code and the output graphics for this exercise have been loaded in a Jupyter Notebook that can be found here: http://bit.ly/2W9cwPH.

  1. Continuing where we left off in Exercise 3, we want the indices of the id_counts series, where the count is 2, to locate the duplicates. We assign the indices of the duplicated IDs to a variable called dupe_mask and display the first 5 duplicated IDs using the following commands:

    dupe_mask = id_counts == 2
    dupe_mask[0:5] 

    You will obtain the following output:

    Figure 1.28: A Boolean mask to locate duplicate IDs

    Here, dupe_mask is the logical mask that we have created for storing the Boolean values.

    Note that in the preceding output, we are displaying only the first five entries using dupe_mask to illustrate to contents of this array. As always, you can edit the indices in the square brackets ([]) to change the number of entries displayed.

    Our next step is to use this logical mask to select the IDs that are duplicated. The IDs themselves are contained as the index of the id_count series. We can access the index in order to use our logical mask for selection purposes.

  2. Access the index of id_count and display the first five rows as context using the following command:

    id_counts.index[0:5]

    With this, you will obtain the following output:

    Figure 1.29: Duplicated IDs

  3. Select and store the duplicated IDs in a new variable called dupe_ids using the following command:

    dupe_ids = id_counts.index[dupe_mask]
  4. Convert dupe_ids to a list and then obtain the length of the list using the following commands:

    dupe_ids = list(dupe_ids)
    len(dupe_ids)

    You should obtain the following output:

    Figure 1.30: Output displaying the list length

    We changed the dupe_ids variable to a list, as we will need it in this form for future steps. The list has a length of 313, as can be seen in the preceding output, which matches our knowledge of the number of duplicate IDs from the value count.

  5. We verify the data in dupe_ids by displaying the first five entries using the following command:

    dupe_ids[0:5]

    We obtain the following output:

    Figure 1.31: Making a list of duplicate IDs

    We can observe from the preceding output that the list contains the required entries of duplicate IDs. We're now in a position to examine the data for the IDs in our list of duplicates. In particular, we'd like to look at the values of the features, to see what, if anything, might be different between these duplicate entries. We will use the .isin and .loc methods for this purpose.

    Using the first three IDs on our list of dupes, dupe_ids[0:3], we will plan to first find the rows containing these IDs. If we pass this list of IDs to the .isin method of the ID series, this will create another logical mask we can use on the larger DataFrame to display the rows that have these IDs. The .isin method is nested in a .loc statement indexing the DataFrame in order to select the location of all rows containing "True" in the Boolean mask. The second argument of the .loc indexing statement is :, which implies that all columns will be selected. By performing the following steps, we are essentially filtering the DataFrame in order to view all the columns for the first three duplicate IDs.

  6. Run the following command in your Notebook to execute the plan we formulated in the previous step:

    df.loc[df['ID'].isin(dupe_ids[0:3]),:].head(10) 

    Figure 1.32: Examining the data for duplicate IDs

    What we observe here is that each duplicate ID appears to have one row with what seems like valid data, and one row of entirely zeros. Take a moment and think to yourself what you would do with this knowledge.

    After some reflection, it should be clear that you ought to delete the rows with all zeros. Perhaps these arose through a faulty join condition in the SQL query that generated the data? Regardless, a row of all zeros is definitely invalid data as it makes no sense for someone to have an age of 0, a credit limit of 0, and so on.

    One approach to deal with this issue would be to find rows that have all zeros, except for the first column, which has the IDs. These would be invalid data in any case, and it may be that if we get rid of all of these, we would also solve our problem of duplicate IDs. We can find the entries of the DataFrame that are equal to zero by creating a Boolean matrix that is the same size as the whole DataFrame, based on the "is equal to zero" condition.

  7. Create a Boolean matrix of the same size as the entire DataFrame using ==, as shown:

    df_zero_mask = df == 0

    In the next steps, we'll use df_zero_mask, which is another DataFrame containing Boolean values. The goal will be to create a Boolean series, feature_zero_mask, that identifies every row where all the elements starting from the second column (the features and response, but not the IDs) are 0. To do so, we first need to index df_zero_mask using the integer indexing (.iloc) method. In this method, we pass (:) to examine all rows and (1:) to examine all columns starting with the second one (index 1). Finally, we will apply the all() method along the column axis (axis=1), which will return True if and only if every column in that row is True. This is a lot to think about, but it's pretty simple to code, as will be observed in the following step.

  8. Create the Boolean series feature_zero_mask, as shown in the following:

    feature_zero_mask = df_zero_mask.iloc[:,1:].all(axis=1)
  9. Calculate the sum of the Boolean series using the following command:

    sum(feature_zero_mask)

    You should obtain the following output:

    Figure 1.33: The number of rows with all zeros except for the ID

    The preceding output tells us that 315 rows have zeros for every column but the first one. This is greater than the number of duplicate IDs (313), so if we delete all the "zero rows," we may get rid of the duplicate ID problem.

  10. Clean the DataFrame by eliminating the rows with all zeros, except for the ID, using the following code:

    df_clean_1 = df.loc[~feature_zero_mask,:].copy()

    While performing the cleaning operation in the preceding step, we return a new DataFrame called df_clean_1. Notice that here we've used the .copy() method after the .loc indexing operation to create a copy of this output, as opposed to a view on the original DataFrame. You can think of this as creating a new DataFrame, as opposed to referencing the original one. Within the .loc method, we used the logical not operator, ~, to select all the rows that don't have zeros for all the features and response, and : to select all columns. These are the valid data we wish to keep. After doing this, we now want to know if the number of remaining rows is equal to the number of unique IDs.

  11. Verify the number of rows and columns in df_clean_1 by running the following code:

    df_clean_1.shape

    You will obtain the following output:

    Figure 1.34: Dimensions of the cleaned DataFrame

  12. Obtain the number of unique IDs by running the following code:

    df_clean_1['ID'].nunique()

    Figure 1.35: Number of unique IDs in the cleaned DataFrame

From the preceding output, we can see that we have successfully eliminated duplicates, as the number of unique IDs is equal to the number of rows. Now take a breath and pat yourself on the back. That was a whirlwind introduction to quite a few pandas techniques for indexing and characterizing data. Now that we've filtered out the duplicate IDs, we're in a position to start looking at the actual data itself: the features, and eventually, the response. We'll walk you through this process.

Exercise 5: Exploring and Cleaning the Data

Thus far, we have identified a data quality issue related to the metadata: we had been told that every sample from our dataset corresponded to a unique account ID, but found that this was not the case. We were able to use logical indexing and pandas to correct this issue. This was a fundamental data quality issue, having to do simply with what samples were present, based on the metadata. Aside from this, we are not really interested in the metadata column of account IDs: for the time being these will not help us develop a predictive model for credit default.

Now, we are ready to start examining the values of the features and response, the data we will use to develop our predictive model. Perform the following steps to complete this exercise:

Note

The code and the resulting output for this exercise have been loaded in a Jupyter Notebook that can be found here: http://bit.ly/2W9cwPH.

  • -2 means the account started that month with a zero balance, and never used any credit

  • -1 means the account had a balance that was paid in full

  • 0 means that at least the minimum payment was made, but the entire balance wasn't paid (that is, a positive balance was carried to the next month)

We thank our business partner since this answers our questions, for now. Maintaining a good line of communication and working relationship with the business partner is important, as you can see here, and may determine the success or failure of a project.

  1. Obtain the data type of the columns in the data by using the .info() method as shown:

    df_clean_1.info()

    You should see the following output:

    Figure 1.36: Getting column metadata

    We can see in Figure 1.34 that there are 25 columns. Each row has 29,685 non-null values, according to this summary, which is the number of rows in the DataFrame. This would indicate that there is no missing data, in the sense that each cell contains some value. However, if there is a fill value to represent missing data, that would not be evident here.

    We also see that most columns say int64 next to them, indicating they are an integer data type, that is, numbers such as ..., -2, -1, 0, 1, 2,... . The exceptions are ID and PAY_1. We are already familiar with ID; this contains strings, which are account IDs. What about PAY_1? According to the values in the data dictionary, we'd expect this to contain integers, like all the other features. Let's take a closer look at this column.

  2. Use the.head(n) pandas method to view the top n rows of the PAY_1 series:

    df_clean_1['PAY_1'].head(5)

    You should obtain the following output:

    Figure 1.37: Examine a few columns' contents

    The integers on the left of the output are the index, which are simply consecutive integers starting with 0. The data from the PAY_1 column is shown on the left. This is supposed to be the payment status of the most recent month's bill, using values –1, 1, 2, 3, and so on. However, we can see that there are values of 0 here, which are not documented in the data dictionary. According to the data dictionary, "The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above" (https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). Let's take a closer look, using the value counts of this column.

  3. Obtain the value counts for the PAY_1 column by using .value_counts() method:

    df_clean1['PAY_1'].value_counts()

    You should see the following output:

    Figure 1.38: Value counts of the PAY_1 column

    The preceding output reveals the presence of two undocumented values: 0 and –2, as well as the reason this column was imported by pandas as an object data type, instead of int64 as we would expect for integer data. There is a 'Not available' string present in this column, symbolizing missing data. Later on in the book, we'll come back to this when we consider how to deal with missing data. For now, we'll remove rows of the dataset, for which this feature has a missing value.

  4. Use a logical mask with the != operator (which means "does not equal" in Python) to find all the rows that don't have missing data for the PAY_1 feature:

    valid_pay_1_mask = df_clean_1['PAY_1'] != 'Not available'
    valid_pay_1_mask[0:5]

    By running the preceding code, you will obtain the following output:

    Figure 1.39: Creating a Boolean mask

  5. Check how many rows have no missing data by calculating the sum of the mask:

    sum(valid_pay_1_mask)

    You will obtain the following output:

    Figure 1.40: Sum of the Boolean mask for non-missing data

    We see that 26,664 rows do not have the value 'Not available' in the PAY_1 column. We saw from the value count that 3,021 rows do have this value, and 29,685 – 3,021 = 26,664, so this checks out.

  6. Clean the data by eliminating the rows with the missing values of PAY_1 as shown:

    df_clean_2 = df_clean_1.loc[valid_pay_1_mask,:].copy()
  7. Obtain the shape of the cleaned data using the following command:

    df_clean_2.shape

    You will obtain the following output:

    Figure 1.41: Shape of the cleaned data

    After removing these rows, we check that the resulting DataFrame has the expected shape. You can also check for yourself whether the value counts indicate the desired values have been removed like this: df_clean_2['PAY_1'].value_counts().

    Lastly, so this column's data type can be consistent with the others, we will cast it from the generic object type to int64 like all the other features, using the .astype method. Then we select a couple columns, including PAY_1, to examine the data types and make sure it worked.

  8. Run the following command to convert the data type for PAY_1 from object to int64 and show the column metadata for PAY_1 and PAY_2:

    df_clean_2['PAY_1'] = df_clean_2['PAY_1'].astype('int64')
    df_clean_2[['PAY_1', 'PAY_2']].info()

    Figure 1.42: Check the data type of the cleaned column

    Congratulations, you have completed your second data cleaning operation! However, if you recall, during this process we also noticed the undocumented values of –2 and 0 in PAY_1. Now, let's imagine we got back in touch with our business partner and learned the following information:

Left arrow icon Right arrow icon

Key benefits

  • Tackle data science problems by identifying the problem to be solved
  • Illustrate patterns in data using appropriate visualizations
  • Implement suitable machine learning algorithms to gain insights from data

Description

Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools, by applying them to realistic data problems. You will learn how to use pandas and Matplotlib to critically examine datasets with summary statistics and graphs, and extract the insights you seek to derive. You will build your knowledge as you prepare data using the scikit-learn package and feed it to machine learning algorithms such as regularized logistic regression and random forest. You’ll discover how to tune algorithms to provide the most accurate predictions on new and unseen data. As you progress, you’ll gain insights into the working and output of these algorithms, building your understanding of both the predictive capabilities of the models and why they make these predictions. By then end of this book, you will have the necessary skills to confidently use machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data.

Who is this book for?

If you are a data analyst, data scientist, or business analyst who wants to get started using Python and machine learning techniques to analyze data and predict outcomes, this book is for you. Basic knowledge of Python and data analytics will help you get the most from this book. Familiarity with mathematical concepts such as algebra and basic statistics will also be useful.

What you will learn

  • Install the required packages to set up a data science coding environment
  • Load data into a Jupyter notebook running Python
  • Use Matplotlib to create data visualizations
  • Fit machine learning models using scikit-learn
  • Use lasso and ridge regression to regularize your models
  • Compare performance between models to find the best outcomes
  • Use k-fold cross-validation to select model hyperparameters

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 30, 2019
Length: 374 pages
Edition : 1st
Language : English
ISBN-13 : 9781838552602
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Apr 30, 2019
Length: 374 pages
Edition : 1st
Language : English
ISBN-13 : 9781838552602
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 100.97
Data Science  with Python
€29.99
Python Machine Learning
€41.99
Data Science Projects with Python
€28.99
Total 100.97 Stars icon

Table of Contents

6 Chapters
Data Exploration and Cleaning Chevron down icon Chevron up icon
Introduction toScikit-Learn and Model Evaluation Chevron down icon Chevron up icon
Details of Logistic Regression and Feature Exploration Chevron down icon Chevron up icon
The Bias-Variance Trade-off Chevron down icon Chevron up icon
Decision Trees and Random Forests Chevron down icon Chevron up icon
Imputation of Missing Data, Financial Analysis, and Delivery to Client Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(17 Ratings)
5 star 64.7%
4 star 17.6%
3 star 5.9%
2 star 5.9%
1 star 5.9%
Filter icon Filter
Top Reviews

Filter reviews by




Honest Reviewer Jul 08, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is very well written and author did a good job explaining every line of codes and concepts. Worth every penny! Thank you!
Amazon Verified review Amazon
Monsoon Feb 03, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I liked this book better because it broke through some other books' lectures and abstracts and dove into the kind of data and scenarios that I am more likely to actually encounter in my job, rather than just memorize them. Plus I didn't have to fix or workaround outdated or outversioned python code as I have had to do with some online teaching forums. This book will move your career or business forward.
Amazon Verified review Amazon
Jonas Jun 24, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book teaches you the best practices of data science and machine learning based on real world case studies. I found this highly valuable because you are able to actually work on real data sets. This is also a quick way to learn industry recognized tools and mathematical concepts that are actually being used by data scientist. Another advantage of this book in my opinion is the author's approach for coding. Author writes and explains each code and outcome separately rather than giving you several paragraphs of code and explain them all at once. I strongly recommend this book if you want to learn data science and machine learning on a practical level applying code and assessing the outcome
Amazon Verified review Amazon
Richard Aug 09, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As someone who has managed multiple data science projects in academia and the business world, I found this book to be a much-needed introduction to practical data science in the real-world. Some books thoroughly cover the mathematical complexity of machine learning models while others focus on implementing the models through coding (e.g. Scikit-Learn, Tensor flow, etc.). However, it is rare to find a book that ties the math and coding together to provide a comprehensive take on the data science process, which includes much under-appreciated topics such as data munging, exploratory data analysis, model evaluation, etc. Nevertheless, the author also does not skip out on explanation of the mathematics of the machine learning models and treat them as “black-boxes,” which can be frustrating for many readers who need more depth.This book is ideal for individuals with some familiarity with Python and limited mathematical background. It does not include the latest, cutting-edge deep learning models. However, having a robust process of understanding the data and evaluating models is more critical to the success of a data science project than applying the latest, most sophisticated models coming out of academic research. In this regards, the author does an excellent job of walking through its reader step-by-step in building a robust pipeline process using real-world data science projects as examples.Chapter 6: Imputation of Missing Data, Financial Analysis, and Delivery to Client offers a good overview of the most important step in data science in the business world. You would be hard-pressed to find information in this chapter anywhere else.For experienced data scientists, this book may be too introductory, but it can serve as a textbook or a training manual for your team if you lead a team with entry-level data scientists/analysts who recently graduated from school and still need help applying what they learned from school in the real-world.My only suggestion to the author would have been to include more materials on the next steps and provide a brief survey of the latest models in data science and resources to learn about them.All in all, it is a great book for new entrants or those hoping to join the field. It also seems ideal as a textbook for short 6-8 week data science courses.
Amazon Verified review Amazon
C. Bennett May 25, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As a professor at DePaul University who teaches data science and machine learning, I can say that this is a great book for introducing the fundamental concepts that lie behind using Python for data science projects. Readers will learn useful coding skills in Python, and its various packages for data manipulation and visualization such as Pandas, Numpy, Matplotlib. Furthermore, they will learn how to use Scikit-Learn, one of the major data science toolkits in Python, to construct machine learning models based on the same data. The book is well laid out, with each section building on the last, and reflects what actual data scientist do in the field day-to-day.The book provides a great platform for anyone who is interested in learning practical "how-to" skills, and creates the foundation for those who want to move on to more advanced concepts.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.