Packt+ | Advance your knowledge in tech

You're reading from Python Data Mining Quick Start Guide A beginner's guide to extracting valuable insights from your data

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781789800265

Length 188 pages

Edition 1st Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Mining

Author (1):

Nathan Greeneltch

View More author details

The first three and a half chapters of the book are focused on the procedural nuts and bolts of a data mining project. This includes creating a data mining Python environment, loading data from a variety of sources, and munging the data for downstream analysis. The remaining content in the book is mostly conceptual, and delivered in a conversational style very close to how I would train a new hire at my company.

Chapter 1, Data Mining and Getting Started with Python Tools, covers the topic of getting started with your software environment. It also covers how to download and install high-speed Python and popular libraries such as pandas, scikit-learn, and seaborn. After reading this chapter and setting up your environment, you should be ready to follow along with the demonstrations throughout the rest of the book.

Chapter 2, Basic Terminology and our End-to-End Example, covers the basic statistics and data terminology that are required for working in data mining. The final portion of the chapter is dedicated to a full working example, which combined the types of techniques that will be introduced later on in this book. You will also have a better understanding of the thought processes behind analysis and the common steps taken to address a problem statement that you may encounter in the field.

Chapter 3, Collecting, Exploring, and Visualizing Data, covers the basics of loading data from databases, disks, and web sources. It also covers the basic SQL queries, and pandas' access and search functions. The last sections of the chapter introduce the common types of plots using Seaborn.

Chapter 4, Cleaning and Readying Data for Analysis, covers the basics of data cleanup and dimensionality reduction. After reading it, you will understand how to work with missing values, rescale input data, and handle categorical variables. You will also understand the troubles of high-dimensional data, and how to combat this with feature reduction techniques including filter, wrapper, and transformation methods.

Chapter 5, Grouping and Clustering Data, introduces the background and thought processes that goes into designing a clustering algorithm for data mining work. It then introduces common clustering methods in the field and carries out a comparison between all of them with toy datasets. After reading this chapter, you will know the difference between algorithms that cluster based on means separation, density, and connectivity. You will also be able to look at a plot of incoming data and have some intuition on whether clustering will fit your mining project.

Chapter 6, Prediction with Regression and Classification, covers the basics behind using a computer to learn prediction models by introducing the loss function and gradient descent. It then introduces the concepts of overfitting, underfitting, and the penalty approach to regularize your model during fits. It also covers common regression and classification techniques, and the regularized versions of each of these where appropriate. The chapter finishes with a discussion of best practices for model tuning, including cross-validation and grid search.

Chapter 7, Advanced Topics – Building a Data Processing Pipeline and Deploying, This chapter covers a strategy for pipe-lining and deploying using built-in Scikit-learn methods. It also introduces the pickle module for model persistence and storage, as well as discussing Python-specific concerns at deployment time.