You're reading from Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781837634743

Length 456 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Maria Zervou

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1: Upstream Data Ingestion and Cleaning

2. Chapter 1: Data Ingestion Techniques FREE CHAPTER

3. Chapter 2: Importance of Data Quality

4. Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution

5. Chapter 4: Cleaning Messy Data and Data Manipulation

6. Chapter 5: Data Transformation – Merging and Concatenating

7. Chapter 6: Data Grouping, Aggregation, Filtering, and Applying Functions

8. Chapter 7: Data Sinks

9. Part 2: Downstream Data Cleaning – Consuming Structured Data

10. Chapter 8: Detecting and Handling Missing Values and Outliers

11. Chapter 9: Normalization and Standardization

12. Chapter 10: Handling Categorical Features

13. Chapter 11: Consuming Time Series Data

14. Part 3: Downstream Data Cleaning – Consuming Unstructured Data

15. Chapter 12: Text Preprocessing in the Era of LLMs

16. Chapter 13: Image and Audio Preprocessing with LLMs

17. Index

Why subscribe?

18. Other Books You May Enjoy

Min-max scaling

Min-max scaling, also known as normalization, scales the values of a variable to a specific range, typically between 0 and 1. Min-max scaling is useful when you want to ensure that all values in a variable fall within a standardized range, making them directly comparable. It is commonly employed when the distribution of the variable is not assumed to be normal.

Let’s have a look at the formula for calculating min-max scaling:

X _ scaled =(X − X _ min) / (X _ max − X _ min)

As you can see from the formula, min-max scaling preserves the relative ordering of values but compresses them into a specific range. One thing to note here is that it is not a way to deal with outliers and if outliers exist in the data, these extreme values can disproportionately influence the scaling. So, it is a good practice to deal with outliers first and then proceed to the scaling of features.

Scaling to a specific range is a suitable approach when the following...