Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Machine Learning Cookbook

You're reading from   Python Machine Learning Cookbook Over 100 recipes to progress from smart data analytics to deep learning using real-world datasets

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher Packt
ISBN-13 9781789808452
Length 642 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Giuseppe Ciaburro Giuseppe Ciaburro
Author Profile Icon Giuseppe Ciaburro
Giuseppe Ciaburro
Prateek Joshi Prateek Joshi
Author Profile Icon Prateek Joshi
Prateek Joshi
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. The Realm of Supervised Learning FREE CHAPTER 2. Constructing a Classifier 3. Predictive Modeling 4. Clustering with Unsupervised Learning 5. Visualizing Data 6. Building Recommendation Engines 7. Analyzing Text Data 8. Speech Recognition 9. Dissecting Time Series and Sequential Data 10. Analyzing Image Content 11. Biometric Face Recognition 12. Reinforcement Learning Techniques 13. Deep Neural Networks 14. Unsupervised Representation Learning 15. Automated Machine Learning and Transfer Learning 16. Unlocking Production Issues 17. Other Books You May Enjoy

Data scaling

The values of each feature in a dataset can vary between random values. So, sometimes it is important to scale them so that this becomes a level playing field. Through this statistical procedure, it's possible to compare identical variables belonging to different distributions and different variables.

Remember, it is good practice to rescale data before training a machine learning algorithm. With rescaling, data units are eliminated, allowing you to easily compare data from different locations.

Getting ready

We'll use the min-max method (usually called feature scaling) to get all of the scaled data in the range [0, 1]. The formula used to achieve this is as follows:

To scale features between a given minimum and maximum value—in our case, between 0 and 1—so that the maximum absolute value of each feature is scaled to unit size, the preprocessing.MinMaxScaler() function can be used.

How to do it...

Let's see how to scale data in Python:

  1. Let's start by defining the data_scaler variable:
>> data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

  1. Now we will use the fit_transform() method, which fits the data and then transforms it (we will use the same data as in the previous recipe):
>> data_scaled = data_scaler.fit_transform(data)

A NumPy array of a specific shape is returned. To understand how this function has transformed data, we display the minimum and maximum of each column in the array.

  1. First, for the starting data and then for the processed data:
>> print("Min: ",data.min(axis=0))
>> print("Max: ",data.max(axis=0))

The following results are returned:

Min: [ 0. -1.5 -1.9 -5.4]
Max: [3. 4. 2. 2.1]
  1. Now, let's do the same for the scaled data using the following code:
>> print("Min: ",data_scaled.min(axis=0))
>> print("Max: ",data_scaled.max(axis=0))

The following results are returned:

Min: [0. 0. 0. 0.]
Max: [1. 1. 1. 1.]

After scaling, all the feature values range between the specified values.

  1. To display the scaled array, we will use the following code:
>> print(data_scaled)

The output will be displayed as follows:

[[ 1.          0.          1.          0.        ] 
[ 0. 1. 0.41025641 1. ]
[ 0.33333333 0.87272727 0. 0.14666667]]

Now, all the data is included in the same interval.

How it works...

When data has different ranges, the impact on response variables might be higher than the one with a lesser numeric range, which can affect the prediction accuracy. Our goal is to improve predictive accuracy and ensure this doesn't happen. Hence, we may need to scale values under different features so that they fall within a similar range. Through this statistical procedure, it's possible to compare identical variables belonging to different distributions and different variables or variables expressed in different units.

There's more...

Feature scaling consists of limiting the excursion of a set of values within a certain predefined interval. It guarantees that all functionalities have the exact same scale, but does not handle anomalous values well. This is because extreme values become the extremes of the new range of variation. In this way, the actual values are compressed by keeping the distance to the anomalous values.

See also

You have been reading a chapter from
Python Machine Learning Cookbook - Second Edition
Published in: Mar 2019
Publisher: Packt
ISBN-13: 9781789808452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime