Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science Algorithms in a Week

You're reading from   Data Science Algorithms in a Week Top 7 algorithms for scientific computing, data analysis, and machine learning

Arrow left icon
Product type Paperback
Published in Oct 2018
Publisher Packt
ISBN-13 9781789806076
Length 214 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
David Toth David Toth
Author Profile Icon David Toth
David Toth
David Natingga David Natingga
Author Profile Icon David Natingga
David Natingga
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Classification Using K-Nearest Neighbors 2. Naive Bayes FREE CHAPTER 3. Decision Trees 4. Random Forests 5. Clustering into K Clusters 6. Regression 7. Time Series Analysis 8. Python Reference 9. Statistics 10. Glossary of Algorithms and Methods in Data Science
11. Other Books You May Enjoy

House ownership – data rescaling

For each person, we are given their age, yearly income, and whether or not they own a house:

Age

Annual income in USD

House ownership status

23

50,000

Non-owner

37

34,000

Non-owner

48

40,000

Owner

52

30,000

Non-owner

28

95,000

Owner

25

78,000

Non-owner

35

130,000

Owner

32

105,000

Owner

20

100,000

Non-owner

40

60,000

Owner

50

80,000

Peter

House ownership and annual income

The aim is to predict whether Peter, aged 50, with an income of $80,000 per year, owns a house and could be a potential customer for our insurance company.

Analysis

In this case, we could try to apply the 1-NN algorithm. However, we should be careful about how we measure the distances between the data points, since the income range is much wider than the age range. Income levels of USD 115 k and USD 116 k are USD 1,000 apart. The two data points for these incomes would be very far apart. However, relative to each other, the difference between these data points isn't actually that big. Because we consider both measures (age and yearly income) to be about as important as each other, we would scale both from 0 to 1 according to the following formula:

In our particular case, this reduces to the following:

After scaling, we get the following data:

Age

Scaled age

Annual income in USD

Scaled annual income

House ownership status

23

0.09375

50,000

0.2

Non-owner

37

0.53125

34,000

0.04

Non-owner

48

0.875

40,000

0.1

Owner

52

1

30,000

0

Non-owner

28

0.25

95,000

0.65

Owner

25

0.15625

78,000

0.48

Non-owner

35

0.46875

130,000

1

Owner

32

0.375

105,000

0.75

Owner

20

0

100,000

0.7

Non-owner

40

0.625

60,000

0.3

Owner

50

0.9375

80,000

0.5

?

Now, if we apply the 1-NN algorithm with the Euclidean metric, we will find out that Peter more than likely owns a house. Note that, without rescaling, the algorithm would yield a different result. Refer to Exercise 1.5 for more information.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at R$50/month. Cancel anytime