You're reading from Principles of Data Science Understand, analyze, and predict data using Machine Learning concepts and tools

Product type Paperback

Published in Dec 2018

Publisher Packt

ISBN-13 9781789804546

Length 424 pages

Edition 2nd Edition

Languages

Python

Tools

NumPy

Concepts

Data Science

Authors (3):

Sunil Kakade

Sinan Ozdemir

Marco Tibaldeschi

View More author details

Table of Contents (17) Chapters

Preface

1. How to Sound Like a Data Scientist FREE CHAPTER

2. Types of Data

3. The Five Steps of Data Science

4. Basic Mathematics

5. Impossible or Improbable - A Gentle Introduction to Probability

6. Advanced Probability

7. Basic Statistics

8. Advanced Statistics

9. Communicating Data

10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials

11. Predictions Don't Grow on Trees - or Do They?

12. Beyond the Essentials

13. Case Studies

14. Building Machine Learning Models with Azure Databricks and Azure Machine Learning service

Other Books You May Enjoy

Leave a review – let other readers know what you think

Index

Why Python?

We will use Python for a variety of reasons, listed as follows:

Python is an extremely simple language to read and write, even if you've never coded before, which will make future examples easy to understand and read later on, even after you have read this book.
It is one of the most common languages, both in production and in the academic setting (one of the fastest growing, as a matter of fact).
The language's online community is vast and friendly. This means that a quick search for the solution to a problem should yield many people who have faced and solved similar (if not exactly the same) situations
Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize.

The last point is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful, but also easy to pick up. By the end of the first few chapters, you will be very comfortable with these modules. Some of these modules include the following:

pandas
scikit-learn
seaborn
numpy/scipy
requests (to mine data from the web)
BeautifulSoup (for web–HTML parsing)

Python practices

Before we move on, it is important to formalize many of the requisite coding skills in Python.

In Python, we have variables that are placeholders for objects. We will focus on just a few types of basic objects at first, as shown in the following table:

Object Type	Example
`int` (an integer)	3, 6, 99, -34, 34, 11111111
`float` (a decimal)	3.14159, 2.71, -0.34567
`boolean` (either `True` or `False`)	The statement "Sunday is a weekend" is `True` The statement "Friday is a weekend" is `False` The statement "pi is exactly the ratio of a circle's circumference to its diameter" is True (crazy, right?)
`string` (text or words made up of characters)	"I love hamburgers" (by the way, who doesn't?) "Matt is awesome" A tweet is a string
`list` (a collection of objects)	`[1, 5.4, True, "apple"]`

We will also have to understand some basic logistical operators. For these operators, keep the Boolean datatype in mind. Every operator will evaluate to either True or False. Let's take a look at the following operators:

Operators	Example
`==`	Evaluates to `True` if both sides are equal; otherwise, it evaluates to `False`, as shown in the following examples: 3 + 4 == 7 (will evaluate to `True`) 3 - 2 == 7 (will evaluate to `False`)
`<` (less than)	3 < 5 (`True`) 5 < 3 (`False`)
`<=` (less than or equal to)	3 <= 3 (`True`) 5 <= 3 (`False`)
`>` (greater than)	3 > 5 (`False`) 5 > 3 (`True`)
`>=` (greater than or equal to)	3 >= 3 (`True`) 5 >= 7 (`False`)

When coding in Python, I will use a pound sign (#) to create a "comment," which will not be processed as code, but is merely there to communicate with the reader. Anything to the right of a # sign is a comment on the code being executed.

Example of basic Python

In Python, we use spaces/tabs to denote operations that belong to other lines of code.

Note

The print True statement belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if, and only if, x + y equals 15.3.

Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, a boolean, and string inputs (in that order):

my_list = [1, 5.7, True, "apples"] 
 
len(my_list) == 4  # 4 objects in the list 
 
my_list[0] == 1    # the first object 
 
 
my_list[1] == 5.7    # the second object

In the preceding code, I used the len command to get the length of the list (which was 4). Also, note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call index 0, and if I want the 95th element, I call index 94.

Example – parsing a single tweet

Here is some more Python code. In this example, I will be parsing some tweets about stock prices (one of the important case studies in this book will be trying to predict market movements based on popular sentiment regarding stocks on social media):

tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" 
 
words_in_tweet = tweet.split(' ') # list of words in tweet 
 
for word in words_in_tweet:             # for each word in list 
  if "$" in word:                       # if word has a "cashtag"  
  print("THIS TWEET IS ABOUT", word)  # alert the user

I will point out a few things about this code snippet line by line, as follows:

First, we set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL".
The words_in_tweet variable tokenizes the tweet (separates it by word). If you were to print this variable, you would see the following:
```
['RT', 
'@robdv:', 
'$TWTR', 
'now', 
'top', 
'holding', 
'for', 
'Andor,', 
'unseating',
'$AAPL']
```
We iterate through this list of words; this is called a for loop. It just means that we go through a list one by one.
Here, we have another if statement. For each word in this tweet, if the word contains the $ character it represents stock tickers on Twitter.
If the preceding if statement is True (that is, if the tweet contains a cashtag), print it and show it to the user.

The output of this code will be as follows:

THIS TWEET IS ABOUT $TWTR
THIS TWEET IS ABOUT $AAPL

We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this book, I will ensure that I am as explicit as possible about what I am doing in each line of code.

Domain knowledge

As I mentioned earlier, domain knowledge focuses mainly on having knowledge of the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. This book will attempt to show examples from several problem domains, including medicine, marketing, finance, and even UFO sightings!

Does this mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete.

A big part of domain knowledge is presentation. Depending on your audience, it can matter greatly on how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused.