Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Principles of Data Science

You're reading from   Principles of Data Science Understand, analyze, and predict data using Machine Learning concepts and tools

Arrow left icon
Product type Paperback
Published in Dec 2018
Publisher Packt
ISBN-13 9781789804546
Length 424 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Sunil Kakade Sunil Kakade
Author Profile Icon Sunil Kakade
Sunil Kakade
Sinan Ozdemir Sinan Ozdemir
Author Profile Icon Sinan Ozdemir
Sinan Ozdemir
Marco Tibaldeschi Marco Tibaldeschi
Author Profile Icon Marco Tibaldeschi
Marco Tibaldeschi
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. How to Sound Like a Data Scientist FREE CHAPTER 2. Types of Data 3. The Five Steps of Data Science 4. Basic Mathematics 5. Impossible or Improbable - A Gentle Introduction to Probability 6. Advanced Probability 7. Basic Statistics 8. Advanced Statistics 9. Communicating Data 10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials 11. Predictions Don't Grow on Trees - or Do They? 12. Beyond the Essentials 13. Case Studies 14. Building Machine Learning Models with Azure Databricks and Azure Machine Learning service Other Books You May Enjoy Index

Why Python?

We will use Python for a variety of reasons, listed as follows:

  • Python is an extremely simple language to read and write, even if you've never coded before, which will make future examples easy to understand and read later on, even after you have read this book.
  • It is one of the most common languages, both in production and in the academic setting (one of the fastest growing, as a matter of fact).
  • The language's online community is vast and friendly. This means that a quick search for the solution to a problem should yield many people who have faced and solved similar (if not exactly the same) situations
  • Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize.

The last point is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful, but also easy to pick up. By the end of the first few chapters, you will be very comfortable with these modules. Some of these modules include the following:

  • pandas
  • scikit-learn
  • seaborn
  • numpy/scipy
  • requests (to mine data from the web)
  • BeautifulSoup (for web–HTML parsing)

Python practices

Before we move on, it is important to formalize many of the requisite coding skills in Python.

In Python, we have variables that are placeholders for objects. We will focus on just a few types of basic objects at first, as shown in the following table:

Object Type

Example

int (an integer)

3, 6, 99, -34, 34, 11111111

float (a decimal)

3.14159, 2.71, -0.34567

boolean (either True or False)

  • The statement "Sunday is a weekend" is True
  • The statement "Friday is a weekend" is False
  • The statement "pi is exactly the ratio of a circle's circumference to its diameter" is True (crazy, right?)

string (text or words made up of characters)

"I love hamburgers" (by the way, who doesn't?)

"Matt is awesome"

A tweet is a string

list (a collection of objects)

[1, 5.4, True, "apple"]

We will also have to understand some basic logistical operators. For these operators, keep the Boolean datatype in mind. Every operator will evaluate to either True or False. Let's take a look at the following operators:

Operators

Example

==

Evaluates to True if both sides are equal; otherwise, it evaluates to False, as shown in the following examples:

  • 3 + 4 == 7 (will evaluate to True)
  • 3 - 2 == 7 (will evaluate to False)

< (less than)

  • 3 < 5 (True)
  • 5 < 3 (False)

<= (less than or equal to)

  • 3 <= 3 (True)
  • 5 <= 3 (False)

> (greater than)

  • 3 > 5 (False)
  • 5 > 3 (True)

>= (greater than or equal to)

  • 3 >= 3 (True)
  • 5 >= 7 (False)

When coding in Python, I will use a pound sign (#) to create a "comment," which will not be processed as code, but is merely there to communicate with the reader. Anything to the right of a # sign is a comment on the code being executed.

Example of basic Python

In Python, we use spaces/tabs to denote operations that belong to other lines of code.

Note

The print True statement belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if, and only if, x + y equals 15.3.

Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, a boolean, and string inputs (in that order):

my_list = [1, 5.7, True, "apples"] 
 
len(my_list) == 4  # 4 objects in the list 
 
my_list[0] == 1    # the first object 
 
 
my_list[1] == 5.7    # the second object 

In the preceding code, I used the len command to get the length of the list (which was 4). Also, note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call index 0, and if I want the 95th element, I call index 94.

Example – parsing a single tweet

Here is some more Python code. In this example, I will be parsing some tweets about stock prices (one of the important case studies in this book will be trying to predict market movements based on popular sentiment regarding stocks on social media):

tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" 
 
words_in_tweet = tweet.split(' ') # list of words in tweet 
 
for word in words_in_tweet:             # for each word in list 
  if "$" in word:                       # if word has a "cashtag"  
  print("THIS TWEET IS ABOUT", word)  # alert the user 

I will point out a few things about this code snippet line by line, as follows:

  • First, we set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL".
  • The words_in_tweet variable tokenizes the tweet (separates it by word). If you were to print this variable, you would see the following:
    ['RT', 
    '@robdv:', 
    '$TWTR', 
    'now', 
    'top', 
    'holding', 
    'for', 
    'Andor,', 
    'unseating',
    '$AAPL']
  • We iterate through this list of words; this is called a for loop. It just means that we go through a list one by one.
  • Here, we have another if statement. For each word in this tweet, if the word contains the $ character it represents stock tickers on Twitter.
  • If the preceding if statement is True (that is, if the tweet contains a cashtag), print it and show it to the user.

The output of this code will be as follows:

THIS TWEET IS ABOUT $TWTR
THIS TWEET IS ABOUT $AAPL

We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this book, I will ensure that I am as explicit as possible about what I am doing in each line of code.

Domain knowledge

As I mentioned earlier, domain knowledge focuses mainly on having knowledge of the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. This book will attempt to show examples from several problem domains, including medicine, marketing, finance, and even UFO sightings!

Does this mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete.

A big part of domain knowledge is presentation. Depending on your audience, it can matter greatly on how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused.

You have been reading a chapter from
Principles of Data Science - Second Edition
Published in: Dec 2018
Publisher: Packt
ISBN-13: 9781789804546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image