Why Python?
We will use Python for a variety of reasons, listed as follows:
- Python is an extremely simple language to read and write, even if you've never coded before, which will make future examples easy to understand and read later on, even after you have read this book.
- It is one of the most common languages, both in production and in the academic setting (one of the fastest growing, as a matter of fact).
- The language's online community is vast and friendly. This means that a quick search for the solution to a problem should yield many people who have faced and solved similar (if not exactly the same) situations
- Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize.
The last point is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful, but also easy to pick up. By the end of the first few chapters, you will be very comfortable with these modules. Some of these modules include the following:
- pandas
- scikit-learn
- seaborn
- numpy/scipy
- requests (to mine data from the web)
- BeautifulSoup (for web–HTML parsing)
Python practices
Before we move on, it is important to formalize many of the requisite coding skills in Python.
In Python, we have variables that are placeholders for objects. We will focus on just a few types of basic objects at first, as shown in the following table:
Object Type |
Example |
---|---|
|
3, 6, 99, -34, 34, 11111111 |
|
3.14159, 2.71, -0.34567 |
|
|
|
"I love hamburgers" (by the way, who doesn't?) "Matt is awesome" A tweet is a string |
|
|
We will also have to understand some basic logistical operators. For these operators, keep the Boolean datatype in mind. Every operator will evaluate to either True
or False
. Let's take a look at the following operators:
Operators |
Example |
---|---|
|
Evaluates to
|
|
|
|
|
|
|
|
|
When coding in Python, I will use a pound sign (#
) to create a "comment," which will not be processed as code, but is merely there to communicate with the reader. Anything to the right of a #
sign is a comment on the code being executed.
Example of basic Python
In Python, we use spaces/tabs to denote operations that belong to other lines of code.
Note
The print True
statement belongs to the if x + y == 15.3:
line preceding it because it is tabbed right under it. This means that the print statement will be executed if, and only if, x + y equals 15.3.
Note that the following list variable, my_list
, can hold multiple types of objects. This one has an int
, a float
, a boolean
, and string
inputs (in that order):
my_list = [1, 5.7, True, "apples"] len(my_list) == 4 # 4 objects in the list my_list[0] == 1 # the first object my_list[1] == 5.7 # the second object
In the preceding code, I used the len
command to get the length of the list (which was 4
). Also, note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call index 0
, and if I want the 95th element, I call index 94
.
Example – parsing a single tweet
Here is some more Python code. In this example, I will be parsing some tweets about stock prices (one of the important case studies in this book will be trying to predict market movements based on popular sentiment regarding stocks on social media):
tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" words_in_tweet = tweet.split(' ') # list of words in tweet for word in words_in_tweet: # for each word in list if "$" in word: # if word has a "cashtag" print("THIS TWEET IS ABOUT", word) # alert the user
I will point out a few things about this code snippet line by line, as follows:
- First, we set a variable to hold some text (known as a string in Python). In this example, the tweet in question is
"RT @robdv: $TWTR now top holding for Andor, unseating $AAPL"
. - The
words_in_tweet
variable tokenizes the tweet (separates it by word). If you were to print this variable, you would see the following:['RT', '@robdv:', '$TWTR', 'now', 'top', 'holding', 'for', 'Andor,', 'unseating', '$AAPL']
- We iterate through this list of words; this is called a
for
loop. It just means that we go through a list one by one. - Here, we have another
if
statement. For each word in this tweet, if the word contains the$
character it represents stock tickers on Twitter. - If the preceding
if
statement isTrue
(that is, if the tweet contains a cashtag), print it and show it to the user.
The output of this code will be as follows:
THIS TWEET IS ABOUT $TWTR THIS TWEET IS ABOUT $AAPL
We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this book, I will ensure that I am as explicit as possible about what I am doing in each line of code.
Domain knowledge
As I mentioned earlier, domain knowledge focuses mainly on having knowledge of the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. This book will attempt to show examples from several problem domains, including medicine, marketing, finance, and even UFO sightings!
Does this mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete.
A big part of domain knowledge is presentation. Depending on your audience, it can matter greatly on how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused.