Congratulations on your installation! Let's now explore using Jupyter Notebooks, which is also known as IPython Notebook. These days, the more modern name is the Jupyter Notebook, but a lot of people still call it an IPython Notebook, and I consider the names interchangeable for working developers as a result. I do also find the name IPython Notebooks helps me remember the notebook file name suffix which is .ipynb as you'll get to know very well in this book!
Okay so now let's take it right from the top again - with our first exploration of the IPython/Jupyter Notebook. If you haven't yet done so, please navigate to the DataScience folder where we have downloaded all the materials for this book. For me, that's E:DataScience, and if you didn't do so during the preceding installation section, please now double-click and open up the Outliers.ipynb file.
Now what's going to happen when we double-click on this IPython .ipynb file is that first of all it's going to spark up Canopy, if it's not sparked up already, and then it's going to launch a web browser. This is how the full Outliers notebook webpage looks within my browser:
As you can see here, notebooks are structured in such a way that I can intersperse my little notes and commentary about what you're seeing here within the actual code itself, and you can actually run this code within your web browser! So, it's a very handy format for me to give you sort of a little reference that you can use later on in life to go and remind yourself how these algorithms work that we're going to talk about, and actually experiment with them and play with them yourself.
The way that the IPython/Jupyter Notebook files work is that they actually run from within your browser, like a webpage, but they're backed by the Python engine that you installed. So you should be seeing a screen similar to the one shown in the previous screenshot.
You'll notice as you scroll down the notebook in your browser, there are code blocks. They're easy to spot because they contain our actual code. Please find the code box for this code in the Outliers notebook, quite near the top:
%matplotlib inline
import numpy as np
incomes = np.random.normal(27000, 15000, 10000)
incomes = np.append(incomes, [1000000000])
import matplotlib.pyplot as plt
plt.hist(incomes, 50)
plt.show()
Let's take a quick look at this code while we're here. e are setting up a little income distribution in this code. We're simulating the distribution of income in a population of people, and to illustrate the effect that an outlier can have on that distribution, we're simulating Donald Trump entering the mix and messing up the mean value of the income distribution. By the way, I'm not making a political statement, this was all done before Trump became a politician. So you know, full disclosure there.
We can select any code block in the notebook by clicking on it. So if you now click in the code block that contains the code we just looked at above, we can then hit the run button at the top to run it. Here's the area at the top of the screen where you'll find the Run button:
Hitting the Run button with the code block selected, will cause this graph to be regenerated:
Similarly, we can click on the next code block a little further down, you'll spot the one which has the following single line of code :
incomes.mean()
If you select the code block containing this line, and hit the Run button to run the code, you'll see the output below it, which ends up being a very large value because of the effect of that outlier, something like this:
127148.50796177129
Let's keep going and have some fun. In the next code block down, you'll see the following code, which tries to detect outliers like Donald Trump and remove them from the dataset:
def reject_outliers(data):
u = np.median(data)
s = np.std(data)
filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return filtered
filtered = reject_outliers(incomes)
plt.hist(filtered, 50)
plt.show()
So select the corresponding code block in the notebook, and press the run button again. When you do that, you'll see this graph instead:
Now we see a much better histogram that represents the more typical American - now that we've taken out our outlier that was messing things up.
So, at this point, you have everything you need to get started in this course. We have all the data you need, all the scripts, and the development environment for Python and Python notebooks. So, let's rock and roll. Up next we're going to do a little crash course on Python itself, and even if you're familiar with Python, it might be a good little refresher so you might want to watch it regardless. Let's dive in and learn Python.