(For more resources on this topic, see here.)

File I/O

First, we will learn about file I/O with NumPy. Data is usually stored in files. You would not get far if you are not able to read from and write to files.

Time for action – reading and writing files

As an example of file I/O, we will create an identity matrix and store its contents in a file.

Identity matrix creation

Creating an identity matrix: The identty matrix is a square matrix with ones on the diagonal and zeroes for the rest.

The identity matrix can be created with the eye function. The only argument we need to give the eye function is the number of ones. So, for instance, for a 2-by-2 matrix, write the following code:

code1

The output is:

code2
Saving data: Save the data with the savetxt function. We obviously need to specify the name of the file that we want to save the data in and the array containing the data itself:
code3

A file called eye.txt should have been created. You can check for yourself whether the contents are as expected.

What just happened?

Reading and writing files is a necessary skill for data analysis. We wrote to a file with savetxt. We made an identity matrix with the eye function.

CSV files

Files in the comma separated values (CSV) format are encountered quite frequently. Often, the CSV file is just a dump from a database file. Usually, each field in the CSV file corresponds to a database table column. As we all know, spreadsheet programs, such as Excel, can produce CSV files as well.

Time for action – loading from CSV files

How do we deal with CSV files? Luckily, the loadtxt function can conveniently read CSV files, split up the fields and load the data into NumPy arrays. In the following example, we will load historical price data for Apple (the company, not the fruit). The data is in CSV format. The first column contains a symbol that identifies the stock. In our case, it is AAPL, next in our case. Nn is the date in dd-mm-yyyy format. The third column is empty. Then, in order, we have the open, high, low, and close price. Last, but not least, is the volume of the day. This is what a line looks like:

code4

How do we deal with CSV files? Luckily, the loadtxt function can conveniently read CSV files, split up the fields and load the data into NumPy arrays. In the following example, we will load historical price data for Apple (the company, not the fruit). The data is in CSV format. The first column contains a symbol that identifies the stock. In our case, it is AAPL, next in our case. Nn is the date in dd-mm-yyyy format. The third column is empty. Then, in order, we have the open, high, low, and close price. Last, but not least, is the volume of the day. This is what a line looks like:
code5

As you can see, data is stored in the data.csv file. We have set the delimiter to , (comma), since we are dealing with a comma separated value file. The usecols parameter is set through a tuple to get the seventh and eighth fields, which correspond to the close price and volume. Unpack is set to True, which means that data will be unpacked and assigned to the c and v variables that will hold the close price and volume, respectively.

What just happened?

CSV files are a special type of file that we have to deal with frequently. We read a CSV file containing stock quotes with the loadtxt function. We indicated to the loadtxt function that the delimiter of our file was a comma. We specified which columns we were interested in, through the usecols argument, and set the unpack parameter to True so that the data was unpacked for further use.

Volume weighted average price

Volume weighted average price (VWAP) is a very important quantity. The higher the volume, the more significant a price move typically is. VWAP is calculated by using volume values as weights.

Time for action – calculating volume weighted average price

These are the actions that we will take:

Read the data into arrays.
Calculate VWAP:
code6

What just happened?

That wasn't very hard, was it? We just called the average function and set its weights parameter to use the v array for weights. By the way, NumPy also has a function to calculate the arithmetic mean.

The mean function

The mean function is quite friendly and not so mean. This function calculates the arithmetic mean of an array. Let's see it in action:

code7

Time weighted average price

Now that we are at it, let's compute the time weighted average price too. It is just a variation on a theme really. The idea is that recent price quotes are more important, so we should give recent prices higher weights. The easiest way is to create an array with the arange function of increasing values from zero to the number of elements in the close price array. This is not necessarily the correct way. In fact, most of the examples concerning stock price analysis in this book are only illustrative. The following is the TWAP code:

code8

It produces this output:

code9

The TWAP is even higher than the mean.

Pop quiz – computing the weighted average

Which function returns the weighted average of an array?
1. 1. Reading from a file: First, we will need to read our file again and store the values for the high and low prices into arrays:
    code10
    
    The only thing that changed is the usecols parameter, since the high and low prices are situated in different columns.
  2. Getting the range: The following code gets the price range:
    code11
    
    These are the values returned:
    
    code12
    
    Now, it's trivial to get a midpoint, so it is left as an exercise for the reader to attempt.
  3. Calculating the spread: NumPy allows us to compute the spread of an array with a function called The ptp function returns the difference between the maximum and minimum values of an array. In other words, it is equal to max(array) – min(array). Call the ptp function:
    code13
    
    You will see this:
    
    code14
    
    Unlock access to the largest independent learning library in Tech for FREE!
    
    Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
    
    Renews at $19.99/month. Cancel anytime
  1. Determine the median of the close price: Create a new Python script and call it simplestats.py. You already know how to load the data from a CSV file into an array. So, copy that line of code and make sure that it only gets the close price. The code should appear like this, by now:
    code15
    
    The function that will do the magic for us is called median. We will call it and print the result immediately. Add the following line of code:
    
    The function that will do the magic for us is called median. We will call it and print the result immediately. Add the following line of code:
    
    code16
    
    The program prints the following output:
    
    code17
    
    Since it is our first time using the median function, we would like to check whether this is correct. Not because we are paranoid or anything! Obviously, we could do it by just going through the file and finding the correct value, but that is no fun. Instead we will just mimic the median algorithm by sorting the close price array and printing the middle value of the sorted array. The msort function does the first part for us. We will call the function, store the sorted array, and then print it:
    
    code18
    
    This prints the following output:
    
    code19
    
    Yup, it works! Let's now get the middle value of the sorted array:
    
    code20
    
    It gives us the following output:
    
    code21
    
    Hey, that's a different value than the one the median function gave us. How come? Upon further investigation we find that the median function return value doesn't even appear in our file. That's even stranger! Before filing bugs with the NumPy team, let's have a look at the documentation. This mystery is easy to solve. It turns out that our naive algorithm only works for arrays with odd lengths. For even-length arrays, the median is calculated from the average of the two array values in the middle. Therefore, type the following code:
    
    code22
    
    This prints the following output:
    
    code23
    
    Success!
    
    Another statistical measure that we are concerned with is variance. Variance tells us how much a variable varies. In our case, it also tells us how risky an investment is, since a stock price that varies too wildly is bound to get us into trouble.
  2. Calculate the variance of the close price: With NumPy, this is just a one liner. See the following code:
    code24
    
    This gives us the following output:
    
    code25
    
    Not that we don't trust NumPy or anything, but let's double-check using the definition of variance, as found in the documentation. Mind you, this definition might be different than the one in your statistics book, but that is quite common in the field of statistics. The variance is defined as the mean of the square of deviations from the mean, divided by the number of elements in the array. Some books tell us to divide by the number of elements in the array minus one.
    
    code26
    
    The output is as follows:
    
    code27
    
    Just as we expected!
  - weighted average
  - waverage
  - average
  - avg