(For more resources on this topic, see here.)
First, we will learn about file I/O with NumPy. Data is usually stored in files. You would not get far if you are not able to read from and write to files.
As an example of file I/O, we will create an identity matrix and store its contents in a file.
The identity matrix can be created with the eye function. The only argument we need to give the eye function is the number of ones. So, for instance, for a 2-by-2 matrix, write the following code:
code1
The output is:
code2
code3
A file called eye.txt should have been created. You can check for yourself whether the contents are as expected.
Reading and writing files is a necessary skill for data analysis. We wrote to a file with savetxt. We made an identity matrix with the eye function.
Files in the comma separated values (CSV) format are encountered quite frequently. Often, the CSV file is just a dump from a database file. Usually, each field in the CSV file corresponds to a database table column. As we all know, spreadsheet programs, such as Excel, can produce CSV files as well.
How do we deal with CSV files? Luckily, the loadtxt function can conveniently read CSV files, split up the fields and load the data into NumPy arrays. In the following example, we will load historical price data for Apple (the company, not the fruit). The data is in CSV format. The first column contains a symbol that identifies the stock. In our case, it is AAPL, next in our case. Nn is the date in dd-mm-yyyy format. The third column is empty. Then, in order, we have the open, high, low, and close price. Last, but not least, is the volume of the day. This is what a line looks like:
code4
code5
As you can see, data is stored in the data.csv file. We have set the delimiter to , (comma), since we are dealing with a comma separated value file. The usecols parameter is set through a tuple to get the seventh and eighth fields, which correspond to the close price and volume. Unpack is set to True, which means that data will be unpacked and assigned to the c and v variables that will hold the close price and volume, respectively.
CSV files are a special type of file that we have to deal with frequently. We read a CSV file containing stock quotes with the loadtxt function. We indicated to the loadtxt function that the delimiter of our file was a comma. We specified which columns we were interested in, through the usecols argument, and set the unpack parameter to True so that the data was unpacked for further use.
Volume weighted average price (VWAP) is a very important quantity. The higher the volume, the more significant a price move typically is. VWAP is calculated by using volume values as weights.
These are the actions that we will take:
code6
That wasn't very hard, was it? We just called the average function and set its weights parameter to use the v array for weights. By the way, NumPy also has a function to calculate the arithmetic mean.
The mean function is quite friendly and not so mean. This function calculates the arithmetic mean of an array. Let's see it in action:
code7
Now that we are at it, let's compute the time weighted average price too. It is just a variation on a theme really. The idea is that recent price quotes are more important, so we should give recent prices higher weights. The easiest way is to create an array with the arange function of increasing values from zero to the number of elements in the close price array. This is not necessarily the correct way. In fact, most of the examples concerning stock price analysis in this book are only illustrative. The following is the TWAP code:
code8
It produces this output:
code9
The TWAP is even higher than the mean.
code10
The only thing that changed is the usecols parameter, since the high and low prices are situated in different columns.
code11
These are the values returned:
code12
Now, it's trivial to get a midpoint, so it is left as an exercise for the reader to attempt.
code13
You will see this:
code14
code15
The function that will do the magic for us is called median. We will call it and print the result immediately. Add the following line of code:
The function that will do the magic for us is called median. We will call it and print the result immediately. Add the following line of code:
code16
The program prints the following output:
code17
Since it is our first time using the median function, we would like to check whether this is correct. Not because we are paranoid or anything! Obviously, we could do it by just going through the file and finding the correct value, but that is no fun. Instead we will just mimic the median algorithm by sorting the close price array and printing the middle value of the sorted array. The msort function does the first part for us. We will call the function, store the sorted array, and then print it:
code18
This prints the following output:
code19
Yup, it works! Let's now get the middle value of the sorted array:
code20
It gives us the following output:
code21
Hey, that's a different value than the one the median function gave us. How come? Upon further investigation we find that the median function return value doesn't even appear in our file. That's even stranger! Before filing bugs with the NumPy team, let's have a look at the documentation. This mystery is easy to solve. It turns out that our naive algorithm only works for arrays with odd lengths. For even-length arrays, the median is calculated from the average of the two array values in the middle. Therefore, type the following code:
code22
This prints the following output:
code23
Success!
Another statistical measure that we are concerned with is variance. Variance tells us how much a variable varies. In our case, it also tells us how risky an investment is, since a stock price that varies too wildly is bound to get us into trouble.
code24
This gives us the following output:
code25
Not that we don't trust NumPy or anything, but let's double-check using the definition of variance, as found in the documentation. Mind you, this definition might be different than the one in your statistics book, but that is quite common in the field of statistics. The variance is defined as the mean of the square of deviations from the mean, divided by the number of elements in the array. Some books tell us to divide by the number of elements in the array minus one.
code26
The output is as follows:
code27
Just as we expected!
Try doing the same calculation using the open price. Calculate the mean for the volume and the other prices.
Usually, we don't only want to know the average or arithmetic mean of a set of values, which are sort of in the middle; we also want the extremes, the full range—the highest and lowest values. The sample data that we are using here already has those values per day—the high and low price. However, we need to know the highest value of the high price and the lowest price value of the low price. After all, how else would we know how much our Apple stocks would gain or lose.
The min and max functions are the answer to our requirement.
We defined a range of highest to lowest values for the price. The highest value was given by applying the max function to the high price array. Similarly, the lowest value was found by calling the min function to the low price array. We also calculated the peak to peak distance with the ptp function.
Stock traders are interested in the most probable close price. Common sense says that this should be close to some kind of an average. The arithmetic mean and weighted average are ways to find the center of a distribution of values. However, both are not robust and sensitive to outliers. For instance, if we had a close price value of a million dollars, this would have influenced the outcome of our calculations.
One thing that we can do is use some kind of threshold to weed out outliers, but there is a better way. It is called the median, and it basically picks the middle value of a sorted set of values. For example, if we have the values of 1, 2, 3, 4 and 5. The median would be 3, since it is in the middle. These are the steps to calculate the median:
Maybe you noticed something new. We suddenly called the mean function on the c array. Yes, this is legal, because the ndarray object has a mean method. This is for your convenience. For now, just keep in mind that this is possible.