Developing a Python package
In this section, we describe the process of developing a Python package that calculates the SAX representation of a subsequence. Apart from this being a good programming exercise, the package is going to be enriched in the chapters that follow when we create the iSAX index.
We will begin by explaining the basics of Python packages.
The basics of Python packages
I am not a Python expert, and the presented information is far from complete. However, it covers the required knowledge regarding Python packages.
In all but the latest Python versions, we used to need a file named __init__.py
inside the directory of every Python package. Its purpose is to perform initialization actions and imports, as well as define variables. Although this is not the case with most recent Python versions, our packages will still have a __init__.py
file in them. The good thing is that it is allowed to be empty if you have nothing to put into it. There is a link at the end of the chapter to the official Python documentation regarding packages, regular packages, and namespace packages, where the use of __init__.py
is explained in more detail.
The next subsection discusses the details of the Python package that we are going to develop.
The SAX Python package
The code of the sax
Python package is included in a directory named sax
. The contents of the sax
directory are presented with the help of the tree(1)
command, which you might need to install on your own:
$ tree sax sax ├── SAXalphabet ├── __init__.py ├── __pycache__ │ __init__.cpython-310.pyc │ sax.cpython-310.pyc │ tools.cpython-310.pyc │ variables.cpython-310.pyc ├── sax.py ├── tools.py └── variables.py 2 directories, 9 files
The __pycache__
directory is automatically generated by Python once you begin using the Python package and contains precompiled bytecode Python code. You can completely ignore that directory.
Let us begin by showing the contents of sax.py
, which is going to be presented in multiple code chunks.
First, we have the import
section and the implementation of the normalize()
function, which normalizes a NumPy array:
import numpy as np from scipy.stats import norm from sax import tools import sys sys.path.insert(0,'..') def normalize(x): eps = 1e-6 mu = np.mean(x) std = np.std(x) if std < eps: return np.zeros(shape=x.shape) else: return (x-mu)/std
After that, we have the implementation of the createPAA()
function, which returns the SAX representation of a time series, given the cardinality and the segments:
def createPAA(ts, cardinality, segments): SAXword = "" ts_norm = normalize(ts) segment_size = len(ts_norm) // segments mValue = 0 for I in range(segments): ts_segment = ts_norm[segment_size * i :(i+1) * segment_size] mValue = meanValue(ts_segment) index = getIndex(mValue, cardinality) SAXword += str(index) +""""
Python uses the double slash //
operator to perform floor division. What the //
operator does is divide the first number by the second number before rounding the result down to the nearest integer – this is used for the segment_size
variable.
The rest of the code is about specifying the correct index numbers when working with the given time series (or subsequence). Hence, the for
loop is used to process the entire time series (or subsequence) based on the segments value.
Next, we have the implementation of a function that computes the mean value of a NumPy array:
def meanValue(ts_segment): sum = 0 for i in range(len(ts_segment)): sum += ts_segment[i] mean_value = sum / len(ts_segment) return mean_value
Finally, we have the function that returns the SAX value of a SAX word, given its mean value and its cardinality. Remember that we calculate the mean value of each SAX word separately in the createPAA()
function:
def getIndex(mValue, cardinality): index = 0 # With cardinality we get cardinality + 1 bPoints = tools.breakpoints(cardinality-1) while mValue < float(bPoints[index]): if index == len(bPoints)–- 1: # This means that index should be advanced # before breaking out of the while loop index += 1 break else: index += 1 digits = tools.power_of_two(cardinality) # Inverse the result inverse_s = "" for i in binary_index: if i == '0': inverse_s += '1' else: inverse_s += '0' return inverse_s
The previous code computes the SAX value of a SAX word using its mean value. It iteratively visits the breakpoints, from the lowest value to the biggest, up to the point that the mean value exceeds the current breakpoint. This way, we find the index of the SAX word (mean value) in the list of breakpoints.
Now, let us discuss a tricky point, which has to do with the last statements that reverse the SAX word. This mainly has to do with whether we begin counting from the top or the bottom of the different areas that the breakpoints create. All ways are equivalent – we just decided to go that way. This is because a previous implementation of SAX used that order, and we wanted to make sure that we created the same results for testing reasons. If you want to alter that functionality, you just have to remove the last for
loop.
As you saw at the beginning of this section, the sax
package is composed of three Python files, not just the one that we just presented. So, we will present the remaining two files.
First, we will present the contents of variables.py
:
# This file includes all variables for the sax package maximumCardinality = 32 # Where to find the breakpoints file # In this case, in the current directory breakpointsFile =""SAXalphabe"" # Sliding window size slidingWindowSize = 16 # Segments segments = 0 # Breakpoints in breakpointsFile elements =""" # Floating point precision precision = 5
You might wonder what the main reason is for having such a file. The answer is that we need to have a place to keep our global parameters and options, and having a separate file for that is a perfect solution. This will make much more sense when the code becomes longer and more complex.
Second, we present the code in tools.py
:
import os import numpy as np import sys from sax import variables breakpointsFile = variables.breakpointsFile maxCard = variables.maximumCardinality
Here, we reference two variables from the variable.py
file, which are variables.breakpointsFile
and variables.maximumCardinality
:
def power_of_two(n): power = 1 while n/2 != 1: # Not a power of 2 if n % 2 == 1: return -1 n = n / 2 power += 1 return power
This is a helper function that we use when we want to make sure that a value is a power of 2:
def load_sax_alphabet(): path = os.path.dirname(__file__) file_variable = open(path +"""" + breakpointsFile) variables.elements = file_variable.readlines() def breakpoints(cardinality): if variables.elements ==""": load_sax_alphabet() myLine = variables.elements[cardinality–- 1].rstrip() elements = myLine.split'''') elements.reverse() return elements
The load_sax_alphabet()
function loads the contents of the file with the definitions of breakpoints and assigns them to the variables.elements
variable. The breakpoints()
function returns the breakpoint values when given the cardinality.
As you can see, the code of the entire package is relatively short, which is a good thing.
In this section, we developed a Python package to compute SAX representations. In the next section, we are going to begin working with the sax
package.