Python tools for data science
Until now, we've been using the term data mining when referring to problems and techniques that we're going to apply throughout this book. The title of this section, in fact, mentions the term data science. The use of this term has exploded in the recent years, especially in business environments, while many academics and journalists have also criticized its use as a buzzword. Meanwhile, other academic institutions started offering courses on data science, and many books and articles have been published on the subject. Rather than having a strong opinion on where we should draw the border between different disciplines, we limit ourselves to observe how, nowadays, there is a general interest in multiple fields, including data science, data mining, data analysis, statistics, machine learning, artificial intelligence, data visualization, and more. The topics we're discussing are interdisciplinary by their own nature, and they all borrow from each other from time to time. These is certainly an amazing time to be working in any of these fields, with a lot of interest from the public and a constant buzz with new advances in interesting projects.
The purpose of this section is to introduce Python as a tool for data science, and to describe part of the Python ecosystem that we're going to use in the next chapters.
Python is one of the most interesting languages for data analytics projects. The following are some of the reasons that make it fit for purpose:
- Declarative and intuitive syntax
- Rich ecosystem for data processing
- Efficiency
Python has a shallow learning curve due to its elegant syntax. Being a dynamic and interpreted language, it facilitates rapid development and interactive exploration. The ecosystem for data processing is partially described in the following sections, which will introduce the main packages we'll use in this book.
In terms of efficiency, interpreted and high-level languages are not famous for being furiously fast. Tools such as NumPy achieve efficiency by hooking to low-level libraries under the hood, and exposing a friendly Python interface. Moreover, many projects employ the use of Cython, a superset of Python that enriches the language by allowing, among other features, to define strong variable types and compile into C. Many other projects in the Python world are in the process of tackling efficiency issues with the overall goal of making pure Python implementations faster. In this book, we won't dig into Cython or any of these promising projects, but we'll make use of NumPy (especially through other libraries that employ NumPy) for data analysis.
Python development environment setup
When this book was started, Python 3.5 had just been released and received some attention for some its latest features, such as improved support for asynchronous programming and semantic definition of type hints. In terms of usage, Python 3.5 is probably not widely used yet, but it represents the current line of development of the language.
Note
The examples in this book are compatible with Python 3, particularly with versions 3.4+ and 3.5+.
In the never-ending discussion about choosing between Python 2 and Python 3, one of the points to keep in mind is that the support for Python 2 will be dismissed in a few years (at the time of writing, the sunset date is 2020). New features are not developed in Python 2, as this branch is only for bug fixes. On the other hand, many libraries are still developed for Python 2 first, and then the support for Python 3 is added later. For this reason, from time to time, there could be a minor hiccup in terms of compatibility of some library, which is usually resolved by the community quite quickly. In general, if there is no strong reason against this choice, the preference should go to Python 3, especially for new green-field projects.
In order to keep the development environment clean, and ease the transition from prototype to production, the suggestion is to use virtualenv
to manage a virtual environment and install dependencies. virtualenv
is a tool for creating and managing isolated Python environments. By using an isolated virtual environment, developers avoid polluting the global Python environment with libraries that could be incompatible with each other. The tools allow us to maintain multiple projects that require different configurations and easily switch from one to the other. Moreover, the virtual environment can be installed in a local folder that is accessible to users without administrative privileges.
To install virtualenv
in the global Python environment in order to make it available to all users, we can use pip
from a terminal (Linux/Unix) or command prompt (Windows):
$ [sudo] pip install virtualenv
The sudo
command might be necessary on Linux/Unix or macOS if our current user doesn't have administrator privileges on the system.
If a package is already installed, it can be upgraded to the latest version:
$ pip install --upgrade [package name]
Note
Since Python 3.4, the pip
tool is shipped with Python. Previous versions require a separate installation of pip
as explained on the project page (https://github.com/pypa/pip). The tool can also be used to upgrade itself to the latest version:
$ pip install --upgrade pip
Once virtualenv
is globally available, for each project, we can define a separate Python environment where dependencies are installed in isolation, without tampering with the global environment. In this way, tracking the required dependencies of a single project is extremely easy.
In order to set up a virtual environment, follow these steps:
$ mkdir my_new_project # creat new project folder
$ cd my_new_project # enter project folder
$ virtualenv my_env # setup custom virtual environment
This will create a my_env
subfolder, which is also the name of the virtual environment we're creating, in the current directory. Inside this subfolder, we have all the necessary tools to create the isolated Python environment, including the Python binaries and the standard library. In order to activate the environment, we can type the following command:
$ source my_env/bin/activate
Once the environment is active, the following will be visible on the prompt:
(my_env)$
Python packages can be installed for this particular environment using pip
:
(my_env)$ pip install [package-name]
All the new Python libraries installed with pip
when the environment is active will be installed into my_env/lib/python{VERSION}/site-packages
. Notice that being a local folder, we won't need administrative access to perform this command.
When we want to deactivate the virtual environment, we can simply type the following command:
$ deactivate
The process described earlier should work for the official Python distributions that are shipped (or available for download) with your operating system.
Conda, Anaconda, and Miniconda
There is also one more option to consider, called conda (http://conda.pydata.org/), which is gaining some traction in the scientific community as it makes the dependency management quite easy. Conda is an open source package manager and environment manager for installing multiple versions of software packages (and related dependencies), which makes it easy to switch from one version to the other. It supports Linux, macOS, and Windows, and while it was initially created for Python, it can be used to package and distribute any software.
There are mainly two distributions that ship with conda: the batteries-included version, Anaconda, which comes with approximately 100 packages for scientific computing already installed, and the lightweight version, Miniconda, which simply comes with Python and the conda installer, without external libraries.
If you're new to Python, have some time for the bigger download and disk space to spare, and don't want to install all the packages manually, you can get started with Anaconda. For Windows and macOS, Anaconda is available with either a graphical or command-line installer. Figure 1.5 shows a screen capture of the installation procedure on a macOS. For Linux, only the command-line installer is available. In all cases, it's possible to choose between Python 2 and Python 3. If you prefer to have full control of your system, Miniconda will probably be your favorite option:
Once you've installed your version of conda, in order to create a new conda environment, you can use the following command:
$ conda create --name my_env python=3.4 # or favorite version
The environment can be activated with the following command:
$ conda activate my_env
Similar to what happens with virtualenv
, the environment name will be visible in the prompt:
(my_env)$
New packages can be installed for this environment with the following command:
$ conda install [package-name]
Finally, you can deactivate an environment by typing the following command:
$ conda deactivate
Another nice feature of conda is the ability to install packages from pip as well, so if a particular library is not available via conda install
, or it's not been updated to the latest version we need, we can always fall back to the traditional Python package manager while using a conda environment.
If not specified otherwise, by default, conda will look up for packages on
https://anaconda.org, while pip
makes use of the Python Package Index (PyPI in short, also known as CheeseShop) at https://pypi.python.org/pypi. Both installers can also be instructed to install packages from the local filesystem or private repository.
The following section will use pip
to install the required packages, but you can easily switch to conda if you prefer to use this alternative.
This section introduces two of the foundational packages for scientific Python: NumPy and pandas.
NumPy (Numerical Python) offers fast and efficient processing or array-like data structures. For numerical data, storing and manipulating data with the Python built-ins (for example, lists or dictionaries) is much slower than using a NumPy array. Moreover, NumPy arrays are often used by other libraries as containers for input and output of different algorithms that require vectorized operations.
To install NumPy with pip
/virtualenv
, use the following command:
$ pip install numpy
When using the batteries-included Anaconda distribution, developers will find both NumPy and pandas preinstalled, so the preceding installation step is not necessary.
The core data structure of this library is the multi-dimensional array called ndarray
.
The following snippet, run from the interactive interpreter, showcases the creation of a simple array with NumPy:
>>> import numpy as np
>>> data = [1, 2, 3] # a list of int
>>> my_arr = np.array(data)
>>> my_arr
array([1, 2, 3])
>>> my_arr.shape
(3,)
>>> my_arr.dtype
dtype('int64')
>>> my_arr.ndim
1
The example shows that our data are represented by a one-dimensional array (the ndim
attribute) with three elements as we expect. The data type of the array is int64
as all our inputs are integers.
We can observe the speed of the NumPy array by profiling a simple operation, such as the sum of a list, using the timeit
module:
# Chap01/demo_numpy.py
from timeit import timeit
import numpy as np
if __name__ == '__main__':
setup_sum = 'data = list(range(10000))'
setup_np = 'import numpy as np;'
setup_np += 'data_np = np.array(list(range(10000)))'
run_sum = 'result = sum(data)'
run_np = 'result = np.sum(data_np)'
time_sum = timeit(run_sum, setup=setup_sum, number=10000)
time_np = timeit(run_np, setup=setup_np, number=10000)
print("Time for built-in sum(): {}".format(time_sum))
print("Time for np.sum(): {}".format(time_np))
The timeit
module takes a piece of code as the first parameter and runs it a number of times, producing the time required for the run as output. In order to focus on the specific piece of code that we're analyzing, the initial data setup and the required imports are moved to the setup
parameter that will be run only once and will not be included in the profiling. The last parameter, number
, limits the number of iterations to 10,000 instead of the default, which is 1 million. The output you observe should look as follows:
Time for built-in sum(): 0.9970562970265746
Time for np.sum(): 0.07551316602621228
The built-in sum()
function is more than 10 times slower than the NumPy sum()
function. For more complex pieces of code, we can easily observe differences of a greater order of magnitude.
Tip
Naming conventions
The Python community has converged on some de-facto standards to import some popular libraries. NumPy and pandas are two well-known examples, as they are usually imported with an alias, for example:
import numpy as np
In this way, NumPy functionalities can be accessed with np.function_name()
as illustrated in the preceding examples. Similarly, the pandas library is aliased to pd
. In principle, importing the whole library namespace with from numpy import *
is considered bad practice because it pollutes the current namespace.
Some of the characteristics of a NumPy array that we want to keep in mind are detailed as follows:
- The size of a NumPy array is fixed at creation, unlike, for example, Python lists that can be changed dynamically, so operations that change the size of the array are really creating a new one and deleting the original.
- The data type for each element of the array must be the same (with the exception of having arrays of objects, hence potentially of different memory sizes).
- NumPy promotes the use of operations with vectors, producing a more compact and readable code.
The second library introduced in this section is pandas. It is built on top of NumPy, so it also provides fast computation, and it offers convenient data structures, called Series and DataFrame, which allow us to perform data manipulation in a flexible and concise way.
Some of the nice features of pandas include the following:
- Fast and efficient objects for data manipulation
- Tools to read and write data between different formats such as CSV, text files, MS Excel spreadsheets, or SQL data structures
- Intelligent handling of missing data and related data alignment
- Label-based slicing and dicing of large datasets
- SQL-like data aggregations and data transformation
- Support for time series functionalities
- Integrated plotting functionalities
We can install pandas from the CheeseShop with the usual procedure:
$ pip install pandas
Let's consider the following example, run from the Python interactive interpreter, using a small made-up toy example of user data:
>>> import pandas as pd
>>> data = {'user_id': [1, 2, 3, 4], 'age': [25, 35, 31, 19]}
>>> frame = pd.DataFrame(data, columns=['user_id', 'age'])
>>> frame.head()
user_id age
0 1 25
1 2 35
2 3 31
3 4 19
The initial data layout is based on a dictionary, where the keys are attributes of the users (a user ID and age represented as number of years). The values in the dictionary are lists, and for each user, the corresponding attributes are aligned depending on the position. Once we create the DataFrame with these data, the alignment of the data becomes immediately clear. The head()
function prints the data in a tabular form, truncating to the first ten lines if the data is bigger than that.
We can now augment the DataFrame by adding one more column:
>>> frame['over_thirty'] = frame['age'] > 30
>>> frame.head()
user_id age over_thirty
0 1 25 False
1 2 35 True
2 3 31 True
3 4 19 False
Using pandas declarative syntax, we don't need to iterate through the whole column in order to access its data, but we can apply a SQL-like operation as shown in the preceding example. This operation has used the existing data to create a column of Booleans. We can also augment the DataFrame by adding new data:
>>> frame['likes_python'] = pd.Series([True, False, True, True], index=frame.index)
>>> frame.head()
user_id age over_thirty likes_python
0 1 25 False True
1 2 35 True False
2 3 31 True True
3 4 19 False True
We can observe some basic descriptive statistics using the describe()
method:
>>> frame.describe()
user_id age over_thirty likes_python
count 4.000000 4.0 4 4
mean 2.500000 27.5 0.5 0.75
std 1.290994 7.0 0.57735 0.5
min 1.000000 19.0 False False
25% 1.750000 23.5 0 0.75
50% 2.500000 28.0 0.5 1
75% 3.250000 32.0 1 1
max 4.000000 35.0 True True
So, for example, 50% of our users are over 30, and 75% of them like Python.
Machine learning is the discipline that studies and develops algorithms to learn from, and make predictions on, data. It is strongly related to data mining, and sometimes, the names of the two fields are used interchangeably. A common distinction between the two fields is roughly given as follows: machine learning focuses on predictions based on known properties of the data, while data mining focuses on discovery based on unknown properties of the data. Both fields borrow algorithms and techniques from their counterpart. One of the goals of this book is to be practical, so we acknowledge that academically the two fields, despite the big overlap, often have distinct goals and assumptions, but we will not worry too much about it.
Some examples of machine learning applications include the following:
- Deciding whether an incoming e-mail is spam or not
- Choosing the topic of a news article from a list of known subjects such as sport, finance, or politics
- Analyzing bank transactions to identify fraud attempts
- Deciding, from the apple query, whether a user is interested in fruit or in computers
Some of the most popular methodologies can be categorized into supervised and unsupervised learning approaches, described in the following section. This is an over simplification that doesn't describe the whole breadth and depth of the machine learning field, but it's a good starting point to appreciate some of its technicalities.
Supervised learning approaches can be employed to solve problems such as classification, in which the data comes with the additional attributes that we want to predict, for example, the label of a class. In this case, the classifier can associate each input object with the desired output. By inferring from the features of the input objects, the classifier can then predict the desired label for the new unseen inputs. Common techniques include Naive Bayes (NB), Support Vector Machine (SVM) and models that belong to the Neural Networks (NN) family, such as perceptrons or multi-layer perceptrons.
The sample inputs used by the learning algorithm to build the mathematical model are called training data, while the unseen inputs that we want to obtain a prediction on are called test data. Inputs of a machine learning algorithm are typically in the form of a vector with each element of the vector representing a feature of the input. For supervised learning approaches, the desired output to assign to each of the unseen inputs is typically called label or target.
Unsupervised learning approaches are instead applied to problems in which the data come without a corresponding output value. A typical example of this kind of problems is clustering. In this case, an algorithm tries to find hidden structures in the data in order to group similar items into clusters. Another application consists of identifying items that don't appear to belong to a particular group (for example, outlier detection). An example of a common clustering algorithm is k-means.
The main Python package for machine learning is scikit-learn. It's an open source collection of machine learning algorithms that includes tools to access and preprocess data, evaluate the output of an algorithm, and visualize the results.
You can install scikit-learn with the common procedure via the CheeseShop:
$ pip install scikit-learn
Without digging into the details of the techniques, we will now walkthrough an application of scikit-learn to solve a clustering problem.
As we don't have social data yet, we can employ one of the datasets that is shipped together with scikit-learn.
The data that we're using is called the Fisher's Iris dataset, also referred to as Iris flower dataset. It was introduced in the 1930s by Ronald Fisher and it's today one of the classic datasets: given its small size, it's often used in the literature for toy examples. The dataset contains 50 samples from each of the three species of Iris, and for each sample four features are reported: the length and width of petals and sepals.
The dataset is commonly used as a showcase example for classification as the data comes with the correct labels for each sample, while its application for clustering is less common, mainly because there are just two well-visible clusters with a rather obvious separation. Given its small size and simple structure, it makes the case for a gentle introduction to data analysis with scikit-learn. If you want to run the example, including the data visualization part, you need to install also the matplotlib library with pip install matplotlib
. More details on data visualization with Python are discussed later in this chapter.
Let's take a look at the following sample code:
# Chap01/demo_sklearn.py
from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
if __name__ == '__main__':
# Load the data
iris = datasets.load_iris()
X = iris.data
petal_length = X[:, 2]
petal_width = X[:, 3]
true_labels = iris.target
# Apply KMeans clustering
estimator = KMeans(n_clusters=3)
estimator.fit(X)
predicted_labels = estimator.labels_
# Color scheme definition: red, yellow and blue
color_scheme = ['r', 'y', 'b']
# Markers definition: circle, "x" and "plus"
marker_list = ['o', 'x', '+']
# Assign colors/markers to the predicted labels
colors_predicted_labels = [color_scheme[lab] for lab in
predicted_labels]
markers_predicted = [marker_list[lab] for lab in
predicted_labels]
# Assign colors/markers to the true labels
colors_true_labels = [color_scheme[lab] for lab in true_labels]
markers_true = [marker_list[lab] for lab in true_labels]
# Plot and save the two scatter plots
for x, y, c, m in zip(petal_width,
petal_length,
colors_predicted_labels,
markers_predicted):
plt.scatter(x, y, c=c, marker=m)
plt.savefig('iris_clusters.png')
for x, y, c, m in zip(petal_width,
petal_length,
colors_true_labels,
markers_true):
plt.scatter(x, y, c=c, marker=m)
plt.savefig('iris_true_labels.png')
print(iris.target_names)
Firstly, we will load the dataset into the iris
variable, which is an object containing both the data and information about the data. In particular, iris.data
contains the data itself, in the form of a NumPy array or arrays, while iris.target
contains a numeric label that represent the class a sample belongs to. In each sample vector, the four values represent, respectively, sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Using the slicing notation for the NumPy array, we extract the third and fourth element of each sample into petal_length
and petal_width
, respectively. These will be used to plot the samples in a two-dimensional representation, even though the vectors have four dimensions.
The clustering process consists in two lines of code: one to create an instance of the KMeans
algorithm and the second to fit()
the data to the model. The simplicity of this interface is one of the characteristics of scikit-learn, which in most cases, allows you to apply a learning algorithms with just a few lines of code. For the application of the k-means algorithm, we choose the number of clusters to be three, as this is given by the data. Keep in mind that knowing the appropriate number of clusters in advance is not something that usually happens. Determining the correct (or the most interesting) number of clusters is a challenge in itself, distinct from the application of a clustering algorithm per se. As the purpose of this example is to briefly introduce scikit-learn and the simplicity of its interface, we take this shortcut. Normally, more effort is put into preparing the data in a format that is understood by scikit-learn.
The second half of the example serves the purpose of visualizing the data using matplotlib. Firstly, we will define a color scheme to visually differentiate the three clusters, using red, yellow, and blue defined in the color_scheme
list. Secondly, we will exploit the fact that both the real labels and cluster associations for each sample are given as integers, starting from 0, so they can be used as indexes to match one of the colors.
Notice that while the numbers for the real labels are associated to the particular meaning of the labels, that is, a class name; the cluster numbers are simply used to clarify that a given sample belongs to a cluster, but there is no information on the meaning of the cluster. Specifically, the three classes for the real labels are setosa, versicolor, and virginica, respectively-the three species of Iris represented in the dataset.
The last lines of the example produce two scatterplots of the data, one for the real labels and another for the cluster association, using the petal length and width as two dimensions. The two plots are represented in Figure 1.6. The position of the items in the two plots is, of course, the same, but what we can observe is how the algorithm has split the three groups. In particular, the cluster at the bottom left is clearly separated by the other two, and the algorithm can easily identify it without doubt. Instead, the other two clusters are more difficult to distinguish as some of the elements overlap, so the algorithm makes some mistakes in this context.
Once again, it's worth mentioning that here we can spot the mistakes because we know the real class of each sample. The algorithm has simply created an association based on the features given to it as input:
Natural language processing
Natural language processing (NLP) is the discipline related to the study of methods and techniques for automatic analysis, understanding, and generation of natural language, that is, the language as written or spoken naturally by humans.
Academically, it's been an active field of study for many decades, as its early days are generally attributed to Alan Turing, one of the fathers of computer science, who proposed a test to evaluate machine intelligence in 1950. The concept is fairly straightforward: if a human judge is having a written conversation with two agents-a human and a machine-can the machine fool the judge into thinking it's not a machine? If this happens, the machine passes the test and shows signs of intelligence.
The test is nowadays known as the Turing Test, and after being common knowledge only in computer science circles for a long time, it's been recently brought to a wider audience by pop media. The movie The Imitation Game (2014), for example, is loosely based on the biography of Alan Turing, and its title is a clear reference to the test itself. Another movie that mentions the Turing Test is Ex Machina (2015), with a stronger emphasis on the development of an artificial intelligence that can lie and fool humans for its own benefits. In this movie, the Turing Test is played between a human judge and the human-looking robot, Ava, with direct verbal interaction. Without spoiling the end of the movie for those who haven't watched it, the story develops with the artificial intelligence showing to be much smarter, in a shady and artful way, than the human. Interestingly, the futuristic robot was trained using search engine logs, to understand and mimic how humans ask questions.
This little detour between past and hypothetical future of artificial intelligence is to highlight how mastering human language will be central for development of an advanced artificial intelligence (AI). Despite all the recent improvements, we're not quite there yet, and NLP is still a hot topic at the moment.
In the contest of social media, the obvious opportunity for us is that there's a huge amount of natural language waiting to be mined. The amount of textual data available via social media is continuously increasing, and for many questions, the answer has probably been already written; but transforming raw text into information is not an easy task. Conversations happen on social media all the time, users ask technical questions on web forums and find answers and customers describe their experiences with a particular product through comments or reviews on social media. Knowing the topics of these conversations, finding the most expert users who answer these questions, and understanding the opinion of the customers who write these reviews, are all tasks that can be achieved with a fair level of accuracy by means of NLP.
Moving on to Python, one of the most popular packages for NLP is Natural Language Toolkit (NLTK). The toolkit provides a friendly interface for many of the common NLP tasks, as well as lexical resources and linguistic data.
Some of the tasks that we can easily perform with the NLTK include the following:
- Tokenization of words and sentences, that is, the process of breaking a stream of text down to individual tokens
- Tagging words for part-of-speech, that is, assigning words to categories according to their syntactic function, such as nouns, adjectives, verbs, and so on
- Identifying named entities, for example, identifying and classifying references of persons, locations, organizations, and so on
- Applying machine learning techniques (for example, classification) to text
- In general, extracting information from raw text
The installation of NLTK follows the common procedure via the CheeseShop:
$ pip install nltk
A difference between NLTK and many other packages is that this framework also comes with linguistic resources (that is, data) for specific tasks. Given their size, such data is not included in the default installation, but has to be downloaded separately.
The installation procedure is fully documented at http://www.nltk.org/data.html and it's strongly recommended that you read this official guide for all the details.
In short, from a Python interpreter, you can type the following code:
>>> import nltk
>>> nltk.download()
If you are in a desktop environment, this will open a new window that allows you to browse the available data. If a desktop environment is not available, you'll see a textual interface in the terminal. You can select individual packages to download, or even download all the data (that will take approximately 2.2 GB of disk space).
The downloader will try to save the file at a central location (C:\nltk_data
on Windows and /usr/share/nltk_data
on Unix and Mac) if you are working from an administrator account, or at your home folder (for example, ~/nltk_data
) if you are a regular user. You can also choose a custom folder, but in this case, NLTK will look for the $NLTK_DATA
environment variable to know where to find its data, so you'll need to set it accordingly.
If disk space is not a problem, installing all the data is probably the most convenient option, as you do it once and you can forget about it. On the other hand, downloading everything doesn't give you a clear understanding of what resources are needed. If you prefer to have full control on what you install, you can download the packages you need one by one. In this case, from time to time during your NLTK development, you'll find a little bump on the road in the form of LookupError
, meaning that a resource you're trying to use is missing and you have to download it.
For example, after a fresh NLTK installation, if we try to tokenize some text from the Python interpreter, we can type the following code:
>>> from nltk import word_tokenize
>>> word_tokenize('Some sample text')
Traceback (most recent call last):
# some long traceback here
LookupError:
*********************************************************************
Resource 'tokenizers/punkt/PY3/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/Users/marcob/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
*********************************************************************
This error tells us that the punkt
resource, responsible for the tokenization, is not found in any of the conventional folders, so we'll have to go back to the NLTK downloader and solve the issue by getting this package.
Assuming that we now have a fully working NLTK installation, we can go back to the previous example and discuss tokenization with a few more details.
In the context of NLP, tokenization (also called segmentation) is the process of breaking a piece of text into smaller units called tokens or segments. While tokens can be interpreted in many different ways, typically we are interested in words and sentences. A simple example using word_tokenize()
is as follows:
>>> from nltk import word_tokenize
>>> text = "The quick brown fox jumped over the lazy dog"
>>> words = word_tokenize(text)
>>> print(words)
# ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
The output of word_tokenize()
is a list of strings, with each string representing a word. The word boundaries in this example are given by white spaces. In a similar fashion, sent_tokenize()
returns a list of strings, with each string representing a sentence, bounded by punctuation. An example that involves both functions:
>>> from nltk import word_tokenize, sent_tokenize
>>> text = "The quick brown fox jumped! Where? Over the lazy dog."
>>> sentences = sent_tokenize(text)
>>> print(sentences)
#['The quick brown fox jumped!', 'Where?', 'Over the lazy dog.']
>>> for sentence in sentences:
... words = word_tokenize(sentence)
... print(words)
# ['The', 'quick', 'brown', 'fox', 'jumped', '!']
# ['Where', '?']
# ['Over', 'the', 'lazy', 'dog', '.']
As you can see, the punctuation symbols are considered tokens on their own, and as such, are included in the output of word_tokenize()
. This opens for a question that we haven't really asked so far: how do we define a token? The word_tokenize()
function implements an algorithm designed for standard English. As the focus of this book is on data from social media, it's fair to investigate whether the rules for standard English also apply in our context. Let's consider a fictitious example for Twitter data:
>>> tweet = '@marcobonzanini: an example! :D http://example.com #NLP'
>>> print(word_tokenize(tweet))
# ['@', 'marcobonzanini', ':', 'an', 'example', '!', ':', 'D', 'http', ':', '//example.com', '#', 'NLP']
The sample tweet introduces some peculiarities that break the standard tokenization:
- Usernames are prefixed with an
@
symbol, so @marcobonzanini
is split into two tokens with @
being recognized as punctuation - Emoticons such as
:D
are common slang on chats, text messages, and of course, social media, but they're not officially part of standard English, hence the split - URLs are frequently used to share articles or pictures, but once again, they're not part of standard English, so they are broken down into components
- Hash-tags such as
#NLP
are strings prefixed by #
and are used to define the topic of the post, so that other users can easily search for a topic or follow the conversation
The previous example shows that an apparently straightforward problem such as tokenization can hide many tricky edge cases that will require something smarter than the initial intuition to be solved. Fortunately, NLTK offers the following off-the-shelf solution:
>>> from nltk.tokenize import TweetTokenizer
>>> tokenizer = TwitterTokenizer()
>>> tweet = '@marcobonzanini: an example! :D http://example.com #NLP'
>>> print(tokenizer.tokenize(tweet))
# ['@marcobonzanini', ':', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']
The previous examples should provide the taste of NLTK's simple interface. We will be using this framework on different occasions throughout the book.
With increasing interest in NLP applications, the Python-for-NLP ecosystem has grown dramatically in recent years, with a whole lot of interesting projects getting more and more attention. In particular, Gensim, dubbed topic modeling for humans, is an open-source library that focuses on semantic analysis. Gensim shares with NLTK the tendency to offer an easy-to-use interface, hence the for humans part of the tagline. Another aspect that pushed its popularity is efficiency, as the library is highly optimized for speed, has options for distributed computing, and can process large datasets without the need to hold all the data in memory.
The simple installation of Gensim follows the usual procedure:
$ pip install gensim
The main dependencies are NumPy and SciPy, although if you want to take advantage of the distributed computing capabilities of Gensim, you'll also need to install PYthon Remote Objects (Pyro4):
$ pip install Pyro4
In order to showcase Gensim's simple interface, we can take a look at the text summarization module:
# Chap01/demo_gensim.py
from gensim.summarization import summarize
import sys
fname = sys.argv[1]
with open(fname, 'r') as f:
content = f.read()
summary = summarize(content, split=True)
for i, sentence in enumerate(summary):
print("%d) %s" % (i+1, sentence))
The demo_gensim.py
script takes one command-line parameter, which is the name of a text file to summarize. In order to test the script, I took a piece of text from the Wikipedia page about The Lord of the Rings, the paragraph with the plot of the first volume, The Fellowship of the Ring, in particular. The script can be invoked with the following command:
$ python demo_gensim.py lord_of_the_rings.txt
This produces the following output:
1) They nearly encounter the Nazgûl while still in the Shire, but shake off pursuit by cutting through the Old Forest, where they are aided by the enigmatic Tom Bombadil, who alone is unaffected by the Ring's corrupting influence.
2) Aragorn leads the hobbits toward the Elven refuge of Rivendell, while Frodo gradually succumbs to the wound.
3) The Council of Elrond reveals much significant history about Sauron and the Ring, as well as the news that Sauron has corrupted Gandalf's fellow wizard, Saruman.
4) Frodo volunteers to take on this daunting task, and a "Fellowship of the Ring" is formed to aid him: Sam, Merry, Pippin, Aragorn, Gandalf, Gimli the Dwarf, Legolas the Elf, and the Man Boromir, son of the Ruling Steward Denethor of the realm of Gondor.
The summarize()
function in Gensim implements the classic TextRank algorithm. The algorithm ranks sentences according to their importance and selects the most peculiar ones to produce the output summary. It's worth noting that this approach is an extractive summarization technique, meaning that the output only contains sentences selected from the input as they are, that is, there is no text transformation, rephrasing, and so on. The output size is approximately 25% of the original text. This can be controlled with the optional ratio
argument for a proportional size, or word_count
for a fixed number of words. In both cases, the output will only contain full sentences, that is, sentences will not be broken down to respect the desired output size.
Network theory, part of the graph theory, is the study of graphs as a representation of relationships between discrete objects. Its application to social media takes the form of Social network analysis (SNA), which is a strategy to investigate social structures such as friendships or acquaintances.
NetworkX is one of the main Python libraries for the creation, manipulation, and study of complex network structures. It provides data structures for graphs, as well as many well-known standard graph algorithms.
For the installation from the CheeseShop, we follow the usual procedure:
$ pip install networkx
The following example shows how to create a simple graph with a few nodes, representing users, and a few edges between nodes, representing social relationships between users:
# Chap01/demo_networkx.py
import networkx as nx
from datetime import datetime
if __name__ == '__main__':
g = nx.Graph()
g.add_node("John", {'name': 'John', 'age': 25})
g.add_node("Peter", {'name': 'Peter', 'age': 35})
g.add_node("Mary", {'name': 'Mary', 'age': 31})
g.add_node("Lucy", {'name': 'Lucy', 'age': 19})
g.add_edge("John", "Mary", {'since': datetime.today()})
g.add_edge("John", "Peter", {'since': datetime(1990, 7, 30)})
g.add_edge("Mary", "Lucy", {'since': datetime(2010, 8, 10)})
print(g.nodes())
print(g.edges())
print(g.has_edge("Lucy", "Mary"))
# ['John', 'Peter', 'Mary', 'Lucy']
# [('John', 'Peter'), ('John', 'Mary'), ('Mary', 'Lucy')]
# True
Both nodes and edges can carry additional attributes in the form of a Python dictionary, which help in describing the semantics of the network.
The Graph
class is used to represent an undirected graph, meaning that the direction of the edges is not considered. This is clear from the use of the has_edge()
function, which checks whether an edge between Lucy and Mary exists. The edge was inserted between Mary and Lucy, but the function shows that the direction is ignored. Further edges between the same nodes will also be ignored, that is, only one edge per node pair is considered. Self-loops are allowed by the Graph
class, although in our example, they are not needed.
Other types of graphs supported by NetworkX are DiGraph
for directed graphs (the direction of the nodes matters) and their counterparts for multiple (parallel) edges between nodes, MultiGraph
and MultiDiGraph
, respectively.
Data visualization (or data viz) is a cross-field discipline that deals with the visual representation of data. Visual representations are powerful tools that offer the opportunity of understanding complex data and are efficient ways to present and communicate the results of a data analysis process in general. Through data visualizations, people can see aspects of the data that are not immediately clear. After all, if a picture is worth a thousand words, a good data visualization allows the reader to absorb complex concepts with the help of a simple picture. For example, data visualization can be used by data scientists during the exploratory data analysis steps in order to understand the data. Moreover, data scientists can also use data visualization to communicate with nonexperts and explain them what is interesting about the data.
Python offers a number of tools for data visualization, for example, the matplotlib library briefly used in the Machine learning section of this chapter. To install the library, use the following command:
$ pip install matplotlib
Matplotlib produces publication quality figures in a variety of formats. The philosophy behind this library is that a developer should be able to create simple plots with a small number of lines of code. Matplotlib plots can be saved into different file formats, for example, Portable Network Graphics (PNG) or Portable Document Format (PDF).
Let's consider a simple example that plots some two-dimensional data:
# Chap01/demo_matplotlib.py
import matplotlib.pyplot as plt
import numpy as np
if __name__ == '__main__':
# plot y = x^2 with red dots
x = np.array([1, 2, 3, 4, 5])
y = x * x
plt.plot(x, y, 'ro')
plt.axis([0, 6, 0, 30])
plt.savefig('demo_plot.png')
The output of this code is shown in the following diagram:
Aliasing pyploy
to plt
is a common naming convention, as discussed earlier, also for other packages. The plot()
function takes two sequence-like parameters containing the coordinates for x
and y
, respectively. In this example, these coordinates are created as the NumPy array, but they could be Python lists. The axis()
function defines the visible range for the axis. As we're plotting the numbers 1 to 5 squared, our range is 0-6 for x axis and 0-30 for y axis. Finally, the savefig()
function produces an image file with the output visualized in Figure 1.7, guessing the image format from the file extension.
Matplotlib produces excellent images for publication, but sometimes there's a need for some interactivity to allow the user to explore the data by zooming into the details of a visualization dynamically. This kind of interactivity is more in the realm of other programming languages, for example, JavaScript (especially through the popular D3.js library at https://d3js.org), that allow building interactive web-based data visualizations. While this is not the central topic of this book, it is worth mentioning that Python doesn't fall short in this domain, thanks to the tools that translate Python objects into the Vega grammar, a declarative format based on JSON that allows creating interactive visualizations.
A particularly interesting situation where Python and JavaScript can cooperate well is the case of geographical data. Most social media platforms are accessible through mobile devices. This offers the opportunity of tracking the user's location to also include the geographical aspect of data analysis. A common data format used to encode and exchange a variety of geographical data structures (such as points or polygons) is GeoJSON (http://geojson.org). As the name suggests, this format is JSON-based grammar.
A popular JavaScript library for plotting interactive maps is Leaflet (http://leafletjs.com). The bridge between JavaScript and Python is provided by folium, a Python library that makes it easy to visualize geographical data, handled with Python via, for example, GeoJSON, over a Leaflet.js map.
It's also worth mentioning that third-party services such as Plotly (https://plot.ly) offer support for the automatic generation of data visualization, off-loading the burden of creating interactive components of their services. Specifically, Plotly offers ample support for creating bespoke data visualizations using their Python client (https://plot.ly/python). The graphs are hosted online by Plotly and linked to a user account (free for public hosting, while private graphs have paid plans).