Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Machine Learning for the Web
Machine Learning for the Web

Machine Learning for the Web: Gaining insight and intelligence from the internet with Python

Arrow left icon
Profile Icon Isoni Profile Icon Steve Essinger
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (27 Ratings)
Paperback Jul 2016 298 pages 1st Edition
eBook
NZ$14.99 NZ$64.99
Paperback
NZ$80.99
Subscription
Free Trial
Arrow left icon
Profile Icon Isoni Profile Icon Steve Essinger
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (27 Ratings)
Paperback Jul 2016 298 pages 1st Edition
eBook
NZ$14.99 NZ$64.99
Paperback
NZ$80.99
Subscription
Free Trial
eBook
NZ$14.99 NZ$64.99
Paperback
NZ$80.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Machine Learning for the Web

Chapter 1. Introduction to Practical Machine Learning Using Python

In the technology industry, the skill of analyzing and mining commercial data is becoming more and more important. All the companies that are related to the online world generate data that can be exploited to improve their business, or can be sold to other companies. This huge amount of information, which can be commercially useful, needs to be restructured and analyzed using the expertise of data science (or data mining) professionals. Data science employs techniques known as machine learning algorithms to transform the data in models, which are able to predict the behavior of certain entities that are highly considered by the business environment. This book is about these algorithms and techniques that are so crucial in today's technology business world, and how to efficiently deploy them in a real commercial environment. You will learn the most relevant machine-learning techniques and will have the chance to employ them in a series of exercises and applications designed to enhance commercial awareness and, with the skills learned in this book, these can be used in your professional experience. You are expected to already be familiar with the Python programming language, linear algebra, and statistics methodologies to fully acquire the topics discussed in this book.

  • There are many tutorials and classes available online on these subjects, but we recommend you read the official Python documentation (https://docs.python.org/), the books Elementary Statistics by A. Bluman and Statistical Inference by G. Casella and R. L. Berger to understand the statistical main concepts and methods and Linear Algebra and Its Applications by G. Strang to learn about linear algebra.

The purpose of this introductory chapter is to familiarize you with the more advanced libraries and tools used by machine-learning professionals in Python, such as NumPy, pandas, and matplotlib, which will help you to grasp the necessary technical knowledge to implement the techniques presented in the following chapters. Before continuing with the tutorials and description of the libraries used in this book, we would like to clarify the main concepts of the machine-learning field, and give a practical example of how a machine-learning algorithm can predict useful information in a real context.

General machine-learning concepts

In this book, the most relevant machine-learning algorithms are going to be discussed and used in exercises to make you familiar with them. In order to explain these algorithms and to understand the content of this book, there are a few general concepts we need to visit that are going to be described hereafter.

First of all, a good definition of machine learning is the subfield of computer science that has been developed from the fields of pattern recognition, artificial intelligence, and computational learning theory. Machine learning can also be seen as a data-mining tool, which focuses more on the data analysis aspects to understand the data provided. The purpose of this discipline is the development of programs, which are able to learn from previously seen data, through tunable parameters (usually arrays of double precision values), that are designed to be adjusted automatically to improve the resulting predictions. In this way, computers can predict a behavior, generalizing the underlying structure of the data, instead of just storing (or retrieving) the values like usual database systems. For this reason, machine learning is associated with computational statics, which also attempt to predict a behavior based on previous data. Common industrial applications of machine-learning algorithms are spam filtering, search engines, optical character recognition, and computer vision. Now that we have defined the discipline, we can describe the terminology used in each machine-learning problem, in more detail.

Any learning problem starts with a data set of n samples, which are used to predict the properties of the future unknown data. Each sample is typically composed of more than a single value so it is a vector. The components of this vector are called features. For example, imagine predicting the price of a second-hand car based on its characteristics: year of fabrication, color, engine size, and so on. Each car i in the dataset will be a vector of features x(i) that corresponds to its color, engine size, and many others. In this case, there is also a target (or label) variable associated with each car i, y(i) which is the second-hand car price. A training example is formed by a pair (x(i), y(i)) and therefore the complete set of N data points used to learn is called a training dataset {(x(i), y(i));i=1,…,N}. The symbol x will denote the space of feature (input) values, and y the space of target (output) values. The machine-learning algorithm chosen to solve the problem will be described by a mathematical model, with some parameters to tune in the training set. After the training phase is completed, the performance of the prediction is evaluated using another two sets: validation and testing sets. The validation set is used to choose, among multiple models, the one that returns the best results, while the testing set is usually used to determine the actual precision of the chosen model. Typically the dataset is divided into 50% training set, 25% validation set, and 25% testing set.

The learning problems can be divided in two main categories (both of which are extensively covered in this book):

  • Unsupervised learning: The training dataset is given by input feature vectors x without any corresponding label values. The usual objective is to find similar examples within the data using clustering algorithms, or to project the data from a high-dimensional space down to a few dimensions (blind signal separations algorithms such as principal component analysis). Since there is usually no target value for each training example, it is not possible to evaluate errors of the model directly from the data; you need to use a technique that evaluates how the elements within each cluster are similar to each other and different from the other cluster's members. This is one of the major differences between unsupervised learning and supervised learning.
  • Supervised learning: Each data sample is given in a pair consisting of an input feature vector and a label value. The task is to infer the parameters to predict the target values of the test data. These types of problems can be further divided into:
    • Classification: The data targets belong to two or more classes, and the goal is to learn how to predict the class of unlabeled data from the training set. Classification is a discrete (as opposed to continuous) form of supervised learning, where the label has a limited number of categories. A practical example of the classification problem is the handwritten digit recognition example, in which the objective is to match each feature vector to one of a finite number of discrete categories.
    • Regression: The label is a continuous variable. For example, the prediction of the height of a child based on his age and weight is a regression problem.

We are going to focus on unsupervised learning methods in Chapter 2, Machine Learning Techniques: Unsupervised Learning, while the most relevant supervised learning algorithms are discussed in Chapter 3, Supervised Machine Learning. Chapter 4, Web Mining Techniques will approach the field of web-mining techniques that can also be considered as both supervised and unsupervised methods. The recommendation systems, which are again part of the supervised learning category, are described in Chapter 5, Recommendation Systems. The Django web framework is then introduced in Chapter 6, Getting Started with Django, and then an example of the recommendation system (using both the Django framework and the algorithms explained in Chapter 5, Recommendation Systems) is detailed in Chapter 7, Movie Recommendation System Web Application. We finish the book with an example of a Django web-mining application, using some of the techniques learned in Chapter 4, Web Mining Techniques. By the end of the book you should be able to understand the different machine-learning methods and be able to deploy them in a real working web application using Django.

We continue the chapter by giving an example of how machine learning can be used in real business problems and in tutorials for Python libraries (NumPy, pandas, and matplotlib), which are essential for putting the algorithms learned in each of the following chapters into practice.

Machine-learning example

To explain further what machine learning can do with real data, we consider the following example (the following code is available in the author's GitHub book folder https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_1/). We have taken the Internet Advertisements Data Set from the UC Irvine Machine Learning Repository (http://archive.ics.uci.edu). Web advertisements have been collected from various web pages, and each of them has been transformed into a numeric feature's vector. From the ad.names file we can see that the first three features represent the image size in the page, and the other features are related to the presence of specific words or phrases on the URL of the image or in the text (1558 features in total). The labels values are either ad or nonad, depending on whether the page has an advert or not. As an example, a web page in ad.data is given by:

125, 125, ...., 1. 0, 1, 0, ad.

Based on this data, a classical machine-learning task is to find a model to predict which pages are adverts and which are not (classification). To start with, we consider the data file ad.data which contains the full feature's vectors and labels, but it has also missing values indicated with a ?. We can use the pandas Python library to transform the? to -1 (see next paragraph for a full tutorial on the pandas library):

import pandas as pd
df = pd.read_csv('ad-dataset/ad.data',header=None)
df=df.replace({'?': np.nan})
df=df.replace({'  ?': np.nan})
df=df.replace({'   ?': np.nan})
df=df.replace({'    ?': np.nan})
df=df.replace({'     ?': np.nan})
df=df.fillna(-1)

A DataFrame is created with the data from the ad.data file, and each ? is first replaced with the an value (replace function), then with -1 (the fillna function). Now each label has to be transformed into a numerical value (and so do all the other values in the data):

adindices = df[df.columns[-1]]== 'ad.'
df.loc[adindices,df.columns[-1]]=1
nonadindices = df[df.columns[-1]]=='nonad.'
df.loc[nonadindices,df.columns[-1]]=0
df[df.columns[-1]]=df[df.columns[-1]].astype(float)
df.apply(lambda x: pd.to_numeric(x))

Each ad. label has been transformed into 1 while the nonad. values have been replaced by 0. All the columns (features) need to be numeric and float types (using the astype function and the to_numeric function through a lambda function).

We want to use the Support Vector Machine (SVM) algorithm provided by the scikit-learn library (see Chapter 3, Supervised Machine Learning) to predict 20% of the labels in the data. First, we split the data into two sets: a training set (80%) and a test set (20%):

import numpy as np
dataset = df.values[:,:]
np.random.shuffle(dataset)
data = dataset[:,:-1]
labels = dataset[:,-1].astype(float)
ntrainrows = int(len(data)*.8)
train = data[:ntrainrows,:]
trainlabels = labels[:ntrainrows]
test = data[ntrainrows:,:]
testlabels = labels[ntrainrows:]

Using the libraries provided by Numpy (a tutorial is provided in the next paragraph), the data are shuffled (function random.shuffle) before being split to assure the rows in the two sets are randomly selected. The -1 notation indicates the last column of the array is not considered.

Now we train our SVM model using the training data:

from sklearn.svm import SVC
clf = SVC(gamma=0.001, C=100.)
clf.fit(train, trainlabels)

We have defined our clf variable that declares the SVM model with the values of the parameters. Then the function fit is called to fit the model with the training data (see Chapter 3, Supervised Machine Learning for further details). The mean accuracy in predicting the 20% test cases is performed as follows, using the score function:

score=clf.score(test,testlabels)
print 'score:',score

Running the preceding code (the full code is available in the chapter_1 folder of the author's GitHub account) gives a result of 92% accuracy, which means 92% of the test cases of the predicted label agree with the true label. This is the power of machine learning: from previous data, we are able to infer if a page will contain an advert or not. To achieve that, we have essentially prepared and manipulated the data using the NumPy and pandas libraries, and then applied the SVM algorithm on the cleaned data using the scikit-learn library. Since this book will largely employ the numpy and pandas (and some matplotlib) libraries, the following paragraphs will discuss how to install the libraries and how the data can be manipulated (or even created) using these libraries.

Installing and importing a module (library)

Before continuing with the discussion on the libraries, we need to clarify how to install each module we want to use in Python. The usual way to install a module is through the pip command using the terminal:

>>> sudo pip install modulename

The module is then usually imported into the code using the statement:

import numpy as np

Here, numpy is the library name and np is the reference name from which any function X in the library can be accessed using np.X instead of numpy.X. We are going to assume that all the libraries (scipy, scikit-learn, pandas, scrapy, nltk, and all others) have been be installed and imported in this way.

Preparing, manipulating and visualizing data – NumPy, pandas and matplotlib tutorials

Most of the data comes in a very unpractical form for applying machine-learning algorithms. As we have seen in the example (in the preceding paragraph), the data can have missing values or non-numeric columns, which are not ready to be fed into any machine-learning technique. Therefore, a machine-learning professional usually spends a large amount of time cleaning and preparing the data to transform it into a form suitable for further analysis or visualization. This section will teach how to use numpy and pandas to create, prepare, and manipulate data in Python while the matplotlib paragraph will provide the basis of plotting a graph in Python. The Python shell has been used to discuss the NumPy tutorial, although all versions of the code in the IPython notebook, and plain Python script, are available in the chapter_1 folder of the author's GitHub. pandas and matplotlib are discussed using the IPython notebook.

Using NumPy

Numerical Python or NumPy, is an open source extension library for Python, and is a fundamental module required for data analysis and high performance scientific computing. The library features support Python for large, multi-dimensional arrays and matrices, and it provides precompiled functions for numerical routines. Furthermore, it provides a large library of mathematical functions to manipulate these arrays.

The library provides the following functionalities:

  • Fast multi-dimensional array for vector arithmetic operations
  • Standard mathematical functions for fast operations on entire arrays of data
  • Linear algebra
  • Sorting, unique, and set operations
  • Statistics and aggregating data

The main advantage of NumPy is the speed of the usual array operations compared to standard Python operations. For instance, a traditional summation of 10000000 elements:

>>> def sum_trad():
>>>   start = time.time()
>>>   X = range(10000000)
>>>   Y = range(10000000)
>>>   Z = []
>>>   for i in range(len(X)):
>>>       Z.append(X[i] + Y[i])
>>>   return time.time() - start

Compare this to the Numpy function:

>>> def sum_numpy():
>>>   
start = time.time()
>>>   X = np.arange(10000000) 
>>>   Y = np.arange(10000000) 
>>>   Z=X+Y
>>>   return time.time() - start
>>> print 'time sum:',sum_trad(),'  time sum numpy:',sum_numpy()
time sum: 2.1142539978   time sum numpy: 0.0807049274445

The time used is 2.1142539978 and 0.0807049274445 respectively.

Arrays creation

The array object is the main feature provided by the NumPy library. Arrays are the equivalent of Python lists, but each element of an array has the same numerical type (typically float or int). It is possible to define an array casting from a list using the function array by using the following code. Two arguments are passed to it: the list to be converted and the type of the new generated array:

>>> arr = np.array([2, 6, 5, 9], float)
>>> arr
array([ 2., 6., 5., 9.])
>>> type(arr)
<type 'numpy.ndarray'>

And vice versa, an array can be transformed into a list by the following code:

>>> arr = np.array([1, 2, 3], float)
>>> arr.tolist()
[1.0, 2.0, 3.0]
>>> list(arr)
[1.0, 2.0, 3.0]

Note

Assigning an array to a new one will not create a new copy in memory, it will just link the new name to the same original object.

To create a new object from an existing one, the copy function needs to be used:

>>> arr = np.array([1, 2, 3], float)
>>> arr1 = arr
>>> arr2 = arr.copy()
>>> arr[0] = 0
>>> arr
array([0., 2., 3.])
>>> arr1
array([0., 2., 3.])
>>> arr2
array([1., 2., 3.])

Alternatively an array can be filled with a single value in the following way:

>>> arr = np.array([10, 20, 33], float)
>>> arr
array([ 10., 20., 33.])
>>> arr.fill(1)
>>> arr
array([ 1., 1., 1.])

Arrays can also be created randomly using the random submodule. For example, giving the length of an array as an input of the function, permutation will find a random sequence of integers:

>>> np.random.permutation(3)
array([0, 1, 2])

Another method, normal, will draw a sequence of numbers from a normal distribution:

>>> np.random.normal(0,1,5)
array([-0.66494912,  0.7198794 , -0.29025382,  0.24577752,  0.23736908])

0 is the mean of the distribution while 1 is the standard deviation and 5 is the number of array's elements to draw. To use a uniform distribution, the random function will return numbers between 0 and 1 (not included):

>>> np.random.random(5)
array([ 0.48241564,  0.24382627,  0.25457204,  0.9775729 ,  0.61793725])

NumPy also provides a number of functions for creating two-dimensional arrays (matrices). For instance, to create an identity matrix of a given dimension, the following code can be used:

>>> np.identity(5, dtype=float)
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

The eye function returns matrices with ones along the kth diagonal:

>>> np.eye(3, k=1, dtype=float)
array([[ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  0.]])

The most commonly used functions to create new arrays (1 or 2 dimensional) are zeros and ones which create new arrays of specified dimensions filled with these values. These are:

>>> np.ones((2,3), dtype=float)
array([[ 1., 1., 1.],
       [ 1., 1., 1.]])
>>> np.zeros(6, dtype=int)
array([0, 0, 0, 0, 0, 0])

The zeros_like and ones_like functions instead create a new array with the same type as an existing one, with the same dimensions:

>>> arr = np.array([[13, 32, 31], [64, 25, 76]], float)
>>> np.zeros_like(arr)
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
>>> np.ones_like(arr)
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

Another way to create two-dimensional arrays is to merge one-dimensional arrays using vstack (vertical merge):

>>> arr1 = np.array([1,3,2])
>>> arr2 = np.array([3,4,6])
>>> np.vstack([arr1,arr2])
array([[1, 3, 2],
       [3, 4, 6]])

The creation using distributions are also possible for two-dimensional arrays, using the random submodule. For example, a random matrix 2x3 from a uniform distribution between 0 and 1 is created by the following command:

>>> np.random.rand(2,3)
array([[ 0.36152029,  0.10663414,  0.64622729],
    [ 0.49498724,  0.59443518,  0.31257493]])

Another often used distribution is the multivariate normal distribution:

>>> np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[5,])
array([[ 11.8696466 ,  -0.99505689],
       [ 10.50905208,   1.47187705],
       [  9.55350138,   0.48654548],
       [ 10.35759256,  -3.72591054],
       [ 11.31376171,   2.15576512]])

The list [10,0] is the mean vector, [[3, 1], [1, 4]] is the covariance matrix and 5 is the number of samples to draw.

Method

Description

tolist

Function to transform NumPy array to list

copy

Function to copy NumPy array values

ones, zeros

Functions to create an array of zeros or ones

zeros_like, ones_like

Functions to create two-dimensional arrays with same shape of the input list

fill

Function to replace an array entries with a certain value

identity

Function to create identity matrix

eye

Function to create a matrix with one entry along a kth diagonal

vstack

Function to merge arrays into two-dimensional arrays

random submodule: random, permutation, normal, rand, multivariate_normal, and others

Random submodule create arrays drawing samples from distributions

Array manipulations

All the usual operations to access, slice, and manipulate a Python list can be applied in the same way, or in a similar way to an array:

>>> arr = np.array([2., 6., 5., 5.])
>>> arr[:3]
array([ 2., 6., 5.])
>>> arr[3]
5.0
>>> arr[0] = 5.
>>> arr
array([ 5., 6., 5., 5.])

The unique value can be also selected using unique:

>>> np.unique(arr)
array([ 5., 6., 5.])

The values of the array can also be sorted using sort and its indices with argsort:

>>> np.sort(arr)
array([ 2.,  5.,  5.,  6.])
>>> np.argsort(arr)
array([0, 2, 3, 1])

It is also possible to randomly rearrange the order of the array's elements using the shuffle function:

>>> np.random.shuffle(arr)
>>> arr
array([ 2.,  5.,  6.,  5.])

NumPy also has a built-in function to compare arrays array_equal:

>>> np.array_equal(arr,np.array([1,3,2]))
False

Multi-dimensional arrays, however, differ from the list. In fact, a list of dimensions is specified using the comma (instead of a bracket for list). For example, the elements of a two-dimensional array (that is a matrix) are accessed in the following way:

>>> matrix = np.array([[ 4., 5., 6.], [2, 3, 6]], float)
>>> matrix
array([[ 4., 5., 6.],
       [ 2., 3., 6.]])
>>> matrix[0,0]
4.0
>>> matrix[0,2]
6.0

Slicing is applied on each dimension using the colon : symbol between the initial value and the end value of the slice:

>>> arr = np.array([[ 4., 5., 6.], [ 2., 3., 6.]], float)
>>> arr[1:2,2:3]
array([[ 6.]])

While a single : means all the elements along that axis are considered:

>>> arr[1,:]
array([2, 3, 6])
>>> arr[:,2]
array([ 6., 6.])
>>> arr[-1:,-2:]
array([[ 3., 6.]])

One-dimensional arrays can be obtained from multi-dimensional arrays using the flatten function:

>>> arr = np.array([[10, 29, 23], [24, 25, 46]], float)
>>> arr
array([[ 10.,  29.,  23.],
       [ 24.,  25.,  46.]])
>>> arr.flatten()
array([ 10.,  29.,  23.,  24.,  25.,  46.])

It is also possible to inspect an array object to obtain information about its content. The size of an array is found using the attribute shape:

>>> arr.shape
(2, 3)

In this case, arr is a matrix of two rows and three columns. The dtype property returns the type of values are stored within the array:

>>> arr.dtype
dtype('float64')

float64 is a numeric type to store double-precision (8-byte) real numbers (similar to float type in regular Python). There are also other data types such as int64, int32, string, and an array can be converted from one type to another. For example:

>>>int_arr = matrix.astype(np.int32)
>>>int_arr.dtype
dtype('int32')

The len function returns the length of the first dimension when used on an array:

>>>arr = np.array([[ 4., 5., 6.], [ 2., 3., 6.]], float)
>>> len(arr)
2

Like in Python for loop, the in word can be used to check if a value is contained in an array:

>>> arr = np.array([[ 4., 5., 6.], [ 2., 3., 6.]], float)
>>> 2 in arr
True
>>> 0 in arr
False

An array can be manipulated in such a way that its elements are rearranged in different dimensions using the function reshape. For example, a matrix with eight rows and one column can be reshaped to a matrix with four rows and two columns:

>>> arr = np.array(range(8), float)
>>> arr
array([ 0., 1., 2., 3., 4., 5., 6., 7.])
>>> arr = arr.reshape((4,2))
>>> arr
array([[ 0.,  1.],
       [ 2.,  3.],
       
[ 4.,  5.],
       [ 6.,  7.]])
>>> arr.shape
(4, 2)

In addition, transposed matrices can be created; that is to say, a new array with the final two dimensions switched can be obtained using the transpose function:

>>> arr = np.array(range(6), float).reshape((2, 3))
>>> arr
array([[ 0., 1., 2.],
      [ 3., 4., 5.]])
>>> arr.transpose()
array([[ 0., 3.],
       [ 1., 4.],
       [ 2., 5.]])

Arrays can also be transposed using the T attribute:

>>> matrix = np.arange(15).reshape((3, 5))
>>> matrix
array([[ 0, 1, 2, 3, 4], 
       [ 5, 6, 7, 8, 9], 
       [10, 11, 12, 13, 14]]) 
>>>matrix .T
array([[ 0, 5, 10],
       [ 1, 6, 11],
       [ 2, 6, 12],
       
[ 3, 8, 13],
       [ 4, 9, 14]])

Another way to reshuffle the elements of an array is to use the newaxis function to increase the dimensionality:

>>> arr = np.array([14, 32, 13], float)
>>> arr
array([ 14.,  32.,  13.])
>> arr[:,np.newaxis]
array([[ 14.],
       [ 32.],
       [ 13.]])
>>> arr[:,np.newaxis].shape
(3,1)
>>> arr[np.newaxis,:]
array([[ 14.,  32.,  13.]])
>>> arr[np.newaxis,:].shape
(1,3)

In this example, in each case the new array has two dimensions, the one generated by newaxis has a length of one.

Joining arrays is an operation performed by the concatenate function in NumPy, and the syntax depends on the dimensionality of the array. Multiple one-dimensional arrays can be chained, specifying the arrays to be joined as a tuple:

>>> arr1 = np.array([10,22], float)
>>> arr2 = np.array([31,43,54,61], float)
>>> arr3 = np.array([71,82,29], float)
>>> np.concatenate((arr1, arr2, arr3))
array([ 10.,  22.,  31.,  43.,  54.,  61.,  71.,  82.,  29.])

Using a multi-dimensional array, the axis along which multiple arrays are concatenated needs to be specified. Otherwise, NumPy concatenates along the first dimension by default:

>>> arr1 = np.array([[11, 12], [32, 42]], float)
>>> arr2 = np.array([[54, 26], [27,28]], float)
>>> np.concatenate((arr1,arr2))
array([[ 11.,  12.],
       [ 32.,  42.],
       [ 54.,  26.],
       [ 27.,  28.]])
>>> np.concatenate((arr1,arr2), axis=0)
array([[ 11.,  12.],
       [ 32.,  42.],
       [ 54.,  26.],
       [ 27.,  28.]])
>>> np.concatenate((arr1,arr2), axis=1)
array([[ 11.,  12.,  54.,  26.],
       [ 32.,  42.,  27.,  28.]])

It is common to save a large amount of data as a binary file instead of using the direct format. NumPy provides a function, tostring, to convert an array to a binary string. Of course there's also the inverse operation, where a conversion of a binary string to an array is supported using the fromstring routine. For example:

>>> arr = np.array([10, 20, 30], float)
>>> str = arr.tostring()
>>> str
'\x00\x00\x00\x00\x00\x00$@\x00\x00\x00\x00\x00\x004@\x00\x00\x00\x00\x00\x00>@'
>>> np.fromstring(str)
array([ 10., 20., 30.])

Method

Description

unique

Function to select only unique values from an array

random, shuffle

Function to randomly rearrange the elements of an array

sort, argsort

sort sorts the order of an array's values in increasing order, while argsort orders the array's indices such that the array gets arranged in an increasing order

array_equal

Compare two arrays and return a True id (they are equal False otherwise)

flatten

Transform a two-dimensional array into a one-dimensional array

transpose

Calculate the transpose of a two-dimensional array

reshape

Rearrange entries of a two-dimensional array into a different shape

concatenate

Concatenate two -dimensional arrays into one matrix

fromstring, tostring

Convert an array to a binary string

Array operations

Common mathematical operations are obviously supported in NumPy. For example:

>>> arr1 = np.array([1,2,3], float)
>>> arr2 = np.array([1,2,3], float)
>>> arr1 + arr2
array([2.,4., 6.])
>>> arr1–arr2
array([0., 0., 0.])
>>> arr1 * arr2
array([51, 4., 9.])
>>> arr2 / arr1
array([1., 1., 1.])
>>> arr1 % arr2
array([0., 0., 0.])
>>> arr2**arr1
array([1., 4., 9.])

Since any operation is applied element wise, the arrays are required to have the same size. If this condition is not satisfied, an error is returned:

>>> arr1 = np.array([1,2,3], float)
>>> arr2 = np.array([1,2], float)
>>> arr1 + arr2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: shape mismatch: objects cannot be broadcast to a single shape

The error states that the objects cannot be broadcasted because the only way to perform an operation with arrays of different size is called broadcasting. This means the arrays have a different number of dimensions, and the array with less dimensions will be repeated until it matches the dimensions of the other array. Consider the following:

>>> arr1 = np.array([[1, 2], [3, 4], [5, 6]], float)
>>> arr2 = np.array([1, 2], float)
>>> arr1
array([[ 1., 2.],
       [ 3., 4.],
       [ 5., 6.]])
>>> arr2
array([1., 1.])
>>> arr1 + arr2
array([[ 2., 4.],
       [ 4., 6.],
       [ 6., 8.]])

The array arr2 was broadcasted to a two-dimensional array that matched the size of arr1. Therefore, arr2 was repeated for each dimension of arr1, equivalent to the array:

array([[1., 2.],[1., 2.],[1., 2.]])

If we want to make the way an array is broadcasted explicit, the newaxis constant allows us to specify how we want to broadcast:

>>> arr1 = np.zeros((2,2), float)
>>> arr2 = np.array([1., 2.], float)
>>> arr1
array([[ 0., 0.],[ 0., 0.]])
>>> arr2
array([1., 2.])
>>> arr1 + arr2
array([[-1., 3.],[-1., 3.]])
>>> arr1 + arr2[np.newaxis,:]
array([[1., 2.],[1., 2.]])
>>> arr1 + arr2[:,np.newaxis]
array([[1.,1.],[ 2., 2.]])

Unlike Python lists, arrays can be queried using conditions. A typical example is to use Boolean arrays to filter the elements:

>>> arr = np.array([[1, 2], [5, 9]], float)
>>> arr >= 7
array([[ False, False],
[False, True]], dtype=bool)
>>> arr[arr >= 7]
array([ 9.])

Multiple Boolean expressions can be used to subset the array:

>>> arr[np.logical_and(arr > 5, arr < 11)]
>>> arr
array([ 9.])

Arrays of integers can be used to specify the indices to select the elements of another array. For example:

>>> arr1 = np.array([1, 4, 5, 9], float)
>>> arr2 = np.array([0, 1, 1, 3, 1, 1, 1], int)
>>> arr1[arr2]
array([ 1., 4., 4., 9., 4., 4., 4.])

The arr2 represents the ordered indices to select elements from array arr1: the zeroth, first, first, third, first, first, and first elements of arr1, in that order have been selected. Also lists can be used for the same purpose:

>>> arr = np.array([1, 4, 5, 9], float)
>>> arr[[0, 1, 1, 3, 1]]
array([ 1., 4., 4., 9., 4.])

In order to replicate the same operation with multi-dimensional arrays, multiple one-dimensional integer arrays have to be put into the selection bracket, one for each dimension.

The first selection array represents the values of the first index in the matrix entries, while the values on the second selection array represent the column index of the matrix entries. The following example illustrates the idea:

>>> arr1 = np.array([[1, 2], [5, 13]], float)
>>> arr2 = np.array([1, 0, 0, 1], int)
>>> arr3 = np.array([1, 1, 0, 1], int)
>>> arr1[arr2,arr3]
array([ 13.,   2.,   1.,  13.])

The values on arr2 are the first index (row) on arr1 entries while arr3 are the second index (column) values, so the first chosen entry on arr1 corresponds to row 1 column 1 which is 13.

The function take can be used to apply your selection with integer arrays, and it works in the same way as bracket selection:

>>> arr1 = np.array([7, 6, 6, 9], float)
>>> arr2 = np.array([1, 0, 1, 3, 3, 1], int)
>>> arr1.take(arr2)
array([ 6.,  7.,  6.,  9.,  9.,  6.])

Subsets of a multi-dimensional array can be selected along a given dimension specifying the axis argument on the take function:

>>> arr1 = np.array([[10, 21], [62, 33]], float)
>>> arr2 = np.array([0, 0, 1], int)
>>> arr1.take(arr2, axis=0)
array([[ 10.,  21.],
       [ 10.,  21.],
       [ 62.,  33.]])
>>> arr1.take(arr2, axis=1)
array([[ 10.,  10.,  21.],
       [ 62.,  62.,  33.]])

The put function is the opposite of the take function, and it takes values from an array and puts them at specified indices in the array that calls the put method:

>>> arr1 = np.array([2, 1, 6, 2, 1, 9], float)
>>> arr2 = np.array([3, 10, 2], float)
>>> arr1.put([1, 4], arr2)
>>> arr1
array([ 2.,  3.,  6.,  2.,  10.,  9.])

We finish this section with the note that multiplication also remains element-wise for two-dimensional arrays (and does not correspond to matrix multiplication):

>>> arr1 = np.array([[11,22], [23,14]], float)
>>> arr2 = np.array([[25,30], [13,33]], float)
>>> arr1 * arr2
array([[ 275.,  660.],
       [ 299.,  462.]])

Method

Description

take

Select values of an array from indices given by a second array

put

Replace the values in an array with values of another array at given positions

Linear algebra operations

The most common operations between matrices is the inner product of a matrix with its transpose, XT X, using np.dot:

>>> X = np.arange(15).reshape((3, 5))
>>> X
array([[ 0, 1, 2, 3, 4], 
       [ 5, 6, 7, 8, 9], 
       [10, 11, 12, 13, 14]]) 
>>> X.T
array([[ 0, 5, 10],
       [ 1, 6, 11],
       [ 2, 6, 12],
       [ 3, 8, 13],
       [ 4, 9, 14]])
>>>np.dot(X .T, X)#X^T X
array([[ 2.584 , 1.8753, 0.8888], 
       [ 1.8753, 6.6636, 0.3884], 
       [ 0.8888, 0.3884, 3.9781]])

There are functions to directly calculate the different types of product (inner, outer, and cross) on arrays (that is matrices or vectors).

For one-dimensional arrays (vectors) the inner product corresponds to the dot product:

>>> arr1 = np.array([12, 43, 10], float)
>>> arr2 = np.array([21, 42, 14], float)
>>> np.outer(arr1, arr2)
array([[  252.,   504.,   168.],
       [  903.,  1806.,   602.],
       [  210.,   420.,   140.]])
>>> np.inner(arr1, arr2)
2198.0
>>> np.cross(arr1, arr2)
array([ 182.,   42., -399.])

NumPy also contains a sub-module, linalg that has a series of functions to perform linear algebra calculations over matrices. The determinant of a matrix can be computed as:

>>> matrix = np.array([[74, 22, 10], [92, 31, 17], [21, 22, 12]], float)
>>> matrix
array([[ 74.,  22.,  10.],
       [ 92.,  31.,  17.],
       [ 21.,  22.,  12.]])
>>> np.linalg.det(matrix)
-2852.0000000000032

Also the inverse of a matrix can be generated using the function inv:

>>> inv_matrix = np.linalg.inv(matrix)
>>> inv_matrix
array([[ 0.00070126,  0.01542777, -0.02244039],
       [ 0.26192146, -0.23772791,  0.11851332],
       [-0.48141655,  0.4088359 , -0.09467041]])
>>> np.dot(inv_matrix,matrix)
array([[  1.00000000e+00,   2.22044605e-16,   4.77048956e-17],
       [ -2.22044605e-15,   1.00000000e+00,   0.00000000e+00],
       [ -3.33066907e-15,  -4.44089210e-16,   1.00000000e+00]])

It is straightforward to calculate the eigenvalues and eigenvectors of a matrix:

>>> vals, vecs = np.linalg.eig(matrix)
>>> vals
array([ 107.99587441,   11.33411853,   -2.32999294])
>>> vecs
array([[-0.57891525, -0.21517959,  0.06319955],
       [-0.75804695,  0.17632618, -0.58635713],
       [-0.30036971,  0.96052424,  0.80758352]])

Method

Description

dot

Dot product between two arrays

inner

Inner product between multi-dimensional arrays

linalg module with functions such as: linalg.det, linalg.inv, linalg.eig

linalg is a module that collects several linear algebra methods among which are the determinant of a matrix (det), the inverse of a matrix (inv) and the eigenvalues, eigenvectors of a matrix (eig)

Statistics and mathematical functions

NumPy provides a set of functions to compute statistics of the data contained in the arrays. Operations of the aggregation type, such as sum, mean, median, and standard deviation are available as an attribute of an array. For example, creating a random array (from a normal distribution), it is possible to calculate the mean in two ways:

>>> arr = np.random.rand(8, 4)
>>> arr.mean()
0.45808075801881332
>>> np.mean(arr)
0.45808075801881332
>>> arr.sum()
14.658584256602026

The full list of functions is shown in the table below:

Method

Description

mean

mean of the elements. If the array is empty, the mean is set to Na N by default.

std, var

Functions to calculate the standard deviation (std) and variance (var) of the array. An optional degree of freedom parameter can be specified (default is the length of the array).

min, max

Functions to determine the minimum (min) and maximum (max) value contained in the array.

argmin, argmax

These functions return the index of the smallest element (argmin) and largest element (argmax).

Understanding the pandas module

pandas is a powerful Python module that contains a wide range of functions to analyze data structures. pandas relies on the NumPy library and it is designed to make data analysis operations easy and fast. This module offers high performance with respect to normal Python functions, especially for reading or writing files or making databases; pandas is the optimal choice to perform data manipulation. The following paragraphs discuss the main methods to explore the information contained in the data, and how to perform manipulations on it. We start by describing how data is stored in pandas and how to load data into it.

Note

Throughout the rest of the book, we use the following import conventions for pandas:

       import pandas as pd

Therefore, whenever code contains the letters pd, it is referring to pandas.

Exploring data

In order to introduce the database structure, called DataFrame, into pandas, we need to describe the one-dimensional array-like object containing data of any NumPy data type and an associated array of data label called its index. This structure is called Series and a simple example is:

Exploring data

The obj object is composed of two values, the index on the left and the associated value on the right. Given that the length of the data is equal to N, the default indexing goes from 0 to N-1. The array and index objects of the Series can be obtained using its values and index attributes, respectively:

Exploring data

The indexing is preserved by applying NumPy array operations (such as scalar multiplication, filtering with a Boolean array, or applying math functions):

Exploring data

A Python dictionary can be transformed into a Series but the indexing will correspond to the key values:

Exploring data

It is possible to specify a separated list as an index:

Exploring data

In this case, the last index value, g, has not got an associated object value, so by default a Not a Number (NaN) is inserted.

The terms of missing or NA will be used to refer to missing data. To find the missing data the isnull and notnull functions can be used in pandas:

Exploring data

We can now start loading a CSV file into a DataFrame structure. A DataFrame represents a data structure containing an ordered set of columns, each of which can be a different value type (numeric, string, Boolean, and others). The DataFrame has two indices (a row and column index) and it can be thought of as a dictionary of Series that share the same index (column). For the purpose of this tutorial, we are using the data contained in the ad.data file stored in the http://archive.ics.uci.edu website (at http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements) as already explained in the preceding machine-learning example.

The data is loaded in the following way using the terminal (in this case the path is data_example/ad-dataset/ad-data):

Exploring data

This file does not have a header (set to none) so the column's names are numbers and we can get a summary of the DataFrame by using the describe function on the object data:

Exploring data

This summarizes quantitative information. We can see that there are 1554 numeric columns (indicated by numbers since there is no header) and 3279 rows (called count for each column). Each of the columns has a list of statistical parameters (mean, standard deviation, min, max, and percentiles) that helps to obtain an initial estimate of the quantitative information contained in the data.

It is possible to obtain the column names using the columns property:

Exploring data

So all the columns names are of type int64 and the following command returns the actual types of all the columns:

Exploring data

The first four columns and the label (last column) are of the type object, while the others are of the type int64. Columns can be accessed in two ways. The first method is by specifying the column name like the key in a dictionary:

Exploring data

Multiple columns can be obtained by specifying a list of them with the column names:

Exploring data

The other way to access columns is by using the dot syntax, but it will only work if the column name could also be a Python variable name (that is no spaces), if it is not the same as the DataFrame property or function name (such as count or sum), and the name is of the string type (not int64 like in this example).

To briefly gain an insight into the content of a DataFrame, the function head() can be used. The first five items in a column (or the first five rows in the DataFrame) are returned by default:

Exploring data

The opposite method is tail(), which returns the last five items or rows by default. Specifying a number on the tail() or head() function, will return the first n items in the chosen column:

Exploring data

It is also possible to use the Python's regular slicing syntax to obtain a certain number of rows of the DataFrame:

Exploring data

This example shows only rows from 1 to 3.

Manipulate data

It is possible to select row(s) in different ways, such as specifying the index or the condition as follows:

Manipulate data

Or specifying multiple conditions:

Manipulate data

The data returned are web pages with feature 1 greater than 0 and containing an advert.

The ix method allows us to select rows specifying the desired index:

Manipulate data

Alternatively the function iloc can be used:

Manipulate data

The difference is that ix works on labels in the index column and iloc works on the positions in the index (so it only takes integers). Therefore, in this example, ix finds all the rows from 0 until the label 3 appears, while the iloc function returns the rows in the first 3 positions in the data frame. There is a third function to access data in a DataFrame, loc. This function looks at the index names associated with the rows and it returns their values. For example:

Manipulate data

Note that this function behaves differently with respect to the normal slicing in Python because both starting and ending rows are included in the result (the row with index 3 is included in the output).

It is possible to set an entire column to a value:

Manipulate data

To also set a specific cell value to the desired values:

Manipulate data

Or the entire row to a set of values (random values between 0 and 1 and ad. label in this example):

Manipulate data

After transforming an array of values in a Series object, it is possible to append a row at the end of the DataFrame:

Manipulate data

Alternatively, the loc function (as in NumPy) can be used to add a row at the last line:

Manipulate data

It is easy to add a column in the DataFrame by simply assigning the new column name to a value:

Manipulate data

In this case, the new column has all the entries assigned to test value. Similarly, the column can be deleted using the drop function:

Manipulate data

A dataset may contain duplicates for various reasons, so pandas provides the method duplicated to indicate whether each row is a repetition or not:

Manipulate data

More usefully, though, the drop_duplicates function returns a DataFrame with only the unique values. For example, for the label the unique values are:

Manipulate data

It is possible to transform the result into a list:

Manipulate data

As we did in the machine-learning example, these labels can be transformed into numeric values using the methods explained in the preceding example:

Manipulate data

The label column is still the object type:

Manipulate data

So the column now can be converted into the float type:

Manipulate data

The first four columns contain mixed values (strings, ?, and float numbers), so we need to remove the string values to convert the columns into a numeric type. We can use the function replace to substitute all the instances of ? (missing values) with NaN:

Manipulate data

Now we can handle these rows with missing data in two ways. The first method is just to remove the lines with missing values using dropna:

Manipulate data

Instead of removing the rows with missing data (which may lead to deleting important information), the empty entries can be filled. For most purposes, a constant value can be inserted in the empty cells with the fillna method:

Manipulate data

Now that all the values are numeric the columns can be set to type float, applying the astype function. Alternatively, we can apply a lambda function to convert each column in the DataFrame to a numeric type:

Manipulate data

Each x instance is a column and the to_numeric function converts it to the closest numeric type (float in this case).

For the sake of completeness of this tutorial, we want to show how two DataFrames can be concatenated since this operation can be useful in real applications. Let's create another small DataFrame with random values:

Manipulate data

This new table with two rows can be merged with the original DataFrame using the concat function placing the rows of data1 at the bottom of the data:

Manipulate data

The number of rows of datatot is now increased by two rows with respect to data (note that the number of rows is different from the beginning because we dropped the rows with NaN).

Matplotlib tutorial

matplotlib.pyplot is a library that collects a series of methods to plot data similar to MATLAB. Since the following chapters will employ this library to visualize some results, a simple example here will explain all the matplotlib code you will see as you continue in this book:

Matplotlib tutorial

After importing the library (as plt), the figure object is initialized (fig) and an axis object is added (ax). Each line plotted into the ax object through the command ax.plot() is called a handle. All the following instructions are then recorded by matplotlib.pyplot and plotted in the figure object. In this case, the line in green has been shown from the terminal and saved as a figure.png file, using the commands plt.show() and fig.savefig() respectively. The result is equal to:

Matplotlib tutorial

Example of simple plot

The next example illustrates a plot of several lines with different format styles in one command using Numpy arrays:

Matplotlib tutorial
Matplotlib tutorial

Example of plot with multiple lines

Note that the function get_legend_handles_labels() returns the list of handles and labels stored in the object ax and they are passed to the function legend to be plotted. The symbols 'r--', 'bs', and 'g^' refer to the shape of the points and their color (red rectangles, blue squares, and green triangles respectively). The linewidth parameter sets the thickness of the line while markersize sets the size of the dots.

Another useful plot to visualize the results is the scatter plot in which values for typically two variables of a set of data (data generated using NumPy random submodule) are displayed:

Matplotlib tutorial

The s option represents the size of the points and colors are the colors that correspond to each set of points and the handles are passed directly into the legend function (p1, p2, p3):

Matplotlib tutorial

Scatter plot of randomly distributed points

For further details on how to use matplotlib we advise the reader to read online material and tutorials such as http://matplotlib.org/users/pyplot_tutorial.html.

Scientific libraries used in the book

Throughout this book, certain libraries are necessary to implement the machine-learning techniques discussed in each chapter. We are going to briefly describe the most relevant libraries employed hereafter:

  • SciPy is a collection of mathematical methods based on the NumPy array objects. It is an open source project so it takes advantage of additional methods continuously written from developers around the world. Python software that employs a SciPy routine is part of advanced projects or applications comparable to similar frameworks such as MATLAB, Octave or RLab. There are a wide range of methods available from manipulating and visualizing data functions to parallel computing routines that enhance the versatility and potentiality of the Python language.
  • scikit-learn (sklearn) is an open source machine learning module for Python programming language. It implements various algorithms such as clustering, classification, and regression including support vector machines, Naive Bayes, Decision Trees, Random Forests, k-means, and Density Based Spatial Clustering of Applications with Noise (DBSCAN) and it interacts natively with numerical Python libraries such as NumPy and SciPy. Although most of the routines are written in Python, some functions are implemented in Cython to achieve better performance. For instance, support vector machines and logistic regression are written in Cython wrapping other external libraries (LIBSVM, LIBLINEAR).
  • The Natural Language Toolkit (NLTK), is a collection of libraries and functions for Natural Language Processing (NLP) for Python language processing. NLTK is designed to support research and teaching on NLP and related topics including artificial intelligence, cognitive science, information retrieval, linguistics, and machine learning. It also features a series of text processing routines for tokenization, stemming, tagging, parsing, semantic reasoning, and classification. NLTK includes sample codes and sample data and interfaces to more than 50 corpora and lexical databases.
  • Scrapy is an open source web crawling framework for the Python programming language. Originally designed for scraping websites, and as a general purpose crawler, it is also suitable for extracting data through APIs. The Scrapy project is written around spiders that act by providing a set of instructions. It also features a web crawling shell that allows the developers to test their concepts before actually implementing them. Scrapy is currently maintained by Scrapinghub Ltd., a web scraping development and services Company.
  • Django is a free and open source web application framework implemented in Python following the model view controller architectural pattern. Django is designed for creation of complex, database-oriented websites. It also allows us to manage the application through an administrative interface, which can create, read, delete, or update data used in the application. There are a series of established websites that currently use Django, such as Pinterest, Instagram, Mozilla, The Washington Times, and Bitbucket.

When to use machine learning

Machine learning is not magic and it may be not be beneficial to all data-related problems. It is important at the end of this introduction to clarify when machine-learning techniques are extremely useful:

  • It is not possible to code the rules: a series of human tasks (to determine if an e-mail is spam or not, for example) cannot be solved effectively using simple rules methods. In fact, multiple factors can affect the solution and if rules depend on a large number of factors it becomes hard for humans to manually implement these rules.
  • A solution is not scalable: whenever it is time consuming to manually take decisions on certain data, the machine-learning techniques can scale adequately. For example, a machine-learning algorithm can efficiently go through millions of e-mails and determine if they are spam or not.

However, if it is possible to find a good target prediction, by simply using mathematical rules, computations, or predetermined schemas that can be implemented without needing any data-driven learning, these advanced machine-learning techniques are not necessary (and you should not use them).

Summary

In this chapter we introduced the basic machine-learning concepts and terminology that will be used in the rest of the book. Tutorials of the most relevant libraries (NumPy, pandas, and matplotlib) used by machine-learning professionals to prepare, t manipulate, and visualize data have been also presented. A general introduction of all the other Python libraries that will be used in the following chapters has been also provided.

You should have a general knowledge of what the machine-learning field can practically do, and you should now be familiar with the methods employed to transform the data into a usable format, so that a machine-learning algorithm can be applied. In the next chapter we will explain the main unsupervised learning algorithms and how to implement them using the sklearn library.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Targets two big and prominent markets where sophisticated web apps are of need and importance.
  • Practical examples of building machine learning web application, which are easy to follow and replicate.
  • A comprehensive tutorial on Python libraries and frameworks to get you up and started.

Description

Python is a general purpose and also a comparatively easy to learn programming language. Hence it is the language of choice for data scientists to prototype, visualize, and run data analyses on small and medium-sized data sets. This is a unique book that helps bridge the gap between machine learning and web development. It focuses on the difficulties of implementing predictive analytics in web applications. We focus on the Python language, frameworks, tools, and libraries, showing you how to build a machine learning system. You will explore the core machine learning concepts and then develop and deploy the data into a web application using the Django framework. You will also learn to carry out web, document, and server mining tasks, and build recommendation engines. Later, you will explore Python’s impressive Django framework and will find out how to build a modern simple web app with machine learning features.

Who is this book for?

The book is aimed at upcoming and new data scientists who have little experience with machine learning or users who are interested in and are working on developing smart (predictive) web applications. Knowledge of Django would be beneficial. The reader is expected to have a background in Python programming and good knowledge of statistics.

What you will learn

  • Get familiar with the fundamental concepts and some of the jargons used in the machine learning community
  • Use tools and techniques to mine data from websites
  • Grasp the core concepts of Django framework
  • Get to know the most useful clustering and classification techniques and implement them in Python
  • Acquire all the necessary knowledge to build a web application with Django
  • Successfully build and deploy a movie recommendation system application using the Django framework in Python

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 29, 2016
Length: 298 pages
Edition : 1st
Language : English
ISBN-13 : 9781785886607
Category :
Languages :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jul 29, 2016
Length: 298 pages
Edition : 1st
Language : English
ISBN-13 : 9781785886607
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total NZ$ 224.97
Advanced Machine Learning with Python
NZ$71.99
Machine Learning for the Web
NZ$80.99
Mastering Social Media Mining with Python
NZ$71.99
Total NZ$ 224.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Introduction to Practical Machine Learning Using Python Chevron down icon Chevron up icon
2. Unsupervised Machine Learning Chevron down icon Chevron up icon
3. Supervised Machine Learning Chevron down icon Chevron up icon
4. Web Mining Techniques Chevron down icon Chevron up icon
5. Recommendation Systems Chevron down icon Chevron up icon
6. Getting Started with Django Chevron down icon Chevron up icon
7. Movie Recommendation System Web Application Chevron down icon Chevron up icon
8. Sentiment Analyser Application for Movie Reviews Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(27 Ratings)
5 star 74.1%
4 star 14.8%
3 star 0%
2 star 7.4%
1 star 3.7%
Filter icon Filter
Top Reviews

Filter reviews by




Dario Fadda Aug 20, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book explains the most important machine learning algorithms most commonly used in thecommercial real world of web development. Significantly increasing their knowledge in this field, the reader willbe perfectly able to understand, in a first time, the mathematical functions behind this technology, and then with practical methods in a real web application based on these algorithms. Its a food for thought useful to learn to use machine learning "secrets" in themost common everyday applications. A very recommended reading!
Amazon Verified review Amazon
Kunhee Lee Oct 29, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Doesn't just describe the theories but shares practical advices through the author's experience.
Amazon Verified review Amazon
Paolo Scattone Sep 01, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Machine Learning for the Web, first publication by Andrea Isoni, PhD and Data Scientist, explores the main applications of machine learning, a programming area based on Python language.The work can be considered divided into two connected section and it is analyzed in eight chapters.The first part illustrates the key concepts of machine learning, the use of libraries for the management and analysis of data extracted from the web and provides a broad overview about the most common systems used in commercial and financial area.The second one introduces the reader to the main features of, Django, web framework for developing applications, and concludes with practical examples of the knowledges acquired.The whole discussion is made in a particularly effective educational approach: each chapter includes theoretical information and is followed by numerous application formulas explained in a comprehensive and detailed way.Machine Learning for the web is an interesting work, very well structured and recommended to all those who are interested in developing skills in a professional environment that in the near future will have a sure positive impact in companies activities in the commercial and financial sectors.
Amazon Verified review Amazon
yogafreak Aug 22, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I'm a newcomer to the ML world. This book was a fun read, with thorough explanations of the most common techniques. Jupyter notebooks with exercises are provided too through the author's Github page, and they help a lot getting a feeling of the performance of each method.With a standard Physics or Engineering-level mathematical background, some additional vocabulary needs to be learned to follow some of the most technical sections. In any case, the author often refers to additional manuals and resources when needed, without interrupting the train of thought.Python 2.7 is the language of choice, but all exercises I tried to reproduce worked almost flawlessly on Python 3.4 (besides some easy changes like print-> print()).Overall, the structure of the book makes perfect sense, guiding the reader through a lot of examples and real-life situations. I could take a few examples and apply them to my own (Astrophysical) research almost straight away.Bottom line, nice book that I recommend for people entering the ML world, provided that they have some mathematical background. But it shouldn't be a problem, you don't want to learn ML without knowing some math, don't you ;)?
Amazon Verified review Amazon
VP Aug 18, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
wow...this book is simply fantastic! There is the math theory and a lot of examples with source code. If someone must code a predictive python web application, this is the right book to start and to understand how to make it. There is a lot of examples for python libraries and obviously for Django. I think these examples, In some cases, are general purpose and not only for the book purpose. All documented and all explained also for beginners. Highly recommended!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.