Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
15 Math Concepts Every Data Scientist Should Know

You're reading from   15 Math Concepts Every Data Scientist Should Know Understand and learn how to apply the math behind data science algorithms

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781837634187
Length 510 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
David Hoyle David Hoyle
Author Profile Icon David Hoyle
David Hoyle
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Part 1: Essential Concepts
2. Chapter 1: Recap of Mathematical Notation and Terminology FREE CHAPTER 3. Chapter 2: Random Variables and Probability Distributions 4. Chapter 3: Matrices and Linear Algebra 5. Chapter 4: Loss Functions and Optimization 6. Chapter 5: Probabilistic Modeling 7. Part 2: Intermediate Concepts
8. Chapter 6: Time Series and Forecasting 9. Chapter 7: Hypothesis Testing 10. Chapter 8: Model Complexity 11. Chapter 9: Function Decomposition 12. Chapter 10: Network Analysis 13. Part 3: Selected Advanced Concepts
14. Chapter 11: Dynamical Systems 15. Chapter 12: Kernel Methods 16. Chapter 13: Information Theory 17. Chapter 14: Non-Parametric Bayesian Methods 18. Chapter 15: Random Matrices 19. Index 20. Other Books You May Enjoy

Linear models

We’ve already introduced, at a high level, the idea of OLS regression for a linear model. But this particular combination of squared loss for measuring the risk and a linear model for  ˆ y  has some very convenient and simple-to-use properties. This simplicity means that OLS regression is one of the most widely used and studied data science modeling techniques. That is why we are going to look in detail at fitting linear models to data using OLS regression.

To start with, we’ll revisit the squared-loss empirical risk function in Eq. 10 and look at what happens to it when we have a linear model  ˆ y . To recap, the squared-loss empirical risk is given by the following:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mtext>Risk</mml:mtext><mml:mo> </mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo> </mml:mo><mml:mrow><mml:munderover><mml:mo stretchy="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msup><mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math>

Eq. 13

Now, for a linear model with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>d</mml:mi></mml:math> features, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math>, we can write the model as follows:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mover><mi>y</mi><mo stretchy="true">ˆ</mo></mover><mo>=</mo><mspace width="0.25em" /><msub><mi>β</mi><mn>0</mn></msub><mo>+</mo><mspace width="0.25em" /><msub><mi>β</mi><mn>1</mn></msub><msub><mi>x</mi><mn>1</mn></msub><mo>+</mo><mspace width="0.25em" /><msub><mi>β</mi><mn>2</mn></msub><msub><mi>x</mi><mn>2</mn></msub><mo>+</mo><mspace width="0.25em" /><mo>⋯</mo><mo>+</mo><mspace width="0.25em" /><msub><mi>β</mi><mi>d</mi></msub><msub><mi>x</mi><mi>d</mi></msub></mrow></mrow></math>

Eq. 14

The vector of model parameters is <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo> </mml:mo><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:math>. We can write the features in vector form as well. We’ll write it as a row-vector, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder><mml:mo>=</mml:mo><mml:mfenced separators="|"><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:math>. Doing so allows us to write Eq. 14 in the following form:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mover><mi>y</mi><mo stretchy="true">ˆ</mo></mover><mo>=</mo><mspace width="0.25em" /><munder><mi>x</mi><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><munder><mi>β</mi><mo stretchy="true">_</mo></munder></mrow></mrow></math>

Eq. 15

We can think of the extra 1 in the feature vector <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder><mml:mo>=</mml:mo><mml:mfenced separators="|"><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:math> as being a feature value <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> that multiplies the intercept <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math> in the linear model in Eq. 14. For the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math> datapoint the feature values can be written in vector form, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:math>, with obviously <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math> for all <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>i</mml:mi></mml:math>. We can combine all the feature vectors <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:munder><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math> from all the datapoints into a data matrix:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><mo>=</mo><mspace width="0.25em" /><mfenced open="(" close=")"><mtable columnspacing="0.8000em 0.8000em 0.8000em 0.8000em" columnwidth="auto auto auto auto auto" columnalign="center center center center center" rowspacing="1.0000ex 1.0000ex 1.0000ex" rowalign="baseline baseline baseline baseline"><mtr><mtd><msub><mi>x</mi><mn>10</mn></msub></mtd><mtd><msub><mi>x</mi><mn>11</mn></msub></mtd><mtd><msub><mi>x</mi><mn>12</mn></msub></mtd><mtd><mo>…</mo></mtd><mtd><msub><mi>x</mi><mrow><mn>1</mn><mi>d</mi></mrow></msub></mtd></mtr><mtr><mtd><msub><mi>x</mi><mn>20</mn></msub></mtd><mtd><msub><mi>x</mi><mn>21</mn></msub></mtd><mtd><msub><mi>x</mi><mn>22</mn></msub></mtd><mtd><mo>…</mo></mtd><mtd><msub><mi>x</mi><mrow><mn>2</mn><mi>d</mi></mrow></msub></mtd></mtr><mtr><mtd><mo>⋮</mo></mtd><mtd><mo>⋮</mo></mtd><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd><mo>⋮</mo></mtd></mtr><mtr><mtd><msub><mi>x</mi><mrow><mi>N</mi><mn>0</mn></mrow></msub></mtd><mtd><msub><mi>x</mi><mrow><mi>N</mi><mn>1</mn></mrow></msub></mtd><mtd><msub><mi>x</mi><mrow><mi>N</mi><mn>2</mn></mrow></msub></mtd><mtd><mo>…</mo></mtd><mtd><msub><mi>x</mi><mrow><mi>N</mi><mi>d</mi></mrow></msub></mtd></mtr></mtable></mfenced><mspace width="0.25em" /><mo>=</mo><mspace width="0.25em" /><mfenced open="(" close=")"><mtable columnspacing="0.8000em 0.8000em 0.8000em 0.8000em" columnwidth="auto auto auto auto auto" columnalign="center center center center center" rowspacing="1.0000ex 1.0000ex 1.0000ex" rowalign="baseline baseline baseline baseline"><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mn>11</mn></msub></mtd><mtd><msub><mi>x</mi><mn>12</mn></msub></mtd><mtd><mo>…</mo></mtd><mtd><msub><mi>x</mi><mrow><mn>1</mn><mi>d</mi></mrow></msub></mtd></mtr><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mn>21</mn></msub></mtd><mtd><msub><mi>x</mi><mn>22</mn></msub></mtd><mtd><mo>…</mo></mtd><mtd><msub><mi>x</mi><mrow><mn>2</mn><mi>d</mi></mrow></msub></mtd></mtr><mtr><mtd><mo>⋮</mo></mtd><mtd><mo>⋮</mo></mtd><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd><mo>⋮</mo></mtd></mtr><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>x</mi><mrow><mi>N</mi><mn>1</mn></mrow></msub></mtd><mtd><msub><mi>x</mi><mrow><mi>N</mi><mn>2</mn></mrow></msub></mtd><mtd><mo>…</mo></mtd><mtd><msub><mi>x</mi><mrow><mi>N</mi><mi>d</mi></mrow></msub></mtd></mtr></mtable></mfenced></mrow></mrow></math>

Eq. 16

If we also put all the observed values, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math>, into a vector <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfenced separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:math>, then we can write the risk function in Eq. 13 in a very succinct form as follows:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><mtext>Risk</mtext><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><mspace width="0.25em" /><msup><mfenced open="(" close=")"><mrow><munder><mi>y</mi><mo stretchy="true">_</mo></munder><mo>−</mo><mspace width="0.25em" /><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><munder><mi>β</mi><mo stretchy="true">_</mo></munder></mrow></mfenced><mi mathvariant="normal">⊤</mi></msup><mfenced open="(" close=")"><mrow><munder><mi>y</mi><mo stretchy="true">_</mo></munder><mo>−</mo><mspace width="0.25em" /><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><munder><mi>β</mi><mo stretchy="true">_</mo></munder></mrow></mfenced></mrow></mrow></math>

Eq. 17

The data matrix X _ _ is also called the design matrix. This terminology originates from statistics, where often the datapoints and, hence, feature values were part of a scientific experiment to quantify the various influences on the response variable y. Being part of a scientific experiment, the feature values x ij were planned in advance; that is, designed.

The optimal values of the model parameters β _ are obtained by minimizing the right-hand side of Eq. 17 with respect to β _. We’ll denote the optimal values of β _ by the symbol  ˆ β  _. We can use the differential calculus we recapped in Chapter 1 to do the minimization. Differentiating the right-hand side of Eq. 17 with respect to β _ and setting the derivatives to zero gives us the following:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><msub><mfenced open="" close="|"><mrow><mspace width="0.25em" /><mfrac><mrow><mo>∂</mo><mtext>Risk</mtext></mrow><mrow><mo>∂</mo><munder><mi>β</mi><mo stretchy="true">_</mo></munder></mrow></mfrac></mrow></mfenced><mrow><munder><mi>β</mi><mo stretchy="true">_</mo></munder><mo>=</mo><munder><mover><mi>β</mi><mo stretchy="true">ˆ</mo></mover><mo stretchy="true">_</mo></munder></mrow></msub><mo>=</mo><mfrac><mn>2</mn><mi>N</mi></mfrac><msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mi mathvariant="normal">⊤</mi></msup><mfenced open="(" close=")"><mrow><munder><mi>y</mi><mo stretchy="true">_</mo></munder><mo>−</mo><mspace width="0.25em" /><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><munder><mover><mi>β</mi><mo stretchy="true">ˆ</mo></mover><mo stretchy="true">_</mo></munder></mrow></mfenced><mo>=</mo><mspace width="0.25em" /><munder><mn>0</mn><mo stretchy="true">_</mo></munder></mrow></mrow></math>

Eq. 18

Re-arranging Eq. 18, we get the following:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mi mathvariant="normal">⊤</mi></msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><munder><mover><mi>β</mi><mo stretchy="true">ˆ</mo></mover><mo stretchy="true">_</mo></munder><mo>=</mo><mspace width="0.25em" /><msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mi mathvariant="normal">⊤</mi></msup><munder><mi>y</mi><mo stretchy="true">_</mo></munder></mrow></mrow></math>

Eq. 19

We can solve Eq. 19 by applying <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math>to both the left- and right-hand sides of Eq. 19 to get the following:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><munder><mover><mi>β</mi><mo stretchy="true">ˆ</mo></mover><mo stretchy="true">_</mo></munder><mo>=</mo><mspace width="0.25em" /><msup><mfenced open="(" close=")"><mrow><msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mi mathvariant="normal">⊤</mi></msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder></mrow></mfenced><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mi mathvariant="normal">⊤</mi></msup><munder><mi>y</mi><mo stretchy="true">_</mo></munder></mrow></mrow></math>

Eq. 20

This solution is very efficient. It is in a closed-form, meaning we have an equation with the thing we want, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math>, on its own on the left-hand side, and a mathematical expression that doesn’t involve <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> on the right-hand side. There is no iterative algorithm required. We just perform a couple of matrix calculations, and we have our optimal parameter estimates <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math>. That we can obtain a closed-form expression for the parameter estimates is one of the most attractive aspects of OLS regression and part of the reason it is so widely used. We’ll walk through some code examples in a moment to illustrate how easy it is to perform OLS regression.

Practical issues

This doesn’t mean the closed-form expression in Eq. 20 doesn’t cause problems. Firstly, you’ll recall from Chapter 3 on linear algebra that we can have square matrices that do not have an inverse. It is very possible that the matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math>does not exist. This happens when there are linear dependencies between the columns of the design matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math>; for example, if one feature is simply a scaled version of another feature, or where combining several features together gives the same numerical value as another feature. In these circumstances, one or more of the features are redundant since they add no new information.

Secondly, in a modern-day data science setting where we might have many thousands of features in a model, the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>d</mml:mi><mml:mo> </mml:mo><mml:mo>×</mml:mo><mml:mi>d</mml:mi></mml:math> matrix <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> can be unwieldy to work with if d is of the order of several thousand.

How to deal with these computational issues is beyond the scope of the book, but they are something you should be aware of in case they crop up in a problem you are trying to solve.

The model residuals

Once we have obtained an estimate <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> for the model parameters, using Eq. 20, we can calculate the residuals. If we denote the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math> residual by <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math>, then obviously we have the following:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mrow><msub><mi>r</mi><mi>i</mi></msub><mo>=</mo><mspace width="0.25em" /><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><mspace width="0.25em" /><msub><mover><mi>y</mi><mo stretchy="true">ˆ</mo></mover><mi>i</mi></msub><mo>=</mo><mspace width="0.25em" /><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><mspace width="0.25em" /><msub><munder><mi>x</mi><mo stretchy="true">_</mo></munder><mi>i</mi></msub><munder><mover><mi>β</mi><mo stretchy="true">ˆ</mo></mover><mo stretchy="true">_</mo></munder><mo>=</mo><mspace width="0.25em" /><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><mspace width="0.25em" /><msub><mfenced open="(" close=")"><mrow><munder><munder><mi>X</mi><mo stretchy="true">_</mo></munder><mo stretchy="true">_</mo></munder><mspace width="0.25em" /><munder><mover><mi>β</mi><mo stretchy="true">ˆ</mo></mover><mo stretchy="true">_</mo></munder></mrow></mfenced><mi>i</mi></msub></mrow></mrow></math>

Eq. 21

What happens if we sum up all the residuals? To answer this question, we make use of Eq. 19 and recall that the first row of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup></mml:math>is all ones if our model has an intercept. So, Eq. 19 tells us the following:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block"><mml:mfenced open="" close="" separators="|"><mml:mrow><mml:mrow><mml:munderover><mml:mo stretchy="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder><mml:mo> </mml:mo><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo> </mml:mo><mml:mrow><mml:munderover><mml:mo stretchy="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo> </mml:mo><mml:mo> </mml:mo><mml:mo> </mml:mo><mml:mo>⇒</mml:mo><mml:mo> </mml:mo><mml:mo> </mml:mo><mml:mo> </mml:mo><mml:mo> </mml:mo><mml:mrow><mml:munderover><mml:mo stretchy="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mfenced open="[" close="]" separators="|"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mo> </mml:mo><mml:msub><mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder><mml:mo> </mml:mo><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mfenced></mml:math>

Eq. 22

So, the sum of all the residuals is zero if our model has an intercept.

Let’s see these ideas in action with a code example.

OLS regression code example

The data in the Data/power_plant_output.csv file in the GitHub repository contains measurements of the power output from electricity generation plants. The power (PE) is generated from a combination of gas turbines, steam turbines, and heat recovery steam generators, and so is affected by environmental factors in which the turbines operate, such as the ambient temperature (AT) and the steam turbine exhaust vacuum level (V). The dataset consists of 9,568 observations of the PE, AT, and V values. The data is a subset of the publicly available dataset held in the UCI Machine Learning Repository (https://archive.ics.uci.edu/datasets). The original data can be found at https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant.

We’ll use the data to build a linear model of the power output PE as a function of the AT and V values. We will build the linear model in two ways – i) using the Python statsmodels package, ii) using Eq. 20 via an explicit calculation. The following code example can be found in the Code_Examples_Chap4.ipynb notebook in the GitHub repository. To begin, we need to read in the data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# Read in the raw data
df = pd.read_csv("../Data/power_plant_output.csv")

We’ll do a quick inspection of the data. First, we’ll compute some summary statistics of the data:

# Use pd.describe() to get the summary statistics of the data
df.describe()

AT

V

PE

Count

9568.000000

9568.000000

9568.000000

Mean

19.651231

54.305804

454.365009

Std

7.452473

12.707893

17.066995

Min

1.810000

25.360000

420.260000

25%

13.510000

41.740000

439.750000

50%

20.345000

52.080000

451.550000

75%

25.720000

66.540000

468.430000

Max

37.110000

81.560000

495.760000

Table 4.1: Summary statistics for the power-plant dataset

Next, we’ll visualize the relationship between the response variable (the target variable) and the features. We’ll start with the relationship between power output (PE) and ambient temperature (AT):

# Scatterplot between the response variable PE and the AT feature.
# The linear relationship is clear.
plt.scatter(df.AT, df.PE)
plt.title('PE vs AT', fontsize=24)
plt.xlabel('AT', fontsize=20)
plt.ylabel('PE', fontsize=20)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.show()
Figure 4.5: Plot of power output (PE) versus ambient temperature (AT)

Figure 4.5: Plot of power output (PE) versus ambient temperature (AT)

Now, let’s look at the relationship between power and vacuum (V):

# Scatterplot between the response variable PE and the V feature.
# The linear relationship is clear, but not as strong as the 
# relationship with the AT feature.
plt.scatter(df.V, df.PE)
plt.title('PE vs V', fontsize=24)
plt.xlabel('V', fontsize=20)
plt.ylabel('PE', fontsize=20)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.show()
Figure 4.6: Plot of power output (PE) versus vacuum level (V)

Figure 4.6: Plot of power output (PE) versus vacuum level (V)

Now, we’ll fit a linear model using the statsmodels package. The linear model formula is specified in statistical notation as PE AT + V. You can think of it as the statistical formula equivalent of the mathematical formula PE = β 0 + β AT x AT + β V x V. We do this fitting using the following code:

# First we specify the model using statsmodels.formula.api.ols
model = smf.ols(formula='PE ~ AT + V', data=df)
# Now we fit the model to the data, i.e. we minimize the sum-of-
# squared residuals with respect to the model parameters
model_result = model.fit()
# Now we'll look at a summary of the fitted OLS model
model_result.summary()

This gives the following parameter estimates for our linear model:

OLS Regression Results

coef

std err

T

P>|t|

[0.025

0.975]

Intercept

505.4774

0.240

2101.855

0.000

505.006

505.949

AT

-1.7043

0.013

-134.429

0.000

-1.729

-1.679

V

-0.3245

0.007

-43.644

0.000

-0.339

-0.310

Table 4.2: OLS regression parameter estimates for our power-plant linear model

We can see from Table 4.2 that, as expected, we get negative estimates for the parameters corresponding to the AT and V features. Now, we’ll repeat the calculation explicitly using the formula in Eq. 20. We’ll use the linear algebra functions available to us in numpy. First, we need to extract the data from the pandas DataFrame to appropriate numpy arrays:

# We extract the design matrix as a 2D numpy array. 
# This initially corresponds to the feature columns of the dataframe.
# In this case it is all but the last column
X = df.iloc[:, 0:(df.shape[1]-1)].to_numpy()
# Now we'll add a column of ones to the design matrix.
# This is the feature that corresponds to the intercept parameter 
# in the moddel
X = np.c_[np.ones(X.shape[0]), X]
# For convenience, we'll create and store the transpose of the 
# design matrix
xT = np.transpose(X)
# Now we'll extract the response vector to a numpy array
y = df.iloc[:, df.shape[1]-1].to_numpy()

Now, we can calculate the OLS parameter estimates using the formula <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder><mml:mo>=</mml:mo><mml:mo> </mml:mo><mml:msup><mml:mrow><mml:mfenced separators="|"><mml:mrow><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:munder><mml:mrow><mml:munder><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:mrow><mml:mrow><mml:mi mathvariant="normal">⊤</mml:mi></mml:mrow></mml:msup><mml:munder><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math>::

# Calculate the inverse of xTx using the numpy linear algebra 
# functions
xTx_inv = np.linalg.inv(np.matmul(xT, X))
# Finally calculate the OLS model parameter estimates using the 
# formula (xTx_inv)*(xT*y).
# Again, we use the numpy linear algebra functions to do this
ols_params = np.matmul(xTx_inv, np.matmul(xT, y))

We can compare the OLS parameter estimates obtained from statsmodels with those obtained from the explicit calculation:

# Now compare the parameter estimates from the explicit calculation 
# with those obtained from the statsmodels fit
df_compare = pd.DataFrame({'statsmodels': model_result.params, 
                           'explicit_ols':ols_params})
df_compare

statsmodels

explicit_ols

Intercept

505.477434

505.477434

AT

-1.704266

-1.704266

V

-0.324487

-0.324487

Table 4.3: A comparison of the parameter estimates from the statsmodels packages and explicit calculation using the OLS formula

The parameter estimates from the two different OLS regression codes are identical to more than 6 decimal places.

This walkthrough of a real example highlights the power of the closed-form OLS regression formula in Eq. 20. This closed-form arises from the linear (in <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math>) nature of the optimality criterion in Eq. 18, which itself arises from the quadratic nature of the risk function in Eq. 17, which ultimately is a consequence of the quadratic form of the squared-loss function in Eq. 8.

But what if we don’t want to use a linear model or a squared-loss function? Firstly, we can’t use OLS regression! Secondly, a different choice of loss function, such as the absolute loss or the pseudo-Huber robust loss function in Eq. 12, will not lead to a closed-form solution for <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:munder><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>β</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>_</mml:mo></mml:mrow></mml:munder></mml:math> if we minimize the empirical risk in Eq. 3. So, how do we minimize the empirical risk to obtain optimal model parameter estimates in these situations? We’ll learn how to address this question in the next section, but for now, let’s review what we have learned in this section.

What we learned

In this section, we have learned the following:

  • How to write the empirical risk for OLS regression in matrix notation
  • How to derive a closed-form expression for OLS model parameter estimates
  • Some of the properties and practical limitations of OLS regression
  • How to perform OLS regression using available Python packages such as statsmodels
  • How to perform OLS regression by explicitly calculating the closed-form formula for OLS model parameter estimates

Having learned how to perform OLS regression, we’ll now learn how to perform least squares regression in more general settings by using gradient descent techniques to minimize the empirical risk.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image