You're reading from Statistical Application Development with R and Python Develop applications using data processing, statistical models, and CART

Product type Paperback

Published in Aug 2017

Publisher

ISBN-13 9781788621199

Length 432 pages

Edition 2nd Edition

Languages

Python

Concepts

Application Development

Author (1):

Prabhanjan Narayanachar Tattar

View More author details

Table of Contents (12) Chapters

Preface

1. Data Characteristics FREE CHAPTER

2. Import/Export Data

3. Data Visualization

4. Exploratory Analysis

5. Statistical Inference

6. Linear Regression Analysis

7. Logistic Regression Model

8. Regression Models with Regularization

9. Classification and Regression Trees

10. CART and Beyond

Index

Continuous distributions

The numeric variables in the survey, Age, Mileage, and Odometer, can take any values over a continuous interval and these are examples of continuous RVs. In the previous section, we dealt with RVs that had discrete output. In this section, we will deal with RVs that have continuous output. A distinction from the previous section needs to be pointed out explicitly.

In the case of a discrete RV, there is a positive number for the probability of an RV taking on a certain value that is determined by the pmf. In the continuous case, an RV necessarily assumes any specific value with zero probability. These technical issues cannot be discussed in this book. In the discrete case, the probabilities of certain values are specified by the pmf, and in the continuous case the probabilities, over intervals, are decided by the probability density function, abbreviated as pdf.

Suppose that we have a continuous RV X with the pdf f(x) defined over the possible x values; that is, we assume that the pdf f(x) is well defined over the range of the RV X, denoted by Continuous distributions . It is necessary that the integration of f(x) over the range is necessarily 1; that is, .The probability that the RV X takes a value in an interval [a, b] is defined by:

In general, we are interested in the cumulative probabilities of a continuous RV, which is the probability of the event P(X<x). In terms of the previous equations, this is obtained as:

A special name for this probability is the cumulative density function. The mean and variance of a continuous RV are then defined by:

As in the previous section, we will begin with the simpler RV in uniform distribution.

Uniform distribution

A RV is said to have uniform distribution over the interval if its probability density function is given by:

In fact, it is not necessary to restrict our focus on the positive real line. For any two real numbers a and b, from the real line, with b > a, the uniform RV can be defined by:

The uniform distribution has a very important role to play in simulation, as will be seen in Chapter 6, Simulation. As with the discrete counterpart, in the continuous case any two intervals of the same length will have an equal probability occurring. The mean and variance of a uniform RV over the interval [a, b] are respectively given by:

Example 1.4.1. Horgan’s (2008), Example 15.3: The International Journal of Circuit Theory and Applications reported in 1990 that researchers at the University of California, Berkeley, had designed a switched capacitor circuit for generating random signals whose trajectory is uniformly distributed over the unit interval [0, 1]. Suppose that we are interested in calculating the probability that the trajectory falls in the interval [0.35, 0.58]. Though the answer is straightforward, we will obtain it using the punif function:

> punif(0.58)-punif(0.35)
[1] 0.23

Of course, we don’t need software for such simple integrals, nevertheless:

Exponential distribution

The exponential distribution is probably one of the most important probability distributions in statistics, and more so for computer scientists. The numbers of arrivals in a queuing system, the time between two incoming calls on a mobile, the lifetime of a laptop, and so on, are some of the important applications where this distribution has a lasting utility value. The pdf of an exponential RV is specified by:

The parameter Exponential distribution is sometimes referred to as the failure rate. The exponential RV enjoys a special property called the memory-less property, which conveys that:

The mathematical statement translates into the property that if X is an exponential RV, then its failure in the future depends on the present, and the past (age) of the RV does not matter. In simple words, this means that the probability of failure is constant in time and does not depend on the age of the system. Let us obtain the plots of a few exponential distributions:

> par(mfrow=c(1,2))
> curve(dexp(x,1),0,10,ylab=”f(x)”,xlab=”x”,cex.axis=1.25)
> curve(dexp(x,0.2),add=TRUE,col=2)
> curve(dexp(x,0.5),add=TRUE,col=3)
> curve(dexp(x,0.7),add=TRUE,col=4)
> curve(dexp(x,0.85),add=TRUE,col=5)
> legend(6,1,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch= 
+ "___”)
> curve(dexp(x,50),0,0.5,ylab=”f(x)”,xlab=”x”)
> curve(dexp(x,10),add=TRUE,col=2)
> curve(dexp(x,20),add=TRUE,col=3)
> curve(dexp(x,30),add=TRUE,col=4)
> curve(dexp(x,40),add=TRUE,col=5)
> legend(0.3,50,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch= 
+ "___”)

The exponential densities

The mean and variance of this exponential distribution are listed as follows:

The complete Python code block is given next:

Normal distribution

The normal distribution is in some sense an all-pervasive distribution that arises sooner or later in almost any statistical discussion. In fact, it is very likely that the reader may already be familiar with certain aspects of the normal distribution; for example, the shape of a normal distribution curve is bell-shaped. The mathematical appropriateness is probably reflected through the reason that though it has a simpler expression, its density function includes the three most famous irrational numbers

Suppose that X is normally distributed with the mean Normal distribution and the variance . Then, the probability density function of the normal RV is given by:

If the mean is zero and the variance is 1, the normal RV is referred to as the standard normal RV, and the standard is to denote it by Z.

Example 1.4.2. Shady Normal Curves: We will again consider a standard normal random variable, which is more popularly denoted in Statistics by Z. Some of the most needed probabilities are P(Z > 0) and P(-1.96 < Z < 1.96). These probabilities are now shaded:

> par(mfrow=c(3,1))
> # Probability Z Greater than 0
> curve(dnorm(x,0,1),-4,4,xlab=”z”,ylab=”f(z)”)
> z=seq(0,4,0.02)
> lines(z,dnorm(z),type=”h”,col=”grey”)
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab=”z”,ylab=”f(z)”)
> z=seq(-1.96,1.96,0.001)
> lines(z,dnorm(z),type=”h”,col=”grey”)
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab=”z”,ylab=”f(z)”)
> z=seq(-2.58,2.58,0.001)
> lines(z,dnorm(z),type=”h”,col=”grey”)