Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Building Statistical Models in Python
Building Statistical Models in Python

Building Statistical Models in Python: Develop useful models for regression, classification, time series, and survival analysis

Arrow left icon
Profile Icon Huy Hoang Nguyen Profile Icon Paul N Adams Profile Icon Stuart J Miller
Arrow right icon
£26.98 £29.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.9 (11 Ratings)
eBook Aug 2023 420 pages 1st Edition
eBook
£26.98 £29.99
Paperback
£37.99
Subscription
Free Trial
Renews at £16.99p/m
Arrow left icon
Profile Icon Huy Hoang Nguyen Profile Icon Paul N Adams Profile Icon Stuart J Miller
Arrow right icon
£26.98 £29.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.9 (11 Ratings)
eBook Aug 2023 420 pages 1st Edition
eBook
£26.98 £29.99
Paperback
£37.99
Subscription
Free Trial
Renews at £16.99p/m
eBook
£26.98 £29.99
Paperback
£37.99
Subscription
Free Trial
Renews at £16.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Building Statistical Models in Python

Sampling and Generalization

In this chapter, we will describe the concept of populations and sampling from populations, including some common strategies for sampling. The discussion of sampling will lead to a section that will describe generalization. Generalization will be discussed as it relates to using samples to make conclusions about their respective populations. When modeling for statistical inference, it is necessary to ensure that samples can be generalized to populations. We will provide an in-depth overview of this bridge through the subjects in this chapter.

We will cover the following main topics:

  • Software and environment setup
  • Population versus sample
  • Population inference from samples
  • Sampling strategies – random, systematic, and stratified

Software and environment setup

Python is one of the most popular programming languages for data science and machine learning thanks to the large open source community that has driven the development of these libraries. Python’s ease of use and flexible nature made it a prime candidate in the data science world, where experimentation and iteration are key features of the development cycle. While there are new languages in development for data science applications, such as Julia, Python currently remains the key language for data science due to its wide breadth of open source projects, supporting applications from statistical modeling to deep learning. We have chosen to use Python in this book due to its positioning as an important language for data science and its demand in the job market.

Python is available for all major operating systems: Microsoft Windows, macOS, and Linux. Additionally, the installer and documentation can be found at the official website: https://www.python.org/.

This book is written for Python version 3.8 (or higher). It is recommended that you use whatever recent version of Python that is available. It is not likely that the code found in this book will be compatible with Python 2.7, and most active libraries have already started dropping support for Python 2.7 since official support ended in 2020.

The libraries used in this book can be installed with the Python package manager, pip, which is part of the standard Python library in contemporary versions of Python. More information about pip can be found here: https://docs.python.org/3/installing/index.html. After pip is installed, packages can be installed using pip on the command line. Here is basic usage at a glance:

Install a new package using the latest version:

pip install SomePackage

Install the package with a specific version, version 2.1 in this example:

pip install SomePackage==2.1

A package that is already installed can be upgraded with the --upgrade flag:

pip install SomePackage –upgrade

In general, it is recommended to use Python virtual environments between projects and to keep project dependencies separate from system directories. Python provides a virtual environment utility, venv, which, like pip, is part of the standard library in contemporary versions of Python. Virtual environments allow you to create individual binaries of Python, where each binary of Python has its own set of installed dependencies. Using virtual environments can prevent package version issues and conflict when working on multiple Python projects. Details on setting up and using virtual environments can be found here: https://docs.python.org/3/library/venv.html.

While we recommend the use of Python and Python’s virtual environments for environment setups, a highly recommended alternative is Anaconda. Anaconda is a free (enterprise-ready) analytics-focused distribution of Python by Anaconda Inc. (previously Continuum Analytics). Anaconda distributions come with many of the core data science packages, common IDEs (such as Jupyter and Visual Studio Code), and a graphical user interface for managing environments. Anaconda can be installed using the installer found at the Anaconda website here: https://www.anaconda.com/products/distribution.

Anaconda comes with its own package manager, conda, which can be used to install new packages similarly to pip.

Install a new package using the latest version:

conda install SomePackage

Upgrade a package that is already installed:

conda upgrade SomePackage

Throughout this book, we will make use of several core libraries in the Python data science ecosystem, such as NumPy for array manipulations, pandas for higher-level data manipulations, and matplotlib for data visualization. The package versions used for this book are contained in the following list. Please ensure that the versions installed in your environment are equal to or greater than the versions listed. This will help ensure that the code examples run correctly:

  • statsmodels 0.13.2
  • Matplotlib 3.5.2
  • NumPy 1.23.0
  • SciPy 1.8.1
  • scikit-learn 1.1.1
  • pandas 1.4.3

The packages used for the code in this book are shown here in Figure 1.1. The __version__ method can be used to print the package version in code.

Figure 1.1 – Package versions used in this book

Figure 1.1 – Package versions used in this book

Having set up the technical environment for the book, let’s get into the statistics. In the next sections, we will discuss the concepts of population and sampling. We will demonstrate sampling strategies with code implementations.

Population versus sample

In general, the goal of statistical modeling is to answer a question about a group by making an inference about that group. The group we are making an inference on could be machines in a production factory, people voting in an election, or plants on different plots of land. The entire group, every individual item or entity, is referred to as the population. In most cases, the population of interest is so large that it is not practical or even possible to collect data on every entity in the population. For instance, using the voting example, it would probably not be possible to poll every person that voted in an election. Even if it was possible to reach all the voters for the election of interest, many voters may not consent to polling, which would prevent collection on the entire population. An additional consideration would be the expense of polling such a large group. These factors make it practically impossible to collect population statistics in our example of vote polling. These types of prohibitive factors exist in many cases where we may want to assess a population-level attribute. Fortunately, we do not need to collect data on the entire population of interest. Inferences about a population can be made using a subset of the population. This subset of the population is called a sample. This is the main idea of statistical modeling. A model will be created using a sample and inferences will be made about the population.

In order to make valid inferences about the population of interest using a sample, the sample must be representative of the population of interest, meaning that the sample should contain the variation found in the population. For example, if we were interested in making an inference about plants in a field, it is unlikely that samples from one corner of the field would be sufficient for inferences about the larger population. There would likely be variations in plant characteristics over the entire field. We could think of various reasons why there might be variation. For this example, we will consider some examples from Figure 1.2.

Figure 1.2 – Field of plants

Figure 1.2 – Field of plants

The figure shows that Sample A is near a forest. This sample area may be affected by the presence of the forest; for example, some of the plants in that sample may receive less sunlight than plants in the other sample. Sample B is shown to be in between the main irrigation lines. It’s conceivable that this sample receives more water on average than the other two samples, which may have an effect on the plants in this sample. The final Sample C is near a road. This sample may see other effects that are not seen in Sample A or B.

If samples were only taken from one of those sections, the inferences from those samples would be biased and would not provide valid references about the population. Thus, samples would need to be taken from across the entire field to create a sample that is more likely to be representative of the population of plants. When taking samples from populations, it is critical to ensure the sampling method is robust to possible issues, such as the influence of irrigation and shade in the previous example. Whenever taking a sample from a population, it’s important to identify and mitigate possible influences of bias because biases in data will affect your model and skew your conclusions.

In the next section, various methods for sampling from a dataset will be discussed. An additional consideration is the sample size. The sample size impacts the type of statistical tools we can use, the distributional assumptions that can be made about the sample, and the confidence of inferences and predictions. The impact of sample size will be explored in depth in Chapter 2, Distributions of Data and Chapter 3, Hypothesis Testing.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Gain expertise in identifying and modeling patterns that generate success
  • Explore the concepts with Python using important libraries such as stats models
  • Learn how to build models on real-world data sets and find solutions to practical challenges

Description

The ability to proficiently perform statistical modeling is a fundamental skill for data scientists and essential for businesses reliant on data insights. Building Statistical Models with Python is a comprehensive guide that will empower you to leverage mathematical and statistical principles in data assessment, understanding, and inference generation. This book not only equips you with skills to navigate the complexities of statistical modeling, but also provides practical guidance for immediate implementation through illustrative examples. Through emphasis on application and code examples, you’ll understand the concepts while gaining hands-on experience. With the help of Python and its essential libraries, you’ll explore key statistical models, including hypothesis testing, regression, time series analysis, classification, and more. By the end of this book, you’ll gain fluency in statistical modeling while harnessing the full potential of Python's rich ecosystem for data analysis.

Who is this book for?

If you are looking to get started with building statistical models for your data sets, this book is for you! Building Statistical Models in Python bridges the gap between statistical theory and practical application of Python. Since you’ll take a comprehensive journey through theory and application, no previous knowledge of statistics is required, but some experience with Python will be useful.

What you will learn

  • Explore the use of statistics to make decisions under uncertainty
  • Answer questions about data using hypothesis tests
  • Understand the difference between regression and classification models
  • Build models with stats models in Python
  • Analyze time series data and provide forecasts
  • Discover Survival Analysis and the problems it can solve

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2023
Length: 420 pages
Edition : 1st
Language : English
ISBN-13 : 9781804612156
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Aug 31, 2023
Length: 420 pages
Edition : 1st
Language : English
ISBN-13 : 9781804612156
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
£16.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
£169.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just £5 each
Feature tick icon Exclusive print discounts
£234.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just £5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total £ 113.97
Building Statistical Models in Python
£37.99
Machine Learning Engineering  with Python
£37.99
Exploratory Data Analysis with Python Cookbook
£37.99
Total £ 113.97 Stars icon

Table of Contents

21 Chapters
Part 1:Introduction to Statistics Chevron down icon Chevron up icon
Chapter 1: Sampling and Generalization Chevron down icon Chevron up icon
Chapter 2: Distributions of Data Chevron down icon Chevron up icon
Chapter 3: Hypothesis Testing Chevron down icon Chevron up icon
Chapter 4: Parametric Tests Chevron down icon Chevron up icon
Chapter 5: Non-Parametric Tests Chevron down icon Chevron up icon
Part 2:Regression Models Chevron down icon Chevron up icon
Chapter 6: Simple Linear Regression Chevron down icon Chevron up icon
Chapter 7: Multiple Linear Regression Chevron down icon Chevron up icon
Part 3:Classification Models Chevron down icon Chevron up icon
Chapter 8: Discrete Models Chevron down icon Chevron up icon
Chapter 9: Discriminant Analysis Chevron down icon Chevron up icon
Part 4:Time Series Models Chevron down icon Chevron up icon
Chapter 10: Introduction to Time Series Chevron down icon Chevron up icon
Chapter 11: ARIMA Models Chevron down icon Chevron up icon
Chapter 12: Multivariate Time Series Chevron down icon Chevron up icon
Part 5:Survival Analysis Chevron down icon Chevron up icon
Chapter 13: Time-to-Event Variables – An Introduction Chevron down icon Chevron up icon
Chapter 14: Survival Models Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.9
(11 Ratings)
5 star 90.9%
4 star 9.1%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Dror Oct 01, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Statistics is a fundamental discipline concerned with the collection, organization, analysis, interpretation, and presentation of data. While Python—an extremely popular general-purpose programming language—has become the programming language of choice for computation in most science and engineering disciplines, most (software-oriented) statistics books still teach statistics using the more special-purpose R language.This unique and highly practical book provides a gentle introduction to statistics and to using the Python programming language for building statistical models. It begins with a clear and useful introduction to statistics, including sampling, data distributions, hypothesis testing, and parametric and non-parametric statistical tests. It then progresses to describe in detail how to build statistical models using Python for a variety of problems, including for regression, classification, time-series, and survival analysis. The descriptions are clear and concise, and gradually present additional common and helpful Python packages for performing statistical analysis. The accompanying GitHub repository includes practical and detailed code examples, and is very helpful in reinforcing the materials and concepts presented in the book.I highly recommend this book to anyone interested in learning statistics and how to use Python for building statistical models. It requires no more than basic knowledge of the Python programming language, and will be ideal for data scientists, analysts, and industry professionals who are taking their first steps in the world of statistics or want to expand their knowledge in this area.Highly recommended!
Amazon Verified review Amazon
JRVV Oct 19, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book provides a broad primer on statistical modeling using Python. This book can also serve as a starting point to those who eventually want to go into machine learning. Recommended.
Amazon Verified review Amazon
Amazon Customer Oct 02, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is exceptionally crafted, serving as a comprehensive review of fundamental statistical knowledge, complemented with practical Python codes. Unlike other market options, which either focus solely on theory or coding, lacking depth in theoretical insight, this book seamlessly bridges theory to application. While many statistical texts predominantly utilize the R language, this book's emphasis on Python is a refreshing change. It not only rejuvenates and reinforces my existing knowledge but also significantly advances my understanding of Statistics and Machine Learning. It stands out as a balanced and insightful resource for both theoretical comprehension and practical application in the field.
Amazon Verified review Amazon
Steven Fernandes Oct 09, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The authors offer a compelling dive into making informed decisions under uncertainty, equipping readers with practical skills, such as hypothesis testing and data analysis. They thoughtfully elucidate the distinctions between regression and classification models and provide a hands-on approach to building models using Python's statsmodels. The text also insightfully explores time-series data analysis, forecasting, and survival analysis, adeptly linking theory with real-world applications. This book emerges as an invaluable guide for both beginners and seasoned practitioners, intertwining robust theoretical constructs with practical applicability in data analysis and model-building, rendering it a must-read in the field of data science.
Amazon Verified review Amazon
Ratan Nov 23, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Authors have done a good job in maintaining the comprehensiveness of the book. They have maintained adequate amount of mathematics what is needed. I particularly loved the way they have presented Hypothesis testing for models which is often missing in many places. They have nicely covered both parametric and non parametric testing.The other part I liked was somewhat less visited topic survival analysis. Overall I found this book an excellent read ! Definitely recommend it.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.