Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PostgreSQL 12 High Availability Cookbook

You're reading from   PostgreSQL 12 High Availability Cookbook Over 100 recipes to design a highly available server with the advanced features of PostgreSQL 12

Arrow left icon
Product type Paperback
Published in Feb 2020
Publisher Packt
ISBN-13 9781838984854
Length 734 pages
Edition 3rd Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Shaun Thomas Shaun Thomas
Author Profile Icon Shaun Thomas
Shaun Thomas
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Architectural Considerations 2. Hardware Planning FREE CHAPTER 3. Minimizing Downtime 4. Proxy and Pooling Resources 5. Troubleshooting 6. Monitoring 7. PostgreSQL Replication 8. Backup Management 9. High Availability with repmgr 10. High Availability with Patroni 11. Low-Level Server Mirroring 12. High Availability via Pacemaker 13. High Availability with Multi-Master Replication 14. Data Distribution 15. Zero-downtime Upgrades 16. Other Books You May Enjoy

Defining timetables through RTO

Like RPO, RTO refers to a common business continuity term known as Recovery Time Objective. In practice, this is the amount of time an outage of the database layer may last. Often, it is incorporated into a Service Level Agreement (SLA) contract presented to clients or assumed as a metric within the application stack. Like RPO, this is a contractual-level element that can determine the number of required nodes at steadily increasing expense as the amount of tolerable downtime decreases.

In this recipe, we will examine the necessary steps to defining a realistic RTO, and what that could mean given known industry standards.

Getting ready

As with RPO, our goal in determining a functional RTO is to set expectations regarding inherent architecture limitations. The primary difference here is that RTO is more easily quantifiable. Fire up your favorite spreadsheet program, such as OpenOffice, Microsoft Excel, or Google Sheets; we'll be using it to keep track of how much time each layer of the application, including the database layer contributes to a potential outage scenario.

How to do it...

We simply need to produce a spreadsheet to track all of the elements of known RTO that depend on the database. We can do this with the following steps:

  1. Locate an already-defined RTO SLA for each portion of the application dependent on PostgreSQL if possible.
  2. If this does not exist, seek the input of major decision makers:
  • VP and C-level executives involved with technology
  • Product manager
  • Application designers and architects
  • Infrastructure team lead
  1. Find an amount of time that will satisfy most or all of the above.
  2. Create a new spreadsheet for RTO.
  3. Create a heading row with the following columns:
  • Activity
  • Time (seconds)
  • Count
  • Total (seconds)
  1. In the Total column, create the following formula:
=B2*C2
  1. Create one row for each type of the following Activity categories:
  • Minor Upgrade
  • Major Upgrade
  • Reboot
  • Switchover
  • Failover
  • OS Upgrade
  • Etc.
  1. Copy and paste the formula into the Total column for all the rows we created.
  1. At the bottom of the Total column, after all relevant rows (row 21, for example), create the following formula:
=SUM(D2:D20) 
  1. Ensure that the end result looks something like the following screenshot:
  2. Follow the rest of the advice in this chapter to find a suitable architecture.
  3. Try to determine a rough cost for this and the closest alternative(s).
  4. Present the design and cost estimates to decision makers.
  5. Document this final RTO decision and architecture as reference material.

How it works...

In order to see where our PostgreSQL cluster fits company expectations, we need to know whether the company and each individual part of the existing application stack has an overall target RTO. If it doesn't, it's our job to approximate one. This means contacting any decision-makers, product owners, architects, and so on, to know what RTO target we're trying to attain and how other resources may contribute. These will act as a type of maximum value we can't exceed.

Keep in mind that RTO values tend to be amplified between layers. If our RTO is higher than some portion of the application stack, that will necessarily raise the RTO of that layer as well, which may increase the RTO of each subsequent layer. This is the exact scenario we're trying to avoid.

Once we have an RTO expectation, we need to examine how possible it is to fall under that target. The easiest way to accomplish this is to build a spreadsheet that essentially consists of a list of dependencies, maintenance tasks, or other occurrences related to PostgreSQL.

The rows we used for Activity are mainly suggestions, and producing an exhaustive list is generally dependent on the architecture to a certain extent. However, all software requires upgrades, machines need to be rebooted, switchover tests to prove high availability functionality may be required, past experience with the full application stack and hardware may imply two unexpected outages per year, and so on. Each of these will contribute to the cumulative RTO for PostgreSQL which we can use as a reference value.

The number we use for the Count column should be the number of times the Activity happens on a yearly basis. As an example, PostgreSQL has a quarterly release schedule for non-critical bug and security enhancements. If you want to follow along with these, it could make sense to set the Count column of Minor Upgrade to 4.

A number of architectural examples that we'll discuss later in this chapter will make it possible to set the Time column to 0 for some actions, or at least to a much lower value. We'll discuss these where relevant. This is also one of the reasons we'll need to execute this recipe multiple times when deciding on an appropriate architecture.

Once we have accounted for as many foreseeable Action components that may be necessary over the course of a year, we'll have a cumulative total that may represent the RTO that PostgreSQL can achieve for a given architecture. As a sanity check, we should compare that value to the lowest RTO for any parts of the application stack that depend on PostgreSQL. It's important we don't exceed this target.

Then, as with RPO, we need to present the possible RTO to decision-makers so that it can be integrated into the overall company RTO. To do that, we must continue with the rest of the chapter to find one or two architectures with either higher or lower expected RTO, estimate the cost of each, and work on a suitable compromise.

Deriving an appropriate RTO may require multiple iterations of this recipe, from estimation, architecture selection, presenting it to appropriate parties, and so on. This isn't a fast or simple process, and it pays to get it right early. We need to know how many PostgreSQL nodes to purchase, where each will reside, how we switch to alternatives, how much time each step may take, and so on.

There's more...

Besides what we discussed in the main recipe, there are other RTO concepts we would like to explore.

This may seem familiar

Believe it or not, it's very likely you've encountered this concept without even realizing it. Internet service providers or application hosts often advertise how many 9s of availability their platform can maintain. It's often presented as a chart like this:

Uptime (%)

Daily

Weekly

Monthly

Yearly

99

14m 24s

1h 40m 48s

7h 18m 18s

3d 15h 39m 30s

99.9

1m 26s

10m 5s

43m 50s

8h 45m 57s

99.99

8.6s

1m 1s

4m 23s

52m 36s

99.999

0.9s

6s

26.3s

5m 16s

As you can imagine, it's generally more desirable to stay toward the higher end of 9s to minimize downtime. On the other hand, this is highly restrictive, as Five 9s only allows just over five minutes of downtime over the course of an entire year. This doesn't leave much room for database maintenance tasks or unexpected outages at any other layer of the stack.

Node counts

Generally, the more nodes we have, the lower our RTO will be. It may make sense to start with an initial estimate spreadsheet, and then create another for each architecture or variant that seems applicable. This will make it easier to rank the monetary cost and associated RTO for each. This may influence the final decision, and hence make it easier to track what options we may have.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime