Now that we understand the types of monitoring and how they can be applied in different scenarios, the next thing is to look at the components that monitoring needs to be possible. Every monitoring architecture or setup works with some base components, when implementing any of the types of monitoring in the previous section. They are as follows:
- Alerts/notifications
- Events
- Logging
- Metrics
- System availability
- Incidence
Alerts/notifications
An alert/notification is an event that is triggered to inform the system administrator or site reliability engineer about a potential issue or an issue that has already happened. Alerts are configured with a certain metric in mind. When configuring an alert or notification, there is a metric that is being evaluated and validated against. When that condition is met, the alert is triggered to send a notification.
Notifications can be sent using different media. Alerts can be sent using SMS (Simple Messaging Service), email, mobile app push notifications, HTTP Push Events, and many more. The message that is sent via this different media contains information about the incident that has occurred or the metric constraint that has been met. Alerts can be used for both proactive monitoring, to warn sysadmins about a high network I/O and for reactive monitoring, to notify an Site Reliability Engineer (SRE) that an API endpoint is down. An AWS service that is specialized in this area is Amazon SNS. The notifications we shall configure in this book will use Amazon SNS to send notifications.
Important Note
Amazon SNS is a fully-managed messaging service that is used for sending notification using SMS, push notification, HTTP call, and email as the medium. SNS does not require you to set up any servers or manage any SMTP or SMPP servers for sending emails or SMS. AWS manages all of that for you and gives you an interactive UI, CLI, or API to manage the SNS service and use all of these media to send notifications.
Most users only use the SNS email service to just get notifications when something in a system goes down or goes wrong or they want to get a warning about something. The SNS HTTP call topic can be used to trigger another service such as an EventBus to start another process such as a background process to clean temporary files in the server, or create a backup based on the warning signal that has been received. The SRE can tie automated runbooks to HTTP endpoints, which SNS topic can trigger as a notification.
Important Note
An SRE is someone who is in charge of making sure that applications and services maintain the highest uptime possible. Uptime is usually measured in percentage. Uptime states that the system hardly goes down or is unavailable for customers to use. A good uptime for a website is 99.9%. A good tool to measure uptime is https://uptime.is.
Events
Any action or activity or series of activities that occur in a system is called an event. In computer systems, there are various events and activities that go on in the background to keep the computer running. A very simple example of an event is the clock. It is a background process that ensures the clock continues to tick to ensure time is kept. Each tick of the clock can be called an event. In the hardware that makes up a PC, there are components such as the CPU, memory, and hard disk. They all have series of events that they perform from time to time. The disk is the memory storage of data in the computer, and it usually performs two basic operations or events—either reading data that has been written to it or trying to write in new data. We can also call these operations events of the disk.
In software programs, every function or method can be called an event. Software programs are made up of hundreds to thousands of methods or functions. Each of these functions has a unique operation they perform to be able to solve a specific problem. The ability to track each of these events is very important in monitoring software systems and applications.
Logging
A log is a historical record of an event. A log does not only have the events and details of the event; it also contains the time that event occurred. Mostly, they are called logs. This means that a series of events form logs. Every programming language, when used to develop, generates logs. It is through logs that developers are able to spot bugs in code. When a log is generated by the interpreter, it is read and articulated, which informs the developer about what the bug could be and allows the developer to know what needs to be tweaked to be able to fix the particular bug that has been identified.
We also showed the Microsoft Event Viewer in the previous section, which contains a list of events. This list of events eventually forms what is called logs. They are events that have taken place with the description of the events, the status of the event, and the date and time the event occurred.
The following screenshot shows an example of a list of events that forms logs:
Figure 1.4 – List of events forms logs
Logs are the heart of monitoring because they give raw data that can be analyzed to draw insights from the behavior of the system. In many organizations, logs are kept for a specific period of time for the purpose of system audit, security analysis/audits and compliance inspection. In some cases, logs can contain sensitive information about an organization, which can be a potential vulnerability that hackers and crackers can use to attack and exploit the system.
Mostly, logs are stored in filesystems where the application is running, but logs can grow so large that storing them in a filesystem might not be very efficient. There are other places logs can be stored that scale infinitely in size of storage for log files. These will be revealed more as we go deeper in this book:
Figure 1.5 – A sample of an nginx log
Figure 1.5 is another example of events that form an nginx log. This is taken from the access log file of a nginx server.
Metrics
A metric is the smallest of unit of insight that is obtained from a log. Metrics give meaning to logs that have been collected from a system. They indicate a standard of measurement for different system components. In a huge collection of logs, what is usually needed is a single explanation for all of the information that has been captured. It could be the estimated disk space that is left or the percentage of memory that is being consumed. This single piece of information helps the SRE or sysadmin to know how to react. In some cases, the metric is fed into a more automated system that responds according to the data received
A simple example is the auto-scaling feature in AWS. This feature helps to create or spin up a new server when something goes wrong with the existing server. One metric that can be used to trigger this is the CPU consumption of the current running server. If the CPU consumption is above 90%, this could mean that, within a few minutes or hours, that server will no longer be reachable or available. Therefore, a remedy needs to be provided for that before the CPU consumption exceeds 90%. That information can be used to create a new server to either replace the existing server or added as part of the load balancer to ensure that the application or service does not have downtime.
The following diagram illustrates how auto-scaling works:
Figure 1.6 – Autoscaling based on instance metrics
Another use of a metric is in detecting malicious network activities. When the network activity of your cloud resources is closely monitored, there might be an anomaly in a metric such as the NetworkIn (which is a metric that measures the number of bytes of data that is transferred inside the network infrastructure). Anomalies could mean very high traffic at a particular time; this could mean that the resources on that network are being hit by unnecessary DDoS requests that could lead to a lot of different scenarios that are negative to the application.
The metric is key to understanding summarized information of what is going on and attach a label to huge events and logs of information that is received from various systems and take action based on this intelligence.
System availability
Availability in simple context means that something or someone is accessible. System availability in that same context means that a system is available for use, by users or customers who require it. In software systems, the availability of your website, application, or API means that it is accessible to whoever needs it and whenever they need to use it. It could be that shopping website that customer needs to access to purchase those Nike sneakers or a developer who needs to integrate a payment service API to their system to enable users to make payments. If the customer or developer is not able to access it anytime they need it, then that service is termed not highly available.
To understand the availability of any system, monitoring plays a very key role. The ability to know when the system is up or the system is down can be aggregated to get the system availability within a period of time. This is generally called system uptime or just uptime. The system uptime of any system can be calculated as follows:
Figure 1.7 – Formula for calculating availability
In the preceding formula, we have the following:
- Total Uptime: How long the system has been available to the user or customer in hours.
- Total Downtime: How long the system has been unavailable to the user customer in hours.
- Availability: The final system availability as a decimal fraction. Which is then multiplied with 100 to get the percentage availability.
Another scenario in applying this is, say, we want to calculate the availability of an API application serving third-party customers who integrate with it for Forex Indices. Let's say within a month, the API was available for a total of 300 hours. Within that same month, there was a huge surge of traffic on the API due to some announcement in the news, which led to the API being unavailable for about 3 hours. Then, the development team also had to do a new update, which involved changes in the API functionality due to the surge of users on the API. The release of this update cost another 4 hours of downtime in the system. This brings the total downtime of the system within the month to 7 hours. Within that same month, the security team needed to look at the logs for roughly 1 hour during monthly system maintenance. This led to another 1 hour 30 mins of downtime.
We can calculate the availability of this system as follows:
- Total Uptime = 300 hours
- Downtime1 = 3 hours
- Downtime2 = 4 hours
- Downtime3 = 1 hour 30 mins = 1.5 hours
- Total Downtime = 3+4+1.5=8.5 hours
- Total Uptime + Total Downtime = 300 + 8.5 = 308.5 hours
- Availability = 300 / 308.5 = 0.9724
- Availability as a percentage = 97.24%
But it is not about this number—how do we actually interpret the meaning of 97.24% availability? There is a chart that will help us to understand the meaning of this number. We might say it is a good number because it is quite close to 100%, right? But it is actually more than that:
Table 1.1 – Uptime chart
If we are to approximate the uptime of our system based on the calculation, it will round down to 97%. Taking this value and checking it on the preceding chart, we can see that this value means the following:
- 10.96 days of downtime in a year
- 21.92 hours of downtime in a month
- 5.04 hours of downtime in a week
- 43.02 minutes of downtime in a day
The fact that monitoring can help us to understand the availability of our system is one step. But this system being monitored is used by our customers. Customers expect our system to be up and running 24/7. They are hardly concerned about any excuses you might have to give for any downtime. In some cases, it could mean losing them to your competition. Organizations do well to communicate and promise their customers system availability. This gives the customer a level of expectation of the Quality of Service (QoS) to be received by the organization. It also helps to boost customer confidence and gives the business a benchmark to meet.
This indicator or metric is called an SLA. SLA is an acronym for Service Level Agreement. According to Wikipedia, SLA is a commitment between a service provider and a client. In simple terms, an SLA is the percentage of uptime a service provider gives to the customer—anything below that number, and the customer is allowed to lay claims and receive compensation. The onus is on the service provider to ensure they do not go below that SLA that has been communicated to the customer.
Dashboard
For every event, log, or metric that is measured or collected, there is always a better way to represent the data being collected. Dashboards are a way to present logs and metric data in a graphically appealing manner. Dashboards are a combination of different graphical representations of data, which could be in the form of bar charts, line graphs, bar graphs, histograms, scattered plots, or pie charts. These representations give the user a summarized version of the logs, which makes it easy to spot things such as trends in a graph.
When there is a rise in Disk I/O, in the number of bytes of data written per second, one of the fastest ways to represent this is through a line graph, which will have a directly proportional slope showing the gradual rise traffic from one point to another. If, during the night, there was some sudden spike in the memory consumption of one of the servers, due to high customer usage of the service, a line graph can be easily used to spot the time of the day the spike happened and see when it came back down.
These, and many more are the values of having a dashboard, which gives a graphical representation of the data collection from the logs of the system. Metrics are also represented in graphs for much easier interpretation. Amazon CloudWatch has a built-in dashboard where different types of graphs can be created and hence added to the dashboard based on certain specifications, or to group related data for easy understanding of the logs and making better meaning of the log data collected:
Figure 1.8 – Sample CloudWatch dashboard
Next, we will understand what an incident is.
Incidents
An incident is an event, condition, or situation that causes a disruption to the normal operation of a system or an organization. Incidents are negative to a system and are related to reactive monitoring. Incidents make a website, API, or application slow or unavailable to a user. Different things can trigger or cause an incident. It could range from a bug in the application that led to the application being totally unavailable, to a security incident where an attacker collected sensitive user or customer information. They are all termed incidents. Some of these incidents can actually be captured by the monitoring tool to show when the incident occurred and can form part of the incident report that is documented in your organization.
It is advisable for every organization to have an Incident Management Framework. This framework defines how failures, or any form of incident reported, is managed by the SRE/sysadmin team. Incidents are usually captured by the monitoring tool. When an attacker performs a brute force attack on a Linux server and gains access, this activity can be picked up by monitoring tools and an alert is sent over to the team. This will help the security team to investigate the issue and ensure it never occurs again. The incident framework guides every team within the organization on how to react in the event of an incident. There are usually levels of incidents labeled according to their levels of severity. In most cases, these are SEV1, SEV2, or SEV3, which means severity 1, severity 2, and severity 3, respectively. The numbers indicate the priority or intensity of the severity.
Phewww! While that was quite a lot of information, these components are at the heart of monitoring architecture and infrastructure. We have seen how dashboards help with proactive monitoring and understanding the infrastructure even before disaster strikes. The next thing is to look at the importance of monitoring, and how these components form up different aspects of the importance of monitoring.
When it comes to the value monitoring gives, there are major items that make it at the very core of every system that is designed and implemented. As far as it is a system that has been designed and is bound to have unforeseen circumstances, then the importance of monitoring the system can never be undermined. Disregarding the importance of monitoring means that the lifetime of a system is not put into consideration. We can also say, fine, we want to monitor this system or that system, but what are the core values that can be derived when engaging in monitoring activities, be it a pro-active monitoring approach or the reactive monitoring? There are key reasons to monitor a system:
- Realizing when things go south
- Ability to debug
- Gaining insights
- Sending data/notifications to other systems
- Controlling Capital Expenditure (CapEx) to run cloud infrastructure
We have been able to list some reasons to ensure monitoring is part of your infrastructure and application deployments. Let's expatiate on these with examples to help to drive home the essence on each point mentioned in the preceding.
Realizing when things go south
Organizations that do not have any kind of monitoring service suffer from customers being the ones to inform them of a downtime. The ability to know this before customers raise it is very important. It paints a bad picture when users go to social media to share negative opinions of a system downtime before the company finds out there is/was a downtime. Reactive monitoring is the technique that helps with this. Using simple endpoint monitoring that pings your endpoints and services from time to time to give feedback is very important. There are times an application might be running but customers are not able to reach it for various reasons. Endpoint monitoring can help to send email alerts or SMS notifications to the SRE team to notify them of a downtime before the customer makes any kind of complaint. The issue can quickly be resolved to improve overall service availability and MTTR.
Important Note
MTTR is an acronym for Mean Time to Recover. It is the measure of how quickly a system recovers from failure.
Ability to debug
When an application has a bug, be it a functional or non-functional bug, the developer needs logs or a way to trace the round-trip of the application to be able to understand where in the flow the bottleneck is present. Without logs or a way to have a bird's-eye view of the behavior, it is almost impossible to debug the application and come up with a solution after the problem is understood. In this scenario, reactive monitoring is the technique to be applied here. Logs from the application server or web server will lead to the bug in the system.
Gaining insight
Insights to the behavior of your system is critical to the progress or retrogression of your application. Insights that can be gained from any application is quite broad. These insights can range from the internal server components to the behavior of the network over time. These insights might not be to actually fix a problem but rather to be able to understand the state of the system from time to time. The ability to spot trends in the system, viewing intermittent behavior, planning capacity for infrastructure, and improving cost optimization are some of the activities that can be carried out to make the system better based on the insights that have been obtained. When there is monitoring of the environment, a rogue NAT gateway that is not needed for an architecture can be deleted, which could save huge costs, considering what Amazon VPC NAT gateways cost, especially when they are not actively in use.
Sending data/notifications to other systems
Monitoring is not just about watching events and generating logs and traces of these events. It also involves taking action based on the logs and metrics that have been generated. With the data obtained from monitoring systems or a notification, an automated recovery operation can be tied to that metric, which can recover the system without any manual intervention from the SREs or sysadmins. Amazon EventBridge is a service that can also send events to third-party SaaS solutions, CloudWatch can be configured to send triggers to Amazon EventBridge to carry operations on other systems that are not within the AWS infrastructure.
Controlling CapEX to run cloud infrastructure
CapEx can be managed when things are monitored. In using different AWS services, there is always the possibility of not keeping track of the resources being provisioned and overspend. The capital expense is what it costs to run a particular cloud service. Monitoring with a budget and billing alarm can be a life saver to alert you when you are spending above the budget that has been set for that particular month. This means the bill is being monitored, and when the services running go over the budget, an email alert is sent to notify you of it. There are also alarms that notify at the beginning of every month, to notify you of the possible forecast for that month.
We have understood the meaning of monitoring and its historical background from the days of Microsoft Windows Event Viewer to new tools that have evolved from that singular basic understanding. Then, we discussed the types of monitoring that can be employed and the strategies of those types of monitoring. We also identified the major components that must be considered when setting up or configuring any monitoring infrastructure. Finally, we have understood the importance of monitoring, drawing from the types and components and strategies we learned and the value each of these bring to the importance of monitoring as a whole. The next stage is to introduce the monitoring service that this book is based on, which is Amazon CloudWatch.