Logging and monitoring is a very important aspect of any IT infrastructure. Here we get granular details about all the events performed in the infrastructure at each level. Logging and monitoring is a bit complex in the cloud. In logs, we cannot always filter on the basis of IP due to dynamic allocation of IP. There can arise a situation where one IP was earlier representing the x virtual machine, but is now representing the y virtual machine.
Apart from this, the cloud comprises different services. We must ensure the activity logging at each service.
In AWS, we can use CloudTrail to log all the activity for each service and we can either store these logs to an S3 bucket or we can forward them to CloudTrail logs.
Recently, CloudTrail logs enabled at the load balancer helped us to identify the illegitimate traffic. Let's consider, we are running one financial application in HA and an autoscaling environment. Over the last few days, we have seen a peak in resource utilization. As it's configured in autoscaling, it could not affect the application's performance. But, when we tried to investigate the issue, we found that there was a bad guy who was attacking our application:
2017-10-23T00:12:54.164535Z ASP-SaaS-Prod-ELB 90.63.223.128:46838 172.31.2.240:80 0.000038 0.001246 0.000057 404 404 0 0 "HEAD http://X.X.X.X:80/mysql/admin/ HTTP/1.1" "Mozilla/5.0 Jorgee" - -
2017-10-23T00:12:54.294395Z ASP-SaaS-Prod-ELB 90.63.223.128:46838 172.31.1.37:80 0.000069 0.000936 0.000051 404 404 0 0 "HEAD http://X.X.X.X:80/mysql/dbadmin/ HTTP/1.1" "Mozilla/5.0 Jorgee" - -
2017-10-23T00:12:54.423798Z ASP-SaaS-Prod-ELB 90.63.223.128:46838 172.31.2.240:80 0.000051 0.001275 0.000052 404 404 0 0 "HEAD http://X.X.X.X:80/mysql/sqlmanager/ HTTP/1.1" "Mozilla/5.0 Jorgee" - -
2017-10-23T00:12:54.553557Z ASP-SaaS-Prod-ELB 90.63.223.128:46838 172.31.1.37:80 0.000047 0.000982 0.000062 404 404 0 0 "HEAD http://X.X.X.X:80/mysql/mysqlmanager/ HTTP/1.1" "Mozilla/5.0 Jorgee" - -
2017-10-23T00:12:54.682829Z ASP-SaaS-Prod-ELB 90.63.223.128:46838 172.31.2.240:80 0.000076 0.00103 0.000065 404 404 0 0 "HEAD http://X.X.X.X:80/phpmyadmin/ HTTP/1.1" "Mozilla/5.0 Jorgee" - -
In the aforementioned logs, you can see how the bad guy is sitting on IP 90.63.223.128.
He is trying to hack the application using different URLs or passing different headers.
To prevent this, we enabled WAF and blocked all the traffic from the outside world. Also, you can make WAF learn about this malicious traffic so that whenever such a request comes, WAF will reject the packet. It won't let the packet pass through WAF.
In the monitoring, you must define the metrics and alarm. It helps us to take preventive action. If anything goes against your expectation, you get an alarm and can take appropriate action to mitigate the risk:
Alarm Details:
- Name: awsrds-dspdb-CPU-Utilization
- Description:
- State Change: OK -> ALARM
- Reason for State Change: Threshold Crossed: 1 datapoint [51.605 (24/10/17 07:02:00)] was greater than or equal to the threshold (50.0).
- Timestamp: Tuesday 24 October, 2017 07:07:55 UTC
- AWS Account: XXXXXXX
Threshold:
- The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 50.0 for 300 seconds.
Monitored Metric:
- MetricNamespace: AWS/RDS
- MetricName: CPUUtilization
- Dimensions: [DBInstanceIdentifier = aspdb]
- Period: 300 seconds
- Statistic: Average
- Unit: not specified
In the preceding example, we have defined an alarm on CPU utilization at the RDS level. We got this alert when there was CPU utilization of more than 50% but less than 70%. As soon as we got the alert, we started investigating, which caused the CPU utilization.
Now, let's see the summarized security risk and preventive action at different levels in the cloud:
- Hypervisor level: In the cloud, we have our VMs running on shared resources. There could be a chance that there is a host, which runs x and y VMs. In case the x VM got compromised or hacked, there can be a risk of the y VM getting compromised as well. Luckily, it's not possible due to isolation of resources, but what if the attacker gets access to the host? So, we must update the required security patches on the hypervisor. We must ensure that all the security parameters are configured at the VM level. Most of the time, it happens when we disable the underlying security parameters. This happens mostly with the private cloud. At the hypervisor level, we also segregate the traffic at the vSwitch level where we must have at least management, guest, and storage traffic running on different VLANs.
- Network level: The network is the backbone of the cloud. If the network is compromised, it can completely break down the cloud. The most common attacks on the network are DDoS, network eavesdropping, illegal invasion, and so on. To secure the network, we must define the following:
- Isolation of traffic (management, storage, and guest)
- ACL for network traffic
- Ingress and egress rules must be clearly defined
- IDS and IPS must be enabled to control the intrusion
- Antivirus and antispam engines should be enabled to scan the packets
- Network monitoring must be configured to track the traffic
- Storage level: Storage is also a critical component of the cloud where we store our critical data. Here, we can have risk of data loss, data tampering, and data theft. At the storage level, we must ensure the following to maintain security and integrity of data:
- All the data at rest must be encrypted
- Backup must be provisioned
- If possible, enable data replication to mitigate the risk of hardware failure
- User roles and data access policy must be defined
- A DLP mechanism should be enabled
- All the data transaction should happen using encrypted channels
- Access logs should be enabled
- VM level: At the VM level, we can have the risk of password compromise, virus infection, and exploited vulnerabilities. To mitigate this, we must ensure the following:
- OS-level security patches must be deployed from time to time
- Compromised VMs must be stopped instantly
- Backup should be provisioned using continuous data protection (CDP) or using a snapshot
- Antivirus and antispam agents should be installed
- User access should be clearly defined
- If possible, define key-based authentication instead of passwords
- The OS must be hardened and the OS-level firewall and security rule should also be enabled
- Logs management and monitoring must be enabled
- User level: User identity and access is critical for every cloud. We must clearly define the users, groups, roles, and access policy. This is the basis of cloud security. This is the portion where we authorize them to play with the infrastructure and service. And, if the identity and access is not clearly defined, it can lead to a disaster at any time. To ensure security, we must define the following:
- Users, groups, roles, and access policies
- Enable MFA for user authentication
- The password policy and access key must be defined
- Make sure that the users are not accessing the cloud using the root account
- Logs must be enabled for audit purposes
- Application level: Once your application is hosted and open for public access, then actual risks arise to maintain the availability and accessibility of the service. Here, you will face DDoS, SQL injection, man-in-the-middle attack, cross-site scripting, and so on. To prevent all such attacks, we must use the following:
- Scalable DNS
- Load balancer
- Provision autoscaling
- SSL
- WAF
- User IAM policies and roles
- Compliance: If you have to match some compliance, such as ISO 27001, PCI, and HIPAA, then you must follow the guidelines of all these compliances and design the solutions accordingly. We will read about compliances in the last chapter and learn how to meet them.
While designing the solution, always think that you are designing for failure. Identify all the single points of failure and find appropriate solutions for them. Also, while designing the solution for the cloud, always consider security, reliability, performance, and cost efficiency, as these factors have a huge impact on your solution as well as organization.