Exploring the elements of the WAF
Cost optimization, operational excellence, performance efficiency, reliability, and security are the five pillars of the WAF. When it comes to the elements of the WAF, this is different from the pillars. If we place the WAF in the center, then we have six supporting elements. These elements support the pillars with the principles and datasets required for the assessment.
As you know, the WAF is a set of best practices developed by Microsoft; these best practices are further categorized into five interconnected pillars. Now, the question is: Where exactly are these best practices inscribed? In other words, the practices should be developed first before we can categorize them into different pillars. This is where the elements come into the picture. The elements act as a stanchion for the pillars.
As per Microsoft’s documentation, the supporting elements for the WAF are the following:
- Azure Well-Architected Review
- Azure Advisor
- Documentation
- Partners, support, and service offers
- Reference architecture
- Design principles
Now, we will see the explanation of each of these elements. Let’s start with the Azure Well-Architected Review.
Azure Well-Architected Review
Assessment of the workload is required for the creation of the remediation plan; the assessment is inevitable. In the Well-Architected Review, there will be a set of questions prepared by Microsoft to understand the processes and practices in your environment. There will be a separate questionnaire for each pillar of the WAF. For example, the questionnaire for cost optimization will contain questions related to Azure Reserved Instances, tagging, Azure Hybrid Benefit, and so on. Meanwhile, the operational excellence questionnaire will have questions related to DevOps practices and approaches. There will be different possible answers to these questions, varying from recommended methods to non-recommended methods. Customers can answer based on their environment, and the system will generate a plan with recommendations that can be implemented to make their environment aligned with the WAF.
The review can be taken by anyone from the Microsoft Assessments portal (https://docs.microsoft.com/en-us/assessments/?mode=home). In the portal, you must select Azure Well-Architected Review, as shown in the following screenshot:
Figure 1.2 – Accessing Microsoft Assessments
Once you select Azure Well-Architected Review, you will be presented with a popup asking whether you want to create a new assessment or create a milestone. If you want to create a new assessment, then you can go for New Assessment, or choose Create a milestone for an existing assessment. At this point, we will conduct an assessment; nevertheless, each pillar of the WAF has its own dedicated chapter, and we will perform the assessment there.
With that, we will move on to the next element of the framework, which is Azure Advisor.
Azure Advisor
If you have worked on Microsoft Azure, you will know that Azure Advisor is the personalized cloud consultant developed by Microsoft for you. Azure Advisor can generate recommendations for you, and you can leverage this tool to improve the quality of workloads. Looking at Figure 1.3, we can see that the recommendations are categorized into different groups, and the group names are the same as the pillars of the WAF:
Figure 1.3 – Azure Advisor
With the help of Azure Advisor, you can do the following:
- Get best practices and recommendations aligned to the pillars of the WAF
- Enhance the cost optimization, performance, reliability, and operational excellence of workloads using actionable recommendations, thus improving the quality of the workloads
- Postpone recommendations if you don’t want to act immediately
Advisor has a score based on the number of actionable recommendations; this score is called Advisor Score. If the score is lower than 100%, that means there are recommendations, and we need to remediate them to improve the score. As you can see in Figure 1.3, the Advisor Score total for the environment is 81%, and the Score by category values are on the right side.
The good thing about Azure Advisor is that recommendations will be generated as soon as you start using the subscription. You don’t have to deploy any agents, make any additional configurations, or pay to use the Advisor service. The recommendations are generated with the help of machine learning (ML) algorithms based on usage, and they will also be refreshed periodically. Advisor can be accessed from the Azure portal, and it has a rich REST API if you prefer to retrieve the recommendations programmatically and build your own dashboard.
In the coming chapters, we will be relying a lot on Azure Advisor for collecting recommendations for each of the pillars.
Now that we have covered the second element of the WAF, let’s move on to the next one.
Documentation
Microsoft’s documentation has done an excellent job of helping people who are new to Azure. All documentation related to the WAF is documented at https://docs.microsoft.com/en-us/azure/architecture/framework/. As a matter of fact, this book is a demystified version of this documentation with additional examples and real-world scenarios.
As with all documentation, the WAF documentation is lengthy and refined, but for a beginner, the amount of information in the documentation can be overwhelming. This book distills the key insights and essentials from the documentation, providing you with everything you need to get started. The following screenshot shows the documentation for the framework:
Figure 1.4 – WAF documentation
As you can see in the preceding screenshot, the contents are organized according to the pillars, and finally, the documentation is concluded with steps to implement the recommendations. You could call this the Holy Bible of WAF. Everything related to the WAF is found in this documentation and we would strongly recommend bookmarking the link to stay updated.
All documentation for Azure is available at https://docs.microsoft.com/en-us/azure/?product=popular. The documentation covers how to get started, the CAF, and the WAF, and includes learning modules and product manuals for every Azure service. Apart from the documentation, this site offers sample code, tutorials, and more. Regardless of the language you write your code in, Azure documentation provides SDK guides for Python, .NET, JavaScript, Java, and Go. On top of that, documentation is also available for scripting languages such as PowerShell, the Azure CLI, and infrastructure as code (IaC) solutions such as Bicep, ARM templates, and Terraform.
Partners, support, and service offers
Deploying complex solutions by adhering to the best practices can be challenging for new customers. This is where we can rely on Microsoft partners. The Microsoft Partner Network (MPN) is massive, and you can leverage Azure partners for technical assistance and support to empower your organization. You can find Azure partners and Azure Expert Managed Service Providers (MSPs) at https://azure.microsoft.com/en-us/partners/. MSPs can aid with automation, cloud operations, and service optimization. You can also seek assistance for migration, deployment, and consultation. Based on the service you are working with and the region you belong to, you can find a partner with the required skills closer to you.
Once the partner deploys the solution, there will be break-fix issues that you need assistance with. Microsoft Support can help you with any break-fix scenarios. For example, if one of your VMs is unavailable or a storage account is inaccessible, you can open a support request. Billing and subscription support is free of cost and does not require you to purchase any support plans. However, for technical assistance, you need to purchase a support plan. A quick comparison of these plans is shown in the following table:
Basic |
Developer |
Standard |
ProDirect |
|
Price |
Free |
$29/month |
$100/month |
$1,000/month |
Scope |
All Azure customers |
Trial and non-production environments |
Production workloads |
Mission-critical workloads |
Billing support |
Yes |
Yes |
Yes |
Yes |
Number of support requests |
Unlimited |
Unlimited |
Unlimited |
Unlimited |
Technical support |
No |
Yes |
Yes |
Yes |
24/7 support |
N/A |
During business hours via email only |
Yes (email/phone) |
Yes (email/phone) |
Table 1.1 – Comparison of Azure support plans
A full comparison is available at https://azure.microsoft.com/en-us/support/plans/. Basic support can only open Severity C cases with Microsoft Support. In order to open Severity B or Severity A cases, you must have a Standard or ProDirect plan. Severity C has an SLA of 8 business hours and is recommended for issues with minimal business impact, while Severity B is for moderate impact with an SLA of 4 hours. If the case opened is a Severity A case, then the SLA is 1 hour. Severity A is reserved for critical business impact issues where production is down. Having a ProDirect plan offers extra perks to customers, such as training, a dedicated ProDirect manager, and operations support. The ProDirect plan also has a Support API that customers can use to create support cases programmatically. For example, if a VM is down, by combining the power of Azure alerts and action groups, we can make a call to the Support API to create a request automatically.
In addition to these plans, there is a Unified/Premier contract that is above the ProDirect plan and is ideal for customers who want to cover Azure, Microsoft 365, and Dynamics 365. Microsoft support is available in English, Spanish, French, German, Italian, Portuguese, traditional Chinese, Korean, and Japanese to support global customers. Keep in mind that the plans cannot be transferred from one customer to another. Based on your requirement, you can purchase a plan and you will be charged every month.
Service offers deal with different subscription types for customers. There are different types of Azure subscriptions having different billing models. A complete list of available offers is listed at https://azure.microsoft.com/en-in/support/legal/offer-details/. When it comes to organizations, the most common options are Enterprise Agreement (EA), Cloud Solution Provider (CSP), and Pay-As-You-Go; these are commercial subscriptions. Organizations deploy their workloads in these subscriptions, and they will be charged based on consumption. How they get charged depends solely on the offer type. For example, EA customers make an upfront payment and utilize the credits for Azure; any charges above the credit limit will be invoiced as an overage. Both Pay-As-You-Go and CSP will get monthly invoices. In CSP, an invoice will be generated by the partner; however, in Pay-As-You-Go, the invoice comes directly from Microsoft.
There are other types of subscriptions used for development, testing, and learning purposes, such as Visual Studio subscriptions, Azure Pass, Azure for Students, the Free Trial, and so on. However, these are credit-based subscriptions, and they are not backed up by the SLAs. Hence, these cannot be used for hosting production workloads.
The next element we are going to cover is reference architecture.
Reference architecture
If you know coding, you might have come across a scenario where you are not able to resolve a code error and you find the solution from Stack Overflow or some other forum. Reference architecture serves the same purpose, whereby Microsoft provides guidance on how the architecture should be implemented. With the help of reference architecture, we can design scalable, secure, reliable, and optimized applications by taking a defined methodology.
Reference architecture is part of the application architecture fundamentals. The application architecture fundamentals comprise a series of steps where we will decide on the architecture style, technology, architecture, and—finally—alignment with the WAF. This will be used for developing the architecture, design, and implementation. The following diagram shows the series of steps:
Figure 1.5 – Application architecture fundamentals
In the preceding diagram, you can see that the first choice is the architectural style, and this is the most fundamental thing we must decide on. For example, we could take a three-tier application approach or go for microservices architecture.
Once that’s decided, then the next decision is about the services involved. Let’s say your application is a three-tier application and has a web frontend. This frontend can be deployed in Azure Virtual Machines, Azure App Service, Azure Container Instances, or even Azure Kubernetes Service (AKS). Similarly, for the data store, we can decide whether we need to go for a relational or non-relational database. Based on your requirements, you can select from a variety of database services offered by Microsoft Azure. Likewise, we can also choose the service that will host the mid-tier.
After selecting the technology, we need to choose the application architecture. This is the stage at which we decide how the architecture is going to be in the following stages and select the style and services we are going to use. Microsoft has several design principles and reference architectures that can be leveraged in this stage. We will cover the design principles in the next section.
The reference architectures can be accessed from https://docs.microsoft.com/en-us/azure/architecture/browse/?filter=reference-architecture, and this is a good starting point to begin with the architecture for your solution. You might get an exact match as per your requirement; nevertheless, we can tweak these architectures as required. Since these architectures are developed by Microsoft by keeping the WAF pillars in mind, you can deploy with confidence as these solutions are scalable, secure, and reliable. The following screenshot shows the portal for viewing reference architectures:
Figure 1.6 – Browsing reference architectures
The portal offers filtering on the type of product and categories. From hundreds of reference diagrams, you can filter and find the one that matches your requirements. For example, a simple search for 3d video rendering
returns two reference architectures, as shown in the following screenshot:
Figure 1.7 – Filtering reference architectures
Clicking on the reference architecture takes you to a complete explanation of the architecture components, data flow, potential use cases, considerations, and best practices aligned with the WAF. The best part is you will have the Deploy to Azure button, which lets you directly deploy the solution to Azure. The advantage is the architecture is already aligned with the WAF and you don’t have to spend time assessing the solution again.
With that, let’s move on to the last element of the WAF—design principles.
Design principles
In Figure 1.5, we saw that reference diagrams and design principles are part of the third stage of application architecture fundamentals. In the previous section, we saw how we can use the reference architecture, and now we will see how to leverage the design principles. There are 11 design principles you should incorporate into your design discussions. Let’s understand each of the design principles.
Design for self-healing
As with on-premises, failures can happen in the cloud as well. We need to acknowledge this fact; the cloud is not a silver bullet for all the issues that you faced on-premises but does offer massive advantages compared to on-premises infrastructure. The bottom line is failures can happen, hardware can fail, and network outages can happen. While designing our mission-critical workloads, we need to anticipate this failure and design for healing. We can take a three-branched approach to tackle the failure:
- Track and detect failures
- Respond to failures using monitoring systems
- Log and monitor failures to build insights and telemetry
The way you want to respond to failures will entirely depend on your services and the availability requirements. For example, you have a database and would like to failover to a secondary region during the primary region failover. Setting up this replication will sync your data to a secondary region and failover whenever the primary region fails to serve the application. Keep in mind that replicating data to another region can be more expensive than having a database with a single region.
Regional outages are generally uncommon, but while designing for healing, you should also consider this scenario. Your focus should be on handling hardware failures, network outages, and so on because they are very common and can affect the uptime of your application. There are recommendations provided by Microsoft on how to design for healing—these are called design patterns. The recommended patterns are presented here:
- Circuit breaker
- Bulkhead
- Load leveling
- Failover
- Retry
As mentioned at the beginning of this chapter, design patterns are not within the scope of this book. Again, thanks to Microsoft, all patterns are listed at https://docs.microsoft.com/en-us/azure/architecture/patterns/. Let’s move on to the next design principle.
Make all things redundant
SPOFs in architecture can be eliminated by having redundancy. Earlier, we discussed RAID storage in the Reliability subsection of the What are the pillars of the WAF? section, where multiple disks are used to improve data redundancy. Azure has different redundancy options based on the service that you are using. Here are some of the recommendations:
- Understand the business requirements: Redundancy is directly proportional to complexity and cost, and not every solution requires you to set up redundancy. If your business demands a higher level of redundancy, be prepared for the cost implications and complexity, and the demand should be justifiable. If not, you will end up with a higher cost than you budgeted for.
- Use a load balancer: A single VM is a SPOF and is not recommended for hosting mission-critical workloads. Instead, you need to deploy multiple VMs and place them behind a load balancer. On top of that, you can consider deploying the VMs across multiple availability zones for improved SLAs and availability. Once the VMs are behind the load balancer, with the help of health probes we can verify if the VM is available or not before routing the user request to the backend VM.
- Database replication: PaaS solutions such as Azure SQL Database and Cosmos DB have out-of-the-box replication within the same region. In addition to that, you can replicate the data to another region with the help of the geo-replication feature. If the primary region goes down, the database can failover to the secondary region for any read or write requests.
- Database partitioning: With the help of database partitioning, we can improve the scalability as well as the availability of the data. If one shard goes down, only a subset of total transactions will be affected; meanwhile, other shards are still reachable.
- Multi-region deployment: Regional outages are uncommon; however, we need to account for regional failure as well based on the application requirements. Deploying the infrastructure to multiple regions can help in improving application availability during regional outages. With the help of Azure Traffic Manager and its priority routing, we can failover to the secondary region if the health probe fails.
- Coordinate failover: As we discussed in the previous point, we can failover the frontend using Azure Traffic Manager; however, we need to make sure that the database transactions are synchronized to the secondary region and are ready to failover. We need to make sure that when the frontend fails over to the secondary region, the database failover is also coordinated. Depending on the data store that you are using, the failover process may vary.
- Plan for manual failback: With the help of Traffic Manager, we can perform automatic failover using health probes, but don’t opt for automatic failback. When the primary region recovers from an outage, not all services need to be up and running. For example, let’s say the frontend service in the primary region is back online; however, the database is still in recovery. Automatic failback will check if the frontend is up and starts the failback, but the database is not recovered yet. Hence, it’s recommended to go with manual failback so that we can verify whether all services are back online and for data consistency to resolve any database conflicts.
- Plan redundancy for Traffic Manager: We rely on Azure Traffic Manager for routing traffic in case of regional failure; having said that, the Traffic Manager service can also face downtime. Make sure that you review the SLA of the Traffic Manager service, and if you require more redundancy, consider adding other traffic management solutions as a contingency plan. In case of Traffic Manager failure, we can route the request to the other traffic management solution by repointing our DNS records.
With that, let’s learn about the next design principle—minimize coordination.
Minimize coordination
This principle applies to Storage, SQL Database, and Cosmos DB where we diminish the coordination between application services to accomplish scalability. The key concepts of this design principle are mostly aligned with some data concepts that are not in the scope of this book. The following are recommendations provided by Microsoft for this design principle:
- Consider using the Compensating Transaction pattern
- Use domain events to synchronize state
- Use Command and Query Responsibility Segregation (CQRS) and event-sourcing patterns
- Partition data
- Design idempotent operations
- Consider using async parallel processing
- Use parallel distributed algorithms
- Improve coordination using leader election
An in-depth explanation of these recommendations is available at https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/minimize-coordination.
Design to scale out
In on-premises, one of the main issues is the capacity constraint. Traditional data centers had capacity issues, and when it comes to the cloud, the advantage is that it offers elastic scaling. In simpler terms, we can provision workloads as required without the need to pre-provision or buy capacity. Talking of scaling, we have two types of scaling, as follows:
- Vertical scaling: Changing the CPU, memory, and other specifications of the resource; this is more of a resizing operation. This type of scaling cause the service to reboot. Increasing the size is called scaling up, and reducing the size is called scaling down.
- Horizontal scaling: This is where autoscaling comes into context. In horizontal scaling, the number of instances is increased or decreased based on the demand. As there is no change to the initial instance, rebooting is not required, and this process of increasing or decreasing can be automated. Increasing the number of instances is called scaling out, and decreasing the number of instances is called scaling in.
Now that we know the types of scaling, as the name suggests, we need to design for scaling out so that the instances are automatically increased based on the demand. The following recommendations are provided for this design principle:
- Disable session affinity: Load balancers have a feature where we can enable session stickiness or session affinity. If we enable this feature, requests from the same client are routed to the same backend server. If there is heavy traffic from a user, the load will not be distributed due to the stickiness, and a single server needs to handle that. Hence, consider avoiding session affinity.
- Find performance bottlenecks: Scaling out is not a silver bullet for all performance issues; sometimes, performance bottlenecks are due to the application code itself. Adding more servers won’t solve these problems, so you should consider debugging or optimizing the code. Secondly, if there is a database performance issue, adding more frontend servers won’t help. You need to troubleshoot the database and understand the issue before choosing to scale out.
- Identify scaling requirements: As mentioned in the previous point, different parts or tiers of your application require different scaling requirements. For example, the way the frontend needs to be scaled is not the same way as a database scales. Identify the requirements and set up scaling as required for each application component.
- Offload heavy tasks: Consider moving tasks that require a lot of CPU or I/O to background jobs where possible. By doing this, the servers that are taking care of user requests will not be overkilled.
- Use native scaling features: Autoscaling is supported by most Azure compute resources. The scaling can be triggered with the help of metrics or based on a schedule. It’s recommended that you set up autoscaling using metrics (CPU, memory, network, and so on) if the load is unpredictable. On the other hand, if the load is predictable, you can set up the scaling based on a schedule.
- Scale aggressively for mission-critical workloads: Set up autoscaling aggressively for mission-critical workloads as we need to add more instances quickly due to the increased demand. It’s recommended that you start the scaling bit earlier than the tipping point to stay ahead of the demand.
- Design for scaling in: Just as we scale out, we should design for scaling in. While scaling out, we are increasing the number of instances based on demand; once the demand is gone, we need to deallocate the extra instances that are added during the scaling event. If we don’t set up scale-in, the additional instances will keep on running and will incur additional charges.
Now that you are familiar with the scale-out design, let’s shift the focus to the next item on the list.
Partition around limits
In Azure, we have limits for each resource. Some of the limits are hard limits, while others are soft limits. If the limit is a soft limit, we can reach out to Microsoft Support and increase the limit as required. When it comes to scaling, there is also a limit imposed by Microsoft for every resource. If your system is growing tremendously, you will eventually reach the upper limit of the resource. These limits include the number of compute cores, database size, storage throughput, query throughput, network throughput, and so on. In order to efficiently overcome the limits, we need to use partitioning. Earlier, we discussed how we can use data partitioning to improve the scalability and availability of data. Similarly, we can use partitioning to work around resource limits.
There are numerous reasons a system can be partitioned to avoid limits, such as the following:
- To avoid limits on database size, number of concurrent sessions, or data I/O of databases
- To avoid limits on the number of messages or the number of concurrent connections of a storage queue or message bus
- To avoid limits on the number of instances supported on an App Service plan
In the case of databases, we can partition vertically, horizontally, or functionally. Just to give you an idea, let’s have a closer look at this:
- In vertical partitioning, frequently accessed fields are stored in one partition, while less frequent ones are in a different partition. For example, customer names are stored in one partition that is frequently accessed by the application while their emails are stored in a different partition as they are not frequently accessed.
- Horizontal scaling is basically sharding where each partition holds data for a subset of the total data. For example, the names of all cities starting with A-N are stored in one partition, while those starting with O-Z are stored in another partition.
- As the name suggests, functional partitioning is where the data is partitioned based on the context or type of data. For example, one partition stores the stock-keeping unit (SKU) of the products while the other one stores customer information.
A full list of recommendations is available here: https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/partition. The next design principle we are going to cover is design for operations.
Design for operations
With the cloud transformation, the regular IT chorus of managing hardware and data center is long gone. The IT is no longer responsible for the data center management as it will be handled by the cloud provider. Having said that, the IT team or the operations team is still responsible for deploying, managing, and administering the resources deployed in the cloud. Some key areas that the operations team should handle include the following:
- Deployment: The provisioning of resources is considered deployment, and this is one of the key responsibilities of the operations team. It’s recommended that you use an IaC solution for the deployment of services. Using these tools will help reduce human error and makes replication of the environment easy, as templates are reusable and repeatable.
- Monitoring: Once the solution is deployed, it’s very important that the operations team monitor the solution for any failures, performance bottlenecks, and availability. Having a monitoring system can detect anomalies and notify administrators before they turn into bigger problems. The operations team needs to set up logging and collection to collect logs from all services. The collected logs need to be stored for insights and analysis.
- Incident response: As mentioned earlier, we need to acknowledge the fact that failures can happen in the cloud, and if it’s a platform issue, the operations team needs to raise a ticket with Microsoft Support. Internally, the operations team can use an IT service management (ITSM) solution to create incidents and assign them to different teams for resolution or investigation.
- Escalation: If the initial analysis is not yielding any results, there should be processes in place to escalate the issue to the stakeholders and find a resolution. The operations team can have different tiers within the organization that handle different issues; further, they can collaborate with Microsoft Support for issues that require engineering intervention and bug fixes.
- Security auditing: Auditing is very important to make sure that the environment is secure. With the help of security information and event management (SIEM) solutions, we can collect data from different data sources and analyze them. The operations team can collaborate with external auditors if they lack the necessary skills to perform security auditing. For example, consider using Azure Defender for Cloud and action recommendations. In addition to that, we can use Sentinel to collect data from different sources for analysis and investigation.
A list of recommendations shared by Microsoft can be reviewed at https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/design-for-operations. With that, we will move on to the next design principle.
Use PaaS services
Unlike on-premises, the cloud offers different service models such as IaaS, PaaS, and Software-as-a-Service (SaaS). Here, we will discuss IaaS and PaaS as SaaS is more of a solution where the end customer doesn’t manage the code and is managed by the cloud provider.
In IaaS, the cloud provider takes care of the infrastructure (physical servers, network, storage, hypervisor, and so on) and the customer can create a VM on top of this hardware. Microsoft is not responsible for maintaining the VM OS; it will be the duty of the customer to update, patch, and maintain the OS and code of the application. In contrast, in PaaS, the cloud provider provides a hosting environment where the infrastructure, OS, and framework are managed by Microsoft. The only thing that the customer needs to do is push their code to the PaaS service, and it’s up and running. Developers can be more productive and write their code without the need to worry about the underlying hardware or its maintenance.
The design principle recommends using PaaS services instead of IaaS whenever possible. IaaS is only recommended if you require more control over the infrastructure, but if you simply require a reliable environment and ease of management, then PaaS is right for you. Table 1.2 shows some of the IaaS replacements for popular caches, queues, databases, and web solutions in Azure:
Instead of running (IaaS) |
Consider deploying (PaaS) |
Active Directory |
Azure AD |
RabbitMQ |
Azure Service Bus |
SQL Server |
SQL Database |
Hadoop |
Azure HDInsight |
PostgreSQL/MySQL |
Azure Database for PostgreSQL/Azure Database for MySQL |
IIS/Apache/NGINX |
Azure App Service |
MongoDB/Cassandra/Gremlin |
Cosmos DB |
Redis |
Azure Cache for Redis |
File Share |
Azure File Share/Azure NetApp Files |
Elasticsearch |
Azure Cognitive Search |
Table 1.2 – IaaS-to-PaaS considerations
This is not a complete list; there are different ways by which you can replace VMs (IaaS) with platform-managed services. Speaking of services, let’s discuss identity services, which are the subject of the next design principle.
Use a platform-managed identity solution
This is often considered a subsection of the previous design principle; however, there are some additional key points that we need to cover as part of the identity solution. Every cloud application needs to have user identities. Due to this reason, Microsoft recommends using an Identity-as-a-Service (IDaaS) solution rather than developing your own identity solution. In Azure, we can use Azure AD or Azure AD B2C as an identity solution for managing users, groups, and authentication.
The following recommendations are shared by Microsoft for this design principle:
- If you are planning to use your own identity solution, you must have a database to store the credentials. While storing the credentials, you need to make sure that they are not stored in clear text. In fact, you should consider encrypted data as well. A better option is to perform cryptographic hashing and then salting before persisting the data in the database. The advantage is that even if the database is configured, the data is not easily retrievable. In the past few years, databases storing credentials have been targets for attack, and no matter how strong your hashing algorithm is, maintaining your own database is always a liability. To mitigate this, you can use an IDaaS, where the credential management is done by the provider in a secure manner. In other words, it’s the responsibility of the IDaaS provider to maintain the database and secure it. You might be wondering how safe is to outsource the credentials to another provider. The short answer is they have invested time and resources to build the IDaaS platform; if something happens, they are responsible for that.
- Use modern authentication and authorization protocols. When we design applications, use OAuth2, SAML, OpenID Connect (OIDC), and so on. Don’t go for legacy methods, which are prone to attacks such as SQL injection. Modern IDaaS systems such as Azure AD use these modern protocols for authentication and authorization.
- IDaaS offers a plethora of additional security features compared with traditional home-grown identity systems. For example, Azure AD offers passwordless login, single sign-on (SSO), multi-factor authentication (MFA), conditional access (CA), just-in-time (JIT) access, privileged identity management (PIM), identity governance, access reviews, and so on. It’s going to be a very complex, time-consuming, and resource-consuming task if you are planning to include these features in your own identity system. Above all, the maintenance required for these add-ons is going to be high. If we are using an IDaaS solution, these are provided out of the box.
- The reliability and performance of the identity solution are also a challenge when opting for your own identity solution. What if the infrastructure hosting your identity solution goes down? How much concurrent sign-in and token issuance can happen simultaneously? These questions need to be addressed as they point to the reliability and performance of the identity solution. Azure AD offers SLAs for Basic and Premium tiers, which include both sign-on and token issuance. Microsoft will make sure that uptime is maintained, but in the case of home-grown identity solutions, you must set up redundant infrastructure for keeping the uptime high. Setting up redundant infrastructure is expensive and hard to maintain. Speaking of performance, Azure AD can handle millions of authentication requests without fail. Unlike your own identity solution, IDaaS is designed to withstand enormous volumes of traffic.
- Attacks are evolving and they are getting more sophisticated, so you need to ensure that your identity solution is also evolving and can resist these attacks. Periodic penetration testing, vetting of employees and vendors with access to the system, and tight control need to be implemented. This process is going to be expensive and time-consuming. In the case of Azure AD, Microsoft conducts periodic penetration testing by both internal and external security professionals. These reports are available publicly. If required, you can raise a request for performing penetration testing on your Azure AD tenant.
- Make complete use of features offered by the identity provider (IdP). These features are designed to protect your identities and applications. Instead of developing your own features, rely on native features, which are easy to set up and configure.
With that, we will discuss the next design principle.
Use the best data store for your application
Most organizations use relational SQL databases for persisting applications. These databases for good for transactions that contain relational data. Keep the following considerations in mind if your preferred option is a relational database:
- Expensive joins are required for queries
- Data normalization and restructuring are required for schema on write
- Performance can be affected due to lock contention
The recommendation is not to use a relational database for every scenario. There are other alternatives, such as the following:
- Key/value stores
- Document databases
- Search engine databases
- Time-series databases
- Column-family databases
- Graph databases
Choose one based on the type of data that your application handles. For example, if your application handles rain-sensor data, which is basically a time series, then you should go for a time-series database rather than using a relational database. Similarly, if you want to have a product catalog for your e-commerce application, each product will have its own specification. The specifications of a smartphone include brand, processor, memory, and storage, while the specifications of a hair dryer are completely different. Here, we need to store the details of each product as a document, and these will be retrieved when the user clicks on the item. For these kinds of scenarios, you should use a document database. In Azure, this type of product catalog can be stored in Azure Cosmos DB.
To conclude, a relational database is not meant for every scenario; consider using alternatives depending on the data that your application wants to store.
We have two more design principles to be covered before we wrap up, so let’s move on to the next one.
Design for evolution
According to Charles Darwin’s theory of evolution, species change over time, give rise to new species, and share a common ancestor. The theory also looks at natural selection, which causes the population to adapt or get accustomed to the environment. Keeping this theory in mind, when you design applications, design for evolution. This design principle talks about the transformation from a monolithic to a microservices architecture. This transformation is more of an evolution to eliminate tight coupling between application components, which makes the system more inflexible and weaker.
Microservices architecture decouples the application components, and they are loosely coupled. If they are closely packed, the changes in one component will create repercussions in another one. This makes it very difficult to launch new changes into the system. To avoid this, we can consider a microservices architecture, where we can issue changes to the system without affecting other services.
A list of recommendations for this design principle is available at https://docs.microsoft.com/en-us/azure/architecture/guide/design-principles/design-for-evolution.
Now, we are going to discuss the last design principle. Let’s dive right in!
Build for business needs
All the principles we discussed so far are driven by a common factor: business requirements. For example, when we discussed the Make all things redundant design principle, we explored different recommendations for setting up redundant infrastructure. But what if the workload that I have is a proof of concept (POC) or development workload? Do I need to have redundant VMs for a development workload? As you can imagine, development workloads don’t require redundant VMs unless this is demanded by the key factor—business requirements. It might seem apparent, but everything boils down to business requirements.
Leverage the following recommendations to build solutions to meet business needs:
- Define business objectives that include certain metrics to reflect the characteristics of your architecture. These numbers include recovery time objective (RTO), recovery point objective (RPO), and maximum tolerable outage (MTO). For instance, a low RTO business requirement needs quick failover automatically to the DR region. On the other hand, you don’t have to set up higher redundancy if the business requirement has a higher RTO.
- Define SLAs and service-level objectives (SLOs) for your application; this will help in choosing the right architecture. For example, if the SLA requirement is 99.9%, we can go for a single VM; however, if the requirement is 99.95%, then you must deploy two VMs in an availability zone.
- Leverage domain-driven design (DDD), whereby we model the application based on the use cases.
- Differentiate workloads based on the requirements for scalability, availability, data consistency, and DR. This will help you plan the strategy for each workload efficiently.
- Plan for growth; as your business grows, your user base and traffic will grow. You need to make sure your application also evolves to handle the new users and traffic. As we discussed in the Design for evolution section, think about decoupling your application components so that your application changes can be easily introduced without disrupting other dependencies.
- On-premises, the cost is paid upfront for hardware, and it’s a capital expenditure. With the cloud, on the other hand, there’s an operational expenditure, which means you pay for the resources that you consume. Here, we need a shift in mindset because with on-premises, even if you let your VM run for 60 days, there is no additional cost as the hardware cost is paid upfront; the only cost is for electricity and maintenance. But in the cloud, you will be paying for the entire 60 days for which the VM was running. To conclude, delete resources you no longer need to avoid incurring more additional costs than expected.
That was the last design principle, and it’s a wrap-up. We have finally completed the elements of the WAF.