Architecting for operations
API platforms on the cloud are analogous to a distributed computing environment, making them relatively complex with a lot of moving parts. Further, transient failures of cloud resources are quite common and hence applications must be designed for resiliency. Hence, there is an imperative need to architect and design all modern applications with a production-first mindset.
Basically, the objective should be to bake in as much telemetry as possible, so that the operations team can monitor the site for any error conditions and then remediate any live site issues with proper root cause analysis. Two of the most important practices in this regard are the following:
- Logging, monitoring, and alerts
- Feature flags
Let's understand these in the next sections.
Logging, monitoring, and alerts
Logging and monitoring play a crucial role in the timely detection of issues and subsequent remedial action by the operations team. All API...