Summary
Running workloads on the cloud has its advantages, but managing the infrastructure platform on the cloud is unique and requires a comprehensive set of cloud-specific services. A central and integrated operations model with proper tooling, an integrated control tower, automated processes and frameworks, insights collected from continuous monitoring, and the right skills can provide a single source of truth to decrease risk and costs while streamlining platform operations.
This chapter discussed SRE principles using deep AI insights and platform operations to speed up incident management, FinOps, MLOps, and problem management, thereby reducing manual interventions. Organizations should enforce a comprehensive operations model with proper skills, processes, practices, technology, toolchains, and overall governance. This chapter mainly focused on the operation and maintenance of cloud platforms, so we discussed different operational activities, processes, practices, and reference...