Netflix is often the first company that is brought up when people talk about a visionary cloud native company, but why? This section will break up the journey that Netflix has undertaken to get to the point it is at today. Using the CNMM, each of the axis will be discussed and key points taken into account to demonstrate their maturity along the cloud native journey.
Cloud native architecture case study – Netflix
The journey
As with all major migrations to the cloud, Netflix's journey was not something that happened overnight. As early as May 2010, Netflix had been publicly touting AWS as its chosen cloud computing partner. The following quote has been extracted from the press release that both companies published at that time (http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=irol-newsArticle&ID=1423977):
"Amazon Web Services today announced that Netflix, Inc., has chosen AWS to run a variety of mission-critical, customer-facing and backend applications. While letting Amazon Web Services do the worrying about the technology infrastructure, Netflix continues to improve its members' overall experience of instantly watching TV episodes and movies on the TV and computer, and receiving DVDs by mail."
That press release goes on to say that Netflix had actually been using AWS for the experimentation of workload development for over a year, meaning that since 2009, Netflix has been on their cloud native journey. Since AWS released its first service in 2006, it is evident that Netflix saw the benefits from the very beginning and aggressively moved to take advantage of the new style of computing.
They phased the migration of components over time to reduce risk, gain experience, and leverage the newest innovations that AWS was delivering. Here's a quick timeline of their migration [2009 - 2010] http://www.sfisaca.org/images/FC12Presentations/D1_2.pdf, [2011 - 2013] https://www.slideshare.net/AmazonWebServices/ent209-netflix-cloud-migration-devops-and-distributed-systems-aws-reinvent-2014 (Slide 11), and [2016] https://medium.com/netflix-techblog/netflix-billing-migration-to-aws-451fba085a4:
- 2009: Migrating video master content system logs into AWS S3
- 2010: DRM, CDN Routing, Web Signup, Search, Moving Choosing, Metadata, Device Management, and more were migrated into AWS
- 2011: Customer Service, International Lookup, Call Logs, and Customer Service analytics
- 2012: Search Pages, E-C, and Your Account
- 2013: Big Data and Analytics
- 2016: Billing and Payments
You can read more about this at https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration. This seven-year journey enabled Netflix to completely shut down their own data centers in January 2016, and so they are now a completely cloud native company. Admittedly, this journey for Netflix was not easy, and a lot of tough decisions and trade-offs had to be made along the way, which will be true for any cloud native journey; however, the long-term benefits of re-engineering a system with a cloud native architecture, instead of just moving the current state to the cloud, means that all of the technical debt and other limitations are left behind. Therefore, in the words of Yury Izrailevsky (Vice President, Cloud and Platform Engineering at Netflix):
This amazing journey for Netflix continues to this day. Since the cloud native maturity model doesn't have an ending point, as cloud native architectures mature, so too will the CNMM and those companies that are pushing the boundaries of how to develop these architectures.
The benefits
The journey to becoming a cloud native company at Netflix was impressive, and continues to yield benefits for Netflix and its customers. The growth Netflix was enjoying around 2010 and beyond made it difficult for them to logistically keep up with the demand for additional hardware and the capacity to run and scale their systems. They quickly realized that they were an entertainment creation and distribution company, and not a data center operations company. Knowing that managing an ever-growing number of data centers around the world would continue to cause huge capital outflows and require a focus that was not core to their customers, they made their cloud-first decision.
Elasticity of the cloud is possibly the key benefit for Netflix, as it allows them to add thousands of instances and petabytes of storage, on demand, as their customer base grows and usage increases. This reliance on the cloud's ability to provide resources as required also includes their big data and analytics processing engines, video transcoding, billing, payments, and many other services that make their business run. In addition to the scale and elasticity that the cloud brings, Netflix also cites the cloud as away to significantly increase their services availability. They were able to use the cloud to distribute their workloads across zones and geographies that use fundamentally unreliable but redundant components to achieve their desired 99.99% availability of services.
Finally, while cost was not a key driver for their decision to move to the cloud, their costs per streaming start ended up being a fraction of what it was when they managed their own data centers. This was a very beneficial side effect of the scale they were able to achieve, and the benefit was only possible due to the elasticity of the cloud. Specifically, this enabled them to continuously optimize instance type mix and to grow and shrink our footprint near-instantaneously without the need to maintain large capacity buffers. We can also benefit from the economies of scale that are only possible in a large cloud ecosystem. These benefits have enabled Netflix to have a laser focus on their customer and business requirements, and not spend resources on areas that do not directly impact that business mission.
CNMM
Now that we understand what the Netflix journey was about and how they benefited from that journey, this section will use the CNMM to evaluate how that journey unfolded and where they stand on the maturity model. Since they have been most vocal about the work they did to migrate their billing and payment system to AWS, that is the workload that will be used for this evaluation. That system consisted of batch jobs, billing APIs, and integrations, with other services in their composite application stack, including an on-premises data center at the time. Full details about that migration can be found at their blog, https://medium.com/netflix-techblog/netflix-billing-migration-to-aws-451fba085a4, on this topic.
Cloud native services axis
The focus of the cloud native services adoption spectrum is to demonstrate the amount of cloud vendor services that are in use for the architecture. While the full extent of the services that Netflix uses is unknown, they have publicly disclosed numerous AWS services that help them achieve their architecture. Referring to the mature cloud vendor services diagram in the beginning of this chapter, they certainly use most of the foundational services that fall into the infrastructure: networking, compute, storage, and database tiers. They also use most of the services from the security and application services tier. Finally, they have discussed their usage of lots of services in the management tools, analytics, dev tools, and artificial intelligence tiers. This amount of services usage would classify Netflix as a very mature user of cloud native services, and therefore they have a high maturity on the cloud native services axis.
It is also important to note that Netflix also uses services that are not in the cloud. They are very vocal that their usage of content delivery networks (CDNs) are considered a core competency for their business to be successful, and therefore they set up and manage their own global content network. This point is made in a blog post at https://media.netflix.com/en/company-blog/how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience by the company in 2016, where they articulated their usage of AWS and CDNs and why they made their decisions:
In addition, there are cases where they choose to use open source tools running on cloud building blocks, like Cassandra for their NoSQL database, or Kafka for their event streams. These architecture decisions are the trade-offs they made to ensure that they are using the best tools for their individual needs, not just what a cloud vendor offers.
Application centric design axis
Designing an application for the cloud is arguably the most complicated part of the journey, and having a high level of maturity on the application centric design axis will require specific approaches. Netflix faced some big challenges during its billing and payment system migration to the cloud; specifically, they wanted near-zero downtime, massive scalability, SOX compliance, and global rollout. At the point of time where they begin this project, they already had many other systems running in the cloud as decoupled services. Therefore, they used the same decoupling approach by designing microservices for their billing and payment systems.
To quote their blog on this topic:
The other major redesign during this process was the move from an Oracle database's heavy relational design to a more flexible and scalable NoSQL data structure for subscription processing, and a regionally distributed MySQL relational database for user-transactional processing. These changes required other Netflix services to modify their design to take advantage of the decoupling of data storage and retry the ability for data input to their NoSQL database solution. This enabled Netflix to migrate millions of rows from their on-premises Oracle database to Cassandra in AWS without any obvious user impact.
During this billing and payment system migration to the cloud, Netflix made many significant decisions that would impact its architecture. These decisions were made with long-term impact in mind, which caused a longer migration time, but ensured a future-proofed architecture that could scale as they grew internationally. The cleaning up of code to remove technical debt is a prime example of this, and allowed them to ensure the new code base was designed using microservices, and had other cloud native design principles included. Netflix has demonstrated a high level of maturity on the application-centric design axis.
Automation axis
The automation axis demonstrates a company's ability to manage, operate, optimize, secure, and predict how their systems are behaving to ensure a positive customer experience. Netflix understood early on in their cloud journey that they had to develop new ways to verify that their systems were operating at the highest level of performance, and almost more importantly, that their systems were resilient to service faults of all kinds. They created a suite of tools called the Simian Army (https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116), which includes all kinds of automation that is used to identify bottlenecks, break points, and many other types of issues that would disrupt their operations for customers. One of the original tools and the inspiration for the entire Simian Army suite is their Chaos Monkey; in their words:
Having systems that can survive randomly shutting down critical services is the definition of high levels of automation. This means that the entire landscape of systems must follow strict automated processes, including environment management, configuration, deployments, monitoring, compliance, optimization, and even predictive analytics. Chaos Monkey went on to inspire many other tools that all fall into the Simian Army toolset. The full suite of the Simian Army is:
- Latency Monkey: Induces artificial delays into RESTful calls to similar service degradation
- Conformity Monkey: Finds instances that do not adhere to predefined best practices and shuts them down
- Doctor Monkey: Taps into health checks that run on an instance to monitor external signs of health
- Janitor Monkey: Searches for unused resources and disposes of them
- Security Monkey: Finds security violations and vulnerabilities and terminates offending instances
- 10-18 Monkey: Detects configuration and runtime problems in specific geographic regions
- Chaos Gorilla: Similar to Chaos Monkey, but simulates an entire outage of AWS available zones
However, they didn't stop there. They also created a cloud wide telemetry and monitoring platform known as Atlas (https://medium.com/netflix-techblog/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a), which is responsible of capturing all-time series data. The primary goal for Atlas is to support queries over dimensional time series data so they can drill down into problems as quickly as possible. This tool satisfies the logging aspect of the twelve-factor app design, and allows them to have enormous amounts of data and events to analyze and take action on before they become customer-impacting. In addition to Atlas, in 2015 Netflix released a tool called Spinnaker (https://www.spinnaker.io/), which is an open source, multi-cloud, continuous delivery platform for releasing software changes with high velocity and confidence. Netflix is constantly updating and releasing additional automation tools that help them manage, deploy, and monitor all their services, using globally distributed AWS regions and, in some cases, using other cloud vendor services.
Netflix has been automating everything in their environment for almost as long as they have been migrating workloads to the cloud. Today, they rely on those tools to ensure their global network is functioning properly and serving their customers. Therefore, they would fall on the highly mature level of the automation axis.