Netflix (Nasdaq: NFLX), the largest online streaming website in the world streams its movies and shows to users globally in 190 countries. It has a large collection of original movies and shows for global audiences. However, the company also licenses content from others. Netflix has spent most of its revenues until now on creating original content. The company has experienced sharp growth in its subscriber base which is now approaching 200 million. With a growing membership base, its profitability has also grown. From India to France and in many other markets, Netflix offers localized content. Most of the company’s popularity is due to its focus on content quality. However, the brand’s focus on innovation is also a key reason it has found superior growth in the global market. Apart from its content, the user-friendly interface and the streaming quality also affect user engagement and loyalty. Ever wondered what technologies are behind the platform that streams thousands of movies and shows to millions of users worldwide. There are two main technologies. One of them is AWS and the other is Open Connect. Netflix uses AWS for all its computing and storage-related needs.
Why Netflix moved to the clouds?
Netflix started its movement to clouds in 2008 when a critical question arose before the company, whether it should do the heavy lifting itself or rely on an external provider. Netflix experienced a database corruption in August 2008 and could not ship DVDs to its members for three days. It was when Netflix realized that it needed to move away from its existing vertically scaled single points of failure, like relational databases in its data centers, towards cloud-based systems that were highly reliable and horizontally scalable. The main reason Netflix started its movement to clouds was not financial, but its need for higher flexibility and scalability drove it towards the clouds. The company knew that cloud technology was the future, and the earlier it started the migration, the faster it would reap the benefits. Before 2015, Netflix had completed the migration process and shut down its last remaining data center. In just eight years, since the migration began, the number of streaming members had grown eight times. This kind of growth would have remained impossible for Netflix with its own datacenters. For example, it could not have racked the servers fast enough, while with the elasticity of the clouds, it can add thousands of virtual servers and petabytes of storage in just a few minutes. Netflix became a truly global Internet TV network by leveraging the cloud technology.
Today, Netflix leverages multiple AWS cloud regions globally to dynamically shift around and expand its global infrastructure capacity for a superior and more enjoyable viewing experience for subscribers in any location around the world. Netflix’s service availability increased significantly in the clouds. The company was able to achieve its desired goals of 99.99% uptime. This was not possible with the datacenters that experienced significant outages. Another major benefit was that the cost of operating in clouds was only a fraction of operating data centers. In the earlier stages of its migration to the clouds, Netflix hit some rough patches, but found that surviving failures is in the clouds. Netflix took around seven years to complete the migration to clouds since it was a complex task. Forklifting everything from the datacenters and then dropping it into AWS could have been an easier approach, but it would have also brought the problems and limitations associated with datacenters. Netflix took the cloud-native approach requiring it to change everything from rebuilding its entire technological infrastructure to how the company operated. The benefits have also proved to be bigger than the company might have anticipated when it began the migration. Turning Netflix into a cloud-native platform was a challenging task. Imagine a scenario where a company moves everything from a monolithic app to hundreds of micro-services, and denormalizes its data model, using NoSQL databases. While it required building new systems, it also required learning new skills. That’s why the company had to invest as much time and effort in its migration to the clouds.
Netflix wrote in a 2010 blog post,
“We could have chosen to build out new data centers, build our own redundancy and failover, data synchronization systems, etc. Or, we could opt to write a check to someone else to do that instead.”Four Reasons We Choose Amazon’s Cloud as Our Computing Platform: The Netflix TechBlog on Medium.
If Netflix had decided to continue with its own data centers, that would have proved very costly for the company. Achieving the same scalability and efficiency too would have been difficult for the platform. However, the biggest advantage was that instead of investing their time in data centers and technological infrastructure, the engineers could now spend time improving the business and creating new innovations to improve the customer experience.
Netflix and AWS
Netflix depends on AWS exclusively for all its storage and computing-related needs. From databases, analytics, recommendation engines, and video transcoding to many more functions, the company uses more than 100,000 server instances on AWS. From Route 53 to Lambda and EC2, the company uses many AWS resources for resiliency and efficiency. Since Netflix operates using more than 100,000 server instances, the result is a highly complex and dynamic networking environment with several applications constantly communicating with each other across AWS and the internet. Netflix constantly monitors and optimizes its network to offer a superior customer experience, and grow its efficiency, and reduce costs.
The Netflix network generates several terabytes of data daily in the form of Virtual Private Cloud (VPC) Flow Logs. The platform needed a solution for ingesting, augmenting, and analyzing the large volume of data. AWS makes it possible for Netflix. It needed a lot of flexibility to analyze and process data. Netflix tried several combinations and experimented with approaches, trying various AWS products to arrive at the final solution. Finally, Netflix deployed Dredge. Dredge centralizes flow logs using Amazon Kinesis Data Streams (KDS). Amazon KDS is a highly scalable and durable real-time data streaming service that captures gigabytes of data per second from hundreds of thousands of services like website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events. The data KDS collects is available in milliseconds for real-time analytics.
John Bennet, senior software engineer at Netflix says,
“Netflix’s Amazon Kinesis Data Streams-based solution has proven to be highly scalable, each day processing billions of traffic flows. Typically, about 1,000 Amazon Kinesis shards work in parallel to process the data stream. “Amazon Kinesis Data Streams processes multiple terabytes of log data each day, yet events show up in our analytics in seconds. We can discover and respond to issues in real time, ensuring high availability and a great customer experience.”John Bennet, Senior Software Engineer Netflix.
KDS allows Netflix to optimize its applications in more suitable ways. It may include moving applications from one region to another or switching from one network protocol to another for a specific type of traffic. According to John Bennet, the KDS based solution enabled Netflix to reduce costs, grow operational efficiency, and resiliency.
Open Connect: Netflix’s customized CDN solution
AWS serves Netflix’s computing and storage-related needs. So, Netflix has everything stored in the clouds. However, AWS is not responsible for the delivery of content to individual users. AWS is responsible for the events taking place before the user hits play. Once a user clicks on a show and starts streaming, the role of Open Connect begins. Open Connect is Netflix’s own customized CDN solution that brings the content people watch on Netflix closer to their location. The job of a CDN is to bring the content closer to people. If a client in Asia, say India, tries to stream content from the US, the speed can be slow. However, if it is possible to serve the same content from Singapore or Mumbai, then the speed will be much better, and the person will have a superior viewing experience. Open Connect resembles a CDN service in various regards, but there are several major differences as well.
Netflix released Open Connect in 2011. Its traffic was rising steadily, and the company needed a solution to manage streaming quality with fast-growing traffic. By the end of 2011, Netflix had around 21.5 million US subscribers. The numbers grew to 25 million by the third quarter of 2012. Netflix accounted for a significant portion of the entire traffic on the ISP networks. It needed to work with those ISPs more collaboratively and directly. Netflix also needed to create a customized CDN solution that allowed it to design a proactive caching solution that was more efficient than the standard demand-driven CDN solution. It needed a CDN solution to reduce the overall demand on the upstream network capacity by multiple orders of magnitude.
Netflix has continued to improve Open Connect to offer its viewers a superior experience. Open Connect enables ISPs to provide their mutual customers a superior video experience. Netflix’s HD content was only available on the ISPs initially that had signed for the Open Connect Initiative.
Open Connect has continued to evolve. Netflix has been working to optimize Open Connect for higher efficiency and resiliency. It developed a new algorithm – HCA (Heterogeneous Cluster Allocation) algorithm – that makes the intelligent distribution of content possible. Netflix continues to make improvements to its CDN to enhance the viewing experience for its subscribers globally.
Open Connect Appliances:
Open Connect Appliances (OCAs) are the building blocks of Open Connect. OCAs are Netflix’s suite of purpose-built server appliances. They store encoded video/image files for delivery via Http/https to clients’ devices, including mobile devices, set-top boxes, smart TVs, etc. OCAs are only responsible for delivering the bits of videos and image files to clients at the fastest speed. In 2016, Netflix had Open Connect Appliances in 1,000 different locations from large cities like New York, Paris, and London to the most remote locations like Greenland and Amazon rainforest. It ensures that most Netflix members receive their audio and video through a server lying inside of or directly connected to their ISP’s network within their local region. Streaming quality has continued to improve across the Middle East, Africa, India, and Asia with the growing footprint of Netflix Open Connect Appliances. ISPs participating in the Netflix Open Connect Program continue to achieve cost savings.
Netflix has deployed thousands of OCAs in two ways:
- Netflix installs OCAs in significant markets in the world within internet exchange points (IXs or IXPs). These OCAs are interconnected with mutually present ISPs through settlement-free public or private peering.
- Netflix provides OCAs free of charge to the qualifying ISPs. These OCAs have the same capabilities as the OCAs within the IXPs. They are deployed directly inside the ISP networks. Netflix provides the server hardware and the ISPs provide the power, space, and connectivity. The ISPs have direct control over which of their customers are routed to their embedded OCAs.
How the OCAs, client devices and Amazon AWS interact.
The OCAs do not store client data but they only serve the content that a client device requests. Apart from that, they also request their status to Open Connect Plane services on AWS. They report things like health metrics, BGP routes learned from the BGP Peer they have a configured session with, and the files they have stored on the disk. AWS takes the data from the reports handed over by the OCAs and uses the data to steer clients via url to the most optimal OCAs based on their availability, health, and network proximity to the client device. Apart from that, the AWS control plane services also control fill behavior meaning adding new files to OCAs nightly, compute optimal behavior for file storage/handling, and handle the storage and interpretation of relevant telemetry about the playback experience. Open Connect actively partners with Netflix client teams to ensure that each device can dynamically optimize content from OCAs based on its specific needs and current network conditions.
How the playback process works.
- OCAs periodically report health metrics, routes learned, and content availability to the cache-control services in AWS.
- A device requests the playback of a title on AWS.
- After checking user authorization and licensing, the playback application services on AWS determine the specific files required to handle the playback request based on individual client characteristics and current network conditions.
- Based on the information stored in cache control services, the steering service in AWS picks the OCAs to serve the requested files from, generates the urls for them, and hands the urls over to the playback application services.
- The playback application services send the urls of the appropriate OCAs to the client device and the OCAs start serving the files or the playback begins.
The Netflix Open Connect Operations team constantly monitors, maintains, and updates the OCA deployments to ensure higher reliability and efficiency.