Netflix started with home delivery of DVDs in the late 90s, and moved to video-on-demand Internet streaming in 2007. The company has since moved away from simply distributing content to creating it.

House of Cards is the first major TV show to bypass the more traditional distribution channel of TV networks and cable operators, and premier directly to viewers online. In 2013, the show won three Primetime Emmy Awards and was nominated for six other categories. Netflix initially committed $100 million to produce the first two seasons, and has since announced that the show will be renewed for a third season.

What is Netflix’s secret in producing the hit show? Big data.

Data collection

At the end of 2013, Netflix had 33 million US subscribers. From these subscribers, Netflix gathered every event (every time the user rewinds, fast forwards or pauses), every rating, every search. This data is supplement with geo-location data, device information, time of day and week, as well as social media data. A further layer of enrichment comes from third-party providers such as Nielsen.

The data is used to improve Netflix’s personalization algorithm, which recommends other shows to watch based on what the user has seen. Netflix discovered how viewers who enjoyed the original House of Cards miniseries, produced by the BBC in the 1990s, also enjoyed movies starring Kevin Spacey or directed by David Fincher. While creating original content may be risky, it appears that Netflix managed to hedge its bets, and leverage its powerful distribution channel, in creating House of Cards.

Data architecture

Netflix started with a more traditional MySQL database for data warehousing, storing more than 10 years of customer data and billions of ratings. However, the growth of data collected by Netflix started to increase exponentially as the service started to shift towards Internet streaming. Netflix ran into scalability problems when the growth of the customer base is supplemented by data collected from existing users watching more and more shows.

The immediate impact of the massive growth in data collected is the increase in data storage costs. As storing the data in the cloud was about 100 times cheaper than in traditional data warehouses, it made sense for Netflix to start migrating its data to the cloud. While this solves the problem of cost and scalability, the geographical diversity of the data could pose difficulties for real-time queries.

Enter Cassandra. Cassandra is an open source NoSQL database, created by Facebook and now under the open source Apache licence. It scales wonderfully across multiple data centers all over the world. It is optimised for random write/reads, which makes real-time queries possible.

Engineers at Netflix then created Aegisthus to create a ‘bulk data pipeline’ out of Cassandra, as Cassandra is more ideal for individual queries. The output from Aegisthus is stored on Amazon S3 once a day, which is what Netflix is now using as its main data warehouse. The data stored in S3 is analyzed with Hadoop. Netflix uses S3 as the backend instead of HDFS for its extremely high durability and versioning capabilities, even though doing so adds a little latency.

Netflix’s use case of big data and cloud infrastructure highlights how different tools are used for different tasks. More importantly, it emphasizes the power of big data in extracting insights like never before but still being cost effective.

 

This week, we feature the best articles on limitations of Big Data. In particular, Tim Harford makes an excellent case on why correlation does not equal causation.

Image credit: Flickr

 


Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

Previous post

Big Data Compounds Privacy Problem

Next post

Democratizing Data Assets: Learning From Data, Big and Small