Using Data Analysis to Discover Top Stories
We start building the Top Stories feeds by analyzing anonymous data on what stories are currently being read and shared the most through Pulse. Since the volume of data coming from Pulse is always growing, we’ve built several highly scalable systems for collection and analysis. Most of these systems are currently built on Amazon Web Services (AWS).
The first system is built on Amazon’s Elastic Compute Cloud (EC2) and Simple Storage Service (S3). It uses Tornado and Scribe to efficiently log data from HTTP requests to disk. This data is then moved to S3 on a regular basis for permanent storage and analysis. All data is stored in two separate S3 buckets for additional reliability. To make the system horizontally scalable, we replicate the Tornado/Scribe stack on multiple EC2 instances and use an Elastic Load Balancer (ELB) to distribute our request load across these instances. The ELB also provides a heartbeat service which automatically redirects requests away from any malfunctioning instances.
Analyzing with Hadoop
Once we have our data on S3, it is important for us to be able to extract breaking stories quickly. To achieve this on our rapidly growing dataset, we built our analysis engine on top of Elastic Map Reduce (EMR) and Hadoop. Using a set of relatively simple mappers and reducers based on the mrjob python framework from Yelp, we are able to quickly scan through hours or days worth of data to surface top read and shared stories. We regularly run these Hadoop jobs directly against our data on S3. To improve performance, this data could be staged on Elastic Block Storage (EBS) first, but in practice we haven’t found this necessary. Since our MapReduce operations are very parallelizable, we can get faster results simply by spinning up a few extra EMR instances. The Amazon EMR interface makes this process extremely simple and cost effective in comparison to running a dedicated Hadoop cluster on EC2 or our own hardware.
In contrast to our data collection and analysis infrastructure, most of our feed serving and client api request handling is built on Google AppEngine (GAE). When our EMR jobs finish, they push the final top stories for each of our categories to the datastore on GAE. Here we achieve horizontal scalability via GAE’s automatic web servers provisioning based on the current request load, as well as it’s solid non-relational datastore and memcache.
This data is then normalized against the current ratio of reading to sharing in the given category and used to create a ranked feed of the current top stories that quickly surfaces breaking news. The feeds for each category are then memcached and efficiently served to users via the Pulse application on each mobile platform.
Check out our post on the Pulse Blog to follow our Top Stories on Twitter!