Scaling to 10M on AWS
As Pulse crosses the 10M user mark (up 10x since last year), we’d like to share a bit more about how we’ve built and scaled Pulse’s backend systems. In this article we will discuss the important role AWS plays in our infrastructure.
Today there are more infrastructure choices than ever. They include running your own hardware, leasing virtual machines, subscribing to higher level platforms and software services, and often a combination of all of the above. It is important to consider the trade-offs and choose the right tool for the job. In our experience, AWS provides an exceptional capability to build systems as close to the metal as you like, while still avoiding the burden and inelasticity of owning your own hardware. It also provides some useful abstraction layers and services above the machine level.
Amazon’s Elastic Compute Cloud (Amazon EC2) instances make it easy to run low level processes that can write directly to disk, and its Amazon Simple Storage System (Amazon S3) provides great long-term file storage. This combination makes an excellent choice for most flat-file logging systems. At Pulse, we’ve built a simple logging system that is blazingly fast on one machine and easy to scale horizontally. Using Tornado to handle HTTP requests and Scribe to buffer and write files, we are able to store logs at near-disk speeds (more than 50 MB/s per instance). Once the logs have been written to disk, we regularly move them to Amazon S3 for reliable long-term storage and easy access. Amazon S3′s low cost and scalable nature allows us to save all of our data without worrying too much about size.
By provisioning one of Elastic Load Balancer (ELB) instances, we are able to easily divide our load over as many logging servers as necessary and automatically direct load away from failing machines. Provisioning these machines in multiple AWS availability zones also makes it easy to achieve fault tolerance.
Pulse’s implementation easily handles millions of events per hour and has been running continuously for over a year without any downtime.
Another major reason we decided to build our event logging system on Amazon S3 was to leverage Amazon Elastic MapReduce and Apache Hive. Now that our data is getting bigger, it is much more efficient to query with a cluster of machines. Without having to configure and maintain our own Hadoop cluster or having to move our data from Amazon S3, AWS allows us to quickly spin up a cluster of 10s to 100s of machines.
With a large cluster, we are able to query a significant portion of our data in minutes instead of hours or days. Because the AWS cluster can simply be turned off when we are done, the cost to run big queries is usually quite reasonable. Consider a cluster of 100 m1.large machines. A set of queries that takes 45 minutes to run on this cluster would cost us $11 – $34 (depending on whether we bid on spot instances or use regular on-demand instances). Assuming you’re not running jobs all the time, this is preferable to the cost of buying and continuously maintaining your own cluster.
Apache Hive makes this process even easier by taking simple SQL queries and converting them into what would often be relatively complex, multi-step Amazon Elastic MapReduce jobs. These SQL queries can be run directly by our business team, avoiding the need for engineering support.
For batch jobs, such as regularly extracting the top read and shared stories, the Pulse backend team likes to use mrjob, an open source framework developed at Yelp. Mrjob allows us to write mappers and reducers in Python (instead of Java) and integrates seamlessly with Amazon Elastic MapReduce. Python is our language of choice because it is more consistent with our codebase and it provides a simple representation for common MapReduce data structures such as tuples and dictionaries. Because our jobs are usually IO-bound, the interpreted runtime doesn’t slow things down much.
Beyond curating our top story feeds, we’ve recently started developing several exciting new user-facing features using Amazon Elastic MapReduce, mrjob, and our data on Amazon S3. As part of our last major release, we announced a new feature called Smart Dock, which recommends new sources to millions of users based on their reading history. This feature makes it much easier to discover relevant content and has been extremely well received by our users. Our newest full-time backend engineer, Leonard Wei, led this project and built it almost entirely on AWS.
Our recommendations pipeline processes over 250GB of the raw log data we have in Amazon S3. We reduce this data down to about 1GB of relevant features via an Amazon Elastic MapReduce job. We then use an LDA-based approach to predict which sources a user is likely to add next. We run this portion of the pipeline on AWS using a single High-CPU Extra Large instance.
Once the model is generated for each user and some additional post-processing is complete, we upload each user’s recommended sources to our serving infrastructure on App Engine. From there, the recommendations are combined with the latest catalog data and sent to the app to be presented in the Smart Dock. One run of the whole pipeline costs us a very reasonable $20 of AWS compute time.
Beyond event logging, analytics and recommendations, we also use AWS for lots of smaller tasks that just make sense to run directly on one or more machines, rather than through a higher level service. Some examples include parsing html pages with node-readability and continuously monitoring all of our systems to make sure we’re aware of any problems. Recently, we also started working on a new real-time analytics infrastructure based on Redis, which will leverage the High-Memory instances Amazon EC2 offers.
To learn more about Pulse’s infrastructure check out some of the backend team’s other posts. Our recent article on how we scaled up for the Kindle Fire launch compliments this one and talks more about our content serving, client APIs and Pulse.me web hosting.