Backend Tips – The Free CDN

New Blog Post Series

This is the first in a series of blog posts in which we will offer a peek into the some of the challenges we tackle on the Backend Team and discuss some tips and tricks we have discovered. These posts will focus on the ways in which we use GAE and AWS to build simple features that have helped us to deliver an amazing product. We plan to dive a little deeper into topics we’ve covered before, as well as highlighting some new ones. Upcoming topics will include GAE MapReduce, Redis, Google Cloud Storage, and duplicate detection via TF-IDF. Our first entry in the series discusses how to use Google’s edge cache as a free content delivery network (CDN).

The Free CDN

At the end of last year, we briefly mentioned Google’s edge cache as a useful feature as part of our guest post on the App Engine blog. Since this is one of our favorite services, I’d like to take a few minutes to explain it in more detail. It is an extremely simple feature that has the potential to significantly improve content serving latency and can be very valuable in terms of cost savings over other CDNs. Hopefully it will be clear by the end of this post why you should think about using it for your next project.

Content Delivery Networks

Content Delivery Networks (CDNs) offer several benefits that are typically desired for both web and mobile apps. They are designed to cache content on many geographically distributed servers, as close to the end user as possible, thereby minimizing latency for requests to the cached content. There are several major CDN providers, but the big ones that come to mind are Akamai and Amazon’s Cloudfront. CDNs vary in quality and price, but generally one should expect to pay a premium for this type of service.

Google’s Edge Cache (aka. CDN)

It turns out that if you’re using Google App Engine (or other Google services like the newly announced Google Cloud Storage) and you configure things correctly, you get the same service for free. By simply setting public cache control headers wherever possible, you allow Google’s edge caches to serve unchanged content directly to users. Here’s an example of a set of response headers that will activate the cache:

 Cache-Control: public, max-age=900, must-revalidate

The most important component of the header is the word ‘public’. It tells Google’s network that the content in this response is not specific to a particular user or private in any way, so it’s safe to cache it as aggressively as possible. ‘max-age’ allows you to decide how often this content will be refreshed from your servers, and ‘must-revalidate’ is just telling the server (or client cache) to strictly follow this timeout.

This technique has been mentioned in at least one Google IO talk, but for some reason hasn’t been widely publicized. Because of the scale of Google’s network, this is perhaps the best CDN available. Best of all, there is no cost for this caching. It’s actually a win-win for both you and Google, since it minimizes the traffic that has to cross their internal networks and servers.

At Pulse we use this feature very heavily. It lets us serve high quality, mobile optimized images at < 50ms latency, while also saving us lots of App Engine instance hours by preventing these requests from hitting our frontend servers. As you can see from the graph below, for this particular App Engine app, we are serving the majority of requests out of Google’s edge cache (labeled red). I encourage you to try it out. It’s almost too easy to be true! If you have questions, feel free to leave comments below or ping me @gregbayer.

Scaling to 10M on AWS

This post complements the recent article about Pulse on the Amazon Web Services Blog.

As Pulse crosses the 10M user mark (up 10x since last year), we’d like to share a bit more about how we’ve built and scaled Pulse’s backend systems. In this article we will discuss the important role AWS plays in our infrastructure.

Today there are more infrastructure choices than ever. They include running your own hardware, leasing virtual machines, subscribing to higher level platforms and software services, and often a combination of all of the above. It is important to consider the trade-offs and choose the right tool for the job. In our experience, AWS provides an exceptional capability to build systems as close to the metal as you like, while still avoiding the burden and inelasticity of owning your own hardware. It also provides some useful abstraction layers and services above the machine level.

Event Logging

Amazon’s Elastic Compute Cloud (Amazon EC2) instances make it easy to run low level processes that can write directly to disk, and its Amazon Simple Storage System (Amazon S3) provides great long-term file storage. This combination makes an excellent choice for most flat-file logging systems. At Pulse, we’ve built a simple logging system that is blazingly fast on one machine and easy to scale horizontally. Using Tornado to handle HTTP requests and Scribe to buffer and write files, we are able to store logs at near-disk speeds (more than 50 MB/s per instance). Once the logs have been written to disk, we regularly move them to Amazon S3 for reliable long-term storage and easy access. Amazon S3′s low cost and scalable nature allows us to save all of our data without worrying too much about size.

By provisioning one of Elastic Load Balancer (ELB) instances, we are able to easily divide our load over as many logging servers as necessary and automatically direct load away from failing machines. Provisioning these machines in multiple AWS availability zones also makes it easy to achieve fault tolerance.

Pulse’s implementation easily handles millions of events per hour and has been running continuously for over a year without any downtime.

Data Analytics

Another major reason we decided to build our event logging system on Amazon S3 was to leverage Amazon Elastic MapReduce  and Apache Hive. Now that our data is getting bigger, it is much more efficient to query with a cluster of machines. Without having to configure and maintain our own Hadoop cluster or having to move our data from Amazon S3, AWS allows us to quickly spin up a cluster of 10s to 100s of machines.

With a large cluster, we are able to query a significant portion of our data in minutes instead of hours or days. Because the AWS cluster can simply be turned off when we are done, the cost to run big queries is usually quite reasonable. Consider a cluster of 100 m1.large machines. A set of queries that takes 45 minutes to run on this cluster would cost us $11 – $34 (depending on whether we bid on spot instances or use regular on-demand instances). Assuming you’re not running jobs all the time, this is preferable to the cost of buying and continuously maintaining your own cluster.

Apache Hive makes this process even easier by taking simple SQL queries and converting them into what would often be relatively complex, multi-step Amazon Elastic MapReduce jobs. These SQL queries can be run directly by our business team, avoiding the need for engineering support.

For batch jobs, such as regularly extracting the top read and shared stories, the Pulse backend team likes to use mrjob, an open source framework developed at Yelp. Mrjob allows us to write mappers and reducers in Python (instead of Java) and integrates seamlessly with Amazon Elastic MapReduce. Python is our language of choice because it is more consistent with our codebase and it provides a simple representation for common MapReduce data structures such as tuples and dictionaries. Because our jobs are usually IO-bound, the interpreted runtime doesn’t slow things down much.

Recommendations

Beyond curating our top story feeds, we’ve recently started developing several exciting new user-facing features using Amazon Elastic MapReduce, mrjob, and our data on Amazon S3. As part of our last major release, we announced a new feature called Smart Dock, which recommends new sources to millions of users based on their reading history. This feature makes it much easier to discover relevant content and has been extremely well received by our users. Our newest full-time backend engineer, Leonard Wei, led this project and built it almost entirely on AWS.

Our recommendations pipeline processes over 250GB of the raw log data we have in Amazon S3. We reduce this data down to about 1GB of relevant features via an Amazon Elastic MapReduce job. We then use an LDA-based approach to predict which sources a user is likely to add next. We run this portion of the pipeline on AWS using a single High-CPU Extra Large instance.

Once the model is generated for each user and some additional post-processing is complete, we upload each user’s recommended sources to our serving infrastructure on App Engine. From there, the recommendations are combined with the latest catalog data and sent to the app to be presented in the Smart Dock. One run of the whole pipeline costs us a very reasonable $20 of AWS compute time.

Other Tasks

Beyond event logging, analytics and recommendations, we also use AWS for lots of smaller tasks that just make sense to run directly on one or more machines, rather than through a higher level service. Some examples include parsing html pages with node-readability and continuously monitoring all of our systems to make sure we’re aware of any problems. Recently, we also started working on a new real-time analytics infrastructure based on Redis, which will leverage the High-Memory instances Amazon EC2 offers.

To learn more about Pulse’s infrastructure check out some of the backend team’s other posts. Our recent article on how we scaled up for the Kindle Fire launch compliments this one and talks more about our content serving, client APIs and Pulse.me web hosting.

 

Scaling with the Kindle Fire

This post was originally published as a guest post on the Google App Engine blog.

As part of the much anticipated Kindle Fire launch, Pulse was announced as one of the only preloaded apps. When you first unbox the Fire, Pulse will be there waiting for you on the home row, next to Facebook and IMDB!

Scale

The Kindle Fire is projected to sell over five million units this quarter alone. This means that those of us who work on backend infrastructure at Pulse have had to prepare for nearly doubling our user-base in a very short period. We also need to be ready for spikes in load due to press events and the holiday season.

Architecture

As I’ve discussed previously on the Pulse Engineering Blog, Pulse’s infrastructure has been designed with scalability in mind from the beginning. We’ve built our web site and client APIs on top of Google App Engine, which has allowed us to grow steadily from 10s to many 1000s of requests per second, without needing to re-architect our systems.

While restrictive in some ways, we’ve found App Engine’s frontend serving instances (running Python in our case) to be extremely scalable, with minimal operational support from our team. We’ve also found the datastore, memcache, and task queue facilities to be equally scalable.

Pulse’s backend infrastructure provides many critical services to our native applications and web site. For example, we cache and serve optimized feed and image data for each source in our catalog. This allows us to minimize latency and data transfer and is especially important to providing an exceptional user experience on limited mobile connections. Providing this service for millions of users requires us to serve 100Ms of requests per day. As with any well designed App Engine app, the vast majority of these requests are served out of memcache and never hit the datastore. Another useful technique we use is to set public cache control headers wherever possible, to allow Google’s edge cache (shown as cached requests on the graph below) and ISP / mobile carrier caches to serve unchanged content directly to users.

Costs

Based on App Engine’s projected billing statements leading up to the recent pricing changes, we were concerned that our costs might increase significantly. To prepare for these changes and the expected additional load from Kindle Fire users, we invested some time in diagnosing and reducing these costs. In most cases, the increases turned out to be an indicator of inefficiencies in our code and/or in the App Engine scheduler. With a little optimization, we have reduced these costs dramatically.

The new tuning sliders for the scheduler make it possible to rein in overly aggressive instance allocation. In the old pricing structure, idle instance time wasn’t charged for at all, so these inefficiencies were usually ignored. Now App Engine charges for all instance time by default. However, any time App Engine runs more idle instances than you’ve allowed, those hours are free. This acts as a hint to the scheduler, helping it reduce unneeded idle instances. By doing some testing to find the optimal cost vs spike latency tolerance and setting the sliders to those levels, we were able to reduce our frontend instance costs to near original levels. Our heavy usage of memcache (which is still free!) also helps keep our instance hours down.

Since datastore operations used to be charged under the umbrella of CPU hours, it was difficult to know the cost of these operations under the old pricing structure. This meant it was easy to miss application inefficiencies, especially for write-heavy workloads where additional indexes can have a multiplicative effect on costs. In our case, the new datastore write operations metric led us to notice some inefficiencies in our design and a tendency to overuse indexes. We are now working to minimize the number of indexes our queries rely on, and this has started to reduce our write costs.

Preparing for the Kindle Fire Launch

We took a few additional steps to prepare for the expected load increase and spikes associated with the Fire’s launch. First, we contacted App Engine’s support team to warn them of the expected increase. This is recommended for any app at or near 10,000 requests per second (to make sure your application is correctly provisioned). We also signed up for a Premier account which gets us additional support and simpler billing.

Architecturally, we decided to split our load across three primary applications, each serving different use cases. While this makes it harder to access data across these applications, those same boundaries serve to isolate potential load-related problems and make tuning simpler. In our case, we were able to divide certain parts of our infrastructure, where cross application data access was less important and load would be significant. Until App Engine provides more visibility into and control of memcache eviction policies, this approach also helps prevent lower priority data from evicting critical data.

I’m hopeful that in the near future such division of services will not be required. Individually tunable load isolation zones and memcache controls would certainly make it a lot more appealing to have everything in a single application. Until then, this technique works quite well, and helps to simplify how we think about scaling.

Introducing Livecount

Analytics is ideally a combination of real-time and batch processing. Batch processing, with something like Hadoop, is great for digging into large amounts of past data and asking questions that cannot be anticipated. However, when it is known ahead of time that certain aggregates will be required, the best solution is often to count each event as it happens. Most analytics dashboards are backed by this kind of real-time data.

Nine months ago, Pulse was just starting to experiment with real-time event counting. We didn’t have much server infrastructure yet and were using Google AppEngine to host a few simple APIs. We started reading about the various ways to implement counters on appengine and came across two frequently recommended solutions.

Existing Solutions

Sharded counters was the first approach we tried. To compensate for our write-heavy counter workload, we split each counter into several datastore entities. This allowed writes to be parallelized and avoided the single entity write limits of the AppEngine datastore. To read a single counter value, we queried all shards and summed up the values. This worked well, but required hitting the datastore on every request. Our tests still showed unacceptably high latencies. More shards sped things up a bit, but performance was always bottlenecked on the datastore.

To avoid datastore latencies, counting in memory seemed like an obvious solution. Given a distributed key-value store like Memcached, counting in memory should be quite scalable, while improving both read and write performance. Of course, memcache data is vulnerable to loss when a server goes down, as well as being subject to eviction if available memory runs short. Unfortunately, the eviction problem is amplified in shared environments where it is always possible to be evicted by memory pressure from another app. While we were willing to accept some risk of data loss, our tests showed the probability was too high on AppEngine’s memcache.

Implementing Write-behind Counters

Livecount was developed to leverage the performance of AppEngine’s memcache, while making an effort to maintain the durability of counts. AppEngine’s task queues turned out to be perfect for the job. Each time a count is updated (or optionally when a multiple is reached), Livecount creates a worker task to write that count from the memcache to the datastore in the background. If the count is ever evicted, it is reloaded from the datastore on the next read or write.

Performance

Since counter updates are usually written back to the datastore within seconds, the risk of loss is minimal. Write performance is excellent, since only the memcache must be updated before completing a request. Most reads can also be served from the memcache. Load on the datastore is further reduced by storing a dirty flag along with each memcached count. If more increment events come in than can be written back in real time, only one write is needed to update the datastore with the latest count. After a successful write, the dirty bit is cleared and the other backlogged write tasks for that counter are skipped.

Using Livecount

This simple solution has allowed Pulse’s backend to easily scale to counting hundreds of events per second, with minimal cost and complexity. Livecount’s API requires nothing more than a simple string counter name.

from livecount.counter import load_and_increment_counter

load_and_increment_counter(name=url)

For more advanced use-cases, namespaces are supported for keep counters organized and easy to query. Recently, we also added support for time period fields to help support hourly/daily/weekly/monthly aggregates. Here’s a more advanced example.

from livecount.counter import PeriodType, load_and_increment_counter

load_and_increment_counter(name=url, period=datetime.now(), \
period_types=[PeriodType.DAY, PeriodType.WEEK], namespace="starred", delta=1)

Livecount is open-source and easily deployable on Google AppEngine. If you have something to count, give Livecount a try. We’d love to hear your thoughts or suggestions for improvement!

Using Data Analysis to Discover Top Stories

For the last six months, Pulse has curated Top Stories feeds to help users discover interesting content when it is most relevant. Top Stories has been one of our most popular features, receiving millions of update requests per day. To help start off our new engineering blog, I’ll share some of the technical details behind how we build these feeds. Yesterday we also announced the ability to follow our Top Stories by category via Twitter. 

We start building the Top Stories feeds by analyzing anonymous data on what stories are currently being read and shared the most through Pulse. Since the volume of data coming from Pulse is always growing, we’ve built several highly scalable systems for collection and analysis.  Most of these systems are currently built on Amazon Web Services (AWS).

Collecting Data

The first system is built on Amazon’s Elastic Compute Cloud (EC2) and Simple Storage Service (S3).  It uses Tornado and Scribe to efficiently log data from HTTP requests to disk.  This data is then moved to S3 on a regular basis for permanent storage and analysis.  All data is stored in two separate S3 buckets for additional reliability. To make the system horizontally scalable, we replicate the Tornado/Scribe stack on multiple EC2 instances and use an Elastic Load Balancer (ELB) to distribute our request load across these instances. The ELB also provides a heartbeat service which automatically redirects requests away from any malfunctioning instances.

Analyzing with Hadoop

Once we have our data on S3, it is important for us to be able to extract breaking stories quickly.  To achieve this on our rapidly growing dataset, we built our analysis engine on top of Elastic Map Reduce (EMR) and Hadoop. Using a set of relatively simple mappers and reducers based on the mrjob python framework from Yelp, we are able to quickly scan through hours or days worth of data to surface top read and shared stories. We regularly run these Hadoop jobs directly against our data on S3. To improve performance, this data could be staged on Elastic Block Storage (EBS) first, but in practice we haven’t found this necessary. Since our MapReduce operations are very parallelizable, we can get faster results simply by spinning up a few extra EMR instances. The Amazon EMR interface makes this process extremely simple and cost effective in comparison to running a dedicated Hadoop cluster on EC2 or our own hardware.

Serving Feeds

In contrast to our data collection and analysis infrastructure, most of our feed serving and client api request handling is built on Google AppEngine (GAE).  When our EMR jobs finish, they push the final top stories for each of our categories to the datastore on GAE. Here we achieve horizontal scalability via GAE’s automatic web servers provisioning based on the current request load, as well as it’s solid non-relational datastore and memcache.

This data is then normalized against the current ratio of reading to sharing in the given category and used to create a ranked feed of the current top stories that quickly surfaces breaking news.  The feeds for each category are then memcached and efficiently served to users via the Pulse application on each mobile platform.

Check out our post on the Pulse Blog to follow our Top Stories on Twitter!