Backend Tips – Conquering Big Tables with MapReduce

mapreduceAs some of our readers already know, Pulse uses Google App Engine (GAE) to serve content from thousands of publishers to millions of users. We have been very happy with the minimal operational overhead App Engine requires and were thrilled to see App Engine scale without hiccups when we were preloaded on the Kindle Fire.

As a backend engineer, it is inevitable that some engineering tasks involve heavy data processing. In our case, this often happens on data in the App Engine datastore. We have always relied on the very flexible and easy remote shell to do this type of work. However, this approach is too slow for many use cases, especially those touching millions of records.

For larger tasks, App Engine’s built-in MapReduce is often the right tool. It allows us to quickly operate on millions of datastore entities in a very short amount time. To give a few examples, we use MapReduce: to quickly migrate existing data from legacy datastore models to new models due to architectural changes, to perform load testing on our system with hundreds of shards simulating millions of users, and to inform our users of Pulse’s latest updates by sending out millions of emails or push notifications.

Data Migration

When making product changes, we sometimes move large amounts of data away from a legacy django-nonrel model. The speed of MapReduce ensures that minimal transition time is required and that the transition is painless enough that it is preferred over simply living with the wrong data model.

Load Testing

We use MapReduce to simulate load tests that would otherwise be unrealistic if we only used a few physical machines. A simple load test might use MapReduce to make thousands of requests within a very short period. These requests can simulate millions of users using Pulse throughout a day.

Lessons Learned

You should plan and test any large Map Reduce task that will consume quota-limited resources before running the full job. It’s a good idea to estimate the amount of datastore reads/writes, url fetch calls, and other API requests beforehand. In some cases, it may be necessary to contact App Engine support to ask for increased quotas (for those that cannot be increased in the admin console).

For those using a framework on top of App Engine, make sure you initialize at the top of your handler file (see below). In some cases, you may also need to add the initialization code to the mapreduce module (at the top of mapreduce/main.py). In Django-nonrel, the init line you’ll need looks like this.

from djangoappengine import main

Getting Started

For those of you new to Map Reduce on App Engine, here’s how to create jobs of your own. The App Engine team has made it pretty easy.

Download the mapreduce library via svn and add it to your app:

 svn checkout http://appengine-mapreduce.googlecode.com/svn/trunk/python/src/mapreduce

Register the MapReduce handler in your app.yaml:

handlers:
- url: /mapreduce(/.*)?
  script: mapreduce/main.py
  login: admin

url – The MapReduce endpoints.
script – The handler file containing the task you want to perform.
login – Restricts access to app admins only.

Create the handler file you specified in the previous step (mr_email_users.py) and pass in the model you want to map over:

def run(user_entity):
    send_email(user_entity.email)

Note: See the official Map Reduce guide below for more advanced options & examples.

Register and configure the MapReduce job in mapreduce.yaml:

mapreduce:
- name: MapReduce Email Users Job
  mapper:
    input_reader: mapreduce.input_readers.DatastoreInputReader
    handler: mr_email_users.run
    params:
    - name: entity_kind
      default: user
    - name: shard_count
      default: 50
    - name: processing_rate
      default: 1000

input_reader – The input reader for this job; you can find other types here.
handler – The entry point to this MapReduce job.
entity_kind – The datastore model being mapped over.
shard_count – The number of concurrent mapper workers to run at once.
processing_rate – The aggregated maximum number of inputs processed per second by all mappers. Can be used to avoid using up all quota, interfering with online users.

Access the MapReduce admin console panel to view and launch jobs:

http://(your app name).appspot.com/mapreduce/status

More Info

You may be interested in the official MapReduce Get Started Guide for Python or Java. In addition, this 2011 Google IO talk includes many new useful MapReduce tips. Please leave any questions and comments below, and we will be happy to answer / discuss!

Backend Tips – Google Cloud Storage

Google App Engine’s datastore meets most of our backend storage needs, but we sometimes find ourselves limited by the maximum entity size of one megabyte. One option for storing larger files is to build a separate system on top of Amazon S3. A downside of this approach, however, is that we cannot take advantage of Google’s edge cache, which acts as a free CDN.

A second option is the new Google Cloud Storage service. Google Cloud Storage is the unofficial successor to the Google App Engine Blobstore, and both services are built on the same underlying infrastructure. Yet unlike the Blobstore, which is bundled with App Engine, Google Cloud Storage is a standalone service for storing and managing data. As such, Cloud Storage is Google’s attempt to roll out an Infrastructure as a Service (IaaS) offering that can compete with Amazon S3.

Getting Started

In order to use Google Cloud Storage with App Engine, the first step is to grant your application access to your storage bucket. The documentation instructs you to add the application’s service account name (application-id@appspot.gserviceaccount.com) as a team member to your Google APIs console project.

However, since we created our project with a Google Apps account, this takes bit more effort.  Only users from our domain (xxx@yourdomain.com) could be added to the team via the console. The solution is to use the GSUtil command line tool to edit the storage bucket’s Access Control List (ACL).

Run the following command to retrieve your bucket’s current ACL: gsutil getacl gs://bucketname > acl.txt. Then add an entry that looks like this:

<Entry>
<Scope type="UserByEmail">
<EmailAddress>application-id@appspot.gserviceaccount.com</EmailAddress>
<Name>Service Account</Name>
</Scope>
<Permission>FULL_CONTROL</Permission>
</Entry>

Finally, run this command to set the new ACL: gsutil setacl acl.txt gs://bucketname.

Storing Data

Google provides an experimental API to integrate Cloud Storage with App Engine. This API allows for reading and writing of files to a storage bucket. While testing, I had already preloaded some test files into our bucket using the (barebones, but functional) Cloud Storage Manager web application. I could also have used the GSUtil tool.

Moving forward, we wanted to start loading files programmatically from within App Engine. The API documentation clearly explains how to create, write to, save, and read from Cloud Storage objects. Note that the function provided by the API to create a Google Cloud Storage object —files.gs.create() — takes a number of useful parameters. For instance, this is where you can specify the ACL and Cache-Control header for the object.

The documentation does not address the case in which the object you wish to save is a user upload. Storing uploaded files in a bucket can be accomplished using the Blobstore, as suggested by this StackOverflow answer. The blobstore_helper module is useful for adapting this code for Django.  Simply replace self.get_uploads('file') with blobstore_helper.get_uploads(request, 'file') in order to retrieve the uploaded files.

Serving Content

The Cloud Storage API does not offer a way to serve files directly from a storage bucket. Instead, you can use the Blobstore API to create a url that points at your file.

First, generate a blob key for the Cloud Storage object using the Blobstore API’s create_gs_key() function. Then serve the object as you would a traditional blobstore object. The example given for the Blobstore Python API assumes use of Google’s webapp framework, which provides helper functions (such as self.send_blob()) that obscure the underlying implementation. This makes it a little tricky to understand how to port the code to a different framework, but once again the blobstore_helper module offers some insight. The module defines its own send_blob function, in which the key line of code is response[blobstore.BLOB_KEY_HEADER] = str(blob_key). Essentially, if you put a special header in the response containing the blob key, then App Engine will automatically fill the body of the response with the content of the blob.

To properly serve the blob, it is also necessary to set a correct Content-Type header for the response. Although the Cloud Storage REST API does support retrieving an object’s metadata, it seems that the API for App Engine does not. Currently, we rely on Python’s mimetypes module, which can guess content type from a filename: response['Content-Type'] = mimetypes.guess_type(filename)[0].

An alternative approach to serving files from Cloud Storage, which applies to images only, is to use App Engine’s Image API. As of App Engine version 1.7.0, it is possible to use the get_serving_url() function with Cloud Storage objects. Simply generate the blob key as before, and plug into this function to generate a url for the image. One benefit of using this approach is that the serving url supports cropping and resizing on the fly by supplying optional parameters.

We will continue to investigate the best practices for using Google Cloud Storage with App Engine as a service for storing and serving large files. For others who might be interested, there was a helpful session at Google IO, entitled Storing Your Application’s Data in the Google Cloud, that covers the basics of this new service. Of course, there are other options to consider as well, such as the Blobstore or Amazon S3. It remains to be seen which service will best meet our needs, but we’re glad that there is now a strong option on the Google side.

Backend Tips – The Free CDN

New Blog Post Series

This is the first in a series of blog posts in which we will offer a peek into the some of the challenges we tackle on the Backend Team and discuss some tips and tricks we have discovered. These posts will focus on the ways in which we use GAE and AWS to build simple features that have helped us to deliver an amazing product. We plan to dive a little deeper into topics we’ve covered before, as well as highlighting some new ones. Upcoming topics will include GAE MapReduce, Redis, Google Cloud Storage, and duplicate detection via TF-IDF. Our first entry in the series discusses how to use Google’s edge cache as a free content delivery network (CDN).

The Free CDN

At the end of last year, we briefly mentioned Google’s edge cache as a useful feature as part of our guest post on the App Engine blog. Since this is one of our favorite services, I’d like to take a few minutes to explain it in more detail. It is an extremely simple feature that has the potential to significantly improve content serving latency and can be very valuable in terms of cost savings over other CDNs. Hopefully it will be clear by the end of this post why you should think about using it for your next project.

Content Delivery Networks

Content Delivery Networks (CDNs) offer several benefits that are typically desired for both web and mobile apps. They are designed to cache content on many geographically distributed servers, as close to the end user as possible, thereby minimizing latency for requests to the cached content. There are several major CDN providers, but the big ones that come to mind are Akamai and Amazon’s Cloudfront. CDNs vary in quality and price, but generally one should expect to pay a premium for this type of service.

Google’s Edge Cache (aka. CDN)

It turns out that if you’re using Google App Engine (or other Google services like the newly announced Google Cloud Storage) and you configure things correctly, you get the same service for free. By simply setting public cache control headers wherever possible, you allow Google’s edge caches to serve unchanged content directly to users. Here’s an example of a set of response headers that will activate the cache:

 Cache-Control: public, max-age=900, must-revalidate

The most important component of the header is the word ‘public’. It tells Google’s network that the content in this response is not specific to a particular user or private in any way, so it’s safe to cache it as aggressively as possible. ‘max-age’ allows you to decide how often this content will be refreshed from your servers, and ‘must-revalidate’ is just telling the server (or client cache) to strictly follow this timeout.

This technique has been mentioned in at least one Google IO talk, but for some reason hasn’t been widely publicized. Because of the scale of Google’s network, this is perhaps the best CDN available. Best of all, there is no cost for this caching. It’s actually a win-win for both you and Google, since it minimizes the traffic that has to cross their internal networks and servers.

At Pulse we use this feature very heavily. It lets us serve high quality, mobile optimized images at < 50ms latency, while also saving us lots of App Engine instance hours by preventing these requests from hitting our frontend servers. As you can see from the graph below, for this particular App Engine app, we are serving the majority of requests out of Google’s edge cache (labeled red). I encourage you to try it out. It’s almost too easy to be true! If you have questions, feel free to leave comments below or ping me @gregbayer.