As some of our readers already know, Pulse uses Google App Engine (GAE) to serve content from thousands of publishers to millions of users. We have been very happy with the minimal operational overhead App Engine requires and were thrilled to see App Engine scale without hiccups when we were preloaded on the Kindle Fire.
As a backend engineer, it is inevitable that some engineering tasks involve heavy data processing. In our case, this often happens on data in the App Engine datastore. We have always relied on the very flexible and easy remote shell to do this type of work. However, this approach is too slow for many use cases, especially those touching millions of records.
For larger tasks, App Engine’s built-in MapReduce is often the right tool. It allows us to quickly operate on millions of datastore entities in a very short amount time. To give a few examples, we use MapReduce: to quickly migrate existing data from legacy datastore models to new models due to architectural changes, to perform load testing on our system with hundreds of shards simulating millions of users, and to inform our users of Pulse’s latest updates by sending out millions of emails or push notifications.
When making product changes, we sometimes move large amounts of data away from a legacy django-nonrel model. The speed of MapReduce ensures that minimal transition time is required and that the transition is painless enough that it is preferred over simply living with the wrong data model.
We use MapReduce to simulate load tests that would otherwise be unrealistic if we only used a few physical machines. A simple load test might use MapReduce to make thousands of requests within a very short period. These requests can simulate millions of users using Pulse throughout a day.
You should plan and test any large Map Reduce task that will consume quota-limited resources before running the full job. It’s a good idea to estimate the amount of datastore reads/writes, url fetch calls, and other API requests beforehand. In some cases, it may be necessary to contact App Engine support to ask for increased quotas (for those that cannot be increased in the admin console).
For those using a framework on top of App Engine, make sure you initialize at the top of your handler file (see below). In some cases, you may also need to add the initialization code to the mapreduce module (at the top of mapreduce/main.py). In Django-nonrel, the init line you’ll need looks like this.
For those of you new to Map Reduce on App Engine, here’s how to create jobs of your own. The App Engine team has made it pretty easy.
Download the mapreduce library via svn and add it to your app:
Register the MapReduce handler in your app.yaml:
- url: /mapreduce(/.*)?
url – The MapReduce endpoints.
script – The handler file containing the task you want to perform.
login – Restricts access to app admins only.
Create the handler file you specified in the previous step (mr_email_users.py) and pass in the model you want to map over:
Note: See the official Map Reduce guide below for more advanced options & examples.
Register and configure the MapReduce job in mapreduce.yaml:
- name: MapReduce Email Users Job
- name: entity_kind
- name: shard_count
- name: processing_rate
input_reader – The input reader for this job; you can find other types here.
handler – The entry point to this MapReduce job.
entity_kind – The datastore model being mapped over.
shard_count – The number of concurrent mapper workers to run at once.
processing_rate – The aggregated maximum number of inputs processed per second by all mappers. Can be used to avoid using up all quota, interfering with online users.
Access the MapReduce admin console panel to view and launch jobs:
You may be interested in the official MapReduce Get Started Guide for Python or Java. In addition, this 2011 Google IO talk includes many new useful MapReduce tips. Please leave any questions and comments below, and we will be happy to answer / discuss!