App Engine, meet Redis on AWS

Since snappy performance is critical to providing a good user experience, we try to keep the latency of all common Pulse backend API requests under 500ms. Most of the time we achieve this by using Google App Engine’s memcache to cache all data which might be reused by many requests. Less commonly requested data is pulled from the datastore, resulting in such requests taking a bit longer than we like.

When these slower requests are rare, we accept them. However, for features that access a broad range of data, the likelihood of missing the cache increases. Some data required for a request may be cached, but some will almost always not be, resulting in high latency for most requests.

To implement these types of features efficiently, one option is to dramatically increase the size of our memcache. This would allow us to keep all required data in cache. However, it would be expensive and is somewhat at odds with the LRU cache policy we like to use for other features. This approach is also currently unsupported on Google App Engine (since memcache capacity is not directly tunable).

We investigated several other options and finally settled on using Redis as a persistent, in-memory, datastore. Redis strikes a great balance between simplicity, powerful primitives, and proven stability. Instead of increasing our memcache or switching entirely to a larger in-memory store, we created a second Redis-based system on AWS. This system is specifically designed to hold data which is important to have available at in-memory speeds (with no expected misses). Achieving this is more expensive than providing a similar LRU cache (which could be smaller), so we reserve it specifically for features that require such guarantees.

Architecture

We wanted to use Redis, but also to make sure that our implementation was both scalable and easily recoverable in the case of failure. From here on out, we will discuss the infrastructure and tools we use to build this system. Here’s a visual overview of the system:

 

Amazon Elastic Load Balancer

This is a really nice utility that AWS gives us. We setup an ELB that points to as many EC2 machines as we need, and for each of those machines (we’ll call them redis frontends), we get automatic round-robin balancing and it will also detect failing machines, give us a warning, and transfer the load to the running machines. Some important dos:

  1. The load balancer can deal with https requests, so use them! Some security is always better than none.
  2. You should make sure that the machines you provide to the load balancer are distributed among the different regions that AWS offers.
  3. You can also use dynamic scaling by putting dynamic instances into a group and giving the group to the load balancer.


HA Proxy

Our redis frontend machines use Tornado as the webserver. Tornado is fast (great!) and single threaded. Single threaded prevents many headaches, scales predictably and has minimal overhead, but doesn’t benefit from multiple cores on a machine. The larger Amazon machines have multiple cores, so we really want to use that to our advantage. Enter HA Proxy, a nice utility that allows you to build an reverse proxy. Here’s a barebone version of the configuration we use:

global
maxconn 1024
daemon
log 127.0.0.1 local0
frontend load_balancer
# We process all requests hitting port 8080
bind *:8080
# We will point them to the backend we describe later
default_backend tornado_servers
mode http
option httplog
option dontlognull
clitimeout 20000
backend tornado_servers
# The balancing strategy
balance roundrobin
# The tornado servers, in this case, the machine has 4 cores
server tornado_1 127.0.0.1:13371 check rise 2 fall 5
server tornado_1 127.0.0.1:13372 check rise 2 fall 5
server tornado_1 127.0.0.1:13373 check rise 2 fall 5
server tornado_1 127.0.0.1:13374 check rise 2 fall 5
retries 1
mode http
contimeout 5000
srvtimeout 20000
# We also get stats from HA Proxy about our tornado servers
stats enable
stats uri /lb?stats

Tornado Frontends

Each of these Tornado instances provides a thin python api layer. The implementation is both simplistic and very specific to our own use-cases. I won’t go into the specific details, but the frontend takes care of all of the security and implements the internal API we provide to our client teams. Certain general tasks like deserialization, error handling, and batching requests before hitting the backend were also very important. We run enough instances to match the number of cores on the machine and they all rely on the sharded redis interface to actually access the data.

Sharded Redis Interface

This is based heavily off of redis-py by Andy McCurdy, so many thanks to him. You can take a look at https://github.com/andymccurdy/redis-py/

The thing we needed to add was the ability to split our data amongst several different machines. Andy is working on a general solution for this called cluster redis, but we opted to go with something simpler in the meantime.

The first thing was to implement the actual sharding, something like:

def find_shard(key):
hash_value = some_consistent_hash_function(key)
return hash_value % num_machines

With that little snippet, it was pretty easy to send operations to a wrapper class of StrictRedis (look at redis-py), and just have all the tornado frontends behave as if there was a single machine serving the data. This works as long as you don’t want to use pipelines.

However, it turns out that you really do want to use pipelines. Whenever you have multiple requests that you can send out at the same time, a pipeline will save you all the roundtrip time of single requests. Without pipelines, it doesn’t matter how blazingly fast redis is, you are stuck on network i/o latency.

Getting pipelines to work is a little bit more involved. Now when a request comes in on a pipeline, we index it by the order it came in and store that tied to the individual machine pipeline we created. An example with two machines:

command1 key1 value1 (key1 -> machine 1)
command2 key2 value2 (key2 -> machine 2)
command3 key3 value3 (key3 -> machine 1)
command4 key4 value4 (key4 -> machine 1)

We will remember it like this:
Pipeline index for machine 1:
[1, 3, 4]
Pipeline for machine 1 will contain:
command1 key1 value1
command3 key3 value3
command4 key4 value4
Pipeline index for machine 2:
[2]
Pipeline for machine 2 will contain:
command2 key2 value2

Now when we execute all the pipelines, we will be able to reconstitute the return values in the order they came in to the sharded_redis interface. With solutions to both the sharding and pipelines, we now have an interface that hides the fact that we actually need multiple machines to serve all the data. Notice that since each tornado frontend uses the interface independently we need to update them synchronously when we make changes!

Redis Backend

Here are a few tips for setting up redis:

  1. Use a password, and make it a long password
  2. Set a memory limit and a reasonable policy to deal with exceeding max memory
  3. Change your machine overcommit_memory setting to 1
    sysctl -w vm.overcommit_memory=1
  4. Don’t run anything except redis on this machine
  5. If you are using AOF files and backup machines (recommended), don’t bother with persistence on the master! Instead, make sure you have an agressive fsync policy (everysec works) for the slave.
For those who want the “why” behind each of the tips:
  1. From Redis Documentation:

    The password is set by the system administrator in clear text inside the redis.conf file. It should be long enough to prevent brute force attacks for two reasons:

    • Redis is very fast at serving queries. Many passwords per second can be tested by an external client.
    • The Redis password is stored inside the redis.conf file and inside the client configuration, so it does not need to be remembered by the system administrator, and thus it can be very long.

    The goal of the authentication layer is to optionally provide a layer of redundancy. If firewalling or any other system implemented to protect Redis from external attackers fail, an external client will still not be able to access the Redis instance without knowledge of the authentication password.

    Note: The AUTH command, like every other Redis command, is sent unencrypted, so it does not protect against an attacker that has enough access to the network to perform eavesdropping.

  2. We actually monitor the machine memory usage as well as the redis memory usage to shard our redis backend more as needed. Even so, its safer to set a reasonable limit of memory that redis should use so that we don’t have a scenario where redis uses all available memory on a machine and then crashes.
  3. From Redis Documentation:

    Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can’t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.

    Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.

  4. Because of the large memory footprint we expect redis to use and the fact that we have to use an optimistic memory allocation setting, running anything else that might use up a lot of memory on the same machine can lead to failures.
  5. This is a optimization to make sure the master Redis instance does not bottleneck because of disk writes. The work associated with persistence is offloaded as much as possible to a backup machine That being said, its important that the slave/backup machine is robust.

Backup

This is simply a second machine running Redis that is set as a slave to the master Redis instance. In AWS, remember to use internal ip addresses when setting this up, since it saves you money. Backups are a must when you are running redis in production for several reasons:

  1. It’s a backup! If your machine in front goes down, you fail over to the backup as you try to fix the first machine. More often than not, you can actually just promote the backup and setup a new backup when you are running on AWS.
  2. If you ever need to expand the number of machines used for serving, you can just promote your backup to a serving machine and set up new backups for both machines. I would be remiss not to mention that you do have to then go through both machines to delete the extra keys later, or else you really won’t have expanded your memory limit.
  3. You can run data analytics on the backup without affecting the all important performance of the actual serving machine.

Tips for improving performance of your iOS application

Any iOS application worthy of a spot on their user’s home screen is made of 3 key ingredients: a great idea, stunning design and smooth performance. In a previous post, we shared a few guidelines to make your app look pretty. Today, we have some simple tips on how to improve the performance of your iOS application. At Pulse, we obsess over every small hiccup in the application and spend countless nights staring at Instruments at the end of our release cycles. Here are some of our insights that might help you in your development process.

Downsize your image assets

Apps with good visual design always delight users. To achieve pixel perfect graphics, every iOS application ships with several image assets. It is crucial that these images are as small in size as possible. Let me elaborate with an example.

It is common practice to add a button to a nib file and set its background to point to an image. When the nib file is read from disk, iOS instantiates all the individual objects in the file, including that button. When it notices that the button’s background points to an image, it reads the image from disk, inflates it in memory and renders it as the background. The bigger the image, the slower it is to read it from disk. Since all this happens synchronously on the main thread, it slows down the app. Tip #1: Once you are satisfied with an asset, remember to always compress it to the smallest size possible, without any loss in quality, before adding it to the bundle. As a rule of thumb, I have always been able to compress icons down to at most 4kb on disk. Check out Core Animation in Practice, Part 2 from WWDC 2010 for more info on optimizing graphics on screen.

Defer main thread operations

It goes without saying that any task that doesn’t need to be executed on the main thread should be shipped to a background thread. NSOperationQueues or Grand Central Dispatch are two great tools for such tasks. With tasks running on the main thread, you need to be very careful that they don’t interfere with a user’s touches. Such tasks can be roughly classified into two groups:

  • View Updates: Any changes to your views need to happen on the main thread. iOS makes it very easy to defer these changes by the simple, do not call us, we’ll call you rule – Never call drawRect yourself. Just call setNeedsDisplay and iOS will re-render your view when the user has stopped scrolling.
  • Processing: There are some critical processing tasks that cannot be performed on a background thread, like saving a Core Data database, changing in-memory state, etc. Tip #2: Group such tasks into independent chunks and execute them in the Default Runloop mode. Eg:
[self performSelectorOnMainThread:@selector(processDataOnMainThread:)
withObject:dictionaryOfParameters
waitUntillDone:NO
modes:[NSArray arrayWithObject:NSDefaultRunLoopMode]]

When the user starts scrolling a scrollview or a tableview, the run loop mode is set to the Common modes. When the user stops scrolling, it is reset to the Default mode. Thus, if you use the vanilla [self processDataOnMainThread:dictionaryOfParams] call, the function will start executing regardless of whether the user is scrolling or not. But, with the API call above, iOS will wait for the user to stop scrolling before executing your function.

Avoid Memory Spikes

Every iOS developer dreads the ominous “Low Memory Warning”. In addition to being delivered if the app uses a lot of memory, Low Memory Warnings can also arise if the application’s memory suddenly spikes, even though the overall memory usage is quite small. If your application’s memory doesn’t go down after repeated memory warnings, iOS will kill your app! Tip #3: Always strive to keep your memory profile smooth. Some typical hot spots for memory spikes are:

  • App Launch: Load as few objects as you need. This will speed up launch and prevent memory warnings!
  • View Controller Initialization: New view controller objects are instantiated when they are pushed on the navigation stack or presented modally. Try to use as few views as possible. Or instantiate some views lazily, if you can.
  • UIWebview: UIWebview is notorious for using up a lot of memory very quickly, especially when loading HTML content with heavy images/videos. Its hard to completely control the memory profile with a UIWebview in your application, but loading data lazily is always a good rule of thumb.

Remember, If you keep your application’s memory profile steady and consistent, it will lead a long and healthy life! Check out Advanced Memory Analysis with Instruments for more info.

Avoid unnecessary caching of images

Throughout an iOS application, we need to refer to images in the bundle. More often than not, imageNamed: is an extremely simple and efficient way to do so. But, you should be aware that imageNamed: also caches any image it imports from the bundle. Thus, it is highly efficient for images that need to be reused throughout your application (like icons, background images for buttons etc.). But it can be an unnecessary memory hog for images that are used sparingly. Tip #4: For loading such images, we should instead read them directly from disk and release the memory when we are done using the image.

NSString *path = [[NSBundle mainBundle] pathForResource:fileName ofType:fileType];
UIImage *image = [[UIImage alloc] initWithContentsOfFile:path];

[image release];

As a rule of thumb, use imageNamed: with images that are used in UI elements and initWithContentsOfFile: for everything else. Here is a handy category we wrote on UIImage that automatically chooses the right image for retina display screens and reads them from disk.

UImage+ImageNamedFromDisk.h
UImage+ImageNamedFromDisk.m

I hope you find these tips useful in your own development. Please share your own insights into optimizing iOS applications by leaving comments below!