Since snappy performance is critical to providing a good user experience, we try to keep the latency of all common Pulse backend API requests under 500ms. Most of the time we achieve this by using Google App Engine’s memcache to cache all data which might be reused by many requests. Less commonly requested data is pulled from the datastore, resulting in such requests taking a bit longer than we like.
When these slower requests are rare, we accept them. However, for features that access a broad range of data, the likelihood of missing the cache increases. Some data required for a request may be cached, but some will almost always not be, resulting in high latency for most requests.
To implement these types of features efficiently, one option is to dramatically increase the size of our memcache. This would allow us to keep all required data in cache. However, it would be expensive and is somewhat at odds with the LRU cache policy we like to use for other features. This approach is also currently unsupported on Google App Engine (since memcache capacity is not directly tunable).
We investigated several other options and finally settled on using Redis as a persistent, in-memory, datastore. Redis strikes a great balance between simplicity, powerful primitives, and proven stability. Instead of increasing our memcache or switching entirely to a larger in-memory store, we created a second Redis-based system on AWS. This system is specifically designed to hold data which is important to have available at in-memory speeds (with no expected misses). Achieving this is more expensive than providing a similar LRU cache (which could be smaller), so we reserve it specifically for features that require such guarantees.
We wanted to use Redis, but also to make sure that our implementation was both scalable and easily recoverable in the case of failure. From here on out, we will discuss the infrastructure and tools we use to build this system. Here’s a visual overview of the system:
Amazon Elastic Load Balancer
This is a really nice utility that AWS gives us. We setup an ELB that points to as many EC2 machines as we need, and for each of those machines (we’ll call them redis frontends), we get automatic round-robin balancing and it will also detect failing machines, give us a warning, and transfer the load to the running machines. Some important dos:
- The load balancer can deal with https requests, so use them! Some security is always better than none.
- You should make sure that the machines you provide to the load balancer are distributed among the different regions that AWS offers.
- You can also use dynamic scaling by putting dynamic instances into a group and giving the group to the load balancer.
Our redis frontend machines use Tornado as the webserver. Tornado is fast (great!) and single threaded. Single threaded prevents many headaches, scales predictably and has minimal overhead, but doesn’t benefit from multiple cores on a machine. The larger Amazon machines have multiple cores, so we really want to use that to our advantage. Enter HA Proxy, a nice utility that allows you to build an reverse proxy. Here’s a barebone version of the configuration we use:
log 127.0.0.1 local0
# We process all requests hitting port 8080
# We will point them to the backend we describe later
# The balancing strategy
# The tornado servers, in this case, the machine has 4 cores
server tornado_1 127.0.0.1:13371 check rise 2 fall 5
server tornado_1 127.0.0.1:13372 check rise 2 fall 5
server tornado_1 127.0.0.1:13373 check rise 2 fall 5
server tornado_1 127.0.0.1:13374 check rise 2 fall 5
# We also get stats from HA Proxy about our tornado servers
stats uri /lb?stats
Each of these Tornado instances provides a thin python api layer. The implementation is both simplistic and very specific to our own use-cases. I won’t go into the specific details, but the frontend takes care of all of the security and implements the internal API we provide to our client teams. Certain general tasks like deserialization, error handling, and batching requests before hitting the backend were also very important. We run enough instances to match the number of cores on the machine and they all rely on the sharded redis interface to actually access the data.
Sharded Redis Interface
This is based heavily off of redis-py by Andy McCurdy, so many thanks to him. You can take a look at https://github.com/andymccurdy/redis-py/
The thing we needed to add was the ability to split our data amongst several different machines. Andy is working on a general solution for this called cluster redis, but we opted to go with something simpler in the meantime.
The first thing was to implement the actual sharding, something like:
hash_value = some_consistent_hash_function(key)
return hash_value % num_machines
With that little snippet, it was pretty easy to send operations to a wrapper class of StrictRedis (look at redis-py), and just have all the tornado frontends behave as if there was a single machine serving the data. This works as long as you don’t want to use pipelines.
However, it turns out that you really do want to use pipelines. Whenever you have multiple requests that you can send out at the same time, a pipeline will save you all the roundtrip time of single requests. Without pipelines, it doesn’t matter how blazingly fast redis is, you are stuck on network i/o latency.
Getting pipelines to work is a little bit more involved. Now when a request comes in on a pipeline, we index it by the order it came in and store that tied to the individual machine pipeline we created. An example with two machines:
command2 key2 value2 (key2 -> machine 2)
command3 key3 value3 (key3 -> machine 1)
command4 key4 value4 (key4 -> machine 1)
We will remember it like this:
Pipeline index for machine 1:
[1, 3, 4]
Pipeline for machine 1 will contain:
command1 key1 value1
command3 key3 value3
command4 key4 value4
Pipeline index for machine 2:
Pipeline for machine 2 will contain:
command2 key2 value2
Now when we execute all the pipelines, we will be able to reconstitute the return values in the order they came in to the sharded_redis interface. With solutions to both the sharding and pipelines, we now have an interface that hides the fact that we actually need multiple machines to serve all the data. Notice that since each tornado frontend uses the interface independently we need to update them synchronously when we make changes!
Here are a few tips for setting up redis:
- Use a password, and make it a long password
- Set a memory limit and a reasonable policy to deal with exceeding max memory
- Change your machine overcommit_memory setting to 1
sysctl -w vm.overcommit_memory=1
- Don’t run anything except redis on this machine
- If you are using AOF files and backup machines (recommended), don’t bother with persistence on the master! Instead, make sure you have an agressive fsync policy (everysec works) for the slave.
- From Redis Documentation:
The password is set by the system administrator in clear text inside the redis.conf file. It should be long enough to prevent brute force attacks for two reasons:
- Redis is very fast at serving queries. Many passwords per second can be tested by an external client.
- The Redis password is stored inside the redis.conf file and inside the client configuration, so it does not need to be remembered by the system administrator, and thus it can be very long.
The goal of the authentication layer is to optionally provide a layer of redundancy. If firewalling or any other system implemented to protect Redis from external attackers fail, an external client will still not be able to access the Redis instance without knowledge of the authentication password.
Note: The AUTH command, like every other Redis command, is sent unencrypted, so it does not protect against an attacker that has enough access to the network to perform eavesdropping.
- We actually monitor the machine memory usage as well as the redis memory usage to shard our redis backend more as needed. Even so, its safer to set a reasonable limit of memory that redis should use so that we don’t have a scenario where redis uses all available memory on a machine and then crashes.
- From Redis Documentation:
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can’t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
- Because of the large memory footprint we expect redis to use and the fact that we have to use an optimistic memory allocation setting, running anything else that might use up a lot of memory on the same machine can lead to failures.
- This is a optimization to make sure the master Redis instance does not bottleneck because of disk writes. The work associated with persistence is offloaded as much as possible to a backup machine That being said, its important that the slave/backup machine is robust.
This is simply a second machine running Redis that is set as a slave to the master Redis instance. In AWS, remember to use internal ip addresses when setting this up, since it saves you money. Backups are a must when you are running redis in production for several reasons:
- It’s a backup! If your machine in front goes down, you fail over to the backup as you try to fix the first machine. More often than not, you can actually just promote the backup and setup a new backup when you are running on AWS.
- If you ever need to expand the number of machines used for serving, you can just promote your backup to a serving machine and set up new backups for both machines. I would be remiss not to mention that you do have to then go through both machines to delete the extra keys later, or else you really won’t have expanded your memory limit.
- You can run data analytics on the backup without affecting the all important performance of the actual serving machine.