Analytics is ideally a combination of real-time and batch processing. Batch processing, with something like Hadoop, is great for digging into large amounts of past data and asking questions that cannot be anticipated. However, when it is known ahead of time that certain aggregates will be required, the best solution is often to count each event as it happens. Most analytics dashboards are backed by this kind of real-time data.
Nine months ago, Pulse was just starting to experiment with real-time event counting. We didn’t have much server infrastructure yet and were using Google AppEngine to host a few simple APIs. We started reading about the various ways to implement counters on appengine and came across two frequently recommended solutions.
Sharded counters was the first approach we tried. To compensate for our write-heavy counter workload, we split each counter into several datastore entities. This allowed writes to be parallelized and avoided the single entity write limits of the AppEngine datastore. To read a single counter value, we queried all shards and summed up the values. This worked well, but required hitting the datastore on every request. Our tests still showed unacceptably high latencies. More shards sped things up a bit, but performance was always bottlenecked on the datastore.
To avoid datastore latencies, counting in memory seemed like an obvious solution. Given a distributed key-value store like Memcached, counting in memory should be quite scalable, while improving both read and write performance. Of course, memcache data is vulnerable to loss when a server goes down, as well as being subject to eviction if available memory runs short. Unfortunately, the eviction problem is amplified in shared environments where it is always possible to be evicted by memory pressure from another app. While we were willing to accept some risk of data loss, our tests showed the probability was too high on AppEngine’s memcache.
Implementing Write-behind Counters
Livecount was developed to leverage the performance of AppEngine’s memcache, while making an effort to maintain the durability of counts. AppEngine’s task queues turned out to be perfect for the job. Each time a count is updated (or optionally when a multiple is reached), Livecount creates a worker task to write that count from the memcache to the datastore in the background. If the count is ever evicted, it is reloaded from the datastore on the next read or write.
Since counter updates are usually written back to the datastore within seconds, the risk of loss is minimal. Write performance is excellent, since only the memcache must be updated before completing a request. Most reads can also be served from the memcache. Load on the datastore is further reduced by storing a dirty flag along with each memcached count. If more increment events come in than can be written back in real time, only one write is needed to update the datastore with the latest count. After a successful write, the dirty bit is cleared and the other backlogged write tasks for that counter are skipped.
This simple solution has allowed Pulse’s backend to easily scale to counting hundreds of events per second, with minimal cost and complexity. Livecount’s API requires nothing more than a simple string counter name.
For more advanced use-cases, namespaces are supported for keep counters organized and easy to query. Recently, we also added support for time period fields to help support hourly/daily/weekly/monthly aggregates. Here’s a more advanced example.
load_and_increment_counter(name=url, period=datetime.now(), \
period_types=[PeriodType.DAY, PeriodType.WEEK], namespace="starred", delta=1)
Livecount is open-source and easily deployable on Google AppEngine. If you have something to count, give Livecount a try. We’d love to hear your thoughts or suggestions for improvement!