path: root/_posts/2012-11-28-perl-redis-and-anyevent-at-craiglist.mdown



---
title: Perl, Redis and AnyEvent at Craiglist
layout: post
category: perl
---

Last night I went to the
[SF.pm](http://www.meetup.com/San-Francisco-Perl-Mongers/) meetup,
hosted by Craiglist (thanks for the food!), where
[Jeremy Zawodny](https://twitter.com/jzawodn) talked about
[Redis](http://redis.io) and his module
[AnyEvent::Redis::Federated](https://metacpan.org/module/AnyEvent::Redis::Federated).
We were about 30 mongers.

I was eating at the same table as Craiglist CTO's, and he went through
some details of their infrastructure.  I was surprised by the quantity
of place where they use Perl, and the amount of traffic they deal with.

## Redis

Jeremey started his talk by explaining what is their current problem:
they have hundred of hosts in multiple data center, and they collect
continuously dozen of metrics.  They looked at MySQL to store them,
but it was too slow to support their write rate.  Another thing
important for them is that mostly only the recent data matters.  They
want to know what's going on *now*, they don't really care about the
past.

So their goal is simple: they need something fast, *really* fast, and
simple.

That's where Redis enter the game.  They need data replication, but
Redis don't have this feature: there's only a master/slave replication
mechanism (so, one way), and they need a solution with multi master,
where a node becoming master does not drop data.

Because Redis is single thread, and servers have multiple cores, they
start 8 process on each node to take advantages of them.

To me, the main benefit of Redis over Memcached is that you can use
it as a data structure server.  The structure they use the most is the
[*ZSET*](http://redis.io/commands#sorted_set).  The format to store a metric is:

 *  key: `$time_period:$host:$metric` (where the $timeperiod is
    usually a day)
 *  score: `$timestamp`
 *  value: `$timestamp:$value`

In addition of storing those metrics in the nodes, they also keep a
journal of what has changed.  The journal looks like this:

 *  score: `$timestamp` of the last time something has changed
 *  value: `$key` that changed
  
The journal is only one big structure, and it's used by their syncer
(more about that in a moment).  The benefit of having ZSET is that
they can delete old data easily by using the key (they don't have
enough memory to store more than a couple of days, so they need to be
able to delete by day kickly).

The journal is use for replication.  Each process has a syncer that
track all his peers, pull the data from those nodes and merge them
with the local data.  Earlier Jeremy mentioned that they have 8
instances on each node, so a the syncer from process 1 on node a will
only check for the process 1 on node b.

He also mentioned a memory optimization done by Redis (you can read
more about that [here](http://redis.io/topics/memory-optimization)).

## AnyEvent::Redis::Federated

Now, it's time to see the Perl code. `AnyEventE::Redis::Federated` is
a layer on top of `AnyEvent::Redis` that implements a consistent
hashing.  I guess now every body has gave up hope to see someday
[redis cluster](http://redis.io/topics/cluster-spec) (and I'm more and
more convinced that hit should never be implemented, and let the
client implement their own solution for hashing / replication).

Some of the nice feature of the modules:

 *  call chaining
 *  [you can get singleton object for the connection](https://metacpan.org/module/AnyEvent::Redis::Federated#SHARED-CONNECTIONS)
 *  you can also use it in blocking mode
 *  query all node (where you send the same command to all the node,
   can be useful to do sanity check on the data)
 *  the client will write to one node, and let the syncer do the job

He then showed us some code (with a very gross example: `new
AnyEvent::Redis::Federated`, I know at least
[one person](http://search.cpan.org/perldoc?indirect) who would have
probably said something :).

The module is not yet used in production, but they've tested it
heavily, in a lot of conditions.  They intent to use it soon with some
home made dashboard to display the metrics.
---
title: Perl, Redis and AnyEvent at Craiglist
layout: post
category: perl
---

Last night I went to the
[SF.pm](http://www.meetup.com/San-Francisco-Perl-Mongers/) meetup,
hosted by Craiglist (thanks for the food!), where
[Jeremy Zawodny](https://twitter.com/jzawodn) talked about
[Redis](http://redis.io) and his module
[AnyEvent::Redis::Federated](https://metacpan.org/module/AnyEvent::Redis::Federated).
We were about 30 mongers.

I was eating at the same table as Craiglist CTO's, and he went through
some details of their infrastructure.  I was surprised by the quantity
of place where they use Perl, and the amount of traffic they deal with.

## Redis

Jeremey started his talk by explaining what is their current problem:
they have hundred of hosts in multiple data center, and they collect
continuously dozen of metrics.  They looked at MySQL to store them,
but it was too slow to support their write rate.  Another thing
important for them is that mostly only the recent data matters.  They
want to know what's going on *now*, they don't really care about the
past.

So their goal is simple: they need something fast, *really* fast, and
simple.

That's where Redis enter the game.  They need data replication, but
Redis don't have this feature: there's only a master/slave replication
mechanism (so, one way), and they need a solution with multi master,
where a node becoming master does not drop data.

Because Redis is single thread, and servers have multiple cores, they
start 8 process on each node to take advantages of them.

To me, the main benefit of Redis over Memcached is that you can use
it as a data structure server.  The structure they use the most is the
[*ZSET*](http://redis.io/commands#sorted_set).  The format to store a metric is:

 *  key: `$time_period:$host:$metric` (where the $timeperiod is
    usually a day)
 *  score: `$timestamp`
 *  value: `$timestamp:$value`

In addition of storing those metrics in the nodes, they also keep a
journal of what has changed.  The journal looks like this:

 *  score: `$timestamp` of the last time something has changed
 *  value: `$key` that changed
  
The journal is only one big structure, and it's used by their syncer
(more about that in a moment).  The benefit of having ZSET is that
they can delete old data easily by using the key (they don't have
enough memory to store more than a couple of days, so they need to be
able to delete by day kickly).

The journal is use for replication.  Each process has a syncer that
track all his peers, pull the data from those nodes and merge them
with the local data.  Earlier Jeremy mentioned that they have 8
instances on each node, so a the syncer from process 1 on node a will
only check for the process 1 on node b.

He also mentioned a memory optimization done by Redis (you can read
more about that [here](http://redis.io/topics/memory-optimization)).

## AnyEvent::Redis::Federated

Now, it's time to see the Perl code. `AnyEventE::Redis::Federated` is
a layer on top of `AnyEvent::Redis` that implements a consistent
hashing.  I guess now every body has gave up hope to see someday
[redis cluster](http://redis.io/topics/cluster-spec) (and I'm more and
more convinced that hit should never be implemented, and let the
client implement their own solution for hashing / replication).

Some of the nice feature of the modules:

 *  call chaining
 *  [you can get singleton object for the connection](https://metacpan.org/module/AnyEvent::Redis::Federated#SHARED-CONNECTIONS)
 *  you can also use it in blocking mode
 *  query all node (where you send the same command to all the node,
   can be useful to do sanity check on the data)
 *  the client will write to one node, and let the syncer do the job

He then showed us some code (with a very gross example: `new
AnyEvent::Redis::Federated`, I know at least
[one person](http://search.cpan.org/perldoc?indirect) who would have
probably said something :).

The module is not yet used in production, but they've tested it
heavily, in a lot of conditions.  They intent to use it soon with some
home made dashboard to display the metrics.