1 files changed, 184 insertions, 0 deletions
diff --git a/posts/2013-01-28-let-s-talk-about-graphite.org b/posts/2013-01-28-let-s-talk-about-graphite.org
new file mode 100644
index 0000000..b0e96d7
--- /dev/null
+++ b/posts/2013-01-28-let-s-talk-about-graphite.org
@@ -0,0 +1,184 @@
+As I've already mentionned in a [[/carbons-manhole/][previous post]], at
+[[http://saymedia.com][$work]] we are currently deploying Graphite and
+the usual suspects.
+
+Finding articles on how to install all these tools is easy, there's
+plenty of them. But what's /really/ hard to find, are stories on /how to
+use them/: what's collected, how, why, how do you organize your metrics,
+do you rewrite them, etc.
+
+What I want with this post is to start a discussion about usages,
+patterns, and good practices. I'm going to share how /we/, at SAY, are
+using them, and maybe this could be the start of a conversation.
+
+** Graphite/collectd/statsd
+
+We're using [[https://github.com/etsy/statsd][statsd]],
+[[https://collectd.org][collectd]]
+[[https://github.com/graphite-project][Graphite]] (and soon
+[[http://riemann.io][Riemann]] for alerting).
+
+*** Retention
+
+Our default retention policy is: =10s:1h, 1m:7d, 15m:30d, 1h:2y=. We
+don't believe that Graphite should be used for alerting: it's a tool for
+looking at history and trends.
+
+360 points for the last hour is enough to refer to a graph when an
+incident occurs. Most of the teams are releasing at least once a week,
+so 1 minute definition for a week is enough to compare trends between
+two releases. Then we go to a month (2 sprints) and they 2 years. We
+thought at first to keep only 15 months (1year + 1 quarter to compare),
+but since we have enough disk space, we decided to keep two years,
+however we might decide to change that in the future.
+
+*** System
+
+We don't do anything fancy here. =collectd= is running on each host, and
+then write to a central =collectd= server.
+
+*** Services
+
+For shared services (Memcached, Varnish, Apache, etc), we talk to
+=statsd=. We have a Perl script named =sigmoid=, with the following
+usage:
+
+#+BEGIN_SRC sh
+    Usage: ./sigmoid [<options>] <metricname> <value>
+     Exactly one of:
+     --counter (value defaults to 1)
+     --aggregate
+     --gauge
+     --event
+     --raw
+
+     Other options:
+     --disable-multiplex (for statsd only)
+     --appname
+     --hostname (to log on behalf of another machine)
+#+END_SRC
+
+This script is used by other scripts who monitor logs, status of apps,
+etc. This way it's very easy for a Perl, Python, Shell script to just
+call =sigmoid= via =system=, and then send the metric and the value to
+=statsd=.
+
+For some other services we might need something more specific. Let's
+take a look at Apache. We have another Perl script for the *CustomLog*
+settings (=CustomLog "|/usr/local/bin/apache-statsd"=). The script is
+doing the following things:
+
+-  compute the size of the HTTP request in bytes
+-  compute how long it took to return the response
+
+Then, it will send the following lines to =statsd= (with $base being the
+vhost in our case):
+
+-  =$base.all.requests:1|c= increases the total of HTTP requests we're
+   receiving
+-  =$base.all.bytes:$bytes|ms= send the size, in bytes, of that request
+-  =$base.all.time:$msec|ms= the time spend to get the response
+
+Now we will send the same line two more times, with a different prefix:
+=$base.method.$request_method= /and/ =$base.status.$status=.
+
+*** Applications
+
+Here, developers decide what they want to collect, and send the metric
+to =statsd=.
+
+*** Events
+
+And finally we have events. Every time we push an application or a
+configuration, we create a new event.
+
+** Proxying statsd
+
+We want metrics to be well organized, in a clear hierarchy.
+[[https://github.com/obfuscurity][Jason Dixon]] wrote in a
+[[http://obfuscurity.com/2012/05/Organizing-Your-Graphite-Metrics][blog
+post]] that /Misaligned paths are ok/. I disagree. We're collecting more
+than 100k metrics so far. If things are not well organized, it will
+become quickly very difficult to find what you're looking for.
+
+So, here's how we organize our metrics. The first level is the
+environment (PROD, CI, DEV, ...). Then we have /apps/ and /hosts/. For
+the /host/ section, we group by cluster type (Hadoop cluster, Web
+servers for TypePad, etc), and then you have the actual host, with all
+the metrics collected. For /apps/, we have four main categories:
+/aggregate/, /counters/, /events/ and /gauges/ (I'll come back on that
+later).
+
+Earlier I said that apps where sending metrics to =statsd=, but that's
+not exactly true. We (mostly) never write directly to statsd or
+Graphite.
+
+On each host, we have a Perl script listening. This proxy will rewrite
+all the incoming metrics by appending to the name the environment, the
+cluster and so on. This way when someone want to send a key, he doesn't
+have to care convention or using the correct prefix.
+
+Also, it will also multiplex the metric: we want the same key to end-up
+under /host/ and under /app/. Let's take an example here. If you're
+writing a web service, you may want to send a metric for the total time
+taken by an endpoint (this will be an aggregate). Our key will be
+something like:
+*<application-name>.<endpoint-name>.<http-method>.<total-time>*. The
+proxy, based on the network address, will determine that it's
+environment is CI, and that it's an application. But it also knows the
+name of the server, and the cluster. So two keys will be created:
+
+-  *<CI>.<apps>.<aggregate>.<application-name>.<endpoint-name>.<http-method>.*
+-  *<CI>.<hosts>.<cluster-name>.<host-name>.<aggregate>.<application-name>.<endpoint-name>.<http-method>.<total-time>*
+
+This way we can find the metric aggregated by application, or if we
+think there's a problem in one machine, we can compare per host the same
+metric.
+
+** Other problems with statsd and Graphite
+
+I don't know if it's a problem with vocabulary, or our maths (I admit
+that my maths are not good, but I trust Abe and Hachi's maths), but you
+can't imagine how much time we spend debating around the words gauges,
+counters and aggregates. What they mean, how they work, when to use
+them. So here's my questions: are we missing something obvious? do we
+over think it? or is it also confusing, and people are misusing them?
+
+Let's take *gauge* as an example. If you read
+[[https://github.com/etsy/statsd/blob/master/README.md#gauges][the
+documentation for gauges]], it seems very simple: you send a value, and
+it will be recorded. Well, the thing is it will record only the last
+value send during the 10 seconds interval. This work well when you have
+a cron job that will look at something every minute and report a metric
+to =statsd=, not if you're sending that 10 times a second (and yes, we
+will provide a patch for documentation soon).
+
+Another one where we lost a good amount of time: if you're smallest
+retention is different from the interval used by statsd to flush the
+data, they will be graphed incorrectly (see this
+[[https://github.com/etsy/statsd/issues/32#issuecomment-1830985][comment]]).
+
+The best "documentation" for =statsd=, so far, are the discussions in
+the [[https://github.com/etsy/statsd/issues][issues]].
+
+We have some other complains about Graphite. Even after reading the
+[[http://graphite.wikidot.com/whisper#toc1][rationals]] for Whisper, I'm
+not convinced it was a good idea to replace RRD with it. We also
+discovered some issues with
+[[http://if.andonlyif.net/blog/2013/01/graphites-derivative-function-lies.html][Graphite's
+functions]].
+
+** Meetup
+
+We've a huge basement at work that can be used to host meetup. There's
+already a few meetup in the San Francisco about "devops" stuff
+([[http://www.meetup.com/San-Francisco-Metrics-Meetup/events/98875712/][Metrics
+Meetup]], [[http://www.meetup.com/San-Francisco-DevOps/][SF DevOps]],
+etc), but maybe there's room for another one with a different format.
+
+What I would like, is a kind of forum, where a topic is picked, and
+people share their /experiences/ (the bad, the good and the ugly), not
+how to configure or deploy something. And there's a lot of topics where
+I've questions: deployment (this will be the topic of my next entry I
+think), monitoring, alerting, post-mortem, etc. If you're interested,
+send me an email, or drop a comment on this post.