summaryrefslogtreecommitdiff
path: root/posts/2013-01-28-let-s-talk-about-graphite.org
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--posts/2013-01-28-let-s-talk-about-graphite.org184
1 files changed, 0 insertions, 184 deletions
diff --git a/posts/2013-01-28-let-s-talk-about-graphite.org b/posts/2013-01-28-let-s-talk-about-graphite.org
deleted file mode 100644
index b0e96d7..0000000
--- a/posts/2013-01-28-let-s-talk-about-graphite.org
+++ /dev/null
@@ -1,184 +0,0 @@
-As I've already mentionned in a [[/carbons-manhole/][previous post]], at
-[[http://saymedia.com][$work]] we are currently deploying Graphite and
-the usual suspects.
-
-Finding articles on how to install all these tools is easy, there's
-plenty of them. But what's /really/ hard to find, are stories on /how to
-use them/: what's collected, how, why, how do you organize your metrics,
-do you rewrite them, etc.
-
-What I want with this post is to start a discussion about usages,
-patterns, and good practices. I'm going to share how /we/, at SAY, are
-using them, and maybe this could be the start of a conversation.
-
-** Graphite/collectd/statsd
-
-We're using [[https://github.com/etsy/statsd][statsd]],
-[[https://collectd.org][collectd]]
-[[https://github.com/graphite-project][Graphite]] (and soon
-[[http://riemann.io][Riemann]] for alerting).
-
-*** Retention
-
-Our default retention policy is: =10s:1h, 1m:7d, 15m:30d, 1h:2y=. We
-don't believe that Graphite should be used for alerting: it's a tool for
-looking at history and trends.
-
-360 points for the last hour is enough to refer to a graph when an
-incident occurs. Most of the teams are releasing at least once a week,
-so 1 minute definition for a week is enough to compare trends between
-two releases. Then we go to a month (2 sprints) and they 2 years. We
-thought at first to keep only 15 months (1year + 1 quarter to compare),
-but since we have enough disk space, we decided to keep two years,
-however we might decide to change that in the future.
-
-*** System
-
-We don't do anything fancy here. =collectd= is running on each host, and
-then write to a central =collectd= server.
-
-*** Services
-
-For shared services (Memcached, Varnish, Apache, etc), we talk to
-=statsd=. We have a Perl script named =sigmoid=, with the following
-usage:
-
-#+BEGIN_SRC sh
- Usage: ./sigmoid [<options>] <metricname> <value>
- Exactly one of:
- --counter (value defaults to 1)
- --aggregate
- --gauge
- --event
- --raw
-
- Other options:
- --disable-multiplex (for statsd only)
- --appname
- --hostname (to log on behalf of another machine)
-#+END_SRC
-
-This script is used by other scripts who monitor logs, status of apps,
-etc. This way it's very easy for a Perl, Python, Shell script to just
-call =sigmoid= via =system=, and then send the metric and the value to
-=statsd=.
-
-For some other services we might need something more specific. Let's
-take a look at Apache. We have another Perl script for the *CustomLog*
-settings (=CustomLog "|/usr/local/bin/apache-statsd"=). The script is
-doing the following things:
-
-- compute the size of the HTTP request in bytes
-- compute how long it took to return the response
-
-Then, it will send the following lines to =statsd= (with $base being the
-vhost in our case):
-
-- =$base.all.requests:1|c= increases the total of HTTP requests we're
- receiving
-- =$base.all.bytes:$bytes|ms= send the size, in bytes, of that request
-- =$base.all.time:$msec|ms= the time spend to get the response
-
-Now we will send the same line two more times, with a different prefix:
-=$base.method.$request_method= /and/ =$base.status.$status=.
-
-*** Applications
-
-Here, developers decide what they want to collect, and send the metric
-to =statsd=.
-
-*** Events
-
-And finally we have events. Every time we push an application or a
-configuration, we create a new event.
-
-** Proxying statsd
-
-We want metrics to be well organized, in a clear hierarchy.
-[[https://github.com/obfuscurity][Jason Dixon]] wrote in a
-[[http://obfuscurity.com/2012/05/Organizing-Your-Graphite-Metrics][blog
-post]] that /Misaligned paths are ok/. I disagree. We're collecting more
-than 100k metrics so far. If things are not well organized, it will
-become quickly very difficult to find what you're looking for.
-
-So, here's how we organize our metrics. The first level is the
-environment (PROD, CI, DEV, ...). Then we have /apps/ and /hosts/. For
-the /host/ section, we group by cluster type (Hadoop cluster, Web
-servers for TypePad, etc), and then you have the actual host, with all
-the metrics collected. For /apps/, we have four main categories:
-/aggregate/, /counters/, /events/ and /gauges/ (I'll come back on that
-later).
-
-Earlier I said that apps where sending metrics to =statsd=, but that's
-not exactly true. We (mostly) never write directly to statsd or
-Graphite.
-
-On each host, we have a Perl script listening. This proxy will rewrite
-all the incoming metrics by appending to the name the environment, the
-cluster and so on. This way when someone want to send a key, he doesn't
-have to care convention or using the correct prefix.
-
-Also, it will also multiplex the metric: we want the same key to end-up
-under /host/ and under /app/. Let's take an example here. If you're
-writing a web service, you may want to send a metric for the total time
-taken by an endpoint (this will be an aggregate). Our key will be
-something like:
-*<application-name>.<endpoint-name>.<http-method>.<total-time>*. The
-proxy, based on the network address, will determine that it's
-environment is CI, and that it's an application. But it also knows the
-name of the server, and the cluster. So two keys will be created:
-
-- *<CI>.<apps>.<aggregate>.<application-name>.<endpoint-name>.<http-method>.*
-- *<CI>.<hosts>.<cluster-name>.<host-name>.<aggregate>.<application-name>.<endpoint-name>.<http-method>.<total-time>*
-
-This way we can find the metric aggregated by application, or if we
-think there's a problem in one machine, we can compare per host the same
-metric.
-
-** Other problems with statsd and Graphite
-
-I don't know if it's a problem with vocabulary, or our maths (I admit
-that my maths are not good, but I trust Abe and Hachi's maths), but you
-can't imagine how much time we spend debating around the words gauges,
-counters and aggregates. What they mean, how they work, when to use
-them. So here's my questions: are we missing something obvious? do we
-over think it? or is it also confusing, and people are misusing them?
-
-Let's take *gauge* as an example. If you read
-[[https://github.com/etsy/statsd/blob/master/README.md#gauges][the
-documentation for gauges]], it seems very simple: you send a value, and
-it will be recorded. Well, the thing is it will record only the last
-value send during the 10 seconds interval. This work well when you have
-a cron job that will look at something every minute and report a metric
-to =statsd=, not if you're sending that 10 times a second (and yes, we
-will provide a patch for documentation soon).
-
-Another one where we lost a good amount of time: if you're smallest
-retention is different from the interval used by statsd to flush the
-data, they will be graphed incorrectly (see this
-[[https://github.com/etsy/statsd/issues/32#issuecomment-1830985][comment]]).
-
-The best "documentation" for =statsd=, so far, are the discussions in
-the [[https://github.com/etsy/statsd/issues][issues]].
-
-We have some other complains about Graphite. Even after reading the
-[[http://graphite.wikidot.com/whisper#toc1][rationals]] for Whisper, I'm
-not convinced it was a good idea to replace RRD with it. We also
-discovered some issues with
-[[http://if.andonlyif.net/blog/2013/01/graphites-derivative-function-lies.html][Graphite's
-functions]].
-
-** Meetup
-
-We've a huge basement at work that can be used to host meetup. There's
-already a few meetup in the San Francisco about "devops" stuff
-([[http://www.meetup.com/San-Francisco-Metrics-Meetup/events/98875712/][Metrics
-Meetup]], [[http://www.meetup.com/San-Francisco-DevOps/][SF DevOps]],
-etc), but maybe there's room for another one with a different format.
-
-What I would like, is a kind of forum, where a topic is picked, and
-people share their /experiences/ (the bad, the good and the ugly), not
-how to configure or deploy something. And there's a lot of topics where
-I've questions: deployment (this will be the topic of my next entry I
-think), monitoring, alerting, post-mortem, etc. If you're interested,
-send me an email, or drop a comment on this post.