Pillars of the Rails Monitoring Stack: 2016 Edition

December 02 Bullet_white By Derek Bullet_white Posted in HowTo Bullet_white Comments Comments

Curious about the tools a monitoring company like us uses to monitor our own Rails apps? Here's a behind-the-scenes rundown of how we ensure our apps are in peak condition heading into 2016.

Stop searching for a single tool

There's no single, do-everything tool that completely monitors a modern-day Rails stack. If there was, it'd be the software equivalent of the Homer Simpson-designed car. There's simply too many specialized things to put into a single monitoring app.

However, there's good news: a number of specialized apps play well together to give you great monitoring coverage of your Rails apps and infrastructure.

When picking a monitoring solution, you can typically choose between two options:

  1. Open Source
  2. Paid Hosted Service

The upsides of open source: free to install and more customizable. The downsides: generally more difficult to use and fairly complex to maintain.

Most of the monitoring services we use are paid services. We typically only use open source options when the paid, hosted option is significantly cost prohibitive. Monitoring software is complicated and keeping your own stack running can be a time-sink. The last thing we want is unreliable software monitoring our apps.

Covering your Blind Spots

Here's the primary areas we monitor:

  • Uptime - are key controller-actions reachable from around the globe?
  • Application Monitoring - when performance goes bad, dive into the the line-of-code causing the issue.
  • Log Monitoring - aggregate logs across app servers.
  • Server & Service Monitoring - server resource usage and ensuring services are running as expected.
  • Exception Monitoring - aggregate, view, and close exceptions.
  • Custom Metrics - track key performance indicators that are specific to your app.
  • Scheduled Job Monitoring - ensure jobs that are scheduled to run actually run.

I'll cover each area in detail below.

Uptime Monitoring

This is the basic building block of monitoring. Whether you are hosting a personal blog or on the stability team for Facebook, you need this. Uptime monitoring tells you if your app is down, but no details beyond that.

Open Source

I'm not aware of a good open source option for this. It's also not an area I'd spend a lot of time investigating: running a geographically distributed network of servers to monitor uptime would be complicated. Paid options are very affordable.

Paid

pingdom

We use Pingdom, the Kleenex of uptime monitoring. Pingdom starts at $15/mo for 10 checks. We've found the UI to be a bit heavyweight for our needs, but the service has been nothing but reliable over the years. While there are many other options, I've been hesitant to swap out Pingdom for anything else for this basic building block.

Implementation Notes

We check two primary controller-actions in each of our Rails apps:

  • General health-check (run a query against each of our database systems)
  • Test our metric checkin endpoint with a fake payload

Application Monitoring

When it comes to tracking down performance issues, application monitoring gives you the most value with the least effort. Finding an application monitoring tool can be confusing - lots of tools say they are application monitoring.

So, lets define application monitoring: application monitoring is the ability to point to a line-of-code when there is a performance problem. With that definition, there's a much more narrow scope.

Open Source

We don't know of a widely-used open source solution for Rails app monitoring.

Paid

scout apm

We use our own solution (Scout). The mainstay is New Relic. Scout is $59/mo/server, New Relic $199/mo/server. Scout also offers per-request pricing, starting at $20 for 1M requests.

Implementation Notes

On larger teams, this is typically most used by developers as it ties directly to code they have written. Folks on the DevOps side are more concerned with higher-level performance metrics than application code.

Choosing between Scout and New Relic? Scout has a laser-focus on application code which New Relic is more of a full-stack solution. We've choose to focus on application code as these are the most time-consuming performance issues to track down and full-stack monitoring is becoming less relevant. More background here.

Log Monitoring

Logs are the lowest common denominator of monitoring.

In most modern setups, you are likely using multiple application servers served behind a load balancer. This means if there is an issue you need to track down, you'd need to find it on the right server. To solve this problem, send your logs to a central service.

Open Source

kibana

There are a couple of options, but we use the ELK stack (ElasticSearch, Logstash, and Kibana).

Paid

Splunk is the large incumbent, but lots of our customers are really happy with Papertrail.

Implementation Notes

Both developers and the devops team will likely use this tool, so it's important that both are comfortable with it. Our customers tell us Papertrail is the easiest option if you're getting going with log monitoring.

This is the only part of our monitoring stack that is open source, soley because our bill using a paid service would eclipse our hosting bill. A monitoring application like ours does a lot of logging.

We use the Lograge gem in our Rails apps to generate more readable, structured log files.

Sidenote: logging is a big topic. We'll provide some tips next week to get the most out of your logs. Subscribe to our email newsletter to get these tips to your inbox.

Server & Service Monitoring

Ensure the servers hosting your app and the the services running on them are behaving. There are a number of options in the server monitoring space - probably more than any other area.

Common use cases of server monitoring:

  • Correlate database disk utilization against Apache response times
  • Ensure replication is running between database servers
  • Track Apache/Nginx throughput and response times

Open Source

The standby is Nagios. However, out of any open source monitoring solution I've mentioned in this post, I'd say Nagios is the most difficult to install, use, and maintain. If you go the open source path, I'd suggest trying Sensu.

Paid

scout servers

Paid services often combine a couple useful services together. For example, when you use Scout, you also get StatsD and AWS monitoring. For many open source tools, you need to combine several unrelated pieces of software together to do charts, monitoring, alerting, and custom metrics. This makes monitoring more brittle.

Besides Scout, Datadog is another common option. Pricing is around $15/server/mo.

Implementation Notes

On larger teams, this is most frequently used by devops. For smaller teams, it's important that your developers are comfortable with the server monitoring tool as well (some server monitoring tools have very poor user experiences).

Exception Monitoring

Exception Monitoring tools make it easy to track exceptions down to a line-of-code, saving you valuable development time hunting down bugs. They also aggregate similar errors together to decrease noise when things are going wrong.

Open Source

Sentry has an open source option (and a paid service - see below).

You can also just use the Exception Notification Gem to notify you of exceptions via email. This doesn't scale well: you'll be overwhelmed with emails during peak outage periods (like a database server going offline).

Paid

sentry

We use Sentry. There are several other options including Honeybadger and Rollbar. These services allow you to collect exceptions from a variety of languages and frameworks.

Pricing starts around $30/mo.

Implementation Details

We classify exceptions into 2 areas:

  1. Bugs in code that need to be fixed (this is where Sentry, Honeybadger, etc) are most useful.

  2. Transient errors that can occur but are typically only problems if they occur at a high rate and/or over an extended period of time (ex: database timeout errors).

For (2), we use StatsD (see Custom Metrics below) and set alerting thresholds on error rates.

Custom Merics

Every app has key indicators that ensure things are working. For example, with Scout, we monitor the number of active servers and watch for large drops. These can indiciate network issues between a customer's datacenter and ours.

StatsD is a terrific, lightweight tool for custom metrics. If you are logging numbers, it often makes sense to put those numbers into StatsD.

Open Source

Graphite is the standard dashboard solution that can accept StatsD metrics. The downside is alerting isn't included.

Paid

scout statsd

Both Scout and Datadog's monitoring agents accept StatsD metrics - this allows you to view+alert on most of your metrics (server, services, and StatsD) from a single service.

Librato is a more general hosted metrics service.

Implementation Notes

StatsD is a lightweight protocol (can even report metrics via bash), so you'll have universal support for custom metrics across languages and frameworks.

Scheduled Jobs Monitoring

Are your backups really running? Is your one-per-day billing script completing successfully? That's where monitoring scheduled jobs comes into play.

Open Source

I'm not aware of any options, but paid options are very affordable, starting at $5/mo.

Paid

deadman

We use Deadman's Snitch, which starts at just $5/mo.

Deadman's Snitch is easy to setup: just hit an assigned URL when a job completes.

TL;DR

Here's our current monitoring stack at Scout entering 2016:

Questions? Suggestions?

We're happy to share more details. Ping us at support@scoutapp.com. Share your suggestions in the comments below.

Get notified of new posts.

We'll deliver a curated selection of optimization tips right to your inbox each month.

Comments

comments powered by Disqus