We’ve been deleting a lot of code from Scout. We’re ripping out major infrastructure, and in doing so, pulling the plug on functionality which, just six months ago, we believed would be crucial to our business. Most importantly, we’re simplifying the most complex, error-prone, and poorly-performing parts of the application. At the same time, our revenue and sales pipeline is growing at a faster rate.
How did this happen? How did we get to a place where we can remove code and functionality and see our business will grow because of it?
As they say, “mistakes were made.” You don’t get the satisfaction of throwing out a bunch of cruft and performance-degrading features without having gone through the pain of:
Building those features in the first place.
Fighting the performance problems for a few months before you realize its all untenable and come up with alternatives.
So yes, mistakes were made. But also, lessons were learned.
UPDATE – Cloud Image Monitoring has been replaced by Server Roles
The greatest roadblock to monitoring is monitoring itself – installing software, tweaking scripts, remembering how to reload scripts, etc makes it painful process. It’s even more of an issue when setting up monitoring for cloud deployments – running a large configuration script or installing a lot of software with a saved image makes your environment very fragile, especially when deploying new servers is a frequent task.
We’re debuting a new feature in Scout that makes monitoring your cloud servers a single crontab entry – no scripts to setup, edit, reload, or
coordinate across multiple instances.
No Rails plugins to install. No performance hit during the request cycle. Nothing to break your application code. Nothing to restart. With just the path to your production Rails log file, Scout’s new Rails monitoring plugin alerts you when your Ruby on Rails application is slowing down and provides detailed daily performance reports.
First, an open-source shoutout: thanks to Willem van Bergen and Bart ten Brinke (the Rails Doctors) for their Request Log Analyzer gem, which we built upon for this functionality.
Rails analysis made easy
Easy setup. All we need is a path to the log file of your production Rails application. That’s it. There’s nothing to configure in your Rails application. Unlike our previous Rails analyzer, you don’t have to install a Rails plugin or even redeploy your Rails application. There are zero changes to your Rails code base.
In-depth analysis. Get rendering time and database time on a per-action basis. Know your error code rates, HTTP request types, cache hit ratios, and more.
No performance impact. Since the analysis happens out the request-response cycle, there is no performance impact on your running Rails app.
Alerts. Like all Scout plugins, you can get alerts based on the flat data the plugin produces. Get alerts on requests/minute, number of slow requests, and average request length.
How it works
The plugin performs a combination of incremental and batch processing on your application’s logfile. Every time the Scout agent runs (3min-30min, depending on your Scout plan, it parses new entries in your log file since the last time it ran. This provides key metrics for near-realtime graphs and alerts.
Once a day, the Analyzer runs to crunch the numbers for more in-depth metrics. This is what provides the breakdowns among all your actions, analysis of most popular actions, most expensive actions, etc.
Server Monitoring and weather forecasts have a lot in common – (1) I want to know what’s happening now, (2) what the forecast looks like for the immediate future, and (3) how the long-term is shaping up.
We’re developing features that make using Scout in a cloud hosting environment super simple (Amazon ec2, Rackspace Cloud, GoGrid, etc). We’re looking for BETA testers, so shoot us an email at firstname.lastname@example.org with your account name if you are interested in giving Scout’s cloud support a try.
If you're using Amazon EC2, you may be familiar with CloudWatch,
Amazon's analytic system that provides metrics on CPU usage,
Network I/O, and Disk I/O of your instances. While CloudWatch
collects metrics, it doesn't provide a web interface for viewing the metrics, graphs, trending, or alerting.
Enter our Scout EC2 Cloudwatch plugin. Like any
other Scout plugin, you can graph the resulting metrics, set
triggers, track trends, and get email alerts when the numbers go
out of bounds.
What does it monitor?
The CloudWatch plugin captures the following ("measures", as EC2 calls them): NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps DiskReadBytes, DiskWriteBytes, CPUUtilization.
Note, this plugin does not fetch EC2 Load Balancer Metrics, only EC2 instance metrics.
Single Instance, Autoscaling Group, etc.
The EC2 CloudWatch plugin can capture metrics from a single EC2 instance, or it can aggregate metrics across a couple of dimensions. It can aggregate metrics across a given instance type, across all instances launched from a specific image (AMI), or by a specified autoscaling group. That means you can, for example, graph the performance of your application server autoscaling group as a whole, or graph just your memcached instance.
To use this plugin, you have to enable CloudWatch for the instance(s) you want to collect metrics from. See Amazon's CloudWatch docs for details. Basically, it's just ec2-monitor-instances ##### from the command line, or passing a monitoring parameter to the ec2-run-instances. It's covered nicely in Amazon's docs.
New to Scout?
If you're learning about Scout through this plugin, sign up for a trial Scout account to give this plugin a try. You can graph all kinds of metrics and measurements from all your servers. It works with cloud instances, VPS's, and dedicated hardware.
James Gray's July 19th talk at RubyKaigi 2009 focused on best practices for long-running Ruby daemon processes.
What types of questions did the audience ask? What did they seem most interested in?
In general, users always want to know about our RRD usage, extracting the daemon functionality from Scout's agent, and the agent's memory usage. It was the same at RubyKaigi. The questions reminded me of how much current Ruby RRD solutions suck and that it's time we did something about that. It also reminded me that I need to get around to extracting our daemon code, which I've always intended to do.
As FiveRuns posted on their blog they have announced End-of-Life for FiveRuns Manage. We have made arrangements with FiveRuns to ease the transition for customers who still need a robust, easy-to-use monitoring solution.
For current Fiveruns customers, we are offering 50% off your first paid month here with Scout . Note that this is only for current FiveRuns Manage customers, and that the offer expires in one week (August 19th). Of course, like any other Scout signup, it’s risk-free: your first month is free (and your second month is half-off) and you can cancel, upgrade, or downgrade at anytime.
FiveRuns Manage customers: use your discount code on our signup page, and welcome to Scout!
Getting started with Scout is very straightforward, and the signup process guides you through all the steps. The main difference from FiveRuns Manage is that you choose the components you want to monitor by selecting plugins. You can add or remove plugins at any time, and we offer some suggestions for getting started below.
Your basic process is this:
Install the gem: sudo gem install scout_agent and start it with the server key you’re given on signup
Customize or add Triggers. Scout uses triggers to alert you of spikes or trends in the data being gathered—for example, “alert me when the five-minute load average exceeds 4.0” Plugins come with default triggers, and you can customize all you need.
You might be familiar with Linux load averages already. Load averages are the three numbers shown with the uptime and top commands - they look like this:
load average: 0.09, 0.05, 0.01
Most people have an inkling of what the load averages mean: the three numbers represent averages over progressively longer periods of time (one, five, and fifteen minute averages), and that lower numbers are better. Higher numbers represent a problem or an overloaded machine. But, what's the the threshold? What constitutes "good" and "bad" load average values? When should you be concerned over a load average value, and when should you scramble to fix it ASAP?