A big thanks to Eric Lindvall of Papertrail for adding steal time to Scout's CPU Usage Plugin and helping out on this blog post!
Netflix tracks CPU Steal Time closely. In fact, if steal time exceeds their chosen threshold, they shut down the virtual machine and restart on a different physical server.
If you deploy to a virtualized environment (for example, Amazon EC2), steal time is a metric you'll want to watch. If this number is high, performance can suffer significantly. What is steal time? What causes high steal time? When should you be worried (and what should you do)?
A couple of years ago I visited Argentina. I have trouble enough pronouncing my limited English vocabulary and I don't speak Spanish, but after a bit of time, it was pretty easy to order food, buy groceries, and use a taxi. However, occasional hangups that happen during my regular life in the states would throw me out of sorts in Spanish: a taxi driver trying to explain he doesn't have enough change would send me off the rails.
Ruby is my English when it comes to writing software, so when I hit hangups installing something Ruby-related, I can usually work my way out of them. Our monitoring agent at Scout is a Ruby gem, and while most of our customers already have Ruby installed, for those that don't a seemingly small hangup to me can be frustrating for them.
Now, thanks to Omnibus, there's an easy way to distribute your Ruby gems as standalone, full-stack program. This means folks without Ruby can have as smooth of an experience with your hip new gem as a hardened Rubyist.
Here's how I've built a full-stack installer for our scout Ruby Gem.
Back in 2010, we suggested using /bin/bash -l -c to run scout via Cron when using RVM. However, this was a brute approach: /bin/bash -l -c tells bash to behave as a login, interactive process. However, as Daniel Szmulewicz elequently stated in the comments for the original blog post, "Cron jobs are by nature non-login, non-interactive processes".
Fast-forward to today: RVM usage is continuing in production, and to make things more complicated, Cron jobs often need to account for both RVM and Bundler. So, what's our preferred approach when running Ruby executables via Cron in an RVM, RVM+Bundler, or Bundler environment? A shell script.
Cron Shell Script: RVM + Bundler
Lets say we want to run a Ruby executable (scout [KEY]) via Cron with (1) Ruby 1.9.2 and (2) my Rails App's Gem bundle:
Make the shell script executable: chmod +x FILE.sh.
Add the Cron job:
* * * * * shell_script.sh
But that's a lot of typing...
It's tempting to use /bin/bash -l -c when you are busy/lazy. To get around this, the scout install [KEY] command will detect if you are using (1) RVM and/or (2) Bundler. If so, we generate the shell script for you and make it executable.
scout install BNrIneEBMwE8h6VlhO4Bw4WmOVSLmnygSFZEPCfi
=== Scout Installation Wizard ===
It looks like you've installed Scout under RVM and/or Bundler.
We've generated a shell script for you.
Run `crontab -e`, pasting the line below into your Crontab file:
* * * * * /Users/dlite/.scout/scout_cron.sh
How do we detect RVM and Bundler? We've encapsulated it into an Environment class:
Scout’s realtime charts have been a big hit. Once you start using them for major deploys or performance incidents, going back to ten terminal windows running “top” feels like the dark ages.
So, how did we go about it?
To inspire hard work, some young men hang a poster on their wall that includes: (1) an exotic sports car (2) a scantly clad lady and (3) a beach house. My inspirational poster would be much less attractive: a friendly butler who offers time-honored wisdom (with an accent because people with accents are smarter) and absolutely loves running errands for me.
I don’t like running errands because I don’t like waiting in lines. My nightmare: having to pickup groceries during a busy weekend afternoon. There are 3 queues at the grocery store that can cause a delay:
- Finding a parking spot
- Getting a shopping cart
- Checking out
Modern web apps face the same queuing issues serving web requests under heavy traffic. For example, a web request served by Scout passes through several queues:
That’s Apache (for SSL processing) to HAProxy on the load balancer, then Apache to Passenger to the Rails app on a web server.
A request can get stuck in any of those five spots. The worst part about queues? Time in queue is easy to miss. Most of the time, people look at the application log when they suspect a slowdown. However, a slowdown in any of the four earlier queues won’t show up in your application log. Just looking at your application and database activity for slowdowns is like recording the time it takes to get your groceries from the time you grab the first item on the shelf till you start waiting to checkout: you’re leaving out the time it takes to find a parking spot, get a cart, and checkout.
Now, before you start worrying about queues, take a deep breath. First, each of these systems are super reliable. For the most part, they just work. Second, it’s much more likely your application logic is the cause of a performance issue than a queuing problem. Look there first.
Third (and most importantly), each of these systems handles queues in remarkably similar ways. Understanding some basic queuing concepts will go a long way. Let’s take a look at some basics and then specific examples for Apache, HAProxy, and Passenger.
A big part of providing good support is making it painless. At Scout, Andre and I handle all of the support requests. Once we’ve gathered the account information, it usually doesn’t take much time to help. The problem is quickly putting the account information together. We don’t want to use a dedicated support application – we usually handle just a couple of support requests per-day.
Why not view all of the account information right from Gmail, where the support request originates? We’re using Rapportive with a custom Raplet to make it happen. When we receive an email from a Scout customer, we see their Scout account info.
You maintain a growing Rails application and you’re seeing something peculiar. Sometimes when you use the application, it feels like the performance deteriorates significantly. However, all of your performance data shows no issues – requests in the Rails log file look speedy, CPU utilization is fine, database performance is solid, etc.
At first, you wave it off as a fluke. But then a customer reports the same issue. Now you’re concerned.
~ or ~
Sysadmin Eye for the Dev Guy
Developers! You can churn out a Rails or Sinatra app in no time. What about putting it out there in production? Occasionally forget the syntax for crontab or logrotate? Yeah, me too.
That's why I wrote up a few essential notes for a serviceable production environment.
This article covers Centos/Red Hat and Ubuntu, which is what I always end up on. My approach is to get some minimal configurations working quickly so I can see some results. From there, I can go back and refine the configurations.
How much memory is really available on your Linux box? Don't use
/proc/meminfo to find out, use
free -m instead. You may have more memory available than you thought.
Here's an example.
/proc/meminfo says about 330MB is free:
~ $cat /proc/meminfo
MemFree: 340996 kB
free -m gives the following:
~ $free -m
total used free shared buffers cached
Mem: 1024 691 332 0 86 288
-/+ buffers/cache: 316 708
Swap: 2047 68 1979
You'll see the "buffers" and "cached" columns, which tell you about the amount of memory that the kernel is using for filesystem buffers, etc.
This sort of cached data will be freed by the kernel when an application tries to allocate more than what is "free", which is why the "-/+ buffers/cache" line is really the important line to pay attention to when you're checking out the free memory on a system.
So in this example, 708MB is how much memory is technically available for allocation should an application need it. The "buffers" (86MB) and "cached" (288MB) will be released by the kernal if they are needed.
All credit for this post goes to Eric Lindvall, who also wrote the memory profiler plugin.
Our co-author today is Jesse Newland
. Jesse keeps RailsMachine customers up and running and troubleshoots their toughest problems. We’re pleased to have him share some of his expertise on Phusion Passenger
Say your Rails application is running in production and it’s getting good traffic. Performance isn’t as good you would like. You’ve already determined that your database is not the bottleneck. What’s your next move?
There is a good chance that Passenger’s
PassengerMaxPoolSize needs to be adjusted.
PassengerMaxPoolSize specifies how many instances of your application Passenger will spin up to service incoming requests. If you were running Mongrels back in the day,
PassengerMaxPoolSize is equivalent to the number of mongrels you configured for your app. The value of
PassengerMaxPoolSize has a major bearing on your application’s performance.