A 5-point Rails app performance audit

October 27 Bullet_white By Derek Bullet_white Comments Comments

Before we talk performance, lets talk entropy. Entropy usually refers to the idea that everything in the universe eventually moves from order to disorder, and entropy is the measurement of that change.

Like entropy, the performance of a Rails app will trend toward disorder. An N+1 database query here, a forgotten pagination implementation there, a missing index here, etc. This performance debt builds over time, and suddenly...we've got a slow app.

slow sloth

Where do you start knocking down this performance debt? Surely, not everything is slow, right? Let's perform a Rails performance audit.

In 10 minutes or less, you'll have a good idea of where your app stands and where to focus your efforts by following this 5-point performance audit. At each step of the audit, I'll work through the analysis on a real production app so you can see an audit applied.

PSA: Do not practice shotgun optimization

more you know

Performance in almost all web apps - including Rails - follows an 80/20 rule: most of your performance problems will be contained within a small amount of the application code. This is a great thing: most of the time, you don't need to litter your code base with performance hacks. You don't need to optimize everything.

It's your job - with the help of production performance monitoring tools - to identify these performance hot spots. In this post, we'll use Scout to perform the audit.

Can you get by without production monitoring? The production version of your app will behave very differently than your code-reloading, trivial database, no traffic, single-user development app. I wouldn't recommend it.

1. What's the general profile of the app?

You're here because your app is running slow, so let's start by looking at your response times. In Scout, this is front-and-center when viewing your app. For this audit, change the timeframe to 7 days vs. the default timeframe. Many apps have seasonal trends - like higher throughput during the business day - and the longer timeframe will reveal the general profile of the app:

7-day response times

Each element of the stacked bar represents a layer of your stack (ex: database, external HTTP calls, Ruby). Added together, this displays the average response time across all requests to the app.

How's the example app look?

Pretty good!

This is likely an app used during the business day as traffic and response times peak around 12pm.

While response times increase from 40 ms to 70 ms during these peak periods, we can't yet conclude there is a scaling problem: customers may be using heavier controller-actions that aren't used during off periods. Let's note this and dig in later.

2. Where does the time go?

Typically, there's just a couple of layers in the stack that are responsible for most of the time your app spends responding to web requests. We can narrow this down by looking at the timeseries chart. This is a hint at where we'll be focusing our time.

Are there periods where requests are backed up in the queue?

There's a special metric on the response time chart: QueueTime. This measures the time from when a request hits a load balancer till when it is first processed by your application server (ex: Puma):

queue time

You'll want this value to remain under 20 ms. If this exceeds 20 ms, you have a capacity issue somewhere in your stack (most commonly at the app server or database).

How's the example app look?

  • ActiveRecord (database) and Controller (Ruby) account for most of the time spent. This is great: it drastically reduces the number of possible layers that need attention.
  • Queue times are generally 2 ms or less - it's unlikely we have a capacity problem.

3. Are we dealing with a cheetah or a sloth?

We've gained a great picture of our's performance profile. Now, it's time to look at the response time numbers and determine how much trouble we're in. Beneath the timeseries chart in Scout you'll find a series of sparklines. The numbers here represent the averages over the given time period (7 days, in our case):

response time sparkline

We'll look at the "response time - mean" sparkline first. Here's some rules of thumb on response times:

Requests Per-Minute Classification
< 50 ms Fast
< 300 ms Normal
> 300 ms Slow

If you're just serving JSON content for an API server, response times should be smaller...perhaps 100 ms is slow in your case.

How's the example app look?

Response times are fast...but, the mean response time don't tell the whole story.

4. What's the spread of response times?

A single, fast, high-throughput controller action can drastically lower an app's mean response time. The mean is a great place to start, but it doesn't provide the entire picture. For a broader picture on your app's performance, you'll want the 95th percentile response time as well:

95th

The 95th percentile response time says that 95% of requests have response times at or below this number. Conversely, 5% of response times are above this threshold. You'll want the 95th percentile response time to be no greater than 4x the mean response time. If this ratio is greater than 4:1, your app may have some controller-actions triggering significantly longer response times.

How's the example app look?

The 95th percentile response time (162 ms) is 3.2x greater than the mean response time (51 ms). This falls within the 4:1 ratio, but it's close enough to the max that there may be some slow controller-actions within the app.

5. How much traffic is the app handling?

As a general rule, the greater the throughput, the more difficult performance work becomes. As throughput grows, the underlying services are generally more complex and there are more business tradeoffs to consider when doing performance work (ex: what do when an endpoint is slow for a single, high-paying customer?).

Generally:

Requests per-minute Scale
<50 rpm Small
50-500 rpm Average
500+ rpm Large

How's the example app look?

Our mean throughput is 240 rpm with spikes up to 350 rpm. This is an average application in terms of throughput. There's likely a decent number of knobs we can turn.

Your 5-point Rails performance audit cheatsheet

We now have a solid 10,000 foot of our app's performance charismatics and health. These are the questions we asked:

  1. What's the general performance profile of the app? Are there clear busy periods during a day? Does the app get dramatically slower during peak times?
  2. Where does the time go? There's likely just a couple dominant pieces (example: Ruby & ActiveRecord). Are there periods where QueueTime exceeds 20 ms, indicating a capacity problem?
  3. Is our app fast (< 50 ms response times), average (< 300 ms response times), or slow (> 300 ms response times)?
  4. Are 95th percentile response times no greater than 4x the mean response time? If not, there may be some pokey slowness even if response times are generally fine.
  5. Is our app small (< 50 rpm), medium (50 - 500 rpm), or large (500+ rpm)? The larger the app, the more involved performance work becomes.

Up Next: digging into the cause

If you're a developer like me, you're anxious to start digging into code and fixing these problems. In my next post for this series, we'll dive into your endpoints to identify where to focus your time. Use the form below to be notified via email:

Get notified of new posts.

Once a month, we'll deliver a finely-curated selection of optimization tips to your inbox.

Comments

comments powered by Disqus