Overhead Benchmarks: New Relic vs. Scout

February 07 Bullet_white By Derek Bullet_white Posted in App Monitoring Bullet_white Comments Comments

High monitoring overhead is a silent killer: your app's requests take longer, throughput capacity shrinks, end users requests start stacking up in a request queue, you react by provisioning more servers, and finally, more servers == more $$$.

So how does Scout's overhead compare with the competition? To find out, we set up a suite of benchmarks comparing Scout's overhead to New Relic.

To ensure fair results, every part of these tests is open-source - from the Rails app we're benchmarking to the Rails log files generated by the benchmarks. We encourage you to analyze the raw data, try these benchmarks on your own, and let us know if you come to a different conclusion.

Benchmarking Scenarios

App monitoring overhead varies based on (1) instruments used and (2) available resources on the application server.

That in mind, we benchmarked agent overhead in the following scenarios:

  1. Representative endpoint - this test hits 100 endpoints in a Rails app, with each controller-action conducting 21 database queries and rendering 20 view partials.
  2. Expensive Endpoints - we simulate an app that does a lot of work to deliver a request (1k database queries and 1k view partials).
  3. Fast Endpoints - we simulate an API-like controller-action that does very little work.

In these benchmarking tests, our metric of comparison will be response time. We're benchmarking a Rails 4.2.5 application running Ruby 2.2.3.

I've put the results below. Beneath this summary, you'll find details and analysis on each benchmark. The percentages below represent the increase in response time when each agent is installed. Lower is better:

Response Time Overhead Benchmarks

Benchmark APM Agent Response Time Overhead
Representative Endpoint

21 database queries and 20 view partials per controller-action.

None 55.6 ms
New Relic 80.4 ms 44.5%
Scout 56.8 ms 2.2%
Expensive Endpoint

1k database queries and 1k view partials per controller-action.

None 2,811.1 ms
New Relic 3,871.0 ms 37.7%
Scout 2,922.8 ms 4.0%
Fast Endpoint

1 database query and no view partials per controller-action.

None 2.82 ms
New Relic 3.71 ms 32.0%
Scout 3.06 ms 8.8%

Representative Endpoint Benchmark

Every Rails app is a special snowflake, but this is a close approximation of a high-traffic, Rails app controller-action based on data we've collected from the apps we are monitoring.

This test hits 100 endpoints in a Rails app, with each controller-action conducting 21 database queries and rendering 20 view partials. Additionally, this forces 0.4% of requests to be > 2,000 ms as New Relic and Scout both collect extended details on slow requests. This test (and all of the others) hit 100 unique endpoints vs. a single endpoint as New Relic and Scout aggregate metrics by endpoint. We want to test the performance of that aggregation.

APM Agent Response Time
(Mean)
Response Time
(95th Percentile)
Response Time
(Max)
Overhead
None 55.6 ms 106.4 ms 2,174.1 ms
New Relic 80.4 ms 149.5 ms 2,263.5 ms 44.5%
Scout 56.8 ms 102.7 ms 2,168.7 ms 2.2%

Analysis

New Relic performs 20X worse than Scout. I was curious if there was a specific area of New Relic's instrumentation that was responsible for the overhead. However, looking at the New Relic Stackprof output, the stack samples are spread fairly evenly - their seven most expensive method calls are each greater than 0.5% of stack samples.

Note that when analyzing the Stackprof output, all of New Relic and Scout's processing and reporting work is done via a background thread. Those results are reflected in the Stackprof output.

Expensive Endpoint Benchmark

Our representative endpoint benchmark hit controller-actions that had 21 database calls and rendered a partial 20 times per-endpoint. In this test, we simulate hitting controller-actions with 1,001 database calls and 1,000 view partials (a 50x increase in instrumented calls). APM agents instrument database calls and view partial rendering time, so we want to the see overhead when they are required to do a lot of instrumentation. This will ramp-up the average response time to around 2 seconds which is also the default threshold for when New Relic and Scout collect additional data on slow requests.

APM Agent Response Time
(Mean)
Response Time
(95th Percentile)
Response Time
(Max)
Overhead
None 2,811.1 ms 3,761.1 ms 5,922.3 ms
New Relic 3,871.0 ms 5,090.0 ms 7,057.7 ms 37.7%
Scout 2,992.8 ms 4,054.4 ms 6,216.4 ms 4.0%

Analysis

Both Scout and New Relic handle the increase in instrumentation gracefully: the overhead doesn't increase linearly as a percentage of the response time vs. the number of instrumented method calls.

Fast Endpoint Benchmark

Basically the opposite of testing endpoints doing lots of work, this tests an endpoint doing very little work. This controller-action conducts a single database call and renders text straight from the controller (no views).

Fast controller actions are important to optimize for, since a few milliseconds of additional time can constitute a significant percentage increase.

APM Agent Response Time
(Mean)
Response Time
(95th Percentile)
Response Time
(Max)
Overhead
None 2.82 ms 8.87 ms 125.8 ms
New Relic 3.71 ms 12.63 ms 193.5 ms 32.0%
Scout 3.06 ms 9.92 ms 116.6 ms 8.8%

Analysis

As this is such a fast controller-action with a 95th-percentile response time under 13 ms in all of our tests, any overhead will naturally appear high. For comparison, a bare-bones StatsD instrumentation tracking throughput, response time, and response codes amounts to 0.7% of the total Stackprof samples during the benchmark. Scout amounts to 3.2% of the samples.

Scout and New Relic do provide knobs that can be tuned to decrease instrumentation if an app is largely composed of very fast endpoints.

Testing Methodology

To run the tests, we provisioned 3 instances on Digital Ocean:

  • Application Server (8 core, 16 GB memory)
  • Database Server (12 core, 32 GB memory)
  • Utility Server (1 core, 512 MB memory)

The utility server runs siege to benchmark the application server performance. Siege was run against a set of 100 endpoints using the same concurrency level as unicorn workers for a minimum of ten minutes. For example:

siege -v -f urls.txt -c 30 -b -t 10M

Eliminating database response time variance

Database query response times can vary enough to skew test results. To prevent this, database rows are cached in the Rails process memory and fetched. Before recording benchmarks, each test begins with a one-minute siege to warm the cache so these larger initial query times aren't included in the results.

The Rails App

We used Ruby 2.2.3 and Rails 4.2.5. The application source can be accessed on Github. There is a dedicated branch for each testing variation. This makes it easy to ensure changes were applied consistently. For example here's the changes to enable Scout on the representative endpoint benchmark.

To simulate a real-world application, the application responds to 100 different endpoints. This adds diversity - APM agents aggregate metrics by endpoint, so we want to ensure we test that aggregation. Each endpoint actually does the same work - we care about diversity but want to keep tests consistent.

Agent Versions

The most recent agent versions were used at the time of the benchmarks:

  • New Relic - 3.14.2.312
  • Scout - 1.3.1

Gathering metrics

Logs of each test run were stored. We used the lograge gem to make parsing metrics from the log file easier. The numbers shared in these benchmarks were generated by parsing the log file of each test run.

Stackprof middleware, installed via the Stackprof gem, is disabled by default but can but enabled to gather CPU Samples across the different agents. Stackprof was disabled during the benchmark runs to eliminate any variability caused by its overhead and persisting data to disk.

Agent Settings

Each agent is tested against its factory-set defaults. While some agents provide configuration options to turn off specific areas of instrumentation, the reality is few developers do this or understand the impact of what the settings do.

Qualifying a Benchmark

We threw out a benchmark if it didn't meet the following criteria:

  • Consistent response times: we ensured response times didn't trend up or down during any of the benchmarks.
  • Consistent throughput: we threw out tests that exhibited erratic throughput behavior which can sometimes be attributed to networking issues between the utility server and the application server.

Benchmarking Resources

Subscribe for More

Curious how we ensure our Ruby code stays fast? Just put your email into the sidebar. We deliver a curated selected of performance tips once-a-month.

Comments

comments powered by Disqus