Understanding CPU Steal Time - when should you be worried?

July 25 Bullet_white By Derek Bullet_white Posted in HowTo Bullet_white Comments Comments

A big thanks to Eric Lindvall of Papertrail for adding steal time to Scout's CPU Usage Plugin and helping out on this blog post!

Netflix tracks CPU Steal Time closely. In fact, if steal time exceeds their chosen threshold, they shut down the virtual machine and restart on a different physical server.

If you deploy to a virtualized environment (for example, Amazon EC2), steal time is a metric you'll want to watch. If this number is high, performance can suffer significantly. What is steal time? What causes high steal time? When should you be worried (and what should you do)?

CPU Steal Time Definition

From ibm.com:

Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.

Your virtual machine (VM) shares resources with other instances on a single host in a virtualized environment. One of the resources it shares is CPU Cycles. If your VM is one of four equally sized VMs on a physical server, its CPU usage isn't capped at 25% of all CPU cycles - it can be allowed to use more than its proportion of CPU cycles (versus memory usage, which does have hard limits).

Where can you see CPU Steal Time?

When you run the Linux top command, you'll see a realtime view of key performance metrics. One of the lines is for the CPU:

top

Two metrics you might have some experience with already are %id (percent idle) and %wa (percent I/O wait). If %id is low, the CPU is working hard and doesn't have much excess capacity. If %wa is high, the CPU is ready to run, but is waiting on I/O access to complete (like fetching rows from a database table stored on the disk).

%st, or percent steal time is the last CPU metric displayed.

CPU Steal Time - the ticket booth analogy

You've purchased tickets to the latest Hollywood blockbuster. There are two lines and one ticket booth:

movie line

If we applied a CPU steal time-like metric to the ticketing process, it would look like this:

  • 0% Steal Time - it's a Wednesday matinee: the ticket booth is picking a moviegoer from line 1, then line 2, then line 1, then line 2, and so on. No one is waiting.
  • 50% Steal Time - It's Friday night: instead of being able to purchase a ticket immediately, half of the time a person in the line needs to wait for the person at the booth to complete their purchase. Things are taking longer.
  • 100% Steal Time - It's a Friday night and the cash register is broken: no one is moving.

Why is high steal time particularly bad for web apps?

If you have a long-running background computational task that is on an underutilized physical server, it may get access to more than it's share of CPU cycles for a while. Later on, the other VMs need their share of CPU Cycles, so the long-running task will run slower. This might not be a deal-breaker for a long-running task: it might take a bit longer or it might even finish faster (since it was able to use more resources earlier).

However, for web apps, this can bring things to halt. For tasks that need to be performed in real-time, like rapidly serving many web requests, a 4x decrease in performance can cause major backups in request queues, which can lead to outages.

What if steal time is well above zero?

There are two possible causes:

  1. You need a larger VM with more CPU resources (you are the problem).
  2. The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).

The catch: you can't tell which case your situation falls under by just watching the impacted instance's CPU metrics. This is easiest to tell when you have multiple, identical servers performing the same roles, each residing on a different host:

steal scenarios

  • Has %st (CPU Steal Time Percentage) increased on every virtual server? This means your virtual machines are using more CPU. You need to increase the CPU resources for your VMs.
  • Has %st (CPU Steal Time Percentage) increased dramatically on only a subset of servers? This means the physical servers may be oversold. Move the VM to another physical server.

So, when should you be worried?

A general rule of thumb - if steal time is greater than 10% for 20 minutes, the VM is likely in a state that it is running slower than it should.

When this happens:

  1. Shut down the instance and move it to another physical server
  2. If steal time remains high, increase the CPU resources
  3. If steal time remains high, contact your hosting provider. Your host may be overselling physical servers.

Monitoring steal time with Scout

Scout's CPU Usage Plugin reports key CPU metrics, including steal time. You can create a trigger to alert you of spikes in steal time.

steal alert

TL;DR

In a virtual environment, CPU cycles are shared across virtual machines on the server. If your VM displays a high %st in top (steal time), this means CPU cycles are being taken away from your VM to serve other purposes. You may be using more than your share of CPU resources or the physical server may be over-sold. Move the VM to another physical server. If steal time remains high, try giving the VM more CPU resources.

Comments

comments powered by Disqus