Your high-powered server is suddenly running dog slow, and you need to remember the troubleshooting steps again. Bookmark this page for a ready reminder the next time you need to diagnose a slow server.
Get on "top" of it
Linux's top command provides a wealth of troubleshooting information, but you have to know what you're looking for. Reference this diagram as you go through the steps below:
Step 1: Check I/O wait and CPU Idletime
How: use top - look for "wa" (I/O wait) and "id" (CPU idletime)
Why: checking I/O wait is the best initial step to narrow down the root cause of server slowness. If I/O wait is low, you can rule out disk access in your diagnosis.
I/O Wait represents the amount of time the CPU waiting for disk or network I/O. Waiting is the key here - if your CPU is waiting, it's not doing useful work. It's like a chef who can't serve a meal until he gets a delivery of ingredients. Anything above 10% I/O wait should be considered high.
On the other hand, CPU idle time is a metric you WANT to be high -- the higher this is, the more bandwidth your server has to handle whatever else you throw at it. If your idle time is consistently above 25%, consider it "high enough"
Step 2: IO Wait is low and idle time is low: check CPU user time
How: use top again -- look for the %us column (first column), then look for a process or processes that is doing the damage.
Why: at this point you expect the usertime percentage to be high -- there's most likely a program or service you've configured on you server that's hogging CPU. Checking the % user time just confirms this. When you see that the % usertime is high, it's time to see what executable is monopolizing the CPU
Once you've confirmed that the % usertime is high, check the process list (also provided by top). Be default, top sorts the process list by %CPU, so you can just look at the top process or processes.
If there's a single process hogging the CPU in a way that seems abnormal, it's an anomalous situation that a service restart can fix. If there are are multiple processes taking up CPU resources, or it there's one process that takes lots of resources while otherwise functioning normally, than your setup may just be underpowered. You'll need to upgrade your server (add more cores), or split services out onto other boxes. In either case, you have a resolution:
- if situation seems anomalous: kill the offending processes.
- if situation seems typical given history: upgrade server or add more servers.
This is an area where historical context can be a huge help in understanding what's going in. If you're using Scout, check out the historical charts for these metrics. A flat line for % user time followed by a huge increase in the last 10 minutes tells a much different story than smooth, steady increase over the last 6 months.
Step 3: IO wait is low and idle time is high
Your slowness isn't due to CPU or IO problems, so it's likely an application-specific issue. It's also possible that the slowness is being caused by another server in your cluster, or by an external service you rely on.
- start by checking important applications for uncharacteristic slowness (the DB is a good place to start),
- think through which parts of your infrastructure could be slowed down externally. For example, do you use an externally hosted email service that could slow down critical parts of your application?
If you suspect another server in your cluster, strace and lsof can provide information on what the process is doing or waiting on. Strace will show you which file descriptors are being read or written to (or being attempted to be read from) and lsof can give you a mapping of those file descriptors to network connections.
Step 4: IO Wait is high: check your swap usage
How: use top or
Why: if your box is swapping out to disk a lot, the cache swaps will monopolize the disk and processes with legitimate IO needs will be starved for disk access. In other words, checking disk swap separates "real" IO wait problems from what are actually RAM problems that "look like" IO Wait problems.
An alternative to top is
free -m -- this is useful if you find top's frequent updates frustrating to use, and you don't have any console log of changes.
Step 5: swap usage is high
High swap usage means that you are actually out of RAM. See step 6 below.
Step 6: swap usage is low
Low swap means you have a "real" IO wait problem. The next step is to see what's hogging your IO.
iotop is an awesome tool for identifying io offenders. Two things to note:
- unless you've already installed iotop, it's probably not already on your system. Recommendation: install it before you need it -- it's no fun trying to install a troubleshooting tool on an overloaded machine.
- iotop requies a Linux of 2.62 or above
Step 7: Check memory usage
How: use top. Once top is running, press the M key - this will sort applications by the memory used.
Important: don't look at the "free" memory -- it's misleading. To get the actual memory available, subtract the "cached" memory from the "used" memory. This is because Linux caches things liberally, and often the memory can be freed up when it's needed. Read here (http://blog.scoutapp.com/articles/2010/10/06/determining-free-memory-on-linux) for more info.
Once you've identified the offenders, the resolution will again depend on whether their memory usage seems business-as-usual or not. For example, a memory leak can be satisfactorily addressed by a one-time or periodic restart of the process.
- if memory usage seems anomalous: kill the offending processes.
- if memory usage seems business-as-usual: add RAM to the server, or split high-memory using services to other servers.
A handy flow chart to tie it all together
- vmstat is also a very handy tool, because it shows past values instead of an in-place update like top. Running
vmstat 1 shows concise metrics on memory, swap, io, and CPU every second.
- Track your disk IO latency and compare to IOPS (I/O operations per second). Sometimes it's not activity in your own server causing the disk IO to be slow in a cloud/virtual environment. Proving this is hard, and you really want to have graphs of historical performance to show your provider!
- Increasing IO latency can mean a failing disk or bad sectors. Keep an eye on this before it escalates to data corruption or complete failure of the disk.
- If your a visual person, Scout's dashboards can help - your data will look like this:
Wrapping it up
Having concrete steps at your fingertips makes slow server troubleshooting a little easier. Top is a powerful tool that provides a wealth of metrics to help you narrow down the cause of server slowness. The metrics you'll be looking at are io wait, cpu idle %, user %, memory free (taking into account the file cache), and swap usage. Depending on whether conditions are a one-off or the result of growing demands on your infrastructure, you may be able to solve the slowdown by restarting services, or you may need to upgrade your servers. Historical context via Scout or a similar tool can be very useful in establishing what's normal for your machines.
We heard you loud and clear: the new dashboards need multiple metrics on the same chart. Starting today, you can add as many metrics as you want to an individual chart, just by dragging metrics from the sidebar.
To remove metrics, just go into chart settings:
Finally, we consolidated the chart settings & got rid of some of the visual noise.
What do you think?
The new UI is designed for:
- smooth ad-hoc exploration of your data
- effortless juxtaposition of different metrics
- clutter-free dashboards you'll want to show off on a dedicated display
How's it working for you? Comments welcome at email@example.com.
Thanks for your feedback since the launch of our new dashboards UI last week! We're taking your suggestions to heart - here's what we've been working on.
You can now stack metrics just like before. Choose the "stacked" display option in the chart settings.
You'll also see a total across metrics on the chart when hovering over the chart.
Name your charts
We give your charts a default name, but sometimes you have a more awesome name. Now you can rename your charts at-will.
More metrics in the tooltip
We've increased the number of visible metrics in the tooltip to 15. Don't forget you can scroll through the metrics in the tooltip if you have more. Also, tooltip metrics are sorted so you'll see the highest values first.
Coming: different metrics on the same chart
We launched with the ability to create charts with a single metric. The top request we're hearing is the ability to place multiple metrics on the same chart (ex: CPU I/O Wait, CPU User %, CPU System).
We're working on this! We'll update when this is available.
We'll keep making dashboards more awesome
We won't rest until you start taking Instragram photos of your favorite Scout dashboard moments.
Send your feedback/suggestions/bug reports to firstname.lastname@example.org.
Dashboards and charts are the swiss army knife of monitoring tools. In fact, dashboards and charts account for roughly half of all page views in our UI.
A couple of months back, we decided it was time to revamp our dashboards and charts experience in Scout. Like a swiss army knife, our goal was to strike the balance between utility and ease-of-use. Today, we're excited to rollout our the new experience to everyone.
The new UI delivers a buttery-smooth experience while providing what you need to monitor your growing server footprint.
Unifying dashboards and charts
Dashboards and charts used to be two separate areas of Scout: now, two have become one. It's never been easier to get all of your key metrics onto a single page.
The sidebar lists all of your metrics. As your Vim-loving colleague would tell you, the fastest way to work is via your keyboard. It's the same with dashboards: to filter metrics, just start typing. Use your up/down keys to page through metrics and hit enter to create a chart.
When you add a chart, you'll see that metric across all of your servers. Say goodbye to clicking check boxes!
What if you want to filter metrics (example: memory usage on just your application servers)? Easy stuff. Use the global filters at the top of the page to apply that filter to every chart on the page:
...or just use the chart-specific settings to apply it to just this chart:
Drag and Drop + Resize
You can drag-and-drop plus resize charts at-will to get your dashboard exactly how you want it.
Lightweight: embrace ad-hoc
Frequently, we'll whip up a dashboard for something we want to inspect. Dashboards don't get in the way: you don't need to name dashboards or charts just to view or share a dashboard with a colleague.
As you add/remove servers or plugins, charts will auto-update with the changes. Filters are applied via environment names, role names, and server names - not specific IDs - so they stay current as your infrastructure grows.
Sharing is caring
Want to share a dashboard with a friend that doesn't have a Scout account? No worries - just click the "Share" link:
Less noise w/range
When you have many servers, charts get very noisy with lots of lines. Our range display solves this problem by displaying a line for the average and a band for the min/max across all of your servers.
You can always toggle to the breakout display to view a line for each server.
The charts will resize dynamically as you adjust the width of your browser window.
When inspecting a chart, you'll see a vertical line on the other charts over that same point in time. With this, correlations become clear.
Easy date navigation
Just like before, our flexible date parser makes it easy to select the end time for a chart. You can select between different durations and move forward+back in time.
What about your existing charts and dashboards?
You'll be able to access these under the "Legacy Charts" navigation area in Scout. We'll be deprecating support for old dashboards and charts in the next several months. Rest assured we'll communicate this timeline as it develops.
We've been using the new UI internally for a while and we can't imagine going back. Charge is hard, but we think this will be a transition you'll be excited about.
As always, send your thoughts and bug reports to email@example.com.
Dashboards have exited beta - see our launch post for the details.
Thanks for your feedback on the first preview of our new charts UI. You spoke, we listened, we coded:
- Resize + drag-and-drop your charts. Total control over how you view your key metrics.
- Chart-specific settings to toggle either a range display or breakout display of your metrics. View min/avg/max of every metric.
- A number of subtle UI enhancements for applied chart filters.
The UI formally known as Charts Beta is now Dashboards Beta
In our initial release of the UI, we referred to it as "Charts BETA". However, rather than just replacing charts, the new UI encompasses both charts+dashboard functionality. Going forward, we'll be referring to this as "Dashboards BETA".
URL change for Dashboards
We've changed the URLs for old-style dashboards in Scout. If you have a dashboard loaded on an external display, the URL now looks like this:
That's an "old"_ prefix before dashboards.
Your existing charts and dashboards will continue to function.
Will legacy charts and dashboards be supported?
We'll support existing charts + dashboards for the time being, but we do plan on deprecating the legacy functionality sometime in the future. Rest assured we'll communicate this timeline as it develops.
We've been using the new UI internally for a while and we can't imagine going back. We think this will be a transition you'll be excited about.
Feedback / Letters to the Editor / etc.
As always, your feedback drives Scout. Send your feedback to firstname.lastname@example.org.
Dashboards have exited beta - see our launch post for the details.
We're rebuilding the charts experience in Scout from the ground-up. Want a preview of where we're heading? Click the "Charts Beta" link in your nav bar.
- There's no faster way to view your metrics. Start typing to filter metrics in the sidebar, key up/down, hit enter. You'll see that metric across all of your servers.
- Readable by default: with lots of servers, charts can become a mess. With our new charts, you'll see
the range of min/avg/max values across all of your servers. Simply click a chart for details on individual servers.
- Navigate forward/back in time with the arrow buttons at the top of the page.
- One metric per-chart - multiple y-axes are confusing.
- Filter metrics by environment, role, and server name (including regex support).
We'll be iterating quickly on this. Send your feedback to email@example.com.
All of Scout's infrastructure has been checked, and we have confirmed that Scout is not vulnerable to CVE-2014-0160.
Opsgenie, the IT alerts company, now integrates with Scout. Email, SMS, phone calls, and and mobile push - OpsGenie lets your team receive alerts exactly how and when they want.
OpsGenie has a full integration guide on their website. Here's the gist:
1. In the OpsGenie UI, click to add a service integration with Scout.
You'll see a webhook URL. Copy that to your clipboard.
2. Create a Webhook in the Scout notification area using the URL provided by OpsGenie above.
3. Add the OpsGenie webhook to a notification group.
Boom! You are ready to go with OpsGenie.
If you are using the latest Scout agent, you'll notice a beautiful new view of your server metrics:
The new server view combines several moving parts to deliver an at-a-glance, auto-refreshing view into your server with minimal effort.
Key Metrics at the top
CPU Usage, Memory Usage, Disk I/O, and Network Activity are the four key metrics for server performance. The new server view puts these at the top of the page, no scrolling required.
We're using d3 for all of the charts. We love the control (and the crisp, pixel-perfect charts). You'll also notice smooth transitions when your page refreshes every minute.
You don't need to install any plugins to collect these metrics, and Scout auto-detects new devices.
Key Metric Details
Click on an any of the overview charts for details.
The new agent automatically collects process metrics. You'll see these processes listed on the page.
Want more details on a process? Make a process a key process. You'll be able to view time series charts of any key process and add triggers on process metrics.
You can toggle between viewing processes and your custom-installed plugins. We're using Isotope to provide a dense display of data - in many cases, the vertical screen hight will be half the height of the old server view.
You'll also see a sparkline bar chart for every metric a plugin reports. This makes it easy to identify trends across all of your server metrics.
You can change the end time and the duration (from 30 minutes to 1 day) on the server view to easily view past data or longer-range trends.
gem install scout on your server, you'll automatically be upgraded to the new UI. No data is lost when you upgrade - all existing plugins continue to report.
We'll be taking major strides forward on the Scout experience in 2014. Next on our list? Charts and dashboards. Follow us on Twitter for sneak peaks.
Interested in the tech behind Scout Realtime, our open-source tool for realtime server metrics?
Boy, are you in luck! I'll be diving into the guts of Scout Realtime @ the Boulder Ruby Group this evening. The fun starts at 6pm with a beginner's track.