Last week, one of our application servers died. We have four app servers, so in theory, the death of one app server shouldn't bring the entire platoon down. However, real-life had other plans: 95% of requests were handled fine, but around 5% were being dropped. Here's the story of how we diagnosed and fixed the issue with our realtime charts.
I’m sitting in the Denver Airport – in a couple of minutes, I’ll board the plane to RailsConf in Portland, Oregon. I’m already getting amped for Voodoo Donuts, Stumpdown Coffee, well-trimmed beards, and of-course, lots of Rails-related chats.
I’m bringing a fresh load of Scout T-Shirts. These aren’t your normal heavy-weight, poor-fitting shirts. They are tastefully designed, American Apparel – Tri-Blend (otherwise known as the most comfortable shirt you’ll own). If you’re attending RailsConf, shoot us an email so we can meetup and improve your wordrobe at the same time. Or, just look for us (Andre and Derek). We don’t always rest our arms on each other, but when we do, we look like this:
It's been three weeks since the launch of the largest feature enhancement in Scout's existence: roles. Haven't heard of roles? Nutshell: roles let you monitor many serves with fewer clicks and more joy. Roles were driven by your feedback and it's showing in the fast adoption numbers below.
Time to give an awkward nerd high-five of thanks:
- Customers on our Roles BETA program - your feedback and willingness to try new things helped us iron out the edges for the public rollout.
- Contributors to our Chef recipe - we've already had six authors commit to our Chef recipe for deploying Scout. It's great to see a hardened Chef recipe based on real-world usage.
- Feedback since the launch - we built roles because of your feedback, and we've enjoyed reading your suggestions post-launch.
Haven't tried roles yet? To get started, see the "Roles" dropdown on your account, and read the FAQ on roles.
Roles are a new feature available immediately for all new and existing accounts.
You have a carefully thought out architecture. You frequently add new servers as your business grows. In fact, scaling up is part of business as usual. Monitoring should scale easily with you -- that's why we're introducing roles.
Roles make it easy to setup plugins and triggers across many servers. Instead of individually configuring servers, configure roles. Then, apply roles to your servers through our UI, the command line, or your configuration management tool.
Some examples of how roles will make your life easier:
- Updating a trigger on 50 app servers
- Adding a Memory Usage plugin on 100 memcached servers
- Updating a plugin to a new version across all 10 MongoDB servers
Roles are available now on your account. Look for the new "Roles" item on the top navigation. If you previously had servers organized by groups, your groups have been upgraded to roles. See documentation here on creating roles and organizing your existing plugins into roles.
Your account now has a single, account-wide key -- use it for any new servers you add. Your existing keys will continue to work, so you don't have to touch any servers you're currently monitoring.
Most setups have a limited number of server configurations (app, db, utility, for example), and several servers of each configuration. When you add another app server, it probably needs the same monitoring template as your existing app servers. Adding more servers using existing templates is the scenario we wanted to make dead simple in Scout.
There's no need to stick to one template at a time: servers can have any number of roles in Scout, so feel free to mix and combine roles as needed to reflect the functionality of your servers. Is one of your HAProxy boxes also running memcached? No need to create a brand new roles, just apply two of the roles you already have.
Once defined, roles are "active": if you update a role (say by adding a plugin or a trigger), all the servers in that role are automatically updated to reflect the changes. It's a much easier way to to manage your monitoring configuration.
Best friends with Chef
We provide an official Chef Recipe designed to work with roles.
To simplify deployment, Scout now provides a single, account-wide key you can use on all your servers.
Even if you're not using Chef, you can (optionally) specify roles directly through the Scout executable:
scout -rdb,app to assign the db and app roles, for example. This makes role assignments highly script-able, whether you're using Chef, Puppet, or Moonshine.
Fine-tuned for large environments
With the recent notification group changes and now roles, we're making monitoring easier for large environments. Our previous tools -- cloud keys and plugin copy-paste -- were useful, but it was easy for things to get out of sync. Roles our our answer for keeping monitoring in sync in large environments.
Roles in Summary
With Roles, we want to make deploying and scaling scout on large environments as easy as possible:
- Roles are "active": updating role's triggers or plugins propagates to all the role's servers.
- Better than templates: Servers can belong to more than one role.
- Account-wide keys: no need to provision keys for new servers - reuse the same 40-character account key in the crontab across all your servers
- Specify roles via the crontab: optionally, you can pass a command-line argument to the scout agent to specify the roles it should belong to.
- Friendly with Chef: we also provide an official chef recipe for roles-enabled server configuration.
To get started, see the "Roles" dropdown on your account, and read the FAQ on roles here.
Whenever we’re asked how to make on-call notification schedules for Scout alerts, we recommend PagerDuty. PagerDuty has invested a ton of time in building a dedicated notification scheduling service, and it’s a great complement to Scout.
With our recent release of notification groups, Scout’s integration with PagerDuty got even more powerful:
- Multiple PagerDuty services: add as many PagerDuty services to Scout as necessary.
- Trigger-specific escalation policies: assign any PagerDuty escalation policy to any threshold in Scout. If you need to create multiple thresholds on a given metric with different escalation policies, it’s simple to do – just add another trigger.
- Automatic incident resolution from Scout: since all integrations are routed through PagerDuty’s API, Scout now auto-resolves any PagerDuty incidents when Scout’s trigger stops firing.
Multiple services in PagerDuty:
... and those same services integrated into Scout:
Adding PagerDuty services within Scout
You need to start in Scout to create a PagerDuty integration:
- Click on Notifications (in the top navigation bar),
- Click on “Add PagerDuty Integration.”
- You’ll be given the option to create a new PagerDuty service, or connect to an existing service within your account.
To assign a PagerDuty service to a trigger, ensure the PagerDuty integration is part of a notification group (the notification group can contain other items too, if needed), then assign that notification group to a trigger.
With the Sidekiq Monitor Plugin (by Scott Klein of StatusPage.io) and the Puppet Last Run Plugin (by Didip Kerabat of Kongregrate) Scout’s plugin directory count has now passed 70 plugins!
The Sidekiq Monitor Plugin monitors key metrics for Sidekiq, a Ruby message processing library. Didip’s Puppet Last Run Plugin tracks key metrics for the most recent Puppet run on a monitored server.
Have a useful plugin sitting around? Share it! Send a pull request to our scout-plugins repository on Github.
It’s been a month since I started attaching torture devices disguised as boots to my feet, long wooden sticks to each torture device, and tumbling down mountains. Skiing has changed my outlook on winter. It’s a season to enjoy, not a time where I gaze wistfully out the window, hoping the short, cold days pass by as quickly as possible.
However, there’s a problem when skiing becomes a favorite hobby: not everyday is a great day on the mountain. If it hasn’t snowed in a while, the surface is hard. The temperature might be in the single digits and the wind may be gusting 50 MPH+. It might dump snow in the backcountry, but the avalanche conditions may make it unsafe.
There’s something special about being able to sneak away when the conditions are the best, even if it’s during the work week. It feels a bit like being a kid again (correction: a kid with a receding hairline). It’s a fun reminder that it’s not always bad to feel redundant.
We recently decided it was time for a major update to the public side of Scout. We’d start with a more polished homepage. Since we’re both developers, the obvious next step seemed like hiring a designer. However, working with an outside designer isn’t a hire-and-forget experience:
- Good designers are difficult to find. Design doesn’t scale like a product business.
- Good designers are busy. It could take 30-60 days to start work, then another 30 days for it to come together. This means we could be looking at a 90 day timeline. We wanted to launch it faster.
Instead of starting work with a designer on a blank slate, we decided to start firming up what we wanted the homepage to look like. We’d end up with one of the following outcomes:
- We’re terrible at design, but we’ve at least thought it through. Hire a designer.
- We can get 80% of the way there, but we’ll need a designer for touchups.
- If we iterate enough, we can launch something we’ll be happy with.
Startup Lessions from CCP - EVE Online Style:
Another great example was the formation of “Team Best Friends Forever” (“BFF” is an inside EVE joke). This team is a group of CCP developers whose sole mission is not to work on major features and improvements, but rather to fix all those annoying “little things” that bother their customers. Too many times, product managers and development teams are focused on the big-ticket items – and that’s fine, but TBFF is a great approach that again proves that CCP listens to their customers.
On Jan 15th, all Scout accounts will be switched over to notification groups. Notification groups are designed to make notifiation management easier and more flexible:
- instead of managing notifications per-plugin/per-user, you will assign users to notification groups, and apply notification groups to triggers.
- you can have multiple PagerDuty integrations and webhook endpoints and assign those to notification groups as well. PagerDuty and webhooks are now first-class notification channels; they are no longer all-or-nothing settings.
- you get much finer control over threshold-to-notification channel mappings. You can configure (for example) three triggers on one metric, with the lowest trigger sending you email, the middle trigger sending you SMS, and the high trigger alerting PagerDuty.