Monitoring Advice Distilled

November 9, 2015 by James Wing

After being asked for my quick list of top monitoring advice more than once recently, I thought I would just write it down. I've written more about troubleshooting than monitoring, and it's been interesting to consider how the two are related, yet different. The really short version is:

Define the standards you are monitoring to
Focus on humans and organizational measures before worrying about tools
Prefer monitoring transactional data over logs
Prefer monitoring success rate over error rate
Make sure you can separate happy quiet from scary silence

Get Your Terms Straight

Monitoring is not the same as alerting, logging, error handling, or troubleshooting. It is related to all those things, but distinctly separate. Monitoring is about knowing the status of your system or application, right now. If the status is not good, maybe you will alert somebody or do some troubleshooting. First, you have to monitor.

Define Monitoring Targets

You need to define the standard that your system will be monitored to. Please, please, don't let some idiot slap a "99.9%" label on your monitoring process and walk away. You need operational definitions you can really work with. 99.9% of what? Measured how? By individual customer, or collectively? Does that count planned maintenance windows? This is tedious and hard, but it's really necessary.

Definitions are hard because things that are easy to measure, especially with commercial tools, are rarely profoundly meaningful. Or meaningful at all. If your web site answers an HTTP get request with a 200 status, does that mean your customer's are happy? Dig deeper for key metrics.

Definitions are necessary because without them there is always a reason for inaction or overreaction. "I didn't see any errors" is the monitoring equivalent of "It works on my machine". But you can also freak out over a few errors that have no material impact on your customer base. Getting organizational support for this effort requires a recognition that downtime or service interruptions are bad for customers, costly for your business, and unpleasant for your team. You want to match your uptime goals with with some kind of value recognition, although I wouldn't hold out for precise dollar amounts.

Look for a small basket of operationally defined metrics, three to five is always the magic range.

Clarify Roles and Responsibilities

I know everybody hates roles and responsibilities discussions, but this is really the place to start with monitoring improvements. The critical task is to clarify not just who is responsible for taking action when something is wrong, but who is responsible for noticing something is wrong in the first place. Many companies leave out the noticing part, assuming that magically automated alerts will chime in at just the right moment, so the only thing left to figure out is the schedule of who's on call and what their phone number is.

Noticing problems is the hardest part. It's also the core of "monitoring". Good organizations notice problems before their customers. You know your team is missing out when support issues routinely report problems you didn't find first.

Prefer Transactional Data Over Logs

I strongly recommend that you monitor your key metrics based on transactional or operational data stores rather than data scrapped from logs. Why? Data quality. Your transactional data is more likely to be good quality data, with a carefully managed structure, controlled changed, and tested code. In my favorite ecommerce web site example, the order data captured by a retailer is carefully managed to high quality.

In contrast, logs are full of crap. Log messages are typically defined by developers, for developers, and reflect the technical solution domain rather than the customer domain. Anybody can write anything to logs, rarely subject to critical review, testing, or data quality checks. Log entries are frequently duplicated by different functions or modules. If you adopted strong logging practices, you could counter some of these concerns, but I'm skeptical that you will.

In a positive light, I believe permissive logging will make for easier and more flexible troubleshooting, and you should let your logs breathe a bit. Just be clear about the reliability of the data for monitoring.

Also on this topic:

The Troubleshooting Chart Process - Much of the Troubleshooting Chart discussion applies to monitoring as well, especially with respect to getting customer-level impact from data.
Log or Operational Data for Troubleshooting? - A more detailed treatment of the differences, from the troubleshooting perspective.

Use Synthetic Transactions

On a more tactical level, do use synthetic transactions to baseline your monitoring apparatus. Synthetic transactions are non-customer transactions put through your production system. For example, you could submit fake orders through an ecommerce web site. The reason you would do this in production is to keep high confidence that orders can be placed. Ang by setting a non-zero baseline, you can monitor for an absence of transactions, and separate restful moments from terrifying silence.

Prefer Success Rate Over Error Rate

Try to focus your monitoring on keeping the success rate up, rather than keeping the error rate down. I know they should be the inverse of each other, but it has been my experience that they can and will be independent at times. Customers care about their success rate, not your error rate. And that's the way ownership is apportioned -- customers get success, you get errors.