Where does troubleshooting data come from? By instinct, you might guess logs of various types, because anyone who has written code has written error messages to logs. You would be partly correct, logs are good and necessary. But I encourage you to also look at operational data stores as part of a troubleshooting investigation, especially when you want quantifiable metrics that can be shared with customers or non-technical team members.
Logs come in many shapes and sizes, but most applications have a small number of general-purpose logs that are a dumping ground for errors, exceptions, and diagnostic information. There are pros and cons, of course:
- Focus on Errors - Errors and exceptions can be prominently visible in logs. Sometimes you luck out and can easily and quickly find the errors that match the time and activity reported by customers. Sometimes these error messages contain helpfully specific information about what went wrong and how to fix it. I think of database errors with "permission denied on table X" as good examples.
- Flexible - Because logs are largely unstructured and not visible to customers, they tend to be very easy for developers to include additional information or custom diagnostic data. Changes do not break the existing data store, there is no need to worry about historical data structure, and mistakes are assumed to be of low consequence.
- Suspicious Data Quality - Log data is of variable quality, in that test scenarios rarely verify the accuracy and completeness of what is logged. If any testing is done on log data at all, it is usually limited to the existence of log entries for errors. The same workflow rules that make it very easy and cheap for a developer to add or modify diagnostic information also make this data untested when it arrives in Production.
- Absence of Data is Meaningless - What does it mean if there are no errors matching the troubleshooting incident? Not much. You certainly cannot assume that zero errors means 100% good health. Zero errors might mean no logging for the feature or errors in the logging code itself. Or maybe there are errors, you just can't see how they relate to your incident yet.
- Shifting Data Patterns - Because of the flexibility of log data, there is little or no perceived need to preserve consistency over time. When you actually sit down to go through the logs during an incident, you are left wondering if the errors really started after the release, or if the errors were simply changed from an earlier representation. For text-based errors, this might be tolerable, but if you try to depend on quantified metrics in the logs, this will drive you crazy.
By "Operational Data" I mean the data stores that back the primary features of your application. Think of this as the order and product tables for your e-commerce site, the customer data points in your CRM system, things that were built to support the happy path. Some would call it "Transactional" data.
- Solid Data Quality - Because these data stores represent primary functionality, they tend to have been well thought out, well tested, and harded from earlier incidents. Problems illustrated through this data will have credibility across your organization.
- Customer Relevance - Data in operational stores is more likely to represent things that customers complain about: orders that were incorrect or missing, actions that didn't happen, reports with incorrect numbers, etc. For troubleshooters, this helps you see and frame the problem in the customer's terms rather than the system's terms. It also makes these metrics more accessible to non-technical team members.
- Absence of Data is Significant - Missing data in operational stores has a solid meaning. No orders in the Order table? That's bad. No records in the FailedOrders table? That's good, hopefully, or at least a solid indication of an activity step that was or wasn't reached.
- Inflexible - Operational data stores are more expensive to change or customize because of the risk aversion overhead in development, testing, and deployment. You would not be likely to change the Order table in your e-commerce application to diagnose a single incident.
- Errors Not Included - It's somewhat rare that operational stores capture errors, although it is very powerful if it does. The emphasis on happy-path development can leave error flows out of operational data stores to an extent where the absence of data is the only signal available to the troubleshooter.
Are Those the Only Choices?
Trends in application development are moving towards a hybrid, something like semi-operational logging with a "log everything" mindset. In this model, the application logs far exhaustively about all activity, rather than just focusing on errors and exceptional flows, to such an extent that operational data points might be duplicated in logs. "Big Data" style filtering tools are then used to sort through the logs efficiently and create derived data points from text (LogStash, SumoLogic, Loggly, Logentries, CloudWatch, etc.).
This can greatly improve the completeness of log data for troubleshooting, with less expense than modifying operational data stores. However, it is really improved logging, because it remains separate from the operational data stores, with many of the thrills and spills implied above.