Customizing Troubleshooting Data

2014-08-25

So you are doing some troubleshooting, but the data just isn't cooperating. You've looked at logs, and at operational/transactional data stores, but you still don't have metrics that match the "success rate" as I told you about in previous posts. The whole Troubleshooting Chart seems like B.S. if you don't have data ready at hand. Yeah, I've been there, and it does seem tough.

But you can make data. In fact, you can probably make better data than you are likely to find by happenstance, given that you can tailor it to more closely match customers' perceptions and better structure it for analysis and presentation.

Transforming Data

The simplest method is to transform data you already have, but cannot directly use, into usable data. This requires some time and effort to do the transformation, but you can start now if you have the data. Some simple examples include:

Merging Data - Your transactional data set doesn't contain all records from the same source, so being able to mash a couple sets together can allow you to put outcome values (success/fail) and categorical attributes (customer, feature, etc.) in one place. This might give you more options for filtering the data, or simply improved display of the results (having customer name instead of just a numerical ID). Merging data sets is technology dependent, but a relational database can do this easily and Excel's VLOOKUP feature is not too much harder.

Continuous Values to Success/Failure - Suppose customers complain about slow web page performance rather than actual errors. Your web logs do contain time data, but the duration values are contiguous and it isn't obvious what values are 'good' vs. 'bad'. I would consider picking an arbitrary value for an acceptable page generation time, then grading all of the pages against that value to get success/fail scoring. You will probably have to adjust the threshold a few times to understand what customers are complaining about. You can then sum up the success/fail values to get success rates by time, customer, etc.

Parsing Logs - Sometimes your data set is trying to escape from a large volume of verbose text logs. I find this to be the case when error logs have values like dates, customer names, and error types, but it's hard to distill this into something quantitative. The output you are looking for is quantified data rows showing dates, amounts, outcomes, customers, etc. You could write custom scripts using regular expressions, your Computer Science professors would be proud of you. Or you could use any of a number of log parsing tools that support quantification as a feature (ElasticSearch and SumoLogic are examples).

Custom Data Collection

Nothing beats setting up a custom data collection system to get structured and relevant troubleshooting data. Of course, the cost is also higher, especially in terms of time you have to wait until you have a worthwhile amount of data to draw conclusions from. In my experience, this requires several days -- partly to debug the data collection itself, and then to gather several days worth of good quality data. But the results can be awesome.

FTP Problems - I worked an issue where customers reported problems transferring files to our FTP server. The reports were initially dismissed as 'crazy' because manual attempts to reproduce the issue successfully accessed the server. I set up a scripted process to send and receive files at regular intervals and capture errors. After several days, it became clear that customers were not crazy, that a significant number of attempts were failing. Once we established belief in the problems, several fixes were identified and applied to improve reliability.

Reports Not Accessible - In another similar issue, customers accessing a custom report solution complained that it was intermittently unavailable. Again, these complaints were regarded as 'crazy', because manual attempts to reproduce the issue successfully accessed the reports. I again set up a scripted process to check the reports from the external endpoint used by customers, and found that it was indeed intermittently available. Again, armed with the knowledge that the problems were real, we were quickly able to determine differences between internal and external access that caused the problems experienced by customers.

Memory Leaks - While troubleshooting an application server that seemed to have gone insane, we noticed that memory utilization was rather high. When restarting the process caused the insanity to disappear, we decided to investigate memory leaks. We were not in the habit of watching memory closely, and we did not gather statistics in a server monitoring package. So I set up a script to check the memory on all of our application servers on a regular basis. After waiting a day, we were able to see strong evidence of memory leaks.

You will have to make your own judgement about the cost of custom data collection vs. the severity of your issue. Again, the waiting time to get a significant data set is the worst part. I'll follow up with another post about the hands-on mechanics of collecting data and making these custom collection scripts easier.