While profiling is invaluable for debugging performance problems that affect the common case, it is of little help in tracking performance problems that affect the slowest 1% of the operations (i.e., long-tail latencies). For Web service providers, these long-tail latencies affect both the cost of the service and the user experience. Because interactions between operations are often responsible for long-tail latency, we must analyze fine-grained traces to investigate their cause.
Unfortunately, analyzing traces is difficult because one needs to reason over long chains of events and because this reasoning often requires significant domain knowledge about what the event sequences mean. This paper shows how we can use formulas in linear-temporal logic to analyze traces. Given these formulas, our system searches through traces to find matches for these formulas and extracts relevant information from the matches. We demonstrate that our system is scalable and enables us to investigate long-tail performance problems at Google. Copyright © 2014 John Wiley & Sons, Ltd.