Using execution trace data to improve distributed systems



One of the most challenging problems facing today's software engineer is to understand and modify distributed systems. One reason is that in actual use systems frequently behave differently than the developer intended. In order to cope with this challenge, we have developed a three-step method to study the run-time behavior of a distributed system. First, remote procedure calls are traced using CORBA interceptors. Next, the trace data is parsed to construct RPC call-return sequences, and summary statistics are generated. Finally, a visualization tool is used to study the statistics and look for anomalous behavior. We have been using this method on a large distributed system (more than 500000 lines of code) with data collected during both system testing and operation at a customer's site. Despite the fact that the distributed system had been in operation for over three years, the method has uncovered system configuration and efficiency problems. Using these discoveries, the system support group has been able to improve product performance and their own product maintenance procedures. Copyright © 2002 John Wiley & Sons, Ltd.