Although we use the bynowestablished phrase “software aging,” it should be clear that no deterioration of the software system per se is implied but rather the software appears to age due to the gradual depletion of resources (8). Likewise, “software rejuvenation” actually refers to rejuvenation of the environment in which the software is executing.
Standard Article
Software Aging and Rejuvenation
Published Online: 14 DEC 2007
DOI: 10.1002/9780470050118.ecse394
Copyright © 2007 by John Wiley & Sons, Inc.
Book Title
Wiley Encyclopedia of Computer Science and Engineering
Additional Information
How to Cite
Trivedi, K. S. and Vaidyanathan, K. 2007. Software Aging and Rejuvenation. Wiley Encyclopedia of Computer Science and Engineering. .
Publication History
 Published Online: 14 DEC 2007
1 Introduction
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
Several studies have now shown that outages in computer systems are more due to software faults than due to hardware faults (1, 2). Recent studies have also reported the phenomenon of “software aging” (3, 4) in which the state of the software degrades with time. The primary causes of this degradation are the exhaustion of operating system resources, data corruption, and numerical error accumulation, which eventually may lead to performance degradation of the software, crash/hang failure, or both. Some common examples of “software aging” are memory bloating and leaking, unreleased filelocks, data corruption, storage space fragmentation, and accumulation of roundoff errors (3). Aging has not only been observed in software used on a mass scale but also in specialized software used in highavailability and safetycritical applications (4). This type of aging in operational software systems is different from code decay in software systems caused by maintenance (5, 6). The former results in performance problems, system slow downs, and crashes, whereas the latter results in unrunnable or invalid software and maintenanceinduced bugs.
As aging leads to transient failures in software systems, environment diversity, a software faulttolerance technique, can be employed proactively to prevent degradation or crashes, which involves occasionally stopping the running software, “cleaning” its internal state or its environment and restarting it. Such a technique known as “software rejuvenation” was proposed by Huang et al. (4, 7, 8), 1 which counteracts the aging phenomenon in a proactive manner by removing the accumulated error conditions and freeing up operating system resources. Garbage collection, flushing operating system kernel tables, and reinitializing internal data structures are some examples by which the internal state or the environment of the software can be cleaned.
Software rejuvenation has been implemented in the AT&T billing applications (4). An extreme example of a systemlevel rejuvenation, proactive hardware reboot, has been implemented in the realtime system collecting billing data for most telephone exchanges in the United States (9). Occasional reboot is also performed in the AT&T telecommunications switching software (10). On reboot, called software capacity restoration, the service rate is restored to its peak value. Onboard preventive maintenance in spacecraft has been proposed and analyzed by Tai et al. (11), which maximizes the probability of successful mission completion by the spacecraft. These operations, called operational redundancy, are invoked whether or not faults exist. Proactive fault management was also recommended for the Patriot missiles' software system (12, 13). A warning was issued saying that a very long running time could affect the targeting accuracy. This decrease in accuracy was evidently due to overflow in the counter keeping track of time, during conversion from integer to real numbers. The longer the system ran continuously, the larger the error became. The warning, however, failed to inform the troops how many hours “very long” was and that it would help if the computer system was switched off and on every eight hours, which exemplifies the necessity and the use of proactive fault management even in safety critical systems. More recently, rejuvenation has been implemented in cluster systems to improve performance and availability (1417). Two kinds of policies have been implemented taking advantage of the cluster failover feature. In the periodic policy, rejuvenation of the cluster nodes is done in a rolling fashion after every deterministic interval. In the predictionbased policy, the time to rejuvenate is estimated based on the collection and statistical analysis of system data. The implementation and analysis are described in detail in Refs. 13 and 15. A software rejuvenation feature known as process recycling has been implemented in the Microsoft IIS 5.0 web server software (18). The popular web server software Apache implements a form of rejuvenation by killing and recreating processes after a certain numbers of requests have been served (19, 20). Software rejuvenation is also implemented in specialized transaction processing servers (21). Rejuvenation has also been proposed for cable and DSL modem gateways (22), in Motorola's Cable Modem Termination System (23) and in middleware applications (24) for failure detection and prevention. Automated rejuvenation strategies have been proposed in the context of selfhealing and autonomic computing systems (25). Recently, recursive restarts and microreboot has been proposed to increase availability (26). Software rejuvenation (preventive maintenance) incurs an overhead (in terms of performance, cost, and downtime), which should be balanced against the loss incurred due to unexpected outage caused by a failure. Thus, an important research issue is to determine the optimal times to perform rejuvenation.
Here, we present two approaches for analyzing software aging and studying agingrelated failures. The rest of this article is organized as follows: The next section describes various analytical models for software aging and to determine optimal times to perform rejuvenation. Measurementbased models are dealt with, followed by discussion of the implementation of a software rejuvenation agent in a major commercial server, various approaches and methods of rejuvenation. The article concludes with pointers to future work.
2 Analytic Models for Software Rejuvenation
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
The aim of the analytic modeling is to determine optimal times to perform rejuvenation that maximizes availability or minimizes the probability of loss or minimizes the mean response time of a transaction (in the case of a transaction processing system), which is particularly important for businesscritical applications for which adequate response time can be as important as system uptime. The analysis is done for different kinds of software systems exhibiting varied failure/aging characteristics.
The accuracy of a modelbased approach is determined by the assumptions made in capturing aging. In Refs. 4, 11, 2729, only the failures causing unavailability of the software are considered, whereas in Ref. 30 only a gradually decreasing service rate of a software that serves transactions is assumed. Garg et al. (31), however, consider both these effects of aging together in a single model. Models proposed in Refs. 4, 27, 28 are restricted to hypoexponentially distributed time to failure. Those proposed in Refs. 11, 29, 30 can accommodate general distributions but only for the specific aging effect they capture. Generally, distributed time to failure, as well as the service rate being an arbitrary function of time are allowed in Ref. 31. It has been noted (2) that transient failures are partly caused by overload conditions. Only the model presented by Garg et al. (31) captures the effect of load on aging. Existing models also differ in the measures being evaluated. In Refs. 11 and 29, software with a finite mission time is considered. In Refs. 4, 27, 28, 31, measures of interest in a transactionbased software intended to run forever are evaluated.
Bobbio et al. (32) present finegrained software degradation models, where one can identify the current degradation level based on the observation of a system parameter. Optimal rejuvenation policies based on a risk criterion and an alert threshold are then presented. Dohi et al. (33, 34) present software rejuvenation models based on semiMarkov processes. The models are analyzed for optimal rejuvenation strategies based on cost as well as steadystate availability. Given a sample data of failure times, statistical nonparametric algorithms based on the total time on test transform are presented to obtain the optimal rejuvenation interval.
2.1 Basic Model for Rejuvenation
Figure 1 shows the basic software rejuvenation model proposed by Huang et al. (4). The software system is initially in a “robust” working state, 0. As time progresses, it eventually transits to a “failureprobable” state, 1. The system is still operational in this state, but can fail (move to state 2) with a nonzero rate. The system can be repaired and brought back to the initial state, 0. The software system is also rejuvenated at regular intervals from the failure probable state 1 and brought back to the robust state 0.
Huang et al. (4) assume that the stochastic behavior of the system can be described by a simple homogeneous continuoustime Markov chain (CTMC) (35). The CTMC is then analyzed and the expected system downtime and the expected cost per unit time in the steady state are computed. An optimal rejuvenation interval that minimizes expected downtime (or expected cost) is obtained.
It is not difficult to introduce the periodic rejuvenation schedule and to extend the CTMC model to the general one. Dohi et al. (33, 34) developed semiMarkov models with the periodic rejuvenation and general transition distribution functions. Garg et al. (27) have developed a Markov Regenerative Stochastic Petri Net (MRSPN) model where rejuvenation is performed at deterministic intervals assuming that the failure probable state 1 is not observable.
2.2 Software Rejuvenation in TransactionsBased Software Systems
In Ref. 31, Garg et al. consider a transactionbased software system whose macrostates representation is presented in Fig. 2. The state in which the software is available for service (albeit with decreasing service rate) is denoted as state A. After failure, a recovery procedure is started. In state B, the software is recovering from failure and is unavailable for service. Lastly, the software occasionally undergoes rejuvenation, denoted by state C. Rejuvenation is allowed only from state A. Once recovery from failure or rejuvenation is complete, the software is reset to state A and is as good as new. From this moment, which constitutes a renewal, the whole process stochastically repeats itself.
The system consists of a servertype software to which transactions arrive at a constant rate. The effect of aging in the model may be captured by using decreasing service rate and increasing failure rate, where the decrease or the increase respectively can be a function of time, instantaneous load, mean accumulated load, or a combination of the above.
Two policies that can be used to determine the time to perform rejuvenation are considered. Under policy I, which is purely timebased, rejuvenation is initiated after a constant time δ has elapsed since it was started (or restarted). Under policy II, which is based on instantaneous load and time, a constant waiting period δ must elapse before rejuvenation is attempted. After this time, rejuvenation is initiated if and only if there are no transactions in the system. Otherwise, the software waits until the queue is empty upon which rejuvenation is initiated. The goal of the analysis is to determine optimal values of δ (rejuvenation interval under policy I and rejuvenation wait under policy II) different objective functions such as the availability, the loss probability, and the mean response time.
2.3 Software Rejuvenation in a Cluster System
Software rejuvenation has been applied to cluster systems (14, 16), which significantly improves cluster system availability and productivity. The Stochastic Reward Net (SRN) model of a cluster system employing simple timebased rejuvenation is shown in Fig. 3. The cluster consists of n nodes, which are initially in a “robust” working state, P_{up}. The aging process is modeled as a twostage hypoexponential distribution (increasing failure rate) (35) with transitions T_{fprob} and T_{noderepair}. Place P_{fprob} represents a “failureprobable” state in which the nodes are still operational. The nodes then can eventually transit to the fail state, P_{nodefail1}. A node can be repaired through the transition T_{noderepair}, with a coverage c. In addition to individual node failures, there is also a commonmode failure (transition T_{cmode}). The system is also considered down when there are a (a ≤ n) individual node failures. The system is repaired through the transition T_{sysrepair}.
In the simple timebased policy, rejuvenation is done successively for all the operational nodes in the cluster, at the end of each deterministic interval. The transition T_{rejuvinterval} fires every δ time units depositing a token in place P_{startrejuv}. Only one node can be rejuvenated at any time (at places P_{rejuv1} or P_{rejuv2}). Weight functions are assigned such that the probability of selecting a token from P_{up} or P_{fprob} is directly proportional to the number of tokens in each. After a node has been rejuvenated, it goes back to the “robust” working state, represented by place P_{rejuved}, which is a clone place for P_{up} in order to distinguish the nodes that are waiting to be rejuvenated from the nodes that have already been rejuvenated. A node, after rejuvenation, is then allowed to fail with the same rates as before rejuvenation even when another node is being rejuvenated. Clone places for P_{upb} and P_{fprob} are needed to capture this result. Node repair is disabled during rejuvenation. Rejuvenation is complete when the sum of nodes in places P_{rejuved}, P_{fprobrejuv}, and P_{nodefail2} is equal to the total number of nodes, n. In this case, the immediate transition T_{immd10} fires, putting back all the rejuvenated nodes in places P_{up} and P_{fprob}. Rejuvenation stops when there are a−1 tokens in place P_{nodefail2}, to prevent a system failure. The clock resets itself when rejuvenation is complete and is disabled when the system is undergoing repair. Guard functions (g1 through g7) are assigned to express complex enabling conditions textually.
For the analysis, the following values are assumed. The mean times spent in places P_{up} and P_{fprob} are 240 hrs and 720 hrs, respectively. The mean times to repair a node, to rejuvenate a node, and to repair the system are 30 mins, 10 mins, and 4 hrs, respectively. In this analysis, the commonmode failure is disabled and node failure coverage is assumed to be perfect. All the models were solved using the SPNP (Stochastic Petri Net Package) tool (36). The measures computed were expected downtime and the expected cost incurred over a fixed time interval. It is assumed that the cost incurred due to node rejuvenation is much less than the cost of a node or system failure since rejuvenation can be done at predetermined or scheduled times. In our analysis, we fix the value for cost_{nodefail} at $5,000/hr and the cost_{rejuv} at $250/hr. The value of cost_{sysfail} is computed as the number of nodes, n, times cost_{nodefail}.
Figure 4 shows the plots for an 8/1 configuration (8 nodes including 1 spare) system employing simple timebased rejuvenation. The upper plot and lower plots show the expected cost incurred and the expected downtime (in hours), respectively, in a given time interval, versus rejuvenation interval (time between successive rejuvenation) in hours. If the rejuvenation interval is close to zero, the system is always rejuvenating and thus incurs high cost and downtime. As the rejuvenation interval increases, both expected downtime and cost incurred decrease and reach an optimum value. If the rejuvenation interval goes beyond the optimal value, the system failure has more influence on these measures than rejuvenation. The analysis was repeated for 2/1, 8/2, 16/1, and 16/2 configurations. For timebased rejuvenation, the optimal rejuvenation interval was 100 hours for the 1spare clusters, and approximately 1 hour for the 2spare clusters.
3 MeasurementBased Models for Software Rejuvenation
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
Whereas all the analytical models are based on the assumption that the rate of software aging is known, in the measurementbased approach, the basic idea is to monitor and collect data on the attributes responsible for determining the health of the executing software. The data is then analyzed to obtain predictions about possible impending failures due to resource exhaustion.
In this section, we describe the measurementbased approach for detection and validation of the existence of software aging. The basic idea is to periodically monitor and collect data on the attributes responsible for determining the health of the executing software, in this case the UNIX operating system. Garg et al. (3) propose an approach for detection and estimation of aging in the UNIX operating system. An SNMPbased distributed resource monitoring tool was used to collect operating system resource usage and system activity data from nine heterogeneous UNIX workstations connected by an Ethernet LAN at the Department of Electrical and Computer Engineering at Duke University. A central monitoring station runs the manager program, which sends get requests periodically to each of the agent programs running on the monitored workstations. The agent programs, in turn, obtain data for the manager from their respective machines by executing various standard UNIX utility programs like pstat, iostat, and vmstat. For quantifying the effect of aging in operating system resources, the metric Estimated time to exhaustion is proposed.
In the timebased estimation method presented by Garg et al. (3), data was collected from the UNIX machines at intervals of 15 minutes for about 53 days. Timeordered values for each monitored object are obtained, constituting a time series for that object. The objective is to detect aging or a longterm trend (increasing or decreasing) in the values. Only results for the data collected from the machine Rossby are discussed here.
First, the trends in operating system resource usage and system activity are detected using smoothing of observed data by robust locally weighted regression, proposed by Cleveland (3). This technique is used to get the global trend between outages by removing the local variations. Then, the slope of the trend is estimated in order to do prediction. Figure 5 shows the smoothed data superimposed on the original data points from the time series of objects for Rossby. Amount of real memory free (plot 1) shows an overall decrease, whereas file table size (plot 2) shows an increase. Plots of some other resources not discussed here also showed an increase or decrease, which corroborates the hypothesis of aging with respect to various objects.
The seasonal Kendall test (3) was applied to each of these time series to detect the presence of any global trends at a significance level, α, of 0.05. With Z_{α}= 1.96, all values are such that the null hypothesis (H_{0}) that no trend exists is rejected for the variables considered. Given that a global trend is present and that its slope is calculated for a particular resource, the time at which the resource will be exhausted because of aging only is estimated. Table 1 refers to several objects on Rossby and lists an estimate of the slope (change per day) of the trend obtained by applying Sen's slope estimate for data with seasons (3). The values for real memory and swap space are in Kilobytes.
Resource Name  Initial Value  Max Value  Sen's Slope Estimation  95 % Confidence Interval  Estimated Time to Exh. (days) 

Rossby  
Real Memory Free  40814.17  84980  −252.00  −287.75 : −219.34  161.96 
File Table Size  220  7110  1.33  1.30 : 1.39  5167.50 
Process Table Size  57  2058  0.43  0.41 : 0.45  4602.30 
Used Swap Space  39372  312724  267.08  220.09 : 295.50  1023.50 
Jefferson  
Real Memory Free  67638.54  114608  −972.00  −1006.81 : −939.08  69.59 
File Table Size  268.83  7110  1.33  1.30 : 1.38  5144.36 
Process Table Size  67.18  2058  0.30  0.29 : 0.31  6696.41 
Used Swap Space  47148.02  524156  577.44  545.69 : 603.14  826.07 
A negative slope, as in the case of real memory, indicates a decreasing trend, whereas a positive slope, as in the case of file table size, is indicative of an increasing trend. Given the slope estimate, the table lists the estimated time to failure of the machine due to aging only with respect to this particular resource. The calculation of the time to exhaustion is done by using the standard linear approximation y = mx + c.
The method discussed in Ref. 3 assumes that accumulated depletion of a resource over a time period depends only on the elapsed time. However, it is intuitive that the rate at which a resource is depleted is dependent on the current workload. In this subsection, we discuss a measurementbased model to estimate the rate of exhaustion of operating system resources as a function of both time and the system workload (37, 38). The SNMPbased distributed resource monitoring tool described previously was used for collecting operating system resource usage and system activity parameters (at 10 min intervals) for over 3 months. Only results for the data collected from the machine Rossby are discussed here. The longest stretch of sample points in which no reboots or failures occurred were used for building the model. A semiMarkov reward model (39) is constructed using the data. First, different workload states are identified using statistical cluster analysis and a statespace model is constructed. Corresponding to each resource, a reward function based on the rate of resource exhaustion in the different states is then defined. Finally, the model is solved to obtain trends and the estimated exhaustion rates and time to exhaustion for the resources.
A methodology based on timeseries analysis to detect and estimate resource exhaustion times due to software aging in a web server while subjecting it to an artificial workload is proposed in Ref. 19. The experiments are conducted on an Apache web server running on the Linux platform.
The analysis can be done using two different approaches: (1) building a univariate model for each of the outputs or, (2) building only one multivariate model with seven outputs. In this case, seven univariate models are built and then combined into a single multivariate model. First, the parameters are determined to determine their characteristics and build an appropriate model with one output and four inputs for each parameter—connection rate, linear trend, periodic series with a period of one week, and periodic series with a period of one day. The autocorrelation function (ACF) and the partial autocorrelation function (PACF) for the output are computed. The ACF and the PACF help us decide the appropriate model for the data (40). For example, from the ACF and PACF of used swap space, it can be determined that an autoregressive model of order 1 [AR(1)] is suitable for this data series. Adding the inputs to the AR(1) model, we get the ARX(1) model for used swap space:
 (1)
where Y_{t} is the used swap space, X_{t} is the connection rate, L_{t} is the time step that represents the linear trend, W_{t} is the weekly periodic series, and D_{t} is the daily periodic series. After observing the ACF and PACF of all the parameters, we find that all of the PACFs cut off at certain lags. So all the multiple input single output (MISO) models are of the ARX type, only with different orders, which gives great convenience in combining them into a multiple input multiple output (MIMO) ARX model, which is described later.
In order to combine the MISO ARX models into a MIMO ARX model, we need to choose the order between different outputs, which is done by inspecting the CCF (crosscorrelation function) between each pair of the outputs to find out the leading relationship between them. If the CCF between parameter A and B gets its peak value at a positive lag k, we say that A leads B by k steps and it might be possible to use A to predict B. In our analysis, there are 21 CCFs that need to be computed. And, in order to reduce the complexity, we only use the CCFs that exhibit obvious leading relationship with lags less than 10 steps. The next step after determination of the orders is to estimate the coefficients of the model by the least squares method. The first half of the data is used to estimate the parameters and the rest of the data is then used to verify the model. Figure 6 shows the twohourahead (24step) predicted used swap space, which is computed using the established model and the data measured up to two hours before the predicted time point. From the plots, we can see that the predicted values are very close to the measured values.
In Ref. 8, a model is developed to account for the gradual loss of system resources, especially the memory resource. In a clientserver system, for example, every client process issues memory requests at varying points in time. An amount of memory is granted to each new request (when there is enough memory available), held by the requesting process for a period of time, and presumably released back to the system resource reservoir when it is no longer in use. A memory leak occurs when the amount of allocated memory is not fully released. The available memory space is gradually reduced as such resource leaks accumulate over time. As a consequence, a resource request that would have been granted in the leakless situation may not be granted when the system suffers from memory resource leaks. This model accommodates both the leakfree case and the leakpresent case. The model relates system degradation to resource requests, releases or resource holding intervals, and memory leaks. These quantities can be monitored and modeled directly from obtainable data measurements (19).
Avritzer and Weyuker (10) monitor production traffic data of a large telecommunication system and describe a rejuvenation strategy that increases system availability and minimizes packet loss. Cassidy et al. (21) have developed an approach to rejuvenation for large online transaction processing servers. They monitor various system parameters over a period of time. Using pattern recognition methods, they come to the conclusion that 13 of those parameters deviate from normal behavior just before a crash, providing sufficient warning to initiate rejuvenation.
4 Implementation of a Software Rejuvenation Agent
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
The first commercial version of a software rejuvenation agent (SRA) for the IBM xSeries line of cluster servers has been implemented with our collaboration (1416). The SRA was designed to monitor consumable resources, estimate the time to exhaustion of those resources, and generate alerts to the management infrastructure when the time to exhaustion is less than a userdefined notification horizon. For Windows operating systems, the SRA acquires data on exhaustible resources by reading the registry performance counters and collecting parameters such as available bytes, committed bytes, nonpaged pool, paged pool, handles, threads, semaphores, mutexes, and logical disk utilization. For Linux, the agent accesses the /proc directory structure and collects equivalent parameters such as memory utilization, swap space, file descriptors and inodes. All collected parameters are logged on to disk. They are also stored in memory preparatory to timetoexhaustion analysis.
In the current version of the SRA, rejuvenation can be based on elapsed time since the last rejuvenation or on prediction of impending exhaustion. When using timed rejuvenation, a user interface is used to schedule and perform rejuvenation at a period specified by the user. It allows the user to select when to rejuvenate different nodes of the cluster, and to select “blackout” times during which no rejuvenation is to be allowed. Predictive rejuvenation relies on curvefitting analysis and projection of the use of key resources, using recently observed data. The projected data is compared with prespecified upper and lower exhaustion thresholds, within a notification time horizon. The user specifies the notification horizon and the parameters to be monitored (some parameters believed to be highly indicative are always monitored by default), and the agent periodically samples the data and performs the analysis. The prediction algorithm fits several types of curves to the data in the fitting window. These different curve types have been selected for their ability to capture different types of temporal trends. A modelselection criterion is applied to choose the “best” prediction curve, which is then extrapolated to the userspecified horizon. The several parameters that are indicative of resource exhaustion are monitored and extrapolated independently. If any monitored parameter exceeds the specified minimum or maximum value within the horizon, a request to rejuvenate is sent to the management infrastructure. In most cases, it is also possible to identify which process is consuming the preponderance of the resource being exhausted, in order to support selective rejuvenation of just the offending process or a group of processes.
5 Approaches and Methods of Software Rejuvenation
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
Software rejuvenation can be divided broadly into two approaches as follows:

Closedloop approach: In the closedloop approach, rejuvenation is performed based on information on the system “health.” The system is monitored continuously (in practice, at small deterministic intervals) and data is collected on the operating system resource usage and system activity. This data is then analyzed to estimate time to exhaustion of a resource that may lead to a component or an entire system degradation/crash. This estimation can be based purely on time and workloadindependent (3, 14), or it can be based on both time and system workload (37, 38).
The closedloop approach can be further classified based on whether the data analysis is done offline or online. Offline data analysis is done based on system data collected over a period of time (usually weeks or months). The analysis is done to estimate time to rejuvenation. This offline analysis approach is best suited for systems whose behavior is fairly deterministic (37, 38). The online closedloop approach, on the other hand, performs online analysis of system data collected at deterministic intervals (14). Another approach to estimate the optimal time to rejuvenation could be based on system failure data (34).
This classification of approaches to rejuvenation is shown in Fig. 7.
Rejuvenation is a very general proactive fault management approach and can be performed at different levels—the system level or the application level. An example of a systemlevel rejuvenation is a hardwarereboot. At the application level, rejuvenation is performed by stopping and restarting a particular offending application, process, or a group of processes, also known as a partial rejuvenation. The above rejuvenation approaches when performed on a single node can lead to undesired and often costly downtime. Rejuvenation has been recently extended for cluster systems, in which two or more nodes work together as a single system (14, 16). In this case, rejuvenation can be performed by causing no or minimal downtime by failing over applications to another spare node.
6 Conclusions
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
In this article, various analytical models for software aging and to determine optimal times to perform rejuvenation were described. Measurementbased models based on data collected from operating systems were also discussed. The implementation of a software rejuvenation agent in a major commercial server was then briefly described. Finally, various approaches to rejuvenation and rejuvenation granularity were discussed.
In the measurementbased models presented in this article, only aging due to each individual resource has been captured. In the future, one could improve the algorithm used for aging detection to involve multiple parameters simultaneously, for better prediction capability and reduced false alarms. Dependencies between the various system parameters could be studied. The best statistical data analysis method for a given system is also yet to be determined.
 1
Bibliography
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
 1Highavailability computer systems, IEEE Computer, 1991, pp. 39–48.and ,
 2Software defects and their impact on system availability – A study of field failures in operating systems, Proc. 21st IEEE Int'l. Symposium on FaultTolerant Computing, 1991, pp. 2–9.and ,
 3A methodology for detection and estimation of software aging, Proc. of 9th Int'l. Symposium on Software Reliability Engineering, Paderborn, Germany, 1998, pp. 282–292., , , and ,
 4Software rejuvenation: Analysis, module and applications, Proc. of 25th Symposium on Fault Tolerant Computing, FTCS25, Pasadena, California, 1995, pp. 381–390., , , and ,
 5Does code decay? Assessing the evidence from change management data, IEEE Trans. Software Eng., 27(1): 1–12, 2001., , , , and ,
 6Software Aging, Proc. 16th Int'l. Conf. on Software Engineering, Sorrento, Italy, 1994, pp. 279–287.,
 7Available: http://www.softwarerejuvenation.com.
 8A workloadbased analysis of software aging and rejuvenation, IEEE Trans. Reliability, 54(3): 541–548, 2005., , and ,
 9Text of Seminar Delivered by Mr. Bernstein. University Learning Center, George Mason University, January 29, 1996.,
 10Monitoring Smoothly Degrading Systems for Increased Dependability. Empirical Software Eng. J., 2(1): 59–77, 1997.and ,
 11Onboard preventive maintenance: Analysis of effectiveness and optimal duty period, 3rd Int'l. Workshop on Object Oriented Realtime Dependable Systems, Newport Beach, CA, 1997., , , and ,
 12Software Rejuvenation. CrossTalk – J. Defense Software Eng., August 2004.and ,
 13Fatal error: How patriot overlooked a scud, Science, 1347, 1992.,
 14Proactive management of software aging, IBM J. R&D, 45(2): 2001., , , , , , and ,
 15IBM Netfinity Director Software Rejuvenation – White Paper, Research Triangle Park, NC: IBM Corp., Jan. 2001.
 16Analysis and implementation of software rejuvenation in cluster systems, Proc. of the Joint Int'l. Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, 2001., , , and ,
 17Software rejuvenation policies for cluster systems under varying workload, Proc. of Tenth Int'l. Pacific Rim Dependable Computing Symp., PRDC 2004, Papeete, Tahiti, French Polynesia, 2004., , and ,
 18
 19An approach to estimation of software aging in a web server, Proc. of the Int'l. Symp. on Empirical Software Engineering, ISESE 2002, Nara, Japan, 2002., , and ,
 20Available: http://www.apache.org.
 21Advanced pattern recognition for detection of complex software aging in online transaction processing servers, Proc. of DSN 2002, Washington D.C., 2002., , and ,
 22Rejuvenation and failure detection in partitionable systems, Proc. of the Pacific Rim Int'l. Symposium on Dependable Computing, PRDC 2001, Seoul, South Korea, 2001.and ,
 23Modeling and analysis of software rejuvenation in cable modem termination system, Proc. of the Int'l. Symp. on Software Reliability Engineering, ISSRE 2002, Annapolis, MD, 2002., , , , and ,
 24Premptive module replacement using the virtualizing operating system, Proc. of the Workshop on SelfHealing, Adaptive and SelfManaged Systems, SHAMAN 2002, New York, NY, 2002.and ,
 25Closed loop design for software rejuvenation, Proc. of the Workshop on SelfHealing, Adaptive and SelfManaged Systems, SHAMAN 2002, New York, NY, 2002., , , and ,
 26Microreboot, A technique for cheap recovery, Proc. 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, 2004., , , , and ,
 27Analysis of software rejuvenation using markov regenerative stochastic petri net, Proc. of the Sixth Int'l. Symposium on Software Reliability Engineering, Toulouse, France, 1995, pp. 180–187., , and ,
 28Time and load based software rejuvenation: Policy, evaluation and optimality, Proc. of the First FaultTolerant Symposium, Madras, India, 1995., , , and ,
 29Minimizing completion time of a program by checkpointing and rejuvenation, Proc. 1996 ACM SIGMETRICS Conference, Philadelphia, PA, 1996, pp. 252–261., , , and ,
 30Optimal rejuvenation for tolerating soft failures, Perform. Eval., 27 & 28: 491–506, 1996., , , , and ,
 31Analysis of preventive maintenance in transactions based software systems, IEEE Trans. Comput., 47(1): 96–107, 1998., , , and ,
 32Fine grained software degradation models for optimal rejuvenation policies, Perform. Eval., 46: 45–62, 2001., , and ,
 33Analysis of software cost models with rejuvenation, Proc. of the 5th IEEE International Symposium on High Assurance Systems Engineering, HASE 2000, Albuquerque, NM, 2000., , and ,
 34Statistical NonParametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, Proc. of the 2000 Pacific Rim International Symposium on Dependable Computing, PRDC 2000, Los Angeles, CA, 2000., , and ,
 35Probability and Statistics, with Reliability, Queuing and Computer Science Applications, 2nd ed., New York: Wiley, 2001.,
 36SPNP: Stochastic Petri Net Package. Version 6.0. et al. (eds.), TOOLS 2000, Lecture notes in computer science 1786, Heidelberg: SpringerVerlag, 2000, pp. 354–357., , and ,
 37A measurementbased model for estimation of resource exhaustion in operational software systems, Proc. of the Tenth IEEE Int'l. Symposium on Software Reliability Engineering, Boca Raton, Florida, 1999, pp. 84–93.and ,
 38A comprehensive model for software rejuvenation, IEEE Trans. on Dependable and Secure Computing, Apr. 2005 (in press).and ,
 39Composite performance and dependability analysis, Perform. Eval., 14(3–4): 197–216, 1992., , , and ,
 40 and ,
Further Reading
 Top of page
 Introduction
 Analytic Models for Software Rejuvenation
 MeasurementBased Models for Software Rejuvenation
 Implementation of a Software Rejuvenation Agent
 Approaches and Methods of Software Rejuvenation
 Conclusions
 Bibliography
 Further Reading
 Optimizing Preventive Service of the Software Products, IBM J. R&D, 28(1): 2–14, 1984. ,