Standard Article

You have free access to this content

Software Aging and Rejuvenation

  1. Kishor S. Trivedi1,
  2. Kalyanaraman Vaidyanathan2

Published Online: 14 DEC 2007

DOI: 10.1002/9780470050118.ecse394

Wiley Encyclopedia of Computer Science and Engineering

Wiley Encyclopedia of Computer Science and Engineering

How to Cite

Trivedi, K. S. and Vaidyanathan, K. 2007. Software Aging and Rejuvenation. Wiley Encyclopedia of Computer Science and Engineering. .

Author Information

  1. 1

    Duke University, Durham, North Carolina

  2. 2

    Scalable Systems Group, Sun Microsystems, Inc., San Diego, California

Publication History

  1. Published Online: 14 DEC 2007

1 Introduction

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading

Several studies have now shown that outages in computer systems are more due to software faults than due to hardware faults (1, 2). Recent studies have also reported the phenomenon of “software aging” (3, 4) in which the state of the software degrades with time. The primary causes of this degradation are the exhaustion of operating system resources, data corruption, and numerical error accumulation, which eventually may lead to performance degradation of the software, crash/hang failure, or both. Some common examples of “software aging” are memory bloating and leaking, unreleased file-locks, data corruption, storage space fragmentation, and accumulation of round-off errors (3). Aging has not only been observed in software used on a mass scale but also in specialized software used in high-availability and safety-critical applications (4). This type of aging in operational software systems is different from code decay in software systems caused by maintenance (5, 6). The former results in performance problems, system slow downs, and crashes, whereas the latter results in unrunnable or invalid software and maintenance-induced bugs.

As aging leads to transient failures in software systems, environment diversity, a software fault-tolerance technique, can be employed proactively to prevent degradation or crashes, which involves occasionally stopping the running software, “cleaning” its internal state or its environment and restarting it. Such a technique known as “software rejuvenation” was proposed by Huang et al. (4, 7, 8), 1 which counteracts the aging phenomenon in a proactive manner by removing the accumulated error conditions and freeing up operating system resources. Garbage collection, flushing operating system kernel tables, and reinitializing internal data structures are some examples by which the internal state or the environment of the software can be cleaned.

Software rejuvenation has been implemented in the AT&T billing applications (4). An extreme example of a system-level rejuvenation, proactive hardware reboot, has been implemented in the real-time system collecting billing data for most telephone exchanges in the United States (9). Occasional reboot is also performed in the AT&T telecommunications switching software (10). On reboot, called software capacity restoration, the service rate is restored to its peak value. On-board preventive maintenance in spacecraft has been proposed and analyzed by Tai et al. (11), which maximizes the probability of successful mission completion by the spacecraft. These operations, called operational redundancy, are invoked whether or not faults exist. Proactive fault management was also recommended for the Patriot missiles' software system (12, 13). A warning was issued saying that a very long running time could affect the targeting accuracy. This decrease in accuracy was evidently due to overflow in the counter keeping track of time, during conversion from integer to real numbers. The longer the system ran continuously, the larger the error became. The warning, however, failed to inform the troops how many hours “very long” was and that it would help if the computer system was switched off and on every eight hours, which exemplifies the necessity and the use of proactive fault management even in safety critical systems. More recently, rejuvenation has been implemented in cluster systems to improve performance and availability (14-17). Two kinds of policies have been implemented taking advantage of the cluster failover feature. In the periodic policy, rejuvenation of the cluster nodes is done in a rolling fashion after every deterministic interval. In the prediction-based policy, the time to rejuvenate is estimated based on the collection and statistical analysis of system data. The implementation and analysis are described in detail in Refs. 13 and 15. A software rejuvenation feature known as process recycling has been implemented in the Microsoft IIS 5.0 web server software (18). The popular web server software Apache implements a form of rejuvenation by killing and recreating processes after a certain numbers of requests have been served (19, 20). Software rejuvenation is also implemented in specialized transaction processing servers (21). Rejuvenation has also been proposed for cable and DSL modem gateways (22), in Motorola's Cable Modem Termination System (23) and in middleware applications (24) for failure detection and prevention. Automated rejuvenation strategies have been proposed in the context of self-healing and autonomic computing systems (25). Recently, recursive restarts and micro-reboot has been proposed to increase availability (26). Software rejuvenation (preventive maintenance) incurs an overhead (in terms of performance, cost, and downtime), which should be balanced against the loss incurred due to unexpected outage caused by a failure. Thus, an important research issue is to determine the optimal times to perform rejuvenation.

Here, we present two approaches for analyzing software aging and studying aging-related failures. The rest of this article is organized as follows: The next section describes various analytical models for software aging and to determine optimal times to perform rejuvenation. Measurement-based models are dealt with, followed by discussion of the implementation of a software rejuvenation agent in a major commercial server, various approaches and methods of rejuvenation. The article concludes with pointers to future work.

2 Analytic Models for Software Rejuvenation

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading

The aim of the analytic modeling is to determine optimal times to perform rejuvenation that maximizes availability or minimizes the probability of loss or minimizes the mean response time of a transaction (in the case of a transaction processing system), which is particularly important for business-critical applications for which adequate response time can be as important as system uptime. The analysis is done for different kinds of software systems exhibiting varied failure/aging characteristics.

The accuracy of a model-based approach is determined by the assumptions made in capturing aging. In Refs. 4, 11, 27-29, only the failures causing unavailability of the software are considered, whereas in Ref. 30 only a gradually decreasing service rate of a software that serves transactions is assumed. Garg et al. (31), however, consider both these effects of aging together in a single model. Models proposed in Refs. 4, 27, 28 are restricted to hypo-exponentially distributed time to failure. Those proposed in Refs. 11, 29, 30 can accommodate general distributions but only for the specific aging effect they capture. Generally, distributed time to failure, as well as the service rate being an arbitrary function of time are allowed in Ref. 31. It has been noted (2) that transient failures are partly caused by overload conditions. Only the model presented by Garg et al. (31) captures the effect of load on aging. Existing models also differ in the measures being evaluated. In Refs. 11 and 29, software with a finite mission time is considered. In Refs. 4, 27, 28, 31, measures of interest in a transaction-based software intended to run forever are evaluated.

Bobbio et al. (32) present fine-grained software degradation models, where one can identify the current degradation level based on the observation of a system parameter. Optimal rejuvenation policies based on a risk criterion and an alert threshold are then presented. Dohi et al. (33, 34) present software rejuvenation models based on semi-Markov processes. The models are analyzed for optimal rejuvenation strategies based on cost as well as steady-state availability. Given a sample data of failure times, statistical non-parametric algorithms based on the total time on test transform are presented to obtain the optimal rejuvenation interval.

2.1 Basic Model for Rejuvenation

Figure 1 shows the basic software rejuvenation model proposed by Huang et al. (4). The software system is initially in a “robust” working state, 0. As time progresses, it eventually transits to a “failure-probable” state, 1. The system is still operational in this state, but can fail (move to state 2) with a non-zero rate. The system can be repaired and brought back to the initial state, 0. The software system is also rejuvenated at regular intervals from the failure probable state 1 and brought back to the robust state 0.

thumbnail image

Figure 1. State transition diagram for rejuvenation.

Huang et al. (4) assume that the stochastic behavior of the system can be described by a simple homogeneous continuous-time Markov chain (CTMC) (35). The CTMC is then analyzed and the expected system downtime and the expected cost per unit time in the steady state are computed. An optimal rejuvenation interval that minimizes expected downtime (or expected cost) is obtained.

It is not difficult to introduce the periodic rejuvenation schedule and to extend the CTMC model to the general one. Dohi et al. (33, 34) developed semi-Markov models with the periodic rejuvenation and general transition distribution functions. Garg et al. (27) have developed a Markov Regenerative Stochastic Petri Net (MRSPN) model where rejuvenation is performed at deterministic intervals assuming that the failure probable state 1 is not observable.

2.2 Software Rejuvenation in Transactions-Based Software Systems

In Ref. 31, Garg et al. consider a transaction-based software system whose macro-states representation is presented in Fig. 2. The state in which the software is available for service (albeit with decreasing service rate) is denoted as state A. After failure, a recovery procedure is started. In state B, the software is recovering from failure and is unavailable for service. Lastly, the software occasionally undergoes rejuvenation, denoted by state C. Rejuvenation is allowed only from state A. Once recovery from failure or rejuvenation is complete, the software is reset to state A and is as good as new. From this moment, which constitutes a renewal, the whole process stochastically repeats itself.

thumbnail image

Figure 2. Macro-states representation of the software behavior.

The system consists of a server-type software to which transactions arrive at a constant rate. The effect of aging in the model may be captured by using decreasing service rate and increasing failure rate, where the decrease or the increase respectively can be a function of time, instantaneous load, mean accumulated load, or a combination of the above.

Two policies that can be used to determine the time to perform rejuvenation are considered. Under policy I, which is purely time-based, rejuvenation is initiated after a constant time δ has elapsed since it was started (or restarted). Under policy II, which is based on instantaneous load and time, a constant waiting period δ must elapse before rejuvenation is attempted. After this time, rejuvenation is initiated if and only if there are no transactions in the system. Otherwise, the software waits until the queue is empty upon which rejuvenation is initiated. The goal of the analysis is to determine optimal values of δ (rejuvenation interval under policy I and rejuvenation wait under policy II) different objective functions such as the availability, the loss probability, and the mean response time.

2.3 Software Rejuvenation in a Cluster System

Software rejuvenation has been applied to cluster systems (14, 16), which significantly improves cluster system availability and productivity. The Stochastic Reward Net (SRN) model of a cluster system employing simple time-based rejuvenation is shown in Fig. 3. The cluster consists of n nodes, which are initially in a “robust” working state, Pup. The aging process is modeled as a two-stage hypo-exponential distribution (increasing failure rate) (35) with transitions Tfprob and Tnoderepair. Place Pfprob represents a “failure-probable” state in which the nodes are still operational. The nodes then can eventually transit to the fail state, Pnodefail1. A node can be repaired through the transition Tnoderepair, with a coverage c. In addition to individual node failures, there is also a common-mode failure (transition Tcmode). The system is also considered down when there are a (an) individual node failures. The system is repaired through the transition Tsysrepair.

thumbnail image

Figure 3. SRN model of a cluster system employing simple time-based rejuvenation.

In the simple time-based policy, rejuvenation is done successively for all the operational nodes in the cluster, at the end of each deterministic interval. The transition Trejuvinterval fires every δ time units depositing a token in place Pstartrejuv. Only one node can be rejuvenated at any time (at places Prejuv1 or Prejuv2). Weight functions are assigned such that the probability of selecting a token from Pup or Pfprob is directly proportional to the number of tokens in each. After a node has been rejuvenated, it goes back to the “robust” working state, represented by place Prejuved, which is a clone place for Pup in order to distinguish the nodes that are waiting to be rejuvenated from the nodes that have already been rejuvenated. A node, after rejuvenation, is then allowed to fail with the same rates as before rejuvenation even when another node is being rejuvenated. Clone places for Pupb and Pfprob are needed to capture this result. Node repair is disabled during rejuvenation. Rejuvenation is complete when the sum of nodes in places Prejuved, Pfprobrejuv, and Pnodefail2 is equal to the total number of nodes, n. In this case, the immediate transition Timmd10 fires, putting back all the rejuvenated nodes in places Pup and Pfprob. Rejuvenation stops when there are a−1 tokens in place Pnodefail2, to prevent a system failure. The clock resets itself when rejuvenation is complete and is disabled when the system is undergoing repair. Guard functions (g1 through g7) are assigned to express complex enabling conditions textually.

For the analysis, the following values are assumed. The mean times spent in places Pup and Pfprob are 240 hrs and 720 hrs, respectively. The mean times to repair a node, to rejuvenate a node, and to repair the system are 30 mins, 10 mins, and 4 hrs, respectively. In this analysis, the common-mode failure is disabled and node failure coverage is assumed to be perfect. All the models were solved using the SPNP (Stochastic Petri Net Package) tool (36). The measures computed were expected downtime and the expected cost incurred over a fixed time interval. It is assumed that the cost incurred due to node rejuvenation is much less than the cost of a node or system failure since rejuvenation can be done at predetermined or scheduled times. In our analysis, we fix the value for costnodefail at $5,000/hr and the costrejuv at $250/hr. The value of costsysfail is computed as the number of nodes, n, times costnodefail.

Figure 4 shows the plots for an 8/1 configuration (8 nodes including 1 spare) system employing simple time-based rejuvenation. The upper plot and lower plots show the expected cost incurred and the expected downtime (in hours), respectively, in a given time interval, versus rejuvenation interval (time between successive rejuvenation) in hours. If the rejuvenation interval is close to zero, the system is always rejuvenating and thus incurs high cost and downtime. As the rejuvenation interval increases, both expected downtime and cost incurred decrease and reach an optimum value. If the rejuvenation interval goes beyond the optimal value, the system failure has more influence on these measures than rejuvenation. The analysis was repeated for 2/1, 8/2, 16/1, and 16/2 configurations. For time-based rejuvenation, the optimal rejuvenation interval was 100 hours for the 1-spare clusters, and approximately 1 hour for the 2-spare clusters.

thumbnail image

Figure 4. Results for an 8/1 cluster system employing time-based rejuvenation.

3 Measurement-Based Models for Software Rejuvenation

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading

Whereas all the analytical models are based on the assumption that the rate of software aging is known, in the measurement-based approach, the basic idea is to monitor and collect data on the attributes responsible for determining the health of the executing software. The data is then analyzed to obtain predictions about possible impending failures due to resource exhaustion.

In this section, we describe the measurement-based approach for detection and validation of the existence of software aging. The basic idea is to periodically monitor and collect data on the attributes responsible for determining the health of the executing software, in this case the UNIX operating system. Garg et al. (3) propose an approach for detection and estimation of aging in the UNIX operating system. An SNMP-based distributed resource monitoring tool was used to collect operating system resource usage and system activity data from nine heterogeneous UNIX workstations connected by an Ethernet LAN at the Department of Electrical and Computer Engineering at Duke University. A central monitoring station runs the manager program, which sends get requests periodically to each of the agent programs running on the monitored workstations. The agent programs, in turn, obtain data for the manager from their respective machines by executing various standard UNIX utility programs like pstat, iostat, and vmstat. For quantifying the effect of aging in operating system resources, the metric Estimated time to exhaustion is proposed.

In the time-based estimation method presented by Garg et al. (3), data was collected from the UNIX machines at intervals of 15 minutes for about 53 days. Time-ordered values for each monitored object are obtained, constituting a time series for that object. The objective is to detect aging or a long-term trend (increasing or decreasing) in the values. Only results for the data collected from the machine Rossby are discussed here.

First, the trends in operating system resource usage and system activity are detected using smoothing of observed data by robust locally weighted regression, proposed by Cleveland (3). This technique is used to get the global trend between outages by removing the local variations. Then, the slope of the trend is estimated in order to do prediction. Figure 5 shows the smoothed data superimposed on the original data points from the time series of objects for Rossby. Amount of real memory free (plot 1) shows an overall decrease, whereas file table size (plot 2) shows an increase. Plots of some other resources not discussed here also showed an increase or decrease, which corroborates the hypothesis of aging with respect to various objects.

thumbnail image

Figure 5. Non-parametric regression smoothing for Rossby objects.

The seasonal Kendall test (3) was applied to each of these time series to detect the presence of any global trends at a significance level, α, of 0.05. With Zα= 1.96, all values are such that the null hypothesis (H0) that no trend exists is rejected for the variables considered. Given that a global trend is present and that its slope is calculated for a particular resource, the time at which the resource will be exhausted because of aging only is estimated. Table 1 refers to several objects on Rossby and lists an estimate of the slope (change per day) of the trend obtained by applying Sen's slope estimate for data with seasons (3). The values for real memory and swap space are in Kilobytes.

Table 1. Estimated slope and time to exhaustion for Rossby, Velum, and Jefferson objects
Resource NameInitial ValueMax ValueSen's Slope Estimation95 % Confidence IntervalEstimated Time to Exh. (days)
Rossby     
Real Memory Free40814.1784980−252.00−287.75 : −219.34161.96
File Table Size22071101.331.30 : 1.395167.50
Process Table Size5720580.430.41 : 0.454602.30
Used Swap Space39372312724267.08220.09 : 295.501023.50
Jefferson     
Real Memory Free67638.54114608−972.00−1006.81 : −939.0869.59
File Table Size268.8371101.331.30 : 1.385144.36
Process Table Size67.1820580.300.29 : 0.316696.41
Used Swap Space47148.02524156577.44545.69 : 603.14826.07

A negative slope, as in the case of real memory, indicates a decreasing trend, whereas a positive slope, as in the case of file table size, is indicative of an increasing trend. Given the slope estimate, the table lists the estimated time to failure of the machine due to aging only with respect to this particular resource. The calculation of the time to exhaustion is done by using the standard linear approximation y = mx + c.

The method discussed in Ref. 3 assumes that accumulated depletion of a resource over a time period depends only on the elapsed time. However, it is intuitive that the rate at which a resource is depleted is dependent on the current workload. In this subsection, we discuss a measurement-based model to estimate the rate of exhaustion of operating system resources as a function of both time and the system workload (37, 38). The SNMP-based distributed resource monitoring tool described previously was used for collecting operating system resource usage and system activity parameters (at 10 min intervals) for over 3 months. Only results for the data collected from the machine Rossby are discussed here. The longest stretch of sample points in which no reboots or failures occurred were used for building the model. A semi-Markov reward model (39) is constructed using the data. First, different workload states are identified using statistical cluster analysis and a state-space model is constructed. Corresponding to each resource, a reward function based on the rate of resource exhaustion in the different states is then defined. Finally, the model is solved to obtain trends and the estimated exhaustion rates and time to exhaustion for the resources.

A methodology based on time-series analysis to detect and estimate resource exhaustion times due to software aging in a web server while subjecting it to an artificial workload is proposed in Ref. 19. The experiments are conducted on an Apache web server running on the Linux platform.

The analysis can be done using two different approaches: (1) building a univariate model for each of the outputs or, (2) building only one multivariate model with seven outputs. In this case, seven univariate models are built and then combined into a single multivariate model. First, the parameters are determined to determine their characteristics and build an appropriate model with one output and four inputs for each parameter—connection rate, linear trend, periodic series with a period of one week, and periodic series with a period of one day. The autocorrelation function (ACF) and the partial autocorrelation function (PACF) for the output are computed. The ACF and the PACF help us decide the appropriate model for the data (40). For example, from the ACF and PACF of used swap space, it can be determined that an autoregressive model of order 1 [AR(1)] is suitable for this data series. Adding the inputs to the AR(1) model, we get the ARX(1) model for used swap space:

  • mathml alt image(1)

where Yt is the used swap space, Xt is the connection rate, Lt is the time step that represents the linear trend, Wt is the weekly periodic series, and Dt is the daily periodic series. After observing the ACF and PACF of all the parameters, we find that all of the PACFs cut off at certain lags. So all the multiple input single output (MISO) models are of the ARX type, only with different orders, which gives great convenience in combining them into a multiple input multiple output (MIMO) ARX model, which is described later.

In order to combine the MISO ARX models into a MIMO ARX model, we need to choose the order between different outputs, which is done by inspecting the CCF (cross-correlation function) between each pair of the outputs to find out the leading relationship between them. If the CCF between parameter A and B gets its peak value at a positive lag k, we say that A leads B by k steps and it might be possible to use A to predict B. In our analysis, there are 21 CCFs that need to be computed. And, in order to reduce the complexity, we only use the CCFs that exhibit obvious leading relationship with lags less than 10 steps. The next step after determination of the orders is to estimate the coefficients of the model by the least squares method. The first half of the data is used to estimate the parameters and the rest of the data is then used to verify the model. Figure 6 shows the two-hour-ahead (24-step) predicted used swap space, which is computed using the established model and the data measured up to two hours before the predicted time point. From the plots, we can see that the predicted values are very close to the measured values.

thumbnail image

Figure 6. Measured and two-hour-ahead predicted used swap space.

In Ref. 8, a model is developed to account for the gradual loss of system resources, especially the memory resource. In a client-server system, for example, every client process issues memory requests at varying points in time. An amount of memory is granted to each new request (when there is enough memory available), held by the requesting process for a period of time, and presumably released back to the system resource reservoir when it is no longer in use. A memory leak occurs when the amount of allocated memory is not fully released. The available memory space is gradually reduced as such resource leaks accumulate over time. As a consequence, a resource request that would have been granted in the leak-less situation may not be granted when the system suffers from memory resource leaks. This model accommodates both the leak-free case and the leak-present case. The model relates system degradation to resource requests, releases or resource holding intervals, and memory leaks. These quantities can be monitored and modeled directly from obtainable data measurements (19).

Avritzer and Weyuker (10) monitor production traffic data of a large telecommunication system and describe a rejuvenation strategy that increases system availability and minimizes packet loss. Cassidy et al. (21) have developed an approach to rejuvenation for large online transaction processing servers. They monitor various system parameters over a period of time. Using pattern recognition methods, they come to the conclusion that 13 of those parameters deviate from normal behavior just before a crash, providing sufficient warning to initiate rejuvenation.

4 Implementation of a Software Rejuvenation Agent

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading

The first commercial version of a software rejuvenation agent (SRA) for the IBM xSeries line of cluster servers has been implemented with our collaboration (14-16). The SRA was designed to monitor consumable resources, estimate the time to exhaustion of those resources, and generate alerts to the management infrastructure when the time to exhaustion is less than a user-defined notification horizon. For Windows operating systems, the SRA acquires data on exhaustible resources by reading the registry performance counters and collecting parameters such as available bytes, committed bytes, non-paged pool, paged pool, handles, threads, semaphores, mutexes, and logical disk utilization. For Linux, the agent accesses the /proc directory structure and collects equivalent parameters such as memory utilization, swap space, file descriptors and inodes. All collected parameters are logged on to disk. They are also stored in memory preparatory to time-to-exhaustion analysis.

In the current version of the SRA, rejuvenation can be based on elapsed time since the last rejuvenation or on prediction of impending exhaustion. When using timed rejuvenation, a user interface is used to schedule and perform rejuvenation at a period specified by the user. It allows the user to select when to rejuvenate different nodes of the cluster, and to select “blackout” times during which no rejuvenation is to be allowed. Predictive rejuvenation relies on curve-fitting analysis and projection of the use of key resources, using recently observed data. The projected data is compared with prespecified upper and lower exhaustion thresholds, within a notification time horizon. The user specifies the notification horizon and the parameters to be monitored (some parameters believed to be highly indicative are always monitored by default), and the agent periodically samples the data and performs the analysis. The prediction algorithm fits several types of curves to the data in the fitting window. These different curve types have been selected for their ability to capture different types of temporal trends. A model-selection criterion is applied to choose the “best” prediction curve, which is then extrapolated to the user-specified horizon. The several parameters that are indicative of resource exhaustion are monitored and extrapolated independently. If any monitored parameter exceeds the specified minimum or maximum value within the horizon, a request to rejuvenate is sent to the management infrastructure. In most cases, it is also possible to identify which process is consuming the preponderance of the resource being exhausted, in order to support selective rejuvenation of just the offending process or a group of processes.

5 Approaches and Methods of Software Rejuvenation

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading

Software rejuvenation can be divided broadly into two approaches as follows:

  • Open-loop approach: In this approach, rejuvenation is performed without any feedback from the system. Rejuvenation, in this case, can be based just on elapsed time (periodic rejuvenation) (4, 27) or instantaneous/cumulative number of jobs on the system (31).

  • Closed-loop approach: In the closed-loop approach, rejuvenation is performed based on information on the system “health.” The system is monitored continuously (in practice, at small deterministic intervals) and data is collected on the operating system resource usage and system activity. This data is then analyzed to estimate time to exhaustion of a resource that may lead to a component or an entire system degradation/crash. This estimation can be based purely on time and workload-independent (3, 14), or it can be based on both time and system workload (37, 38).

    The closed-loop approach can be further classified based on whether the data analysis is done offline or online. Offline data analysis is done based on system data collected over a period of time (usually weeks or months). The analysis is done to estimate time to rejuvenation. This offline analysis approach is best suited for systems whose behavior is fairly deterministic (37, 38). The online closed-loop approach, on the other hand, performs online analysis of system data collected at deterministic intervals (14). Another approach to estimate the optimal time to rejuvenation could be based on system failure data (34).

This classification of approaches to rejuvenation is shown in Fig. 7.

thumbnail image

Figure 7. Approaches to software rejuvenation.

Rejuvenation is a very general proactive fault management approach and can be performed at different levels—the system level or the application level. An example of a system-level rejuvenation is a hardware-reboot. At the application level, rejuvenation is performed by stopping and restarting a particular offending application, process, or a group of processes, also known as a partial rejuvenation. The above rejuvenation approaches when performed on a single node can lead to undesired and often costly downtime. Rejuvenation has been recently extended for cluster systems, in which two or more nodes work together as a single system (14, 16). In this case, rejuvenation can be performed by causing no or minimal downtime by failing over applications to another spare node.

6 Conclusions

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading

In this article, various analytical models for software aging and to determine optimal times to perform rejuvenation were described. Measurement-based models based on data collected from operating systems were also discussed. The implementation of a software rejuvenation agent in a major commercial server was then briefly described. Finally, various approaches to rejuvenation and rejuvenation granularity were discussed.

In the measurement-based models presented in this article, only aging due to each individual resource has been captured. In the future, one could improve the algorithm used for aging detection to involve multiple parameters simultaneously, for better prediction capability and reduced false alarms. Dependencies between the various system parameters could be studied. The best statistical data analysis method for a given system is also yet to be determined.

End Notes
  • 1

    Although we use the by-now-established phrase “software aging,” it should be clear that no deterioration of the software system per se is implied but rather the software appears to age due to the gradual depletion of resources (8). Likewise, “software rejuvenation” actually refers to rejuvenation of the environment in which the software is executing.

Bibliography

  1. Top of page
  2. Introduction
  3. Analytic Models for Software Rejuvenation
  4. Measurement-Based Models for Software Rejuvenation
  5. Implementation of a Software Rejuvenation Agent
  6. Approaches and Methods of Software Rejuvenation
  7. Conclusions
  8. Bibliography
  9. Further Reading
  • 1
    J. Gray and D. P. Siewiorek, High-availability computer systems, IEEE Computer, 1991, pp. 3948.
  • 2
    M. Sullivan and R. Chillarege, Software defects and their impact on system availability – A study of field failures in operating systems, Proc. 21st IEEE Int'l. Symposium on Fault-Tolerant Computing, 1991, pp. 29.
  • 3
    S. Garg, A. van Moorsel, K. Vaidyanathan, and K. Trivedi, A methodology for detection and estimation of software aging, Proc. of 9th Int'l. Symposium on Software Reliability Engineering, Paderborn, Germany, 1998, pp. 282292.
  • 4
    Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, Software rejuvenation: Analysis, module and applications, Proc. of 25th Symposium on Fault Tolerant Computing, FTCS-25, Pasadena, California, 1995, pp. 381390.
  • 5
    S. G. Eick, T. L. Graves, A. F. Karr, J. S. Marron, and A. Mockus, Does code decay? Assessing the evidence from change management data, IEEE Trans. Software Eng., 27(1): 112, 2001.
  • 6
    D. L. Parnas, Software Aging, Proc. 16th Int'l. Conf. on Software Engineering, Sorrento, Italy, 1994, pp. 279287.
  • 7
  • 8
    Y. Bao, X. Sun, and K. Trivedi, A workload-based analysis of software aging and rejuvenation, IEEE Trans. Reliability, 54(3): 541548, 2005.
  • 9
    L. Bernstein, Text of Seminar Delivered by Mr. Bernstein. University Learning Center, George Mason University, January 29, 1996.
  • 10
    A. Avritzer and E. J. Weyuker, Monitoring Smoothly Degrading Systems for Increased Dependability. Empirical Software Eng. J., 2(1): 5977, 1997.
  • 11
    A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht, On-board preventive maintenance: Analysis of effectiveness and optimal duty period, 3rd Int'l. Workshop on Object Oriented Real-time Dependable Systems, Newport Beach, CA, 1997.
  • 12
    L. Bernstein and C. M. R. Kintala, Software Rejuvenation. CrossTalk – J. Defense Software Eng., August 2004.
  • 13
    E. Marshall, Fatal error: How patriot overlooked a scud, Science, 1347, 1992.
  • 14
    V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan, and W. Zeggert, Proactive management of software aging, IBM J. R&D, 45(2): 2001.
  • 15
    IBM Netfinity Director Software Rejuvenation – White Paper, Research Triangle Park, NC: IBM Corp., Jan. 2001.
  • 16
    K. Vaidyanathan, R. E. Harper, S. W. Hunter, and K. S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, Proc. of the Joint Int'l. Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, 2001.
  • 17
    W. Xii, Y. Hong, and K. S. Trivedi, Software rejuvenation policies for cluster systems under varying workload, Proc. of Tenth Int'l. Pacific Rim Dependable Computing Symp., PRDC 2004, Papeete, Tahiti, French Polynesia, 2004.
  • 18
  • 19
    L. Li, K. Vaidyanathan, and K. S. Trivedi, An approach to estimation of software aging in a web server, Proc. of the Int'l. Symp. on Empirical Software Engineering, ISESE 2002, Nara, Japan, 2002.
  • 20
  • 21
    K. Cassidy, K. Gross, and A. Malekpour, Advanced pattern recognition for detection of complex software aging in online transaction processing servers, Proc. of DSN 2002, Washington D.C., 2002.
  • 22
    C. Fetzer and K. Hostedt, Rejuvenation and failure detection in partitionable systems, Proc. of the Pacific Rim Int'l. Symposium on Dependable Computing, PRDC 2001, Seoul, South Korea, 2001.
  • 23
    Y. Liu, Y. Ma, J. J. Han, H. Levendel, and K. S. Trivedi, Modeling and analysis of software rejuvenation in cable modem termination system, Proc. of the Int'l. Symp. on Software Reliability Engineering, ISSRE 2002, Annapolis, MD, 2002.
  • 24
    T. Boyd and P. Dasgupta, Premptive module replacement using the virtualizing operating system, Proc. of the Workshop on Self-Healing, Adaptive and Self-Managed Systems, SHAMAN 2002, New York, NY, 2002.
  • 25
    Y. Hong, D. Chen, L. Li, and K. S. Trivedi, Closed loop design for software rejuvenation, Proc. of the Workshop on Self-Healing, Adaptive and Self-Managed Systems, SHAMAN 2002, New York, NY, 2002.
  • 26
    G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox, Microreboot, A technique for cheap recovery, Proc. 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, 2004.
  • 27
    S. Garg, A. Puliafito, and K. S. Trivedi, Analysis of software rejuvenation using markov regenerative stochastic petri net, Proc. of the Sixth Int'l. Symposium on Software Reliability Engineering, Toulouse, France, 1995, pp. 180187.
  • 28
    S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi, Time and load based software rejuvenation: Policy, evaluation and optimality, Proc. of the First Fault-Tolerant Symposium, Madras, India, 1995.
  • 29
    S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi, Minimizing completion time of a program by checkpointing and rejuvenation, Proc. 1996 ACM SIGMETRICS Conference, Philadelphia, PA, 1996, pp. 252261.
  • 30
    A. Pfening, S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi, Optimal rejuvenation for tolerating soft failures, Perform. Eval., 27 & 28: 491506, 1996.
  • 31
    S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi, Analysis of preventive maintenance in transactions based software systems, IEEE Trans. Comput., 47(1): 96107, 1998.
  • 32
    A. Bobbio, A. Sereno, and C. Anglano, Fine grained software degradation models for optimal rejuvenation policies, Perform. Eval., 46: 4562, 2001.
  • 33
    T. Dohi, K. Goseva–Popstojanova, and K. S. Trivedi, Analysis of software cost models with rejuvenation, Proc. of the 5th IEEE International Symposium on High Assurance Systems Engineering, HASE 2000, Albuquerque, NM, 2000.
  • 34
    T. Dohi, K. Goseva-Popstojanova, and K. S. Trivedi, Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, Proc. of the 2000 Pacific Rim International Symposium on Dependable Computing, PRDC 2000, Los Angeles, CA, 2000.
  • 35
    K. S. Trivedi, Probability and Statistics, with Reliability, Queuing and Computer Science Applications, 2nd ed., New York: Wiley, 2001.
  • 36
    C. Hirel, B. Tuffin, and K. S. Trivedi, SPNP: Stochastic Petri Net Package. Version 6.0. B. R. Haverkort et al. (eds.), TOOLS 2000, Lecture notes in computer science 1786, Heidelberg: Springer-Verlag, 2000, pp. 354357.
  • 37
    K. Vaidyanathan and K. S. Trivedi, A measurement-based model for estimation of resource exhaustion in operational software systems, Proc. of the Tenth IEEE Int'l. Symposium on Software Reliability Engineering, Boca Raton, Florida, 1999, pp. 8493.
  • 38
    K. Vaidyanathan and K. S. Trivedi, A comprehensive model for software rejuvenation, IEEE Trans. on Dependable and Secure Computing, Apr. 2005 (in press).
  • 39
    K. S. Trivedi, J. Muppala, S. Woolet, and B. R. Haverkort, Composite performance and dependability analysis, Perform. Eval., 14(3–4): 197216, 1992.
  • 40
    R. H. Shumway and D. S. Stoffer, Time Series Analysis and Its Applications, New York: Springer-Verlag, 2000.