Fault‐level coverage analysis of multistate cloud‐RAID storage systems

In this paper, a multistate cloud‐RAID (redundant array of independent disks) storage system subject to fault‐level coverage (FLC) is modeled and analyzed. Most of the existing works on reliability analysis of cloud‐RAID systems have either assumed binary‐state for storage disks or failed to consider imperfect fault coverage, an inherent behavior of fault‐tolerant systems. This work advances the state of the art by proposing a combinatorial method based on multivalued decision diagrams for analyzing reliability of a multistate cloud‐RAID system with FLC. The FLC is one common type of imperfect fault coverage behaviors, where the system fault recovery capability is dependent on the number of disk faults happening within a certain recovery window. Effects of the functional dependence behavior between the RAID controller and disks are addressed. The method is illustrated through a detailed analysis of an example cloud‐RAID 5 storage system. Numerical results are provided to show the impact of different design parameters on system performance. These results also demonstrate that failure to consider FLC leads to inaccurate system state probabilities, further misleading system design activities based on these probabilities such as maintenance and optimization.

even leading to the failure of the entire system. Such a behavior is referred to as imperfect fault coverage (IPC), which can contribute greatly to the system performance, particularly system failure probabilities. Therefore, it is significant to address this behavior in the system modeling and analysis activities. 9,10 Some research efforts have been expended in the reliability design and modeling of cloud-RAID storage systems. For example, Schnjakin and Meinel used the RAID technology with the encryption algorithm to achieve reliability and security of data stored in cloud. 11 Similarly, Al-Anzi et al suggested a web-based generation of cloud computing for data storage and management, where users' data are encrypted and distributed among various service providers based on RAID technology for higher reliability. 12 Hansen and Archibald proposed a solution combining local and cloud storage in a RAID-like configuration to achieve high data reliability and increase the amount of storage and access speed. 13 Sahana et al introduced a technique of data storage using file type classification and RAID to improve reliability and availability of cloud storage systems. 14 Liu and Xing developed a two-level hierarchical methodology combining continuous-time Markov chains and combinatorial multivalued decision diagrams (MDDs) to assess reliability of cloud-RAID 5 and 6 storage systems with heterogeneous disks from different providers, but without considering the IPC. 15,16 In recent works, 17,18 the effects of the IPC were addressed in the reliability analysis of binary-state cloud storage systems, where the cloud-RAID system and its disks are assumed to be either operational or failed. To the best of our knowledge, only a few works have addressed the IPC in the reliability analysis of multistate cloud-RAID systems; particularly, Mandava et al addressed an element-level IPC where the system recovery capability in the case of a disk fault is not relevant to statuses of other disks in the system. 19 However, in a load-sharing or work-sharing environment, it is more practical that the fault coverage probability relies on the number of disk faults that have happened within a certain recovery window, which can be modeled by the fault-level IPC or multifault model. 20 In this work, we extend the MDD-based combinatorial method for modeling and evaluating the cloud-RAID system state probabilities and reliability considering the effects of the IPC modeled by fault-level coverage (FLC). A detailed case study on a cloud-RAID 5 system is provided to illustrate the proposed methodology and effects of different design parameters on the system performance.
The remainder of this paper is organized as follows. Section 2 presents the architecture of a typical multistate cloud-RAID 5 storage system. Section 3 states the preliminary models used in the proposed approach. Section 4 presents the MDD-based combinatorial method for the reliability analysis of cloud-RAID systems considering effects of multistate and FLC; dependencies of disks on the cloud RAID controller are also addressed. Section 5 presents numerical results of the example system considered. Section 6 gives conclusions and future research directions. Figure 1 illustrates an example of a cloud-RAID 5 with four disks, where the single-bit parity code is utilized to tolerate the failure of any single disk drive. Data (eg, A) are separated into blocks (eg, A1, A2, A3) being stripped across independent disks that form an array in the cloud-RAID system. 21 All the parity blocks (eg, Ap for A1, A2, A3) are distributed across the multiple disks too. These disks may come from different cloud service providers. The controller manages and controls the disks to work together as a logical unit. 22 In other words, the disk drives are functionally dependent on the RAID controller. 23,24 Without considering IPC, a RAID system and its disks can exhibit three disjoint states: Good (G), Degraded (D), and Failed (F). These states and possible transitions among them can be modeled by a continuous-time Markov chain (CTMC). Based on the CTMC, a set of differential state equations is constructed and then solved using the Laplace transform-based method to obtain the three state probabilities. Further details on the Markov modeling of a single disk and evaluation of the three state probabilities can be found in the work of Liu and Xing. 16 Due to the IPC, the F state is further classified as a covered failure (CF) and an uncovered failure (UF). 19 The example cloud-RAID 5 system is in state G if three out of four disks are in the G state and no other disks have suffered from UFs. If two out of four disks are in the CF state and no other disks have suffered from UFs, the system is in state CF. Any state between state G and state CF belongs to state D for the system. The system is in state UF if any disk is in the UF state. The system is reliable when the system stays in the G or D state. In other words, the system reliability is the probability that the system remains in state G or D during the considered mission time. In this work, the controller is assumed to have perfect fault coverage and work with binary states: failure or operation.

FLC model
The FLC model is often applied to load-sharing, work-sharing, or control system environments. 20,25,26 In the FLC model, the fault coverage probability, denoted by c, relies on the number of element faults happening to a particular group within a certain recovery window. Specifically, the first element fault is covered with a probability c 1 , the second element fault is covered with a probability c 2 , and so on. These coverage probabilities are typically different.

Multivalued decision diagram
An MDD is a rooted, directed acyclic graph used to represent a multivalued logic function. 27,28 An MDD model constructed for a system state consists of two sink/leaf nodes "0" and "1," representing the system not being or being in the particular state, respectively. Each nonsink node (as illustrated in the left subfigure of Figure 2) corresponds to a multistate (particularly, n-state) component and is associated with an n-valued variable x. The MDD rooted at this node encodes an n-valued expression f, which can be represented by a case format as where f x = i (i = 1, 2, … , n) means the expression f evaluated with x being valued as i. The system MDD model is built in a bottom-up manner. At the bottom level, the event that a component is at a particular state is represented by a basic event MDD. The right subfigure in Figure 2 demonstrates such an MDD representing the n-state component x being in state 2.
The logic operations (AND or OR) between the sub-MDDs are performed based on manipulation rules of (2), where g and h represent two sub-MDD models, ◊ represents the logic operation, and index represents the order of variable x or y. 29,30 Note that the MDD generation depends on the ordering of all the component variables; different orderings lead to different MDD models. However, the evaluation of these different models provides the same system state probabilities.
Typically, the index or order of each variable is determined using the heuristic method. 31 To apply the rules of (2) for combining the two sub-MDDs into one, the variable indexes of the two root nodes (x for g, y for h) are compared. If the two root nodes have the same variable index (meaning they belong to the same component), the logical operation is performed between their child nodes; otherwise, the logical operation is performed between the child nodes of the root node with the smaller index and the root node with the larger index.
After obtaining the system MDD model for a particular system state S k , the probability of the system being in S k is obtained by adding probabilities of all paths from the root node of the MDD to sink node "1."

PROPOSED COMBINATORIAL METHOD
In this section, we present the proposed MDD-based method for two cases: perfect controller in Section 4.1 (ie, the controller is fully reliable during the considered mission time) and imperfect controller in Section 4.2 (ie, the controller may fail during the considered mission time). For the latter case, the functional dependence of disk drives on controller A must be considered during the evaluation.

Case 1: Perfect controller
The system state probabilities are evaluated by applying the MDD model in Section 3.2 and addressing FLC on paths from the root node to the sink node "1." For a clear representation, edges representing a disk in different states in the MDD model are denoted by different types of line. In particular, a dash-dot-dot line is for state G, a dash-dot line is for state D, a short dash line is for state CF, and a long dash line is for state UF. The probabilities of the system being in states G, D, CF, and UF are represented by P G , P D , P CF , and P UF , respectively. Coverage probabilities in the case of i disk faults occurring, denoted by c i (i = 1, 2, 3, 4), are assumed to be identical for different disks. However, the proposed methodology can be easily extended to model different coverage probabilities by using different values of c i for paths involving different combinations of i disk failures. Figure 3 illustrates the multistate fault tree (MFT) model describing the combinations of disk states that can lead to the entire cloud-RAID 5 to occupy state G. Specifically, the system is in state G when at least three out of the four disks are in the G state, and none of the disks is in the UF state. Note that the rules in (2) for an MDD generation are only applicable to logic AND and OR operations. Therefore, before applying the rules, the 3 ∕ 4 and NOT gates in the MFT in Figure 3 are converted to combinations of AND and OR gates of non-UF states of disks. Applying the rules of (2), we generate the MDD model in a bottom-up manner from the converted MFT for state G, where nodes labeled 1, 2, 3, and 4 correspond to the four disks. To address the FLC, the node labeled c 1 is added to paths involving a single-disk covered fault. 20 Figure 4 shows the final MDD model, where only paths to sink node "1" (representing the example cloud-RAID 5 system being in state G) are shown for simplicity.

System G state probability
All the paths in Figure 4 diagrammatically prove the definition of the system being in state G (at least three out of four disks must be in state G to make the entire system occupy state G). In the case of a disk covered fault being involved in a path, node c 1 is added to the path.
Based on the MDD generated, the probability of the example cloud-RAID 5 system being in state G, denoted by P G , is calculated as the sum of probabilities of all paths from the root node to sink node "1" in Figure 4. Equation (3) gives P G , where p ig and p id denote the probability that disk i (i = 1,2,3,4) is in state G and state D, respectively; p if represents the occurrence probability of a fault in disk i (thus, p if c 1 gives the CF probability of disk i). The disk state probabilities p ig , p id , p if can be estimated by using the Markov model-based method as studied in the works of Liu and Xing 16 and Mandava et al. 19 P Figure 5 shows the MFT model of the example cloud-RAID 5 system being in state CF. The system is in state CF when at least two out of the four disks are in the CF state, and none of the disks is in the UF state. Figure 6 illustrates the MDD model generated from the MFT model in Figure 5 by applying the rules of (2). Similar to Figure 4, Figure 6 shows only paths to sink node "1" (representing the example cloud-RAID 5 system being in state CF), where more than 1 disk occupies state CF. For a path involving i disk covered faults, nodes labeled by c 1 , … , c i are added to the path. For example, for the left-most path in Figure 6, two disks are failed covered; thus, nodes representing coverage factors c 1 and c 2 are added to the path. For the right-most path, all the four disks are failed covered; thus, nodes representing coverage factors c 1 , c 2 , c 3 , and c 4 are added to the path. The probability of the system being in state CF, P CF can be evaluated as the sum of probabilities of all disjoint paths from the root node to sink node "1" in Figure 6, as shown in (4). Figure 7 illustrates the MFT model of the example cloud-RAID 5 system being in state UF. Applying the rules of (2), we generate the MDD model for state UF from the MFT in Figure 7. To address the FLC, nodes labeled c i andc i are added to paths involving the disks CF and UF, respectively, wherec i = 1 − c i . Figure 8 gives the final MDD model, where only paths to sink node "1" (representing the system being in state UF) are shown for simplicity of representation. All the paths in Figure 8 diagrammatically show the definition of system being in state UF, ie, any disk failing uncovered causes the entire system to occupy the UF state. For example, considering the fourth path from the right, disks 1, 2, 3 are failed covered and disk 4 is failed uncovered; thus, nodes representing coverage factors c 1, c 2, c 3 , andc 4 are added to the path. Evaluating the MDD in Figure 8, the probability of the system being in state UF, P UF is obtained by adding the probabilities of all paths from the root node to sink node "1" as shown in (5).

System D state probability
The probability of the system being in state D can be obtained as P D = 1 − (P G + P CF + P UF ). For verification, we also generate the MDD model of the example cloud-RAID 5 system being in state D in Figure 9. Similar to MDD models constructed for the other system states, only paths leading to sink node "1" (representing the example system being in state D) are shown. All the paths in Figure 9 show the definition of system being in state D (ie, any state between state G and state CF). Evaluating the MDD model in Figure 9, we obtain the probability of the system being in state D denoted by P D as shown in (6).
The system reliability is the probability that the system occupies state G or D, ie, R system = P G + P D .

Case 2: Imperfect controller
To address the imperfectly reliable controller A, we apply the total probability law 30 to evaluate the probability of the example cloud-RAID 5 system being in state k (ie, G, D, CF, UF) as where q A denotes the failure probability of controller A, and P k represents the conditional system state k probability given that the controller is operational. In this work, q A is given as an input parameter or can be derived from the controller's failure time distribution parameter(s). For example, if the controller fails exponentially with constant rate A , then q A = (1−e − A * t ). P k in (7) can be obtained by applying the evaluation method detailed in Section 4.1. In the following, we describe the evaluation of Pr (S k |A fails) in (7). Because when the controller fails, the probability of the system being in state G is simply 0, ie, Pr (S G |A fails) = 0. Thus, (7) can be rewritten as (8) for evaluating the system G state probability, where P G is evaluated using (3).
Similarly, when the controller fails, the entire systems fails. In this sense, the probability of the system being in state CF is assumed to be 0, ie, Pr (S CF |A fails) = 0. Thus, (7) can be rewritten as (9) for evaluating the system CF state probability, where P CF is evaluated using (4).
In addition, Pr (S D |A fails) = 0. Thus, (7) can be rewritten as (10) for evaluating the system D state probability, where P D is evaluated using (6).
Because when the controller fails, the entire systems fails. In this sense, Pr (S UF |A fails) = 1. Thus, (7) can be rewritten as (11) for evaluating the system UF state probability, where P UF is evaluated by (5). Table 1 lists 12 different sets of input disk state probabilities grouped by different coverage factors. Specifically, group 1 containing sets 1 to 4 considers c 1 = c 2 = c 3 = c 4 = 1 (perfect fault coverage), group 2 containing sets 5 to 8 considers c 1 = c 2 = c 3 = c 4 = 0.5, and group 3 containing sets 9 to 12 considers c 1 = c 2 = c 3 = c 4 = 0.7. Based on Equations (3), (6), (4), and (5) derived in Section 4.1, we obtain system state probabilities P G , P D , P CF , P UF , respectively, for cases where A is perfectly reliable. The results are shown in Table 2.

Case 1: Perfect controller
Since in group 1 (c 1 = c 2 = c 3 = c 4 = 1), all the disk faults are always fully covered, regardless of the number of faults that have happened. In these cases, the system UF state probability is always 0 in this group, and the system G, D, and CF     state probabilities are directly dependent to the input disk probabilities of being in states G, D, and failure, respectively. For example, as the disk G state probability increases, the system G state probability increases correspondingly. Based on results in Table 2, the system D state probability decreases and the system UF and CF state probabilities increase with an increase in the disk failure probability; the system CF state probability increases and the system UF state probability decreases as the coverage factor increases (ie, the probability that a disk fault is covered becomes higher). The comparison of reliability results in groups 2 or 3 with those corresponding ones in Group 1 implying that failure to consider the IPC behavior leads to overestimated system reliability results, which may further mislead the system design and maintenance activities. Table 3 shows results of the system state probabilities using nonidentical coverage factors for different numbers of disk faults (columns 2-5) with the disk state probabilities being p ig = 0.7, p id = 0.2, p if = 0.1. For sets 1 and 2, c 1 or c 2 is 0, meaning that when one of the first two disk faults take place, it is uncovered causing the entire system to fail uncovered. Therefore, the system CF state probability is 0. The system G and D state probabilities increase with an increase in c 1 . For sets 2, 3, 4, 6, 7, and 8 sharing the same values of c 1 , the system probabilities for states G and D are the same because the evaluation of these two system state probabilities depends only on the value of c 1 . Due to the same reason, the system probabilities for states G and D in set 5 (with a higher value of c 1 ) are higher than those in sets 2, 3, 4, 6, 7, and 8. With the same values of c 1 and c 2 , the higher value of c 3 leads to a higher system CF state probability (set 7 vs set 8). Table 4 shows the system state probabilities at t = 1000 hours considering the imperfect controller with failure rate A = 0.0001/h, coverage factors of c 1 = c 2 = c 3 = c 4 = 0.5 and different disk state probabilities listed in the table. Equations derived in Section 4.2 are applied to perform the calculations. For set 1, as the disk probability of state G is 0, the example cloud-RAID 5 system has no chance of being in state G; thus, P SG = 0. For set 2, as the disk probability of state D is 0, the example cloud-RAID 5 system has no chance of being in state D, thus P SD = 0. For set 3, the disks undergo no failures and the system has a probability of 0 being in state CF, but it still can fail uncovered due to the single-point failure from the controller. Table 5 presents the system state probabilities at t = 1000 hours for the controller with different failure rates. The disk state probabilities used for the analysis are p ig = 0.8, p id = 0.1, p if = 0.1, and coverage factors are c 1 = c 2 = c 3 = c 4 = 0.5. As the controller failure rate increases, the entire system state G, D, CF probabilities decrease and the system UF state probability increases due to the larger chance of the single-point failure from the controller. Table 6 lists the system state probabilities for several different values of the mission time (in hours) with A = 0.0001/h, disk state probabilities of p ig = 0.8, p id = 0.1, p if = 0.1, and coverage factors of c 1 = c 2 = c 3 = c 4 = 0.5. As time proceeds, the controller's failure probability increases, leading to a higher system UF state probability and, thus, lower system G, D, and CF state probabilities.

CONCLUSION AND FUTURE WORK
Cloud-RAIDs play a crucial role in modern technological systems such as the IoT systems by providing ubiquitous data storage solutions. To assure the reliable and safe operation of these systems, it is significant to model and evaluate the reliability of the cloud-RAID system. However, existing reliability models have various limitations, such as assuming binary-state for disks, and/or perfect fault coverage (ie, fully reliable fault recovery mechanisms). This paper relaxes these limitations by modeling a multistate cloud-RAID storage system with the consideration of the FLC, where the system fault recovery capability (coverage factor) depends on the number of disk faults happening within a certain recovery window. An MDD-based method is presented to evaluate the state probabilities of the considered system. The effects of different parameters (ie, coverage factor, failure rate of the RAID controller, and mission time) are demonstrated by analyzing an example cloud-RAID 5 system. The significance of considering the IPC behavior for accurate system state probabilities is also demonstrated through numerical results. The suggested combinatorial method is applicable to homogeneous and heterogeneous disks with arbitrary types of time-to-failure distributions.
In the future, we will apply the proposed methodology to other levels of the cloud-RAID systems. Based on the suggested reliability evaluation method and our preliminary work for binary systems in the work of Mandava and Xing, 32 we are also interested in investigating the optimal resource allocation problem 33 for multistate cloud storage systems subject to IPC. The problem aims to maximize the system reliability, minimize the system cost, or balance the two performance metrics through solving constrained or multiobjective optimization problems. Another direction is to extend the MDD-based methodology to address the cloud-RAID systems performing phased-mission 34 storage tasks.