1. Introduction
The use of wireless sensor networks (WSN) is broadly extended in high technology and communications’ systems. A survey of the use, properties, characteristics and applications is given in [
1,
2]. The WSN frequently operates in remote areas under hard environmental conditions. The elements are formed by nodes operating unattended and the miniaturization implies that the elements in the nodes are so close that the arrival of a shock can produce the failure of several elements simultaneously. Other structures of frequent use are the high performance computing (HPC) systems, composed by aggregate computing power in order to reach higher performance than particular computers or workstations, solving problems in science, industry, and business. These two structures are formed by a large number of components and systems, generally including redundancy and the possibility of simultaneous failures. In the WSN, the maintenance is usually performed by inspections at certain times for reducing costs and a higher effectivity.
In references [
3,
4], systems under simultaneous failures have been studied in the discrete case. In the first one, series-parallel and parallel-series systems of order 3 with lifetimes of the components not depending on the time are studied, and the reliability is calculated by using combinatorial methods. In the second one, the reliability of a network is calculated by using computational calculations, and the lifetime of the components are discrete random variables.
The study of simultaneous failures in continuous time is not frequent in the literature. In [
5], an HPC system with multiple failures is studied. The failure time of the nodes is assumed to follow Weibull distributions, and the authors develop a reliability model for a system with k nodes and simultaneous failures. The reliability, the failure rate, and the mean time to failure of the system are calculated. Dependence between the reliability measures and the number of nodes is detected. The analytical structure proposed by the authors is a multivariate Weibull model based on the Marshall–Olkin multivariate exponential distribution. The Markov processes have been an appropriate model in reliability, see [
6,
7], among many others. A limitation of the Markov processes in modeling systems is that the staying times in states are exponentially distributed, or, equivalently, that the instantaneous transition rates do not depend on the time; they are constant. This is a severe restriction and in many cases is not satisfied. We present an extension of this last paper to the case in which the staying times follow phase-type distributions. The class of the phase-type distributions is dense, in the sense of the distributions, in the family of the distribution functions defined on the positive real half-line. Thus, this assumption supposes that the model can be applied to systems in which the staying times follow distributions as Weibull, Gamma, Erlang, and others; see [
8]. Moreover, it can also be applied to datasets coming from lifetimes of components of systems. In both cases, a model governing the system by approximating the lifetimes of the components by Phase-type distributions can be constructed. Some models using Phase-type distributions in reliability and other domains are [
9,
10], among others. In [
9], a reliability redundancy allocation problem is developed on a non-repairable system with heterogenous components where lifetimes follow PH distributions. In [
10], these distributions are involved in a queuing model applied to nodes in a WSN.
We propose a model based in the matrix-analytic methods (MAMs), extending other previous papers. The MAMs have proven to be a flexible structure for modeling arrival of events and contain many familiar processes as very special cases, see [
11,
12,
13], among others. A survey of these MAMs applied to different domains is in [
14]. The introduction of the MAMs in the analysis of the systems implies the use of the following elements: the Markovian arrival processes, the Phase-type distributions, and the Kronecker operations; they play a central role in the study and calculations of the performance measures, and extend many other particular systems given the generality of these structures. The Kronecker operations are studied in [
15,
16]. In [
17], an updated revision of the Markovian arrival process, an essential element in the MAMs, is presented. A survey of the techniques through the state-space models are investigated and compared, in order to facilitate the appropriate model for calculating the reliability and the availability in a dynamic environment to practitioners [
18]. When the model is constructed by using these methods, its applicability is extended, and the stochastic process governing the system is a multidimensional Markov process analytically tractable.
The system we present is submitted to shocks, and the shock models have been studied in the literature under different methodologies and applied to different systems. The initial systematic study for shock models is [
19]; from then, many extensions have been following different ways. In [
20], the behavior of different populations under shock models is studied. In [
21], a complete study of the different types of failures and models is performed. In [
22], different shock models following non-homogeneous Poisson processes are applied for calculating the reliability of the systems. In [
23], an analysis of the availability is made in a system affected by attacks. One of the first papers applying the phase-type distributions as a model for arrival of shocks is [
24], extending previous papers. The shock models and the continuous Markov processes have been considered suitable models for analyzing the ageing of components and systems [
25]. The MAP as a model for the arrival of failures is introduced in [
26], and it is applied in [
27,
28,
29,
30], for studying different reliability systems. In [
31], a system under shocks, damages, and other elements is modeled by a Markovian arrival process with marked arrivals.
We study an N-system with different units. It is submitted to internal failures affecting single units, and shocks that can produce damage and, eventually, the failure of one or several units. The maintenance is performed by inspections, replacing all the down units (if any) with new and identical ones. The case under exponential units has been studied in [
6]. In the present paper, we extend it to a more general system: the lifetimes of the units follow PH-distributions, extending the exponential lifetimes; the arrival of shocks is governed by a Markovian arrival processs (MAP), extending the Poisson process of the previous paper; the time between inspections follows a PH-distribution, extending the exponential case; the size of the damage produced by the shocks is introduced in the study, and it follows a discrete PH-distribution, extending the exponential case, where the number of failures follows a Geometric distribution. Clearly, the applicability of the present paper is greater than the previous ones. The study we carry out does not follow straightaway from the exponential version; the matricial structure in the present one has to be constructed—it is not directly derived from the exponential case. A Markov process is not a suitable structure for being applied to systems under simultaneous failures; it must be treated in an appropriate way, and this is solved by introducing the Kronecker operations involving indicator functions allowing for the preservation of the Markovian structure for systems whose units are not exponentially distributed.
For this system, a multidimensional Markov process governing the system under the conditions above given is constructed. The main quantities of interest associated with the system can be calculated in an well-structured form in transient and stationary regimes. In return, the dimensions of the involved matrices are very large even in systems with a few units. Formally, the construction of the generator is not affected by the orders of the involved matrices, though it is complex due to the simultaneity of the events. We illustrate the general procedure applying it to a k-out-of-N system, of frequent use in engineering; an application for
is carried out; in it, the performance measures usual in reliability are calculated, and it is shown how the model can be applied taking into account the structure of the system and how it operates. In [
32], a computational method for determining the reliability of a k-out-of-n system with different components is constructed; in the present paper, we propose an alternative procedure applying the MAMs, and calculating other performance measures using algebraic calculations with matrices. A numerical application is added illustrating how the model can be constructed when the units follow Weibull distributions.
The paper contributes to the study of general reliability systems in several ways: (1) The units are different and they are not exponentially distributed; the lifetimes of the units follow PH-distributions, this allows the approximation of systems with units following general distributions; (2) the arrival of shocks is governed by an MAP, this allows dependence among the interarrival times. In addition, they comprise the usual arrival distributions; (3) the shocks produce damage and eventually the failure; (4) there are simultaneous failures of units and simultaneous replacements are carried out after an inspection; (5) the generator construction and the performed reliability analysis can be applied to n-systems with dissimilar components arranged in different structures. The general hypotheses established for this model and the compact expressions achieved by using the MAM methodology expand its application possibilities.
The paper is organized as follows: in
Section 1, we give the definitions of Phase-type distributions and Markovian arrival processes and comment on how these operate. In
Section 2, the assumptions of the system are established and the multidimensional Markov process governing the system is constructed. In
Section 3, the model is applied to a k-out-of-N system, obtaining explicit expressions for the performance measures.
Section 4 is a methodological one, and a detailed numerical application to a 2-out-of-3 system is performed, showing the steps for applying the calculations to other systems.
Preliminaries
The present section introduces the basic definitions used in the paper and some comments about how the MAMs operate.
Definition 1. If and are rectangular matrices of orders respectively, their Kronecker product is the matrix of order , written in compact form as .
The Kronecker sum of the square matrices and of orders p and q, respectively, is defined by , where denotes the identity matrix of order k.
For more details about these operations, see references [
12,
15].
Definition 2. The distribution function on of a phase-type distribution is It is associated with a finite Markov process with one absorbent state and initial vector . Matrix is the submatrix of the generator of the process restricted to the transient states, and it is non-singular. Vector is a column vector of 1’s. The absorption column vector is denoted by , and it satisfies . The order of matrix and vectors are the same. It is said that the distribution has representation , and it is written .
is the distribution function of the first passage time of the Markov process for the absorbent state given the initial vector . The order of the distribution is the order of matrix . It is denoted a PH-distribution. A PH-renewal process is a renewal process whose distribution function is given by a PH-distribution.
The discrete version of a PH-distribution is defined in [
12], and it is denoted by PH
-distribution.
Definition 3. Let be an irreducible infinitesimal generator of a Markov process. Let a sequence of matrices be non-negative and matrix have non-negative off-diagonal entries. The diagonal entries of are strictly negative, and it is non-singular. The following is satisfied: Associated with this Markov process , there is a renewal Markov process performing an arrival process to real line operating as follows. It is assumed that there are n different types of arrival. Matrix governs the interarrival times, and matrix , , governs the occurrence of the arrival of type This is the Markovian arrival process (MAP) associated with the initial Markov process, with parameter matrices , . It is denoted by . The order of the MAP is the order of the involved matrices. The states of the Markov process are denominated virtual phases of the MAP, and they follow exponential times.
A -distribution can be interpreted as follows. While operating, the process occupies exponential states (phases) randomly, and it finishes when it reaches the absorbent one. The PH-renewal process associated with this PH-distribution is a renewal one that reinitiates with matrix rate . A with two parameter matrices extends the concept of PH-renewal process when, after a renewal, the following period initiates according to a rate depending on the phase occupied when the renewal occurs; the matrix including the different initial rates after a renewal is . Matrix represents the change of virtual phases not corresponding to renewals. The renewals will be interpreted as shocks arriving to the system.
2. The System
We consider a system with N independent units under the conditions given in the Introduction. The system is renewed randomly, replacing the non-operational units, if any, with new and identical ones after inspections. The internal failures, the arrival of shocks, the damage caused by the shocks, and the number of units affected by the shocks are random, and they occur according to the following assumptions.
Assumption 1. The time of the internal failure of unit i follows a -distribution of order .
Assumption 2. The shocks arrive following an with initial vector and order a.
Assumption 3. The shocks cause damage or failure. The size of the damage is governed by a discrete -distribution with initial probability vector β and order r.
Assumption 4. The time between consecutive inspections follows a -distribution of order b. The times of inspection are negligible.
The damage caused by a shock can be interpreted as a counter before the failure, in such a way that, after a shock, the counter of the units changes and the absorption in the discrete PH-distribution governing the damages would indicate the failure of the corresponding unit.
First, we define the internal state of the units. The internal state of unit
i is denoted by
. These states are characterized by the internal phase and the phase of damage. Unit
i is operational if it occupies any operational phase of
and the same for
; the corresponding absorption vectors for these distributions are
, and
, respectively. The set associated with this case is
. In other cases, it will be non-operational, the set associated is
and it is formed by the corresponding absorbent phases of the previous PH-distributions. We denote by
the set formed by the union of these sets:
Vector
,
, denotes the set of the internal states of the units. The macro states of the system are determined by vectors
, the virtual phases (
) of the MAP governing the arrival of shocks and the phases (
l) of the PH-distribution governing the time between inspections. They are denoted by
We denote by the number of non-operational units at time t. We will construct the multidimensional Markov process with space of macro states S.
If , unit i is operational and if not, . Vector is a state-vector of the system. Given a state-vector , the occurrence of a failure (internal or shock) or inspection causes a transition to another state-vector , and the changes in the components of vector are indicated by if and in other case (there is no change). This is an indicator of the change of a unit, and vector registers the changes of the units in a transition . The number of units changing in such transition is the sum over i of the components of the vector.
The number of non-operational units in vector is . The possible values for are The set is formed by the state-vectors having m non operational units (non-null components). These sets form a partition of the set of state-vectors. Let two state-vectors and the Kronecker indicator if and if ; expression indicates that vectors have the same number of operational units.
We calculate a general Markov process governing the system considering that the system fails when the last operational unit fails; this is equivalent to a system organized in parallel. We will show that the case of a 1-out-of-N system is obtained from the present model in a straight way. First, the transition rate matrices among the state-vectors are constructed and then the generator including the transition rate matrices among the states. Once is constructed, the transition probability matrix with is calculated from expression , , , by using computational calculations.
The transition among the state-vectors occurs by the arrival of one of the events: the arrival of a shock, an internal failure, or an inspection. The construction of the Markov process governing the system follows the line in [
6], but this does not mean that they are calculated straightaway; new matrices are necessary and the operations of Kronecker play a central role. Going forward, the subindices of the identity matrices indicate their order, and it is the same for vectors
The transition among the state-vectors are determined by the state of the units (up or down) and the indicator of change
.
If the system occupies state-vector
and a shock arrives, there is a possible transition to state-vector
(that can be the same vector
x) according to the rates among the phases of the state-vectors; the rate of change of unit
i due to a shock is given by
The first line indicates that there is a change in unit i, the failure of the unit is governed by , and it can occur from any phase (). The Kronecker product expresses the changes in terms of the corresponding matrices and vectors. The second line indicates that there is no change (given by ), and a shock arrives (counter ) producing no change. In the third line, the unit is not operational.
The expression of the change for all units in vector
to the arrival of a shock is calculated using the Kronecker product, and it is
If an internal failure arrives, the rate among the phases of the state-vectors is given by reasoning similarly as above for the arrival of internal failures, and we have for unit
i:
The first line indicates that the failure of the unit is governed by
, and it occurs in any phase of damage (
). The second line indicates that there is a change in an operational phase without damage (
) or there is a change in another unit; in the second summand, a change is produced in one unit different from
i, and this is represented by
. For the total of units, we have
Note that expressions are similar, though in the previous one the shocks can be simultaneous and in the last the internal failures cannot. The difference is that, in expression , simultaneous failures are not allowed; this is performed incorporating the Kronecker’s delta.
After the occurrence of an inspection, the unit
i is affected in two ways:
The first line indicates a replacement of the down unit that reinitiates (
) and the counter reinitiates (
). The second line indicates that nothing occurs. For the total of units, we have
The matrix transition rates among the state-vectors depend on the number of units failing, which is
, and the special case
. They are
The matrix grouping the transition rates between states
in terms of the state-vectors is
and, finally, the generator of the Markov process in terms of the macro states is
If
, the initial block vector of the Markov process is
The number of 0’s in the expression of is the difference between the order of and the order of the first component.
The stationary distribution is obtained as in the exponential case, changing the corresponding matrices [
6].
3. k-Out-of-N Systems
The model we have constructed is organized as in parallel, since the system fails when the last operational unit fails. One of the advantages of the use of MAMs is that they are versatile and can be applied to systems under other assumptions with the appropriate changes. We dedicate this section to studying the k-out-of-N systems under the assumptions and notation in
Section 2. This class of systems is of frequent use in the applications.
The system is operational if at least
k units are. The set of macro-states is
; this set can be partitioned in
and
, operational and non-operational macro-states. Grouping the set of non-operational macro-states in one, denoted by
, the set of macro-states of the system is
. The initial probability vector (
3) and generator
in (
2), are, respectively,
The occupancy probabilities of the operational macro states are
,
The reliability
is the time of the first failure of the system; it is a
-distribution, where
and
are derived from
and
, respectively, eliminating the row and column associated with
. The availability at time
t is given by
Particular systems are the series and parallel ones. As particular cases, we have the 1-out-of-N system (parallel) and the N-out-of-N system (series). The corresponding renewal processes associated with the first time of failure can be studied straightaway as can be seen in references from the authors.
Let a k-out-of-3 system be
. For
, it is a parallel system. The set of operational and non-operational states are
,
, respectively. The failure of the system follows a
-distribution with
The mean time of failure is
. The renewal of the system occurs when an inspection arrives while the system occupies state 3. The time until the renewal follows a
-distribution with
For
,
,
. The failure of the system follows a
-distribution with
The mean time of failure is
. The renewal of the system occurs when an inspection arrives and the system is down; it follows a
-distribution with
For
,
, states
, are non-operational. The first time of failure follows a
with
and
The mean time of failure is
The renewal of the system occurs when an inspection arrives while the system occupies a non-operational state. The time until the renewal of the system follows a
with vector
and
These are the fundamental matrices for calculating the performance measures associated with the system.
4. Numerical Application
The expressions above can be directly applied when the lifetimes of the units are governed by PH-distributions. In engineering, the Weibull and the lognormal distributions, among others, are frequently used. The Weibull distribution plays a central role in reliability, a text presenting in an integrated form different types of Weibull models is [
33]; it is a complete survey, classifying different models, including an inferential study and applying to several domains with special dedication to reliability. We illustrate the numerical application to k-out-of-3 systems with units governed by Weibull distributions. The first step is to fit a PH-distribution, and then to apply the results in the previous section.
The reliability function of a two parametric
Weibull distribution is
with mean and variance given, respectively, by
4.1. Fitting PH-Distributions
We consider three different Weibull distributions corresponding to the three units with the parameters given below. For every one, the mean and the variance are calculated. A fit of a PH-distribution is performed applying the EM algorithm [
8]. The number of data are
, and 1000 iterations are realized. In every case, the first PH-distribution selected for fitting is of order 2; if the fitting is rejected, then a PH-distribution of order 3 is fitted, and so on. The goodness of the fit is calculated using the Kolmogorov–Smirnov test. In the cases we present, two of them are well fitted to PH-distributions of order 2, and the other to a PH-distribution of order 3. In all cases, the statistic calculated
Z from the sample is less than the statistics
, and all the fittings are not rejected.
We assume a prefixed unit time. The goodness of the fits guarantees a good approximation of the model and consequently also for the performance measures to the initial system. The calculations using the MAMs are more tractable than the ones with the Weibull distributions. In the following figures, when the first significant decimal figure is after a ten thousandth, it approaches zero.
Unit 1 is assumed to follow a Weibull distribution with
. Then,
,
The representation
of the PH-distribution fitted is
Unit 2 is assumed to follow a Weibull distribution with
. Then,
,
The representation
of the PH-distribution fitted is
Unit 3 is assumed to follow a Weibull distribution with
. Then,
,
The representation
of the PH-distribution fitted is
4.2. System 2-Out-of-3
We apply the results to a 2-out-of-3 system. The matrices of the MAP governing the arrival of shocks are assumed to be
The probability of a fatal shock is , and is the probability that it only causes damage. For simplifying the calculations, the occurrence of a fatal shock after non fatal shocks follows a geometric distribution defined by ; it is a -distribution with . The time between inspections is assumed to be exponentially distributed with mean ; then, and
The rate of failure of the system is
The availability
, the reliability
, and the rate of system failure
for different values of time are obtained in
Table 1. The last column represents the stationary values of the performance measures.
The availability decreases and approaches the stationary value for , the failure rate increases with time, and the reliability decreases quickly.
These are performance measures for the system. We consider the cycles associated with the system. A cycle is the timespan of two consecutive operational and non-operational periods, the length of it is denoted by
. The mean number of renewals is
. The study of these measures in the cycles is relevant, since it allows us to know the behavior of the system in a transient regime. Denoting by
,
the up and down times in a cycle, respectively, we have
. The fraction of time that the system is operational in a cycle is
The values of these measures are . It is deduced that the system is operational close to of the time in a cycle, and the failure rate is less than the one in the transient case for .
In order to improve the previous performance measures, the only quantity that can be controlled by the researcher in the evolution of the system is the inspection. We take into account this to study how the performance measures change in terms of the inspection time. We consider the stationary regime. In
Table 2, the numerical values of the availability (
A), the mean time operational in a cycle (
), and the fraction of time that the system is operational in a cycle
are calculated for different values of
.
The variation of these performance measures is observed; availability, the mean time operational in a cycle, and the fraction of time that the system is operational in a cycle decrease when the mean time of the inspections increases.
Finally, we give a graphical comparison among the studied 2-out-of-3 system, and series and parallel ones. In
Figure 1, the availability of these three systems is plotted.
As expected, the availability is significantly higher when units are arranged in parallel. In a similar way, other performance measures could be compared, and other numerical analysis can be performed according to the interest of the researcher in the practice.