1. Introduction
The analysis of mortality and its temporal trends allows a country to understand the dynamics of its population and provides a fundamental guide for establishing economic and social policy. There are a wide variety of mortality models to understand those dynamics. According to Alexopoulos et al. [
1], the best-known mortality model, and the most successful in terms of generating extensions, is the Lee–Carter (LC) model. This model was built to decompose the historical pattern, obtaining the trends of mortality and its relationship with the population’s age.
The LC model proposed in 1992 by Lee and Carter [
2] and its different extensions or variants have been applied for modeling and forecasting mortality in insurance and population studies. In this sense, most applications have been done in developed countries. Callot et al. [
3] proposed a modification of the LC model that facilitates the separation of the deterministic and stochastic dynamics; and empirical illustrations of mortality data for the United States, Japan, and France were provided to demonstrate the advances of the modified model. Carfora, Cutillo, and Orlando [
4] proposed a quantitative comparison of the leading mortality models (including the basic LC model) to evaluate both their goodness of fit and their forecasting performance on Italian population data. Booth et al. [
5] compared five variants or extensions of the Lee–Carter method for mortality forecasting for populations of 10 developed countries (Australia, Canada, Denmark, England, Finland, France, Italy, Norway, Sweden, and Switzerland). Salhi and Loisel [
6] proposed a multivariate approach for forecasting pairwise mortality rates of related populations and realized a comparison with the classical LC model for data for England and Wales. Recently, some research proposed alternative procedures to the classic LC model to obtain the mortality rate. Postigo-Boix et al. [
7] presented polynomial functions where the amount of data required to establish the mortality rate is considerably reduced.
There are also successful applications of the LC model and its different versions for the mortality data of Central America and South America. Examples include the works by Andreozzi [
8] and Belliard and Willians [
9] for Argentina; García-Guerrero [
10] for Mexico; Lee and Rofman [
11] for Chile; Aguilar [
12] for Costa Rica; and Díaz et al. [
13] for Colombia. These papers show the usefulness of the LC models in analyzing and modeling mortality in developing countries.
As noted, until now, most of the papers have focused on modeling the dynamics of mortality. However, the Lee–Carter model will capture the overall pattern of mortality behavior of the population in the age profile and over time with an excellent or regular fit. The analyst can describe this pattern and the changes in mortality when analyzing the estimates obtained for the model parameters. Typically, the model does not correctly reproduce observed mortality, and some of the information on that phenomenon may remain in the residuals vector. It is precisely here that the control chart plays an important role: attempting to discover some other substantial changes in mortality behavior which have not already been collected by the model.
Therefore, in this paper, we go beyond modeling mortality and propose the use of the residuals of LC models to monitor and identify situations of substantial change in the mortality trend. The purpose of the study was to determine whether death probability changed significantly over a studied period. Therefore, times (years) and age intervals with death probabilities which significantly differ from the trend pattern determined by the LC models were identified. For this task, we used a
control chart implemented on the residuals of LC models complemented with the Mason, Tracy, and Young (MTY) decomposition [
14] to detect the age range where the change occurred.
A control chart is a straightforward graphical tool, initially proposed by Shewhart in 1927 [
15], to verify the temporal stability of a parameter of interest in the probability distribution of a random variable. In this way, the univariate control chart monitors a single variable. Then, in 1947, Hotelling [
16] extended the application of control charts for simultaneous control of two or more random variables, creating the
multivariate control chart.
Although control charts were initially proposed to monitor industrial processes (statistical quality control), many papers evidenced their application to other areas of knowledge. For example, in medicine, Woodall [
17] mentioned that control charts are linked to health care surveillance. The recent review work of Vetter and Morrice [
18] proposed the use of control charts as a tool that allows health professionals to understand and communicate performance data and improve quality for patient care, anesthesiology, perioperative medicine, critical care, and pain management. Some papers report their use to monitor hospital performance indicators [
19], clinical variables in patients [
20], and chronic and infectious diseases [
21,
22,
23], and for monitoring the effectiveness of surgical procedures [
24].
Additionally, the literature has evidenced the use of control charts for mortality data. Chamberlin et al. [
25] proposed using control charts to determine whether the severity of illnesses of patients and mortality rates changed significantly over the five years from 1986 to 1990. Marshall and Mohammed [
26] used control charts to monitor mortality rates after coronary artery bypasses. More recently, Urdinola and Rojas-Perilla [
27] proposed to use this approach to identify the under-registration of adult mortality in Colombia.
Studies on mortality are important for countries like Colombia, considering that it is in the group of Latin American countries with the highest mortality rates [
28]. In addition, its rapid increase in violent crime over the last century must be taken into account. According to Gaviria [
29], the homicide rate began its upward trend in the late 1970s and had more than tripled by the early 1990s. For this period, the homicide rate in Colombia was three times those of Brazil and Mexico, seven times that of the United States and 50 times that of a typical European country.
In our context, the fitted mortality models for abridged life tables are designed to simultaneously predict a vector of mortality rates. Therefore, each time a prediction is made, a vector of estimated mortality rates is obtained, one for each age interval, and consequently, a residual vector is generated. The residuals measure the departures between the current mortality rate and the expected rate according to the model. A high residual indicates that the current mortality does not correspond to the observed trend. Consequently, it is suspected that there is a change in the population’s mortality for a specific age range.
According to this, life table monitoring is a multivariate control problem which consists of simultaneously monitoring p random variables (each observed residual is a random variable). Therefore, we propose to monitor the residuals of the mortality model with a control chart, and only for the years identified as out-of-control, use the first term of the MTY decomposition to identify the age range involved in the out-of-control signal.
The proposed methodology is illustrated with the mortality data of Colombia, obtained from “The Latin America Human Mortality Database“ [
30]. Primarily, Colombia has presented a series of demographic phenomena in the last 60 years, such as the progressive increase of the population, which is reflected by a change of the population pyramid, and an increase in life expectancy, which is primarily associated with the drop in mortality rate and the aging of the population. Although the methodology is illustrated through this particular case, our proposal is standard and generalizable to other similar contexts that require sociodemographic explanations of the “anomalies“ detected with control charts.
This paper has the following structure:
Section 2 presents the methodology, including definitions of life tables, Lee–Carter models, and standardized deviance residuals, and descriptions of the control charts that were used and the MTY decomposition.
Section 3 describes Colombian mortality data, and the results obtained in the monitoring of those data; and
Section 5 provides the main conclusions of the paper.
2. Methodology
In this section, we briefly describe the life tables and Lee–Carter (LC) and Lee–Carter with two terms (LC2) models. Then, more carefully, we present the Hotelling chart for individual observations and the MTY decomposition. These theoretical elements support our proposal to study substantial changes in the mortality of developing countries.
2.1. Life Tables
Period life tables, also known as mortality tables, are a demographic analysis tool that summarizes the information of mortality incidence for a population for a given period. Life tables are classified according to the length of the age interval in which the data are presented: “complete“ when containing data for every single age from birth to the last applicable age, and “abridged“ when containing data at age intervals, generally 5 years of age for most of the age range [
31]. The basic life table functions are
,
,
,
,
,
, and
. However, the life table does not always publish all of these functions. The interpretation of the life table functions in a complete table would be as follows [
32]:
For a fictitious cohort with an incidence of mortality according to the mortality rates that have been defined:
The probability of death is the likelihood that deaths occur in a certain period at each age x.
The number of survivors is the number of individuals from the fictitious cohort that reach the age of x.
The number of deaths is the number of deaths within the fictitious cohort at each age x.
The stationary population is the total time lived for all individuals of the fictitious generation that are x years old.
The total years lived is the total amount of years lived for all individuals of the fictitious generation aged x or more.
The life expectancy at birth is the average number of years that survivors have left to live at age x.
The abridged life table shows estimates based on mortality data from vital statistics and population size obtained from population censuses. Censuses are conducted approximately every 10 years in some countries, such as Argentina, Brazil, and Mexico, among others, and some censuses are carried out with intervals longer than 10 years, as is the case in Colombia [
33]. Developing countries, due to misstatements regarding vital registrations related to age during mortality collection, often build the mortality table with age intervals.
In an abridged life table, the interpretation of the functions is similar to the case of the complete life table, except that the , , , and values relate to age interval :
The probability of death is calculated from the death rate :
, where is the average number of years lived by individuals dying in the age interval and n is the amplitude of the age interval.
The number of deaths is the number of individuals from the fictitious generation who died during the age interval .
The stationary population is the total time lived for all individuals of the fictitious generation of years old.
The intervals commonly used to group ages in an abridged life table are until the final open interval, because usually preferred ages are those which end in multiples of five in a declaration of death. To ensure a broader view of the dynamics of mortality in a population, it is additionally necessary to visualize the temporal trend of the incidence of mortality. For this, dynamic life tables are used, which correspond to the collection of period life tables, complete or abridged, obtained for each year of a time interval. Hereafter, the total number of age intervals for each period will be denoted by p, and the total number of periods analyzed will be denoted by m.
2.2. Lee–Carter Models
Lee and Carter [
2] proposed a simple method for modeling and forecasting mortality: a model of age-specific death rates with a time component and a fixed relative age component, and a time series model (an autoregressive integrated moving average (ARIMA)) of the time component. This method offers three significant advantages: it is a parsimonious demographic model combined with standard statistical time-series methods, forecasting is based on persistent long-term trends, and probabilistic confidence intervals are provided for the forecasts [
34].
The Lee–Carter model (LC) expresses the age-specific death rate as a measure depending on both the age of individuals,
x, and the time period,
t. The classical LC model is expressed as follows:
where
is an age-specific parameter that is independent of time (it describes the general mortality profile according to age),
is an age-specific parameter that represents how rapidly or slowly mortality at each age varies when the general level of mortality changes, and
is the general mortality index which depends on time and reflects the general level of mortality. The errors
are assumed to be independent, identically distributed
random variables.
The LC model has a structure which is invariant under some linear transformations of the parameters. For example, for any value of constant
c, it is verified that
To ensure the identifiability of the model, Lee and Carter [
2] proposed including the following constraints in the model:
and
.
A modification to the Lee–Carter model, called the Lee–Carter model with two terms (LC2), was developed by Renshaw and Haberman [
35]. They indicated that the interaction between age and time can be better captured by adding terms to the LC model. The LC2 model is expressed as follows:
In this paper, we adjust Equations (
1) and (
2) with the adequacy proposed by Debón, Montes and Puig [
36], who suggest modeling the logit death probability
considering a binomial distribution for the death rate. Thus, the LC model is expressed as:
with constraints
,
; and the LC2 model
with constraints
and
. Details about this fitting using R can be found in Debón et al. [
37]. Finally, the LC models are based on historical mortality patterns, and if the trends do not continue to hold, then the models will no longer be valid [
38].
2.3. Standardized Deviance Residuals
Residuals are the basis of most diagnostic methods and are often used to analyze the goodness of fit of mortality models. However, as we mentioned before, residuals can identify moments in time and age intervals at which the observed probability of death is substantially different from the pattern of mortality for a period of time. With this objective, we propose using a Hotelling multivariate control chart.
In the goodness-of-fit analysis of mortality models, deviance residuals are often used [
36,
39,
40], taking into account that patterns in the residuals could indicate that the model does not adequately describe all the characteristics of the data [
41]. The deviance residuals based on a binomial distribution for the number of deaths at age
x are as follows:
where
denotes the observed number of deaths,
is the deaths estimated by the model,
is the number of persons living at the beginning of the indicated age interval, and
is an empirical scaling factor estimated by the expression
Further,
is the total model deviance,
is the number of observations in the data, and
is the effective number of parameters [
41].
The deviance residuals are usually symmetrical, but their variance and scale are not standard. Therefore, to correct these situations, deviance residuals are usually standardized.
The standardized deviance residuals are defined by
where
is the leverage, the distance between an observation
and the center of the observations.
The standardized deviance residuals are distributed by a standard normal distribution with unit variance when the fitted model is satisfactory. For this reason, the values of these residuals will generally lie between
and 2 [
42]. Moreover, these residuals satisfied the assumptions of the Hotelling
control charts. Given the above, the standardized deviance residuals were used to monitor the mortality trend.
2.4. Hotelling Control Charts for Standardized Deviance Residuals
The control charts are useful to determine whether a process has been in a state of statistical control by examining historical data [
43]. Specifically, the multivariate control charts are used for process-monitoring problems in which several related variables are of interest.
Hotelling [
16] proposed the
control chart to meet the objective of simultaneous monitoring of
random variables, which generally have some non-negligible degree of association. Under the assumption that the vector
follows a
p-variate normal distribution,
, with known mean vector
and covariances matrix
, the statistic:
follows a chi-square distribution with degrees of freedom
p.
The -chart is then a chart of the statistic vs. the observation number, with an upper control limit (UCL) located at , which represents the upper percentile of the chi-square distribution. Here, is the desired type I error probability.
Since
and
are often unknown, these parameters should be estimated from a reference sample composed by
m observations of
. The sample mean vector (
) and covariance matrix (
S), obtained from the reference sample, are the estimators of
and
, respectively, for Equation (
3).
According to Tracy [
44], when
and
are estimated, the Hotelling
statistic
where
represents a beta distribution with parameters
and
. This distribution depends on the number of variables
p and the sample size
m, which must satisfy
[
45].
Therefore, the upper control limit (UCL) for the
-chart should be located at:
where
is the upper
percentile of a beta distribution with parameters
and
.
When a point exceeds the upper control limit in a chart, it is interpreted as a signal of change in the distribution of . Then, it is recommended to carry out an investigation to find the causes that produce the signal or apply some procedure to identify the causal variable(s) of the signal, as the chart itself cannot do this.
In our proposal, the changes in the dynamics of mortality are evaluated through monitoring the
p-dimensional vector of standardized deviance residuals
, which was obtained from a Lee–Carter model adjusted for a reference sample of
m consecutive periods (
). In this particular application, the mean vector
and the covariance matrix
are estimated from the reference sample of the
m historical standardized residuals vectors. Under this approach, the Hotelling’s
statistic takes the form:
In this application context, an observation that exceeds the control limit of Hotelling’s chart is interpreted as a departure from the mortality trend pattern reproduced by fitting any Lee–Carter model. Consequently, this period is labeled as out-of-control, a substantial change in mortality dynamics is suspected, and the second phase of analysis investigates the age intervals that may be involved in the out-of-control diagnosis.
As can be seen, the proposal for multivariate control is more flexible in its assumptions than the usual analysis of residuals, since the condition of complete independence imposed on the error term is now relaxed. Under the multivariate control approach, each set of p-residuals, associated with the p-intervals of age for a particular year, form a vector of random p-variables that are not necessarily independent, nor identically distributed. Therefore, the multivariate control chart allows the methodology to be applied even when the model presents local fitting problems. Note that a particular case of the multivariate strategy is constituted when the p-residuals are independent and identically distributed.
Another difference between these strategies for identifying anomalies is related to the number of assessments made in the hypothesis test. In the residuals analysis, each residual is checked individually against tolerable limits of variation, which implies making a set of
comparisons. Conversely, under the multivariate approach, the comparison against the control limit is done for each
p-dimensional observation: that is, a comparison for each year. Then, the multivariate control chart reduces the number of comparisons to
m, which reduces the probability of global type I error. As is known, the probability of global error increases exponentially according to the overall number of comparisons made simultaneously [
45].
Finally, it should be noted that the performance of a Hotelling
control chart is related to the number of periods m that make up the observation period to which Lee–Carter’s model fits. For the Hotelling
control chart, m defines the sample size used to estimate the parameters of the multivariate probability distribution of standardized residuals vector. In this sense, through simulation studies, Champ and Jones-Former [
46] showed that when the sample size m is small, the true-false alarm rate of multivariate control charts is usually substantially higher than the established nominal rate. A recommendation of these authors is to use broader control limits. The estimation error effect on the false alarm rate is absorbed without substantially affecting the chart’s performance in detecting changes.
2.5. MTY Decomposition
Several approaches are presented for the problem of interpreting a multivariate signal. Mason, Tracy, and Young [
14] proposed the MTY method of decomposition to find the causes that produce the signal. The MTY method decomposes the
Hotelling statistics into
p additive orthogonal components, each of which reveals the contribution of the individual process variable and the relative joint contribution of the same process variable.
For the case of
p variables, there are
different possible MTY decompositions. One such decomposition is given by
The first term of decomposition is called the unconditional term and corresponds to the
statistic calculated for the variable
. The expression is the following:
where
and
are the mean and standard deviation estimates of the standardized deviance residuals
obtained with the
m historical observations of
. In our context, the unconditional term measures the standardized distance between observed mortality in an age interval and the expected pattern according to the model. Thus, when a signal of change is emitted by the
statistic, the unconditional term’s high value indicates that the signal of change may be related to the age range
j.
The other terms, called the conditional terms, are calculated as
where
and
are the mean and conditional standard deviation estimates of
, respectively.
These parameters can be estimated through the estimation of a linear regression model ( as a response variable and as predictors) with the m historical observations of the standardized deviance residuals vector. When the unconditional term’s calculation yields a high value, this is an indication that the signal may be associated with a change in the correlation structure of the variables being monitored. In our application context, where the monitoring variables correspond to the standardized deviance residuals of a Lee–Carter model, the interpretation of this kind of change does not make practical sense. For this reason, their analysis is not considered in our proposal.
An appropriate
F distribution can describe the probability distribution for the unconditional term on the MTY decompositions:
Using these distributions, for a specified
level and a reference sample of size
m, the upper control limits for the unconditional term are obtained as follows:
where
is the upper
percentile of the
F distribution with degree of freedom
.
As a result, one can use the
F distribution to determine when an individual unconditional term of the decomposition is significantly broad and contributes to the signal. In essence, a significant value for an unconditional term implies that the designated variable is out-of-control [
45]. For our proposal, we use only the unconditional term of the MTY decomposition to identify the age range involved in the mortality change signal.
Finally, it is essential to note that our proposal for mortality surveillance based on implementing a multivariate control chart presents some advantages with respect to the simple exercise of verifying model fit through residuals analysis. Usually, the residuals analysis is carried out to validate the distributional assumptions established a priori about the errors in the mortality model: mean zero, homoscedasticity, independence, and equality of probability distribution. Under the established assumptions, the residuals are treated as independent observations of a random variable of zero mean and constant variance. Implicitly, this is equivalent to assuming that the model’s goodness of fit is the same for any age interval and any instant in time, which is usually not satisfied in practice. In an application with real data, the model often fits very well for certain age intervals, but overestimates or underestimates other intervals’ mortality rates. In these cases, there will be a natural difference in error distribution for certain age intervals. Under a residuals analysis, this situation will be identified as a model fitting problem, limiting its use.
On the other hand, our multivariate control proposal is oriented to identify moments of time and age intervals, in which the probability of observed death is substantially different from the mortality pattern that has been collected by the adjusted model. Under this strategy, changes in mortality can be identified in either the model or the control chart.
4. Discussion
The obtained results were compared to a series of events registered in Colombia from a socio-demographic perspective.
The alert detected with the residuals can be related to a change in the causes of death from an epidemiological point of view. Cristancho [
51] mentions that according to the National Administrative Department of Statistics (DANE), homicide became the leading cause of death for men in the 1980s. In that decade, there was also an increase in mortality due to non-infectious diseases and external causes. External causes, especially for men, were associated with violence and accidents.
It should be noted that in the 1990s, the so-called emerging diseases increased in Colombia. In 1990, there was a massive dengue epidemic with a 40% fatality rate: its intensity decreased in 1991–1995 [
52]. Additionally, in 1991 there was an epidemic outbreak of cholera [
53].
In analyzing facts related to public health in Colombia, two crucial events were identified as out-of-control and related to the detected years. In 1975, the National Health System was established; in the period from 1990–2000, there were changes in mortality related to this Health System Reform.
On the other hand, with the Political Constitution of 1991, a new political administrative division was created in Colombia, creating new departments, which benefited the collection of information in the country’s southern region. The exhaustive work of this division in recovering information increased the number of registered deaths.
5. Conclusions
This paper demonstrates the usefulness of control charts as a tool to detect substantial changes in mortality behavior by monitoring residuals from mortality models. In some developing countries, data are presented in age groups because of misstatements related to age; usually, there is a preference for the declaration of the age of death to occur in multiples of five, and there are various other registration difficulties. Therefore, a question of interest in the demographic and actuarial fields is the use of residuals of a model to monitor the mortality of abridged life tables.
Previous works such as Urdinola and Rojas-Perilla [
27] use control charts with only one measure per unit of time to monitor mortality data, which involves building numerous control charts and increasing the false alarm rate (out-of-control points). Instead, in this paper, we use a Hotelling
multivariate control chart, which simultaneously monitors
p random variables (age intervals) per unit of time, which dramatically reduces the number of charts to build. Further, this multivariate control chart considers the relationships between the residuals associated with different age intervals. The methodology we propose considers that mortality is a phenomenon that is not stable over time, but instead exhibits trends collected through Lee–Carter models. Consequently, the control charts applied to the residuals of these models can detect the other types of mortality changes that were not previously collected by the models.
The LC and LC2 mortality models identified the principal characteristics of mortality in Colombia. The infant mortality is high and decreases slowly until the age of 15, after which it increases progressively as the population ages. The hump phenomenon is recorded in young adults, mainly in men. This phenomenon has been described before in other papers [
13,
54] and is mainly explained by the homicides related to internal armed conflict [
55] and illicit activities such as illegal drug markets [
29], as well as the availability of firearms [
28].
The Hotelling control chart is an exciting option to monitor the residuals of the mortality models and thus identify the years in which the mortality differs from the patterns that are collected by the mortality model. With the combined use of Hotelling and MTY decomposition, the years and age range of that atypical pattern were identified. Two years were identified as out-of-control: 1991 for the residuals of both models for both men and women with an influence of very young ages, and the year 1979 for residuals from the Lee–Carter model for women with the influence of very young ages and very advanced ages.
The Lee–Carter model collects information regarding the phenomenon of violence in Colombia. Therefore, the years identified as out-of-control in the charts are associated with very early or quite advanced ages, which are inversely related to violence that did not claim as many victims at those ages. Besides, the mortality changes identified in the control charts pertain to changes in the population’s health conditions, or new causes of death such as COVID-19 in future investigations. Future studies to evaluate this combined methodology by adding information from new censuses for Colombia would be interesting.
Nevertheless, our proposal for mortality surveillance consists of two analysis tools that work sequentially; therefore, this study has limitations regarding models and control charts. In the first place, the model captures the time trend and the age profile of mortality in the population. Subsequently, the Hotelling control chart and the MTY decomposition identify those years and age ranges whose death probabilities differ substantially from the model trend. Therefore, the chart out-of-control signals are interpreted as possible changes in mortality according to the model trend. For example, the LC model is more straightforward than the LC2 one, as the LC2 model includes more particular additional change structures. The selection of one or the other model will entail a different characterization of the mortality trend, and consequently, a variation in the Hotelling control chart diagnosis.
Finally, we would like to point out that although this paper only applied control charts to the Colombian abridged life tables, the methodology can be extended to abridged life tables in any developing country. It might be useful to look at other datasets and examine whether conclusions are consistent for different countries.