1. Introduction
Extracting trends from time series data is a central task in many fields, including economics, geophysics, climatology and engineering. Extensive research has been done on trend extraction methods, and these methods can be roughly divided into two groups, the smoothing-based approach and the non-smoothing-based approach. The division is due to the dominant role smoothing-based methods historically played in the context of trend extractions. Research along the smoothing-based path has yielded fruitful results and gained much popularity. Some well-known methods are Henderson filters [
1], seasonal-trend decomposition based on loess (STL) [
2], Hodrick-Prescott filters [
3] and X12-ARIMA [
4], which was updated to X-13-ARIMA in 2013. All these methods give a set of weights that are applied to the data as an averaging operator to give the underlying trend, and they differ mainly in the class of functions used for fitting and the smoothness criterion. They are also referred to as linear filters. For nonlinear filters, various methods have been suggested, see, for example, the optimal order statistic filter [
5], the stack filters [
6] and the median filters [
7].
The non-smoothness-based approach has received more attention in later years. Two methods, originating from the field of signal processing, have made their way into the field of time series analysis. They are the Singular Spectrum Analysis (SSA) and the Empirical Mode Decomposition (EMD). SSA is based on the idea of factorisation: it performs the Singular Value Decomposition to the covariance matrix of trajectory matrices to give a trend [
8]. SSA has been actively developed [
9,
10,
11,
12,
13,
14] and applied to various kinds of data, e.g., climate data [
15,
16], financial data [
17,
18] and geophysical data [
11]. On the other hand, EMD is related to the idea of orthogonal projection: the method decomposes signals into finite, nearly-orthogonal components that admit Hilbert transforms [
19]. While the method works in the time domain, it can also be interpreted as a special case of the Wavelet methods which work in the frequency domain [
20]. Due to its high adaptiveness to nonlinear and non-stationary data, EMD has been widely studied and applied, cf. [
21,
22,
23]. For a comprehensive review of trend extraction methods, readers are referred to Alexandrov, Bianconcini, Dagum, Maass and McElroy [
24].
Out of all the methods noted, we investigate in this work seasonal-trend decomposition based on loess, aka STL, in the context of trend extractions when there is missing data. We study STL because, compared with other more recent and mathematically sophisticated methods, STL has a broader user base. For one, STL is one of the few early methods that give a full decomposition of time series data (into the trend, seasonal and residual components) with almost no assumptions made about the data. Secondly, STL is easy to use and has good properties, e.g., it can handle non-stationary data and has fast convergence. The method also has the capacity to handle missing data, however, it seems this feature has not been implemented in practice. This poses a challenge to practitioners as the missing data problem is often encountered. A typical approach to circumvent the difficulty of missing data is to make complete the data with imputation methods. Numerous questions arise immediately upon the decision of doing so. To name a few, do imputations introduce bias? How reliable the STL estimates are after the imputation? To what extent are imputation methods able to recover missing data in the context of trend extraction with STL? It is the goal of this paper to settle these questions.
Moreover, we remark that imputing missing data before applying STL is not merely to avoid re-implementing a version of STL that can handle missing data. In fact, it addresses a problem that was not fully considered in the work of Cleveland et al. [
2] on proposing STL. Cleveland et al. [
2] suggested handling missing data with loess smoothing (in the cycle-subseries smoothing step of the work), but this is not possible when the missing data form large gaps in the observed data. By considering imputations before applying STL, we open up ourselves to many imputation methods so that different types of missing data can be handled, after which STL can be applied.
Regarding imputation methods and the general framework for handling missing data, the major theoretical issues were settled in [
25,
26,
27]. Later developments were mostly about clarifications [
28], implementations [
29,
30], and practical concerns [
31,
32]. One notable imputation method is the multivariate imputations by chained equation (MICE) [
33] which deals with multivariate missing data (both categorical and numerical) and has raised some revised interest in recent years [
34,
35].
In this work, we study the problem of extracting trends from time series data when some data are missing. In particular, we investigate a general class of procedures that impute the missing data and then extract trends using STL. We refer to them as the imputation-STL procedures. Working under the settings given in [
2], we derive an error bound for the extracted trends in terms of imputation errors. This answers the questions we posed earlier regarding the impact of imputations on the trend estimates. More importantly, this provides a framework for analysing errors of the trend extracted with any imputation-STL procedure. Apart from the theoretical results, we also examine a special case, the loess-STL procedure through simulation studies. We demonstrate that loess-STL provides reliable trend estimates when the ground-truth trend is smooth and the missing data disperse over the time series. This, together with the theoretical results, justifies the use of the procedure in practice. We also present an application to real data; specifically, we apply the loess-STL procedure to the Antarctic upper air temperature data and make available a profile of temperature trends for further climatological analysis.
The structure of this paper is as follows. In
Section 2, we review the methods loess and STL for time series data and define some terminology. In
Section 3, we present an error analysis of the imputation-STL procedures in the context of trend extraction with missing data. In
Section 4, we present simulation studies with loess-STL procedures. In
Section 5, we apply the loess-STL procedure to a real dataset of radiosonde records of upper air temperature at 22 Antarctic research stations covering the past 50 years, and we conclude in
Section 6.
2. Methods Review and Terminology
In this section, we first review the methods of loess and STL by summarising the work of Cleveland, Cleveland and McRae [
2]. This can be skipped by readers who are familiar with the methods. Then we define some terminology which we will use in the rest of the paper.
2.1. Loess
Locally weighted regression, aka loess, is a nonparametric method in regression analysis. It models the dependent variable as a smooth function of the independent variables; the smooth function is estimated by fitting the data locally with polynomials. The method adapts well to data and has several advantages. First, the flexible form incorporates a wider class of relationships beyond the linear. Second, no prior knowledge about the data is required (other than that the data are a representative sample), so subjective judgement can be avoided when little is known about the relationship between the variables. Third, it is useful for explanatory analysis, e.g., it can serve as a baseline for searching for good parametric models; it can also act as a benchmark against parametric models during model evaluations. However, these advantages come at a cost: like other nonparametric methods, loess requires more data than parametric models to get the same precision for the estimates. In the following, we detail the assumptions and the fitting procedure.
2.1.1. Assumptions
Loess assumes a data generating process of
,
, where
are observations of the dependent variable
Y,
are those of the independent variable
X,
N is the total number of observed data points and
are independent normal random variables with mean 0 and variance
. The function
f specifies the functional relationship between the dependent variable
Y and the independent variable
X and is assumed to be smooth. This justifies the use of Taylor’s theorem, which gives grounds for approximating functions locally by polynomials. The normality assumption about the data-generating process allows the distribution of residuals, fitted values and residual sum of squares to be represented by some known parametric families of distributions. In particular, given the assumption, the residuals and the fitted values are normally distributed provided that
is known, and the residual sum of squares follows a chi-squared distribution [
36]. The distributional results make it possible to assess the uncertainty in these quantities. Loess also assumes that the estimate
approximates
f with no bias. The assumption is not unrealistic as it is shown on p. 62 in [
37] that under some mild conditions, the estimate is asymptotically unbiased. Each assumption we saw is associated with a particular feature of the method, and it is possible to forgo some of the properties for greater generality of the model. For instance, ref. [
38] relaxed the normality assumption using the idea of robust regression by Huber [
39]. In this case, a Monte Carlo simulation is then needed to assess the standard error.
2.1.2. Fitting Procedure
Loess approximates the functional relationship
f by fitting a polynomial locally at each point
x (in the domain of
f) using points in the neighbourhood of
x. The fitting uses weighted least square(WLS) regression. Overall, three quantities are needed in this procedure, the degree of polynomials to be fit, the size of the neighbourhood and the weights for performing the WLS regression. Regarding the degree of polynomials to be fit, first-degree or second-degree polynomials are commonly used and are usually sufficient as long as the functional relationship is not too erratic. Quadratic fitting is generally preferred over linear fitting near extrema [
37]. Alternatively, the degree of polynomials can be chosen using M-plots as suggested in [
36]. Regarding the neighbourhood size, as it directly controls the smoothness of the estimates, the choice should be made based on the research context. The neighbourhood size is chosen such that the resulting estimate answers the research question in some optimal sense. But in cases where one wants to avoid subjectivity, the neighbourhood size can be chosen through data-driven techniques like cross-validation. Regarding the weights for the WLS regression, we will calculate them based on the tricube weight function
where
u is a dummy variable. Concretely, suppose we have
N data points, the degree of polynomial to be fit is
d and the size of the neighbourhood is
q, and we want to fit a polynomial locally at the point
. First, we identify the
q data points, denoted by
, so that
are nearest to
. Next, to each of these points, we assign a weight,
where
is the tricube weight function and the maximum is taken over the points in the neighbourhood, i.e.,
. Then we fit a degree-
d polynomial, denoted by
, to these points using weighted least square regression. In the following step, the fitted value for
at
is given by
. Finally, the above procedure is repeated for each
,
, with each
being fitted by
.
2.2. STL
Seasonal-trend decomposition based on loess, aka STL, is a method of decomposition of time series. Through iterative smoothing of the data, it decomposes a time series into three components: the trend component, the seasonal component, and the remainder component. From a frequency analysis point of view, what STL does is filter out signals of different frequencies; the signal with the lowest frequency is regarded as the trend, the one with the medium frequency is the seasonal component, and the ones with the highest frequencies are the remainder (also named the noise in STL). Several advantages are using STL for time series decomposition. First, it only makes weak assumptions about the data-generating process, so it handles a wide class of data. Second, the computation is fast, and it can handle missing data and outliers. Third, prior knowledge about the components can be incorporated into the model.
To apply STL, six parameters need to be specified, they are the number of outer loops
, the number of inner loops
, the number of cycle-subseries
, the neighbourhood size for seasonal smoothing
, the neighbourhood size for trend smoothing
, and the neighbourhood size for seasonal trend smoothing
. Cleveland et al. [
2] recommend the following choice of parameters.
if resistance to outliers are not needed, and
otherwise;
depends on the application, for example, 12 would be an appropriate choice for monthly climate data;
is specified by the user to incorporate prior knowledge of the regularity of the seasonal pattern. This must be an odd integer
;
and
, where
is an operator such that
, for any real number
x, equals the smallest odd integer greater than or equal to
x. Readers are referred to the original paper for full details on the reasoning behind these choices.
Suppose we have a time series of N data points , and the parameters of the STL have been chosen, then the STL procedure works as follows. First we initialise the robustness weight , and the trend , . Then we feed them into the inner loop, which does the following three things.
Detrend the data: The procedure subtracts the inputted trend from the data, i.e., .
Extract the seasonal component: this involves 7 minor steps. Denoting the data by , the steps are:
- (a)
Break into cycle-subseries .
- (b)
Smooth each cycle-subseries using loess with a degree of the polynomial set to 1 and neighbourhood size set to , i.e., . Note that smoothed values are computed from the position just prior to the first point to the position just after the last point.
- (c)
Combine smoothed cycle-subseries to get the seasonal component , i.e.
.
- (d)
Run a moving average filter of length through twice.
- (e)
Then a moving average filter of length 3 once. The result is still denoted as .
- (f)
Extract the seasonal trend vector from the smoothed seasonal component vector , i.e., apply loess smoothing with a degree of the polynomial set to 1, and neighbourhood size set to .
- (g)
Detrend the seasonal component, i.e., compute , with .
Deseasonalise the data and extract a new trend: The deseasonlising is done by subtracting from the data the detrended seasonal component estimated in the last step, i.e., , and the extraction of the new trend is done by performing loess smoothing on the deseasonalised data with a degree of the polynomial set to 1 and neighbourhood size set to , i.e., .
After a single pass of the inner loop, we get a revised trend and a seasonal component. If the data has not passed through the inner loop for times, then we feed the revised trend into the inner loop again. Otherwise, both the revised trend and the seasonal component are fed into the outer loop, which does the following.
After a single pass of the outer loop, we get a full decomposition of the data (i.e., the trend, seasonal and remainders components) and revised robustness weights. If the data has not passed through the outer loop for
times, then we feed everything into a new round of inner loops. Otherwise, the procedure ends and returns the full decomposition. For readers’ convenience, we give in
Figure 1 a schematic representation of the algorithm.
2.3. Terminology
We will use the following terminology throughout the remaining sections. We first talk about data. We define
the missing dataset to be the observed dataset that has some data missing,
the complete dataset to be the dataset without any parts missing—assuming it actually exists in the first place, and
the imputed dataset to be the dataset we get after applying imputation methods to the missing dataset.
Next, we talk about trends. The trend is the long-term low-frequency signal, obtained by our procedure, which in the simplified form is deseasonalising the data and then smoothing the result to remove short-term fluctuation. We define
the complete trend to be the trend estimated with the complete dataset,
the imputed trend to be the trend estimated with the imputed dataset, and
the true trend to be the true underlying trend.
In
Section 3, we relate the imputed trend to the complete trend in terms of imputation errors. In
Section 4, we verify the result through simulations, demonstrating that the imputed trend can approximate the true trend well. Combining the results from both sections, we see a more complete picture of how missing data affects the trend estimate. This helps us identify the situations where imputation-STL procedures can give reliable trend estimates.
3. Error Analysis of STL with Imputations
In this section, we first analyse the errors of the trend estimates from the imputation-STL class of procedures. Then we investigate a particular case, the loess-STL procedure. Lastly, we conclude the section with some remarks.
3.1. Error Bound for the Estimated Trend from an Imputation-STL Procedure
We first define the terms
trend error and
imputation error, and then we present our results. The trend error is defined to be the mean of the squared differences between the complete trend and the imputed trend. For complete dataset
, denote the complete trend by
. For imputed dataset
, denote the corresponding imputed-trend by
. Then the trend error
is given by
where
are
and
(
) in vector notation, and
denotes the Euclidean norm. Similarly, the imputation error is defined to be the mean of the squared differences between the complete dataset and the imputed dataset. With the above introduced notations, the imputation error
is given by
where
are
and
(
) in vector notation.
Our results can be stated plainly as follows. Assuming the settings given in Cleveland et al. [
2], the trend estimated by an imputation-STL procedure has an error bounded above by a constant multiplying the imputation error. This is useful in two ways. First, one can now explicitly assess the trend error with the imputation error. In other words, our result answers the question “How much error do I get in my estimated trend if my imputations are wrong by X amount?”. Second, given a desired accuracy level, our result specifies the amount of imputation error a data set can tolerate. In addition, it also specifies how much improvement the estimate could have when new data points become available. Further discussion is given at the end of this section; now we present the mathematical statement.
Theorem 1. Suppose a time series is circular, and the parameters of STL are chosen according to Section 2.2, then the trend produced by an imputation-STL type of procedure satisfies thatwhere with being the number of inner loops chosen. We first state the theoretical settings and consequences given in Cleveland et al. [
2] and then several lemmas, before we can prove Theorem 1.
Settings: The data
is assumed coming from a circular time series
of period length
N; namely
if
, i.e.,
is divisible by
N. Also, the parameter choices follow the recommendation as given in
Section 2.2.
Consequences: Denote the operator matrices associated with the operations in steps 2 and 3 of
Section 2.2 by
and
respectively. To be clear,
is the
operator matrix that takes the input
and outputs the seasonal component
, and
is the
operator matrix that takes the input
and outputs the revised trend
. Given the above and by [
40], we have
- (C1)
being from a circular time series implies and can be augmented to be circulant matrices;
- (C2)
Enforcing the parameter choices in
Section 2.2 implies that all eigenvalues of
and
are inside or on the unit circle, and
and
each have, at most, one eigenvalue on the unit circle.
The following definition and lemmas are taken from pages 104 and 113–114 in [
40], except Lemma 1(a) which follows directly from the usual definition of induced operator norms.
Lemma 1. Let be a matrix and consider as an operator, then
- (a)
, where is the Euclidean norm, and is the induced operator norm;
- (b)
, where denotes the greatest eigenvalue of .
Definition 1. An matrix of the formis named a circulant matrix. Lemma 2. If and are circulant matrices, then
- (a)
and are circulant;
- (b)
All eigenvalues of are given by , ;
- (c)
and , where is the j-th eigenvalue , and and are similarly defined.
Now we present the proof of Theorem 1.
Proof. Denote the complete data by
and the resultant imputed data by
. Also, we denote the operator matrix of the trend filter of STL after
k iterations of the inner loop by
; the expression for
can be shown to be
We want to compare the trend extracted from the complete data and the trend extracted from the imputed data. Hence, we compute the mean squared difference between the two trends,
. By Lemma 1(a), we have
Now we evaluate
. First from (
1), we have
Note that
,
(by
(C1)) and
are circulant matrices, so Lemma 2(a) implies
is circulant. It then follows from Lemma 2(c) that
. Next, note that
and denoting the eigenvalues of
by
’s and the eigenvalues of
by
’s, we have by Lemma 2(b),
and
Setting
, it follows
. We will need this later to get
. Continuing from (
3) and applying Lemma 2(c) give
where
and
are the
j-th eigenvalues of
and
respectively, and
. Note that the indexing strictly follows that of Lemma 2(b). Now we take modulus on both sides to get
Let
M be the index where
is maximised, i.e.,
and let
t and
s be the
M-th eigenvalues of
and
, respectively. Then we have
where the last inequality above is implied by
(C2).
Finally, by Lemmas 1(b) and the result we referenced previously, we have
. Substituting this back into (
2), we have
Now the proof is completed noting that by definition and with being the inner loop size in STL, and .
As a particular case, if we enforce the parameter choice from
Section 2.2, then we have
□
With the expression we derived, we continue our discussion about the result. First, we now have an upper bound for the error of the estimated trend in terms of the squared imputation error. So if the imputation error is known, then we know how large the trend error would be in the worst-case scenario. In practice, the imputation error is however unknown; we will discuss how to estimate the imputation error in the next section. Second, if we examine closely the right hand side of (
4), we see two quantities affecting the upper bound, namely the squared total imputation error
and the number of data points “originally” available,
N. They have the following implications:
Since the constant L is fixed beforehand, its influence on the upper bound becomes negligible when N is far larger than L. For the same reason, large imputation errors for a few data points would not cause too much trouble.
The expression suggests the imputation errors at individual data points can grow, e.g., we have a faulty machine that goes wrong consistently once in a while and produces missing data which are then imputed with some errors. Remarkably, we know precisely how fast the total imputation error can grow before it is no longer possible to keep the trend error small. Expression (
4) says as long as
grows at a rate (strictly) slower than
N, then we will not lose much precision in our estimated trend.
3.2. Error Bound for STL with Loess Imputation
In this section, we investigate the loess-STL procedure, which is a particular case of the imputation-STL class of procedures. We made this choice because smoothing methods are widely used in different branches of statistics. Furthermore, interpolation by smoothing is a common way to handle missing data. The goal of this section is twofold. One, we illustrate concretely how to apply oerror-boundund results in practice. Two, this prepares us for the simulation studies we conduct in the next section.
Consider a case where data is missing every other point, i.e., out of the complete data , we have missing. One such example is that, for a monthly mean temperature time series, the data is missing every second month. We study this case because it corresponds to the worst-case scenario of the dispersing missing pattern. By dispersing missing data patterns, we mean there are no consecutive missing data points. In other words, the missing-ness incidences do not cluster and form any gap of size greater than or equal to 2.
We use the same notation as the previous section, i.e., denote the time series data by and the data with imputed values by . We also write , where can be regarded as the time at which is observed, . As it only makes sense to apply loess smoothing when the underlying trend is smooth, we assume f to be twice differentiable. For mathematical convenience, we also assume to be bounded, i.e., for some constant . (Further justification is given at the end of this section.)
In the following, we find the imputation error at a point, i.e., we find
. First, by Taylor’s expansion theorem, we have
where
and
h is a small increment. Next, since loess is a linear smoother, we can express it in a form of equivalent kernels. We denote the kernel weight associated with a data point
by
when smoothing
; then the loess estimator is given by
where
,
and
is the set containing the neighbours of
.
Note that as we have time series data, we know data points are equally spaced of 1-unit distance in the time domain. This together with our assumption of the missing data pattern implies that we have symmetric neighbourhoods and weights for the loess imputation. Hence,
, and we have
where
is the one-sided neighbourhood of
. Now suppose the size of the neighbourhood of
is
. Then the kernel weights
is given by
where
is the tricube weight function. Inequality (
8) cannot be directly applied to the aforementioned complete data
which is recorded at
but missing at
. However, with slight variation in the proof we obtain the following
and with the assistance of computer algebra software such as Maple 2020 we have shown that the expression on the right-hand side is asymptotically
We have thus found an upper bound for the imputation error at the point
. We can use the bound directly as a conservative estimate of the imputation error. To get the total squared imputation error over the whole time series, we square the individual errors and sum up all the missing data points. In our current setting, we have
points missing(ignoring odd-even parity). Assuming for simplicity the neighbourhood size is the same for all points, the total squared imputation error is
Now we can substitute this back to (
5) and get
We arrive at the expression that tells us how large the trend error can be when loess-STL is applied. Overall, we illustrated how to apply our result from
Section 3.1 when a particular imputation method is considered.
3.3. Remarks and Some Practical Concerns
How to estimate
D, the upper bound of
? One way is to smooth the data with any choice of smoother, then differentiate the resulting curve twice to get
D. There are various packages in R [
41] that can handle this, e.g., the
fda package. In some cases, eyeballing can give a rough but quick estimate. One simply traces the curve with a tangent line, records the maximum slope and the minimum slope, and then takes the difference.
What to do if the estimated
D gives a loose bound? If
D gives a loose bound, or if one feels that the bounded-second-derivative condition is too strong, one can redo the derivation with an alternate form of Taylor’s theorem, given by
where
is the remainder such that
. Then one can replace
in the derivation by
, which can also be estimated by what we suggested in the previous point. In this case, one may directly evaluate the imputation error and stop at the first inequality in Equation (
8).
In the later section, we will see loess smoothing is applied to subseries instead of the whole time series. In that case, the dispersing missing data assumption should hold for each subseries. This is, however, a relaxation rather than a tightening of assumption.
Is bounded second derivative well justified? It is a reasonable assumption to make unless one expects the trend changes direction with infinite acceleration. But again, if the condition does not hold, or D is too large to be useful, one can still refer to the second point above to get the result.
It turns out from
Section 3.2 that expression (
9) also provides some guidelines on how to pick the parameter of the imputation method considered. In particular, the (half) neighbourhood size
l can be chosen such that the upper bound is tight.
In principle, any missing data pattern can be analysed in the same way as shown in our derivation. The approach we used is quite standard in analysing linear smoothers. See, for example, Hastie and Tibshirani [
42], Fan and Gijbels [
37].
A minor technical remark: The bound of D only needs to hold over the domain where the smoothing is performed.
4. Simulation Studies
In this section, we present simulation studies of the loess-STL procedure. Our goals are to verify the theoretical results from
Section 3 and to assess the applicability of loess-STL to real data. The simulation studies are conducted under 40 settings; the settings are detailed in
Section 4.1. For each setting, 10,000 simulations are run. Each simulation consists of three steps. First, we simulate a dataset and remove some proportions of points from it. Second, we apply the loess-STL procedure to impute the missing data and extract a trend from it. Third, we evaluate the estimated trend. These steps are detailed in
Section 4.2. We present the simulation result in
Section 4.3 and its consistency with the theoretical result in
Section 4.4.
4.1. General Settings
The simulation studies are conducted with under 40 settings. Each setting specifies how a time series is generated and how many data points are removed to create missing data. The time series is made up of three components, the trend component, the seasonal component and the remainder component. Two approaches are available for simulating these components: the data-based approach and the model-based approach. In the former approach, we extract the components from some real data with STL (using the
stats::stl function in R [
41]) and use them directly as the simulated components; in the latter approach, we simulate the components from a model we specify. We consider the data-based approach because our goal is to assess the loess-STL procedures’ applicability to real data, so we want the simulated data to share similar characteristics as the real ones. On the other hand, we consider the model-based approach to show that loess-STL not only works on a particular dataset, but could also generalise to other situations. Full details of the two approaches are given in the next section.
An outline of the 40 settings and some information about the real dataset we use are given as follows. We consider:
a model-based approach for the trend component.
a model-based or data-based approach for each of the seasonal components (which is of 12-month frequency) and the remainder component, and
ten missing data proportions, ranging from 5% to 50% at a step of 5%.
These give a combination of
settings which is summarised in
Table 1.
For the data-based approach, we consider the Antarctic upper air temperature data which were obtained from the Integrated Global Radiosonde Archive (IGRA) available at
https://www.ncei.noaa.gov/products/weather-balloon/integrated-global-radiosonde-archive (accessed on 23 December 2022). The IGRA is a comprehensive radiosonde dataset which consists of radiosonde and pilot balloon observations from more than 2800 globally distributed stations. Observations are available at standard and variable pressure levels; meteorological variables include pressure, temperature, geopotential height, relative humidity, dew point depression, and wind direction and speed. For this study, we select the temperature data from 22 Antarctic stations at 16 standard pressure levels for a period of 50 years. Radiosonde observations are usually performed twice a day, at noon and midnight respectively.
The IGRA radiosonde data have undergone quality assurance procedures, including, most notably, the basic checks on the elapsed time and relative humidity as well as an improved selection of a single surface level within soundings in which multiple levels are identified as surface; further information could be found at
https://www.ncei.noaa.gov/data/integrated-global-radiosonde-archive/doc/igra2-readme.txt (accessed on 23 December 2022).
Out of the noon or midnight time series that would have been fully observed under ideal conditions, only 526 of them are available with each having at least one observation but possibly many missing values. As the quantity of interest is the macro movement of the temperature data, we aggregate the data using averaging to get monthly noon or midnight data; this also helps reduce the noise (i.e., the local fluctuations) in the data. Note that during the aggregation, a monthly average noon (or midnight) temperature will be missing only when there were no radiosonde observations at all across all noons (or midnights) of that month.
In the next session, we detail the simulation procedure.
4.2. Details of Simulation Studies
For each of the 40 settings, 10,000 simulations are run. Each simulation consists of three steps. First, we simulate an artificial monthly mean temperature time series and remove some proportions of data from it. Second, we apply the loess-STL procedure to impute the missing data and extract a trend from it. Third, we evaluate the estimated trend. Details are given in the remainder of this section.
4.2.1. Simulating a Dataset to Have Missing Data
To simulate a time series, we first search through the 526 time series in the real dataset and collect those with no missing data. Then we sample one time series from this pool of ‘perfect’ time series randomly and apply STL to extract the trend, seasonal (of 12-month frequency) and remainder components, denoted by and respectively. These components are used directly as the simulated components in the data-based approach, or they are used to estimate the model parameters in the model-based approach, which we detail in the following.
Trend. We generate a piece-wise linear function and then apply loess smoothing to give a smooth trend. The number of pieces follows a discrete uniform distribution . Each piece occupies the same amount of time (except the last piece may contain extra points when exact division is not possible). Each slope is sampled from the normal distribution , where the parameters and are estimated by the method of moment, i.e., we compute , then we set and , where n is the size of the corresponding monthly mean temperature time series used in the current simulation. As for the intercept, the first piece starts at 0, and the subsequent pieces start where the previous pieces end.
Season. We use a sine curve with random magnification at the stationary points. The size of magnification follows a uniform distribution, U(0.8,1.2). To ensure the component is smooth, the points in the neighbourhood of the stationary points are scaled by the same factor; the radius of the neighbourhood is set to be a quarter of the wavelength of the sine curve.
Remainder. We use the normal distribution , where and are estimated by matching the moments of the remainder component of the sampled data set, .
After a time series is simulated, as above, we randomly remove from it a proportion of points assuming an equal chance of removal for each point. The missing proportions we consider are 5% to 50% at a step of 5%.
4.2.2. Loess-STL
Once the time series with missing data is ready, we proceed to impute the missing data with our procedure, i.e., we apply loess smoothing (using a neighbourhood size = 0.75 × No. of available points; the choice of 0.75 follows the default setting in
stats::stl function in R [
41]) to the cycle-subseries of the time series and interpolate the missing points. By cycle-subseries, we mean the subseries formed by partitioning the series according to the cycle implied by the research context. For example, for the artificial monthly temperature data simulated here, we partition the data according to months to form 12 subseries. The first series contains the January temperature data over the years; the second series contains the February temperature data over the years and so on. We consider the imputation as successful if all the missing points are imputed and failure if there is any point that cannot be imputed, owing to, for example, having an insufficient number of data points.
Next, we apply STL to extract a trend from the imputed dataset. Six parameters need to be specified. Five of them can be chosen automatically as given in
Section 2.2. The only one left is the neighbourhood size for seasonal smoothing,
.
is a tunable parameter that incorporates expert knowledge about the seasonal component into the analysis, but the flexibility opens up a gap to be filled when little is known about the data. We suggest two ways to choose
in such a situation. The first is to note that choosing
is related to the bias-variance trade-off in finding the best curve that describes the seasonal effect. Hence, one can choose a value that minimises the smoothness-penalised least square error as
. This approach makes sense theoretically but requires a large effort to implement. The second way is more ad hoc. Noting that STL uses loess to smooth the data and loess in R uses a span of 0.75 by default, one can choose
. Admittedly, the value of 0.75 is somewhat arbitrary; the general idea is to avoid extreme cases like 0 and 1 when one does not have much information. For our simulation, we suggested using
, reflecting no preference is given to any of 0 and 1 over the other one.
4.2.3. Evaluation Measures for the Estimated Trend
For each simulation, we compute two measures to evaluate the estimated trend. The first quantity is the mean squared difference between the complete trend and the imputed trend. We refer to it as the trend error
. For the second quantity, we first perform an ordinary least square(OLS) fit on both the complete trend and the imputed trend versus the time, and then we take the modulus of the difference of the two slope coefficients. We refer to it as the slope error. Note that the slopes express the average (temperature) change per unit time (month) change over the entire timeframe. We consider this quantity because it is frequently used in the context of climate data (see for example, Turner, Lachlan-Cope, Colwell, Marshall and Connolley [
43]; Zhang [
44]; Steig, Schneider, Rutherford, Mann, Comiso and Shindell [
45]).
4.3. Results of Simulation Studies
For each of the 40 simulation settings (4 configurations × 10 missing proportions), 10,000 simulations are run; we summarise the results using boxplots in
Figure 2 and
Figure 3. The configurations referred to in the figures are given in
Table 1.
In
Figure 2 we present boxplots of the trend errors (defined in
Section 4.2.3) against the different proportions of missing data under the four configurations. In addition, we label the averages with grey squares. In the following, we summarise our findings.
The average, the medium, the interquartile range and the maximum/minimum (excluding the outliers) of the trend errors in each configuration show a near-linear increasing pattern as the proportion of missing points increases. Similar patterns are observed over the four configurations.
Averages of the trend errors are significantly below 0.2 squared degree Celsius over all the settings, with the maximum being at the 50% missingness setting. At this 50% missingness level, the trend errors only go as high as 0.181 for the average, 0.718 for the maximum(including outliers), and 0.110 for the interquartile range. The results are satisfactory given that the amount of missing data is substantial.
At the 50% missingness level, a few outliers show abnormally large errors in configurations 2 and 4. Two facts contribute to this phenomenon. First, the missingness proportion 50% is a critical point beyond which a dispersing missing-data pattern cannot form. At the critical level, large gaps can sometimes form; this prevents loess-STL from extracting trends accurately and gives rise to large errors. Second, both configurations 2 and 4 use the data-based approach to generate the seasonal component. These seasonal components are relatively more irregular, so it is harder to get an accurate estimate. The large missingness proportion further worsens the situation, leading to large errors in the extracted trend.
Relating to the true trend, in all 40 settings, the 10,000 mean squared differences between the (smoothed) complete trend and the true trend have an average of less than 0.048 and a maximum of less than 0.34. The corresponding figures for the 10,000 mean squared differences between the (smoothed) imputed trend and the true trend are 0.075 and 0.485 respectively. The first two statistics suggest STL can produce reliable trend estimates while the last two statistics suggest when loess imputation is used, STL can be robust against missing data.
Figure 4 gives a visual impression of the comparison between the complete trend, the imputed trend, and the true trend for a randomly simulated time series of length 498 months using Configuration 2 and with a random removal of 40% data.
As a supplementary note, the 95%-quantiles of the trend errors are below 0.324 in all 40 settings. Moreover, in each setting the 95%-quantile of the 10,000 trend errors is less than 0.6 times the corresponding maximum error. This suggests that the typical case is generally much better than the worst case.
In
Figure 3, we present boxplots of slope errors (defined in
Section 4.2.3) against the different proportions of missing data under the four configurations. Again, we label the averages using grey squares. Our findings can be summarised as follows:
The average, the medium, the interquartile range and the maximum/minimum (excluding the outliers) of the slope errors show a near-linear increasing pattern as the proportion of missing points increases. Similar patterns are observed over the four configurations.
Averages of the slope errors are below 0.001 over all the settings. At the 50% missingness level, the slope errors only go as high as 0.00071 for the average, 0.00385 for the maximum (including outliers), and 0.00074 for the interquartile range.
As a supplementary note, the 95%-quantiles of the slope errors are below 0.00172 in all 40 settings. In each setting, the 95%-quantile of the 10,000 slope errors are less than 0.54 times the corresponding maximum error, again suggesting that the estimate one typically gets is generally much better than that in the worst-case scenario.
4.4. Consistency with Theoretical Results
Our simulation studies have shown that applying loess-STL to incomplete data (up to 50% missing) can produce trend estimates that are close to the ones from applying STL to complete data. Now we relate these results to the theoretical results from
Section 3. We first compute theoretical upper bounds for the trend errors, then we check if the bounds actually hold. Since we know the original data in the simulation studies (as we generated them), we can directly compute the imputation errors and apply Equation (
4) to find the theoretical bounds. The maximum of the trend errors (over 10,000 simulations in each setting) and the theoretical bounds are given in
Table 2 and
Table 3 respectively.
Comparing
Table 2 and
Table 3, we see that the theoretical bounds are effective, showing consistency of the numerical results with our theoretical results. However, we remark that these bounds inevitably become loose as the proportion of missing data gets large. This is because the imputation error only gives information about the mean squared individual imputation errors but not the signs of the individual errors. Thus, in the worst-case scenario where all the individual imputation errors have the same sign, the trend extracted with STL can indeed have a large bias. The bias is more pronounced when the proportions of missing data are large, therefore, the bounds for those proportions are loose.
5. Application
Deriving an accurate trend in meteorological data (e.g., temperature) is important for the detection and attribution of climate change. To derive plausible trends, long-term time series, preferably over several decades, are used. However, these time series often have missing data (e.g., due to failure of instruments) which impact the accuracy of the estimated trend. In remote areas like the Arctic and the Antarctic, the proportion of missing data could be particularly large, but at the same time, accurate trend estimates over these areas are of great importance. For example, the response of the Arctic to global warming is one of the major indicators of climate change, cf. IPCC [
46].
In this section, we apply the loess-STL procedure to the Antarctic temperature data introduced in
Section 4.1. In particular, we apply the loess-STL to the midnight temperature time series collected at the Novolazaravskaja station at 8 different pressure levels from October 1969 to March 2011. We made this choice because the missing data in these time series show a high degree of dispersed-ness, hence it fulfils the assumptions we proposed. In
Figure 5, we show as an illustration the monthly average midnight temperature time series at 150 hPa pressure level over Novolazaravskaja station before and after imputations. The circles are the original data points, and the rhombuses are the imputed data points. The time series is 498 months long and has 42 data points missing, which is equivalent to a missing proportion of 8.4%. Upon the imputation, STL is applied to extract the trend from the time series. We do this to each of the 8-time series collected at the 8 pressure levels and generate a profile plot displayed in
Figure 6. In the figure, the bars and the dots represent the average temperature change per decade (in degree Celsius) at the corresponding pressure level. As a reminder, the average change is the slope coefficient (multiplying 120 months) of the OLS line fit on the extracted trend. The
standard errors are provided using error bars to represent approximately 95% confidence intervals, and a smoothed line is plotted to show the dynamics of the average temperature change over the different pressure levels.
Figure 6 confirms the climatologists’ understanding that radiosonde observations over Antarctica become warming in the lower troposphere between 850 and 400 hPa, and strong cooling in the upper troposphere between 250 and 150 hPa over the past 5 decades.