1. Introduction
Since the beginning of the space age, the possibility of using orbiting “observers” able to gather information about the Earth has stimulated the interest of scientific institutions, governments and the military. Among the plethora of data that can be gathered by satellite sensor arrays, Synthetic Aperture Radar (SAR) images and their advanced interferometric processing (InSAR) stand out as some of the most valuable resources for monitoring natural geohazards [
1]. This has been made possible due to the significant leap in data transmission technologies in recent years, coupled with an increase in computing power, storage capacity and sensor availability, leading to a substantial expansion in the potential applications of satellite imaging data.
The use of interferometric techniques allows the extraction of topographic and kinematic information about the measured ground surface, by comparing the phases of subsequent images, and achieving millimeter accuracy against independent geodetic monitoring data (e.g., [
2,
3,
4,
5]). As highlighted in the review article by Bernardi et al. [
6], there are numerous ongoing efforts to apply conventional statistical frameworks, like time-series analysis and geostatistics, to the output products of InSAR processing. However, the statistical analysis of these products is still in its infancy. The goal of this work is to apply advanced statistical techniques to the analysis of InSAR data to enhance the practical use of these data. In particular, we aim at tackling the problem of detecting early warning signals for geohazards.
In the last decade, various works have addressed the problem of detecting anomalous patterns in time series of ground motions obtained by InSAR processing. Berti et al. [
7] classified time series of ground displacements into classes by statistically comparing them to reference displacement patterns selected by experts. Chang and Hanssen [
8] proposed a similar approach based on a multiple hypothesis testing procedure to compare the time series to a comprehensive set of alternative models built from a library of canonical kinematic models. Li et al. [
9] proposed an approach, based on machine learning algorithms, to classify the time series into five categories. Cigna et al. [
10] defined two deviation indexes that assess, respectively, the change in the velocity and/or acceleration and the magnitude of the discontinuity, assuming a linear trend; Tapete and Casagli [
11] applied one of the two deviation indexes developed by Cigna et al. [
10] across the whole duration of deformation time series to detect trend variations not only in correspondence with specific time periods, but also at other epochs that may have not been known a priori. Notti et al. [
12] proposed a methodology for the analysis of deformation time series including the approaches of Berti et al. [
7] and Cigna et al. [
10] to investigate landslides and land subsidence processes.
All the mentioned approaches are based on an a priori selection of a prespecified model for the trend. The current literature presents a gap in the application of model-free approaches based on advanced statistical methods. The current study aims at filling this gap by applying state-of-the-art statistical techniques. Indeed, the proposed approach does not rely on prior assumptions about the expected motion trend and the shape of the anomaly, and it is, therefore, able to identify anomalies presenting any kind of deviation from the regular trend, by setting as a reference the nearby curves, thus using the data themselves to define what is the “normal” behavior of the displacement evolution. Moreover, the aforementioned works use the spatial distribution of the data only for representation and interpretation of the results. In the proposed approach, the spatial dependence of the data is included in the analysis. To the best of our knowledge, there has yet to be an attempt to integrate temporal and spatial dynamics, or to utilize more sophisticated statistical methods, to extract new insights from these data.
Our aim is to develop an efficient statistically based tool capable of providing key information to feed into geohazard monitoring and early warning systems, based on a Functional Data Analysis (FDA) framework [
13]. This statistical methodology is particularly suited for the problem considered, as it appropriately accounts for the intrinsic regular nature of the phenomenon under study. Indeed, geological motions follow a continuous trajectory in time (although sometimes abrupt, as in the case of paroxysmal events) with a degree regularity given by the forces involved. The proposed technique relies on the assumption that the InSAR data used for the analysis carry enough information to extract meaningful information about the phenomenon considered, with enough geographical locations to understand its spatial distribution and enough time points to extract its temporal evolution. The proposed method is here applied to an illustrative case (the detection of precursors to a mud volcano eruption in Sicily), but its general formulation makes it suitable for the detection of early warning signals of other volcano eruptions and, more generally, to the detection of early warning signals of a wide range of geological events. For example, Moro et al. [
14] illustrated the possibility of detecting seismic precursor signals from InSAR data in the case of the 2009 L’Aquila earthquake.
In the eastern part of Caltanissetta, a city in Sicily Island (Italy), lies the village of Santa Barbara. This village was the site of a significant event: a mud volcano eruption, which we utilize as a test case in our study. The eruption, which occurred on 11 August 2008 was of such intensity that it caused damage to urban infrastructure up to 2 km away from the main eruptive vent. For a more comprehensive understanding of the event and its geological characteristics, one can refer to Cigna et al. [
10] and Madonia et al. [
15]. The dataset exploited in this work is generated using a well-established multi-temporal InSAR processing workflow, and the initial results were briefly presented in Fontana et al. [
16], where they also underwent a preliminary and exploratory analysis based on FDA. While Fontana et al. [
16] already suggested the possibility of using FDA for the analysis of InSAR data, by applying a functional clustering algorithm and proposing a forecasting technique, in this work, we innovate by exploiting the FDA framework to tackle the specific problem of early warning detection and present a fully fledged algorithm able to detect precursors to the paroxysmal event in the considered application.
2. Materials
The dataset analyzed in this work, already described in Fontana et al. [
16], is obtained from 32 ENVISAT Advanced SAR images along ascending track T172. The period of acquisition starts on 12 October 2002 and ends on 7 June 2008 which is the last date before the eruption of the mud volcano. The data are acquired in C-band (5.6 cm wavelength, 5.3 GHz frequency) and they are characterized by a Line-Of-Sight (LOS) with ∼23° look angle and VV co-polarization. The ground resolution is ∼20 m and the nominal site revisit is 35 days. The algorithm used for InSAR processing is the well-established Small Baseline Subset (SBAS) technique [
17,
18], a robust multi-temporal InSAR implementation that was originally developed in 2002 and widely exploited by the scientific community for a number of geohazard applications at the local, regional, national to continental scales (e.g., [
19,
20,
21,
22]). In the resulting dataset, 1735 coherent targets are retained and their geographical distribution covers an area of 150 km
2. The algorithm estimated, for each target, the annual LOS velocity, the time series of LOS displacements, the temporal coherence and the elevation above the reference ellipsoid. The algorithm proposed in
Section 3 is applied to the displacement time series. The geographical locations of the coherent targets are represented in
Figure 1, and the corresponding time series of LOS displacements are shown in
Figure 2. The precision of the dataset, i.e., the standard error, is on average
mm/year for the velocity estimates and
mm/year for each displacement record, across the whole processed area. These values depend on the quality of the scatterers (e.g., their coherence) and the number of available observations (i.e., images, and small baseline interferograms) [
23]. The dataset shows a generally stable scenario across the processed area, with LOS displacement velocities between
mm/year for over
of the coherent targets. Some ground displacement velocity peaks of up to
and
mm/year (in the direction away from and towards the sensor, respectively) can be observed (
Figure 1). These correspond to maximum cumulative values of
to
mm LOS displacement over the 2002–2008 period. The ground deformation scenario in Caltanissetta for the time period between 2002 and 2005 was described by Vallone et al. [
24]. Moreover, the specific event of the mud volcano eruption, although on a different dataset than the one analyzed in this work, was considered in Cigna et al. [
10], where Deviation Indices were computed to semi-automatically identify changes in InSAR time series.
The displacement data considered can be thought as a discrete sampling (at the time instants of SAR acquisitions) of a continuous phenomenon (the ground deformation). The physical constraints of the deformation dynamics suggest a degree of smoothness of the phenomenon under study. Indeed, the first and second derivatives of the trajectories represent, respectively, the velocity and the acceleration profiles of the displacement. Moreover, we assume the presence of additive error induced by the measuring process and the InSAR processing. As already explored by Fontana et al. [
16], FDA is a suitable framework able to properly account for the particular features of the data considered. Indeed, FDA is the branch of statistics that considers as statistical units smooth functions depending on a continuous variable (e.g., time, space or frequency).
3. Methods
The assumption is that, in an operational scenario of real-time monitoring of the evolving situation in an area where an event is expected to happen or has already started to occur, whenever a new satellite image is collected, this image is ingested into the processing chain, processed and converted into new information about the ongoing deformation. Therefore, apart from the time needed for the satellite image downlink to the ground station and its provision from the image provider ground segment (that for a satellite mission designed to provide imagery in emergency contexts is by definition highly shortened), the assumed operational scenario aims to approach the “real-time” performance and thus a valuable situation to address application purposes of civil protection during emergencies.
We select a calibration period of 1 year, using the six scenes from 12 October 2002 to 6 December 2003. This “burn-in” is required by the SBAS procedure to produce reliable displacement series. We then proceed to analyze each subsequent scene, namely from 6 December 2003 to 7 June 2008.
More specifically, with
being the generic deformation curve, observed on the temporal domain
T at location
, where
are, respectively, the latitude and the longitude of the point, and
[6 December 2003, 7 June 2008] being each time instant after the calibration period, the analysis performed at a given
considers only the data in a specific period of time
T defined as [12 October 2002,
. A representation of the measured points overlaid on a map of the area of interest is available in
Figure 1, while their time dynamics can be seen in
Figure 2.
The method identifies anomalies by comparing the time series of locations near the point to be monitored. Therefore, we restrict our attention to deformation curves measured in geographical positions that are close enough (in geographical terms) to the area in which the mud volcano is: in more mathematical terms, with
being the position of the mud volcano, we restrict our analysis to those geographical positions
for which
, where
is the geodetic distance between two geographical locations. To compute the geodetic distance, we use the inverse method by Vincenty [
25] using the
WGS-84 ellipsoid.
To filter out the measurement error, and to recover the smooth structure of our subset of displacement curves, we employ a smoothing procedure based on a b-spline basis of order 6 (i.e., degree 5). This is to ensure to have cubic b-splines on acceleration curves. The smoothing parameter is then selected in a data-driven way by minimizing generalized cross-validation error, as is it commonly performed in the FDA realm [
13].
After the smoothing, to focus our attention on recent variations in displacement, we restrict, for each , the analysis to , where , for a fixed parameter k. In other words, for each date after the calibration period in which a scene was acquired, we focus our attention on the portion of deformation curves that includes the k days that reach into the past with respect to the time point t. This parameter represents the “memory” of our method. To have data points in the period , and also for the first time instants, considered, the calibration period should have a length larger or equal to k.
Moreover, since we are interested in differences in displacement that are relative, we normalize each displacement curve in the following way:
In other words, we shift each curve vertically so that its value at
is zero.
We now want to determine, for each
, what curves
are outlying with respect to the other curves considered. We use the functional generalization of the classical boxplot [
26]. The functional boxplot revolves around the concept of data depth: namely, a method for multi- or infinite-dimensional data that allows the establishment of a natural ordering between points that are deep in the data cloud and points that are shallow. In this specific case, the depth considered is the
Modified Band Depth (MBD) by Lopez-Pintado and Romo [
27]. Following Liu et al. [
28] and Sun and Genton [
26], we define the median function as
where
, for
, is the
r-th deepest curve according to MBD, so
is the deepest curve and
the shallowest.
The
-central region, i.e., the functional equivalent of the univariate interquartile range, is
The upper and lower whiskers are defined as:
and
where
n is the number of functional observations and
F is a custom parameter, usually set to 3, a standard value in the statistical literature, and commonly used as the default one in statistical packages, trading off between the type I and type II error rates of the outlier detection procedure.
also yields a good trade-off between data overfitting and the discrimination performance of the method, as inspected via an empirical evaluation which we omit for the sake of conciseness. All the curves that are outside of the region of space identified by
and
for at least one point are classified as outliers.
The essential gain provided by the use of the FDA framework and the concept of a functional boxplot is that a methodology designed like this is not only able to take into account the geographical dimension of the problem (provided by the geographical restrictions), but also the time dimension, taken into account by the use of some of the previous deformation history of a given point, instead of a single scalar value.
The algorithm was tested on the Santa Barbara (Caltanissetta) case study. In this specific case, given the characteristic dimension of the problem in geographical terms, we select
m. This choice restricts the analysis to 39 points. A representation of them overlaid on a terrain map can be found in
Figure 3, while their time dynamics can be seen in
Figure 4. Within this area, annual LOS displacement velocities are between −2.8 and +6.4 mm/year, and the cumulative LOS displacements reach
and
mm during the 2002–2008 period.
With respect to the temporal dimension, the
k parameter is set to 365 days, according to empirical observations that define the “natural” temporal scale of paroxysmal phenomena such as the ones in Caltanissetta of one year. The parameter for the smoothing procedure is set as described in the previous section, and equals
. The choice of the parameters
and
k is here based on empirical observation of the data, but it is also coherent with the dimension/scale of the mud volcano and the affected/impacted area (see geological information on the event in Cigna et al. [
10], and the maps in Brighenti et al. [
29]) and, from the temporal point of view, with the potential time scale of any precursors.
4. Results
Some of the sets of time series generated via the data manipulation procedure described above can be observed in the different panels of
Figure 5, while the whole set of 27 figures is available in
Figure S1. More specifically,
Figure 5a represents data at a point in the first part of the observation window,
Figure 5b at a point in the middle of the observation period and
Figure 5c at a point in the last part of the observation period. In the latter two panels, the two colored curves, which are the ones located at the SW (purple) and SE (blue) points, show possibly outlying dynamics.
Having defined the different time series, the next step in the procedure is represented by computing the functional boxplots. In this case, we set
. Of course, such a parameter can be optimized according to the outlier detection task at hand. Increasing it reduces the sensitivity of the methodology, which raises fewer alarms, while decreasing it renders the method more sensitive. A representation of the functional boxplots for the same dates explored in
Figure 5 is available in
Figure 6. It can be seen that in
Figure 5b, the two “suspect” curves are not outside the functional whiskers shown in
Figure 6b, and thus not identified as outliers, while this happens for the same curves in
Figure 5c and
Figure 6c (i.e., at the end of the observation period).
The geographical representation of the outliers (
Figure 7) sheds additional light on the procedure. Indeed, the two outlying curves correspond to points very close to the mud volcano (
Figure 7c). Both points are characterized by anomalous values of LOS displacement but with different signs: the point southwest of the mud volcano features positive values (i.e., it is moving towards the sensor), while the point southeast of it features negative values (i.e., it is moving away from the sensor). This specific configuration, with the two points moving in opposite directions along the ENVISAT descending LOS, suggests an inflation of the mud volcano area. The two points are located on residential structures in the urban agglomeration south of the volcano, and geological and geomorphological surveys have highlighted the development, in those and other sectors near and south of the volcano, of a series of fractures and shear lineaments which confirm the existence of high stress and strain linked to the presence of the volcano already before the August 2008 event [
15,
30,
31,
32].
Both sets of the 27 figures of the functional boxplots and the maps are available as
Figures S2 and S3. The complete series of functional boxplots is also available on a single page as
Figure S5.
5. Discussion
By observing the time dynamics and persistence of the signals of deviation given by the procedure, we focus our attention on the period starting from 16 September 2006 to the day of the last observation, 7 June 2008. The maps and functional boxplots relative to that period can be found in
Figure 8. More specifically, we observe two outlying points that switch on relatively early in time, and, for the SE outlier case, stay on for almost a year and a half before the paroxysmal event. It is immediately evident how the dimension of the “time persistence” of signals represents a fundamental aspect to be taken into consideration when analyzing deformation series using the proposed methodology.
To prove the validity of our proposal, and specifically of the use of the FDA approach to solve this kind of problem, we test our methodology against a similar one but developed without the use of functional tools. More specifically, we restrict ourselves to the same geographical area (so is still equal to 750 m) and we use the same memory parameter (so days). We also perform the same re-centering on the first day of the time window. The essential differences between the “functional” and “scalar” proposals are thus that we do not perform any kind of smoothing on the data, and instead of using the functional boxplot, we use a standard one on the latest available observation.
The series of boxplots for the scalar approach is in
Figure 9, and a summary of the comparison is available in
Table 1, while the complete results alongside the boxplots are available in
Figure S4. It is immediately evident, focusing on the two points close to the mud volcano, that the functional boxplot methodology is able to raise a more robust alarm than the scalar proposal, identifying two points instead of one, and in a slightly more persistent way. In fact, our suggestion to a civil protection agency in this case would have been to either increase the number of scans on the area, as something most probably was happening, or to activate an on-site monitoring of the mud volcano site.
In other words, our methodology would allow any practitioner using it to immediately detect points that had very wild and anomalous dynamics in the period of interest, and thus to provide to them very useful information to start implementing monitoring and/or mitigation strategies.
From a methodological point of view, the present work represents a step further with respect to previous studies in the field by demonstrating the effective use of FDA in early warning signal detection. In the application considered, the current work extends and complements previous studies on the same use case: as in Cigna et al. [
10], precursors of the mud volcano eruption have been identified in InSAR data, but here, no prior assumption on the shape of the anomaly was made; moreover, as in Fontana et al. [
16], an FDA approach has been adopted, but here, the specific goal of early warning has been successfully addressed.
The method relies on the proper selection of the parameters , k and F. While the flexibility given by these parameters allows the method to adapt to a wide range of use cases with different geological conditions, the sensitiveness of the results to this choice represents a limitation of the proposed approach. The three parameters can be fixed on the basis of prior knowledge of the phenomenon under study, as performed in the current study, since they have a physical meaning and clear interpretation.
6. Conclusions
In this paper, we propose a novel approach for post-processing InSAR data in order to provide early warning signals for geological hazards. This topic is of paramount importance for the practical use of these data for geohazard monitoring to provide practitioners with useful information for implementing mitigation strategies.
The proposed methodology appropriately takes into account the spatial and temporal dimension of the problem. Indeed, to exploit the smoothness of the phenomenon, we adopt an FDA approach and we use functional boxplots to identify outlying curves. The proposed methodology does not rely on preselected patterns, unlike previous studies based on predefined deformation trends, and uses the data themselves to set the reference to detect anomalies. Therefore, the presented approach allows more flexibility and can potentially be applied to a wide variety of geological events. In the considered application, the proposed functional approach presents advantages over a scalar one, as the warnings for the paroxysmal event are provided in advance. Moreover, the analysis of the time dimension allowed by the FDA framework used in the methodology provides additional insights to practitioners. In particular, the proposed methodology allows the user to identify precursor signs of the analyzed test case almost 5 months in advance, and with signals that are persistent. In terms of technical recommendations to a civil protection practitioner, i.e., the target user of our methodology, we believe that fundamental attention has to be paid to the geographical proximity and temporal persistence of raised warnings.
The proposed method could be tested and validated on other application cases of volcano eruptions featuring different characteristics (such as the Maccalube of Aragona, where many more events and activity have been recorded in the last years [
33] compared to those of Santa Barbara) or to other paroxysmal events to assess its capabilities. Another future research direction would be the extension of the current study to find heuristic criteria that are able to guide the choice of the three parameters characterizing the proposed methodology using information on the geology of the area, or according to specific problem classes (paroxysmal vs persistent events), or for specific application tasks (volcanic eruptions, landslides, subsidence, monitoring of buildings, etc.). Another valid proposal would be to implement data-driven techniques for the selection, exploiting known techniques in uncertainty quantification such as Conformal Prediction, a novel non-parametric forecasting method based on minimal assumptions.