Next Article in Journal
Performance Assessment of the Medium Frequency R-Mode Baltic Testbed at Sea near Rostock
Previous Article in Journal
Method for Prevention of Liquefaction Caused by Earthquakes Using Grouting Applicable to Existing Structures
Previous Article in Special Issue
SAT-Hadoop-Processor: A Distributed Remote Sensing Big Data Processing Software for Earth Observation Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A SqueeSAR Spatially Adaptive Filtering Algorithm Based on Hadoop Distributed Cluster Environment

1
Faculty of Land Resources Engineering, Kunming University of Science and Technology, Kunming 650093, China
2
Department of Natural Resources of Yunnan Province, Kunming 650224, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(3), 1869; https://doi.org/10.3390/app13031869
Submission received: 17 November 2022 / Revised: 26 January 2023 / Accepted: 29 January 2023 / Published: 31 January 2023
(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)

Abstract

:
Multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques analyze a study area using a set of SAR image data composed of time series, reaching millimeter surface subsidence accuracy. To effectively acquire the subsidence information in low-coherence areas without obvious features in non-urban areas, an MT-InSAR technique, called SqueeSAR, is proposed to improve the density of the subsidence points in the study area by fusing the distributed scatterers (DS). However, SqueeSAR filters the DS points individually during spatial adaptive filtering, which requires significant computer memory, which leads to low processing efficiency, and faces great challenges in large-area InSAR processing. We propose a spatially adaptive filtering parallelization strategy based on the Spark distributed computing engine in a Hadoop distributed cluster environment, which splits the different DS pixel point data into different computing nodes for parallel processing and effectively improves the filtering algorithm’s performance. To evaluate the effectiveness and accuracy of the proposed method, we conducted a performance evaluation and accuracy verification in and around the main city of Kunming with the original Sentinel-1A SLC data provided by ESA. Additionally, parallel calculation was performed in a YARN cluster comprising three computing nodes, which improved the performance of the filtering algorithm by a factor of 2.15, without affecting the filtering accuracy.

1. Introduction

Since the permanent scatterer (PS) interferometry technique [1] was first proposed in 2000, various multi-temporal interferometric synthetic aperture radar (MT-InSAR) techniques have been developed [2,3,4,5,6,7,8,9,10,11,12]. The classical PS-InSAR technique obtains the deformation information of the study area by constructing a triangular network, by selecting PSs that are not affected by space-time and using the amplitude discretization method to measure the spatio-temporal stability of the point target intensity. However, in non-urban areas, such as vegetation, bare land, and low-intensity impermeable surfaces, it is often difficult to obtain accurate deformation information, owing to the limitations of the algorithm.
In response to the above problem, Ferretti et al. proposed the SqueeSAR technique [13], which considers the different statistical forms of PSs and distributed scatterers (DSs) and introduces a new way to jointly process PSs and DSs. DSs are susceptible to time and space decorrelation factors, which makes it difficult for interferograms to meet the requirements of time-series analysis. Therefore, before deformation settlement, the DS point targets must be screened and optimized to improve the observed signal-to-noise ratio and reduce the error propagation probability [14,15,16].
In contrast to the conventional MT-InSAR, SqueeSAR uses the Kolmogorov-Smirnov (KS) test [17] to find the DS point targets with the same statistical characteristics in a given range, and then acquires the surface deformation through the spatially adaptive filtering of these DS point targets. Finally, it reconstructs the phase using the phase triangle algorithm (PTA), employing the data processing of PS-InSAR. Therefore, the core steps of SqueeSAR include the homogeneous distribution test algorithm based on the KS test, spatial adaptive filtering, and the phase triangulation algorithm. It was found that the SqueeSAR algorithm is limited in terms of spatial adaptive filtering because of the need to generate N(N−1)/2 pairs of interferograms for phase reconstruction to obtain a better accuracy of surface deformation. On one hand, considering that the amount of data of a single interferogram can reach hundreds or even thousands of MB (depending on the size of the study area) and that spatial adaptive filtering requires the phase information of N(N−1)/2 interferograms to be read simultaneously, this leads to large memory consumption or even memory overflow in the process of loading the interferogram phase information. On the other hand, during the entire filtering process, each statistically homogeneous pixel (SHP) identified by the KS test must be filtered, which results in a significant processing time. Spatial adaptive filtering plays an indispensable role in the SqueeSAR algorithm, which mainly averages the phase information generated from the SHP based on the N(N−1)/2 interferograms that are spatially identified by the KS test algorithm to reduce the speckle noise in the averaging area (farmland, forest, etc.) without affecting the phase value of the point object target; thus, the phase triangulation algorithm can invert the performance with good accuracy. Considering the current problems, such as the dramatic increase in computer compliance and complexity in data processing due to the massive amount of data brought about by the launch of satellites with SAR sensors, there is an urgent need for a solution for SAR data processing to alleviate problems such as low data processing performance and hardware limitations.
To relieve the pressure of data storage and reduce the data processing time in order to obtain the deformation rate results of the study area in a short period of time, a clustered environment with disk expansion to store data and a parallelization strategy to process SAR data is the solution chosen by most researchers. In one study [18], the performance of a small baseline subset InSAR (SBAS) was evaluated on a high-performance cluster (HPC) consisting of 16 nodes by analyzing the situation of the different data processing modules of SBAS with different parallelization strategies for data processing, resulting in an overall data processing performance improvement of nearly 16 times in SBAS. Based on the P-SBAS algorithm proposed in the literature [18], a scheme using cloud computing, which is highly scalable, has also been proposed, and a performance evaluation and accuracy validation have been performed using 280 computing nodes of the Amazon Web Services Cloud environment for parallel processing in comparison with the conventional serial processing time, as well as using external GPS measurement data [19,20,21,22,23]. When the data volume is too large, it is also difficult for the classical PS-InSAR algorithm to obtain the deformation sedimentation rate of the study area in a very short time. The literature [24,25] proposed a P-CSInSAR parallelization algorithm and performed national deformation mapping, which automatically downloads the study area data using a bash script, and divides the parallelization strategy into image alignment, interference, and time series analysis; three large modules are used to parallelize the PS-InSAR algorithm with different parallelization strategies, which can automatically and efficiently process the Sentinel-1A raw data and greatly improve the data processing performance. Limited by the fact that it generally takes a long time to meet the real-time requirements on a central processing unit (CPU)-based workstation or server, the parallel processing of SAR imaging on a graphics processing unit (GPU) can fully utilize the computational resources of the GPU with the help of the compute unified device architecture (CUDA) [26]. Goldstein filtering is a widely used interferogram filtering algorithm, and the literature [27] proposed a GPU-based parallel filtering processing method, explored the relationship between the filtering time, data size, and filtering window in the parallel processing method, and summarized the advantages and disadvantages of GPU-based parallel computing in filtering. The proposed parallelized processing scheme was able to obtain a 22-fold improvement in the processing time under the guarantee of accuracy. In addition, when the data volume is large or includes many data types, distributed storage and processing through a clustered environment can effectively improve the data storage and processing efficiency [28]. The literature [29] proposed a composite minimum discontinuous phase unwrapping algorithm in a clustered environment, which first divides the original entangled phase into regular chunks based on the main thread and distributes the unwound phase chunks to idle computational nodes for unwinding. Finally, the unwinding results are sent back to the main thread to obtain the final unwound phase, which results in performance improvement.
Traditional high-performance computers are mainly designed for “simple data and complex algorithm” computing and generally use vertical scaling to obtain more computing power [30]. Distributed technology using parallel computing and the dynamic horizontal scaling of computing nodes can solve the problems of big data storage and processing performance to a significant extent. The cloud platform is constructed by cluster building using the popular big data framework Hadoop [31], and the parallelization of individual processing modules of the SAR data based on Spark [32] or MapReduce [33] can reasonably utilize the computer resources and significantly improve the data processing performance [34,35,36,37,38].
Most of the current research on parallelization uses multi-file and multi-core parallel computing to parallelize data processing. However, spatially adaptive filtering requires the filtering of N(N−1)/2 interferograms, which consumes a large amount of memory during the entire data processing process, and the use of multi-core parallelization leads to memory overflow and other problems. In this study, we: (1) focus on the limitations of spatial adaptive filtering in the implementation of the SqueeSAR algorithm, which requires a large amount of memory and significant time for pixel-by-pixel spatial averaging; and (2) build a YARN cluster based on the Hadoop framework, store the data in the dedicated storage nodes of the YARN [39] cluster, use the Spark distributed computing engine to split the data into multiple pieces and distribute them to each computing node for processing through memory interaction, and merge the results to the master node through the aggregation operator to complete the interferogram phase filtering process. The spatially adaptive filtering algorithm was tested on a YARN cluster consisting of three computing nodes with 24 cores and was compared with the conventional (serial) filtering algorithm. Spark distributed computing can achieve up to a 2.15 times improvement compared with the conventional method.

2. Spatial Adaptive Filtering Algorithm in Hadoop Environment

2.1. Spatial Adaptive Filtering Algorithm

Theoretically, the InSAR interferometric phase ψ consists of several components, including the reference ellipsoidal phase φ r e f , topographic phase φ t o p , deformation phase φ d e f , atmospheric phase φ a t m , and noise phase φ n o i , as follows.
ψ = φ r e f + φ t o p + φ d e f + φ a t m + φ n o i
The essence of obtaining the surface deformation of the entire study area is the process of obtaining the deformation phase φ d e f by removing or suppressing the remaining phases.
To reduce speckle noise in uniform areas (e.g., farmland and forests) with high SHP coverage without affecting point targets (e.g., buildings, boulders, etc.), the optimal phase can be estimated by spatially simple averaging the phase values of the N(N−1)/2 pairs of the SHPs identified by the KS test, achieving an inverse performance with high deformation rate accuracy when the SHPs reach a certain number (25) [13]. The signal-to-noise ratio of the interferometric stripes can be significantly improved, at the cost of a reduced resolution, using simple spatial adaptive filtering. In Figure 1, we provide a comparison before and after filtering, and Figure 1a,b show the effect of the interferogram without and after filtering, respectively. It can be clearly seen that the noise in non-urban areas is significantly reduced. Moreover, as shown in Figure 2 and explicated using the DespecKS algorithm proposed in a previous study [13], only the SHP is averaged by spatial adaptive filtering, preserving the information associated with the point-by-point radar targets that are identified by the KS test as individual pixels (PS) and are therefore not affected by the filtering. By changing the phase value of the coherence matrix to the interferometric phase corresponding to spatial filtering, the phase optimization enables a better inversion of the surface deformation accuracy. The equation for spatial adaptive filtering is as follows:
N L i = 1 n j Ω u j
where i denotes the center pixel of the search window, n denotes the number of SHPs in the search window based on the KS test, and u j denotes the phase value of the SHP. To better understand the spatial adaptive filtering algorithm, we provide the pseudo-code for spatial adaptive filtering in Algorithm 1. The specific steps of spatial adaptive filtering are as follows:
(1)
Read the phase information of the original N(N−1)/2 interferograms.
(2)
The DS candidate points in the search window corresponding to a certain pixel are summed and considered as conforming if they are larger than the filtering threshold (25).
(3)
For each pixel that meets the requirement, all the phase values of the SHP in the window centered on it are summed and divided by the sum of the SHP, and the phase value of the center pixel is replaced by the average value.
(4)
Repeat Steps 2–3 until all pixels that meet the threshold requirements are replaced by the average phase value.
Algorithm 1. Serial Spatial Adaptive Filtering Algorithm
Input: Three-dimensional complex array interferogram phase values p[s,i,j]. The KS test identified the SHP corresponding to the row and column data and the identification information of the pixel points in the search window (containing 1 and 0; 1 for SHP and 0 for others), N, M, and W.
Output: Interferogram phase values after filtering with a three-dimensional complex list pnosie
  1 for each n ∊ [0,N] do
  2  for each m ∊ [0,M] do
  3  W[n][m] ⃪ Obtain SHP identification information in the search window
  4  sum[n][m] ⃪ Calculate the sum of the numbers of W[n][i]
  5  if sum[n][m] > (Filter Threshol)) then
  6   phase[n][m] ⃪ Obtain the interferogram phase values corresponding to the pixel phase[:,n,m]
  7   fliter = (phase·W[n][m])/sum[n][m]
  8   pnoise[:,n,m] = fliter
  9 end for
  10 end for
The spatial adaptive filtering algorithm requires the spatial averaging of each pixel of N(N−1)/2 interferograms, which does not require an increased computer performance in terms of performance; however, because it requires sequential pixel-by-pixel processing, when the pixel points are large, the serial approach not only poorly utilizes the computing resources, but also requires a considerable amount of time for the data processing. We propose a parallelized processing scheme using clusters by splitting the different pixel datasets using a YARN cluster built on Hadoop. We use Spark distributed technology to load the split data into the memory and distribute them to each node for filtering, which can fully utilize the idle CPU resources to process the data while relieving the pressure on the memory resources, ensuring no loss of accuracy and improving the data processing performance.

2.2. Hadoop Distributed Framework

With the development of Big Data, it has become difficult for traditional high-performance computers to meet the hardware requirements required for the current data processing, and for most organizations and people working in the Big Data field, high-performance computers are too costly and cannot be iteratively updated and maintained with the growing demand [40,41,42].
Hadoop is an open-source distributed computing platform under the Apache Software Foundation. The emergence of Hadoop renders high-performance computers more attainable, and through the Hadoop framework, users can form a distributed computing and storage platform with high-performance computing and storage using inexpensive hardware. Hadoop users can use a clustered environment to solve the parallel computing, storage, and management problems in Big Data without understanding the underlying details of the distribution. Hadoop has evolved into a big data processing framework, with MapReduce, HDFS [43], and YARN as the main components, which are widely adopted and deployed in several fields of big data.
HDFS is a cluster-based distributed file system that runs on a cluster and has many similarities to the existing distributed file systems, but also has clear differences from other distributed file systems. The advantages of HDFS are its high fault tolerance, high reliability, suitability for batch processing, and intended deployment on low-cost hardware. In Figure 3, we provide a principle diagram for storing and reading data in HDFS. Users can read, upload, and save data at various stages of data processing and analysis to the HDFS for storage, which can greatly reduce the disk Input/Output (I/O) and achieve a performance improvement owing to the copy mechanism of the HDFS.
To optimize the scheduling use of CPU, RAM, and I/O resources in a cluster, several conceptual aspects must be considered, such as data dependencies in the overall data processing flow, load imbalance, proper task partitioning, and the relevant scheduling strategies to be employed. YARN is a resource management and scheduling system that provides unified resource management and scheduling for upper-layer applications and can provide reasonable resource monitoring and scheduling management for clusters.
MapReduce is a programming model for the parallel computing of large-scale datasets (1 TB and above), which performs split computation and merge operations on large-scale datasets through the “Map” and “Reduce” features. It can be used to run programs on distributed systems when the programmer does not know much about distributed parallel programming.

2.3. Spark-Space Adaptive Filtering Algorithm

In the field of data computing, frequent I/O operations can cause significant losses in the data-processing performance [44]. Owing to the framework limitation of MapReduce, a MapReduce task can only contain one Map and Reduce call, and after the computation is completed, MapReduce writes the results back to the disk (distributed file system) for the next computation, which involves a large amount of I/O and requires a considerable amount of time [45]. Many studies have shown that the MapReduce-based algorithms suffer a significant performance loss when frequent I/O operations and network communications are performed [46,47,48]. Spark solves these problems in several ways. It is mainly based on in-memory computing, which enhances multi-iterative batch processing and ensures a high fault tolerance and scalability, while improving the real-time processing of data in big data environments. As Spark supports in-memory computation and acyclic data flow through a directed acyclic graph (DAG) execution engine, the number of shuffles can be reduced, in most cases, compared with MapReduce. In particular, in the case of computation that does not involve data communication between nodes, Spark can complete the entire algorithm in memory at once, greatly reducing the disk I/O operations. As InSAR is more complex in terms of the data structure and algorithm, one Map and Reduce call may not be able to complete the entire filtering algorithm, particularly considering that the filtering algorithm needs to load N(N−1)/2 sets of interferograms for filtering, and using MapReduce may not be very suitable.
A Resilient Distributed Dataset (RDD) is an abstraction of distributed memory that provides a highly constrained shared-memory model in which an RDD is only a collection of read-only record partitions. For developers, an RDD can be considered as a Spark object that runs in the memory itself. Under this concept, reading files, data computation, the resulting set of files, different partitions, dependencies between data, etc., can all be seen as RDDs. Spark provides rich RDDs that make it possible to perform complex algorithmic tasks in interferogram filtering in a single Spark program.
As Spark does not have a distributed storage system or resource scheduling manager, it needs to be stored and managed by other distributed frameworks, and the parallelization solution we use is on a YARN cluster built under the Hadoop framework. In fact, most Spark-based companies and organizations are currently using the Hadoop framework to develop on the YARN cluster. The workflow of Spark on a YARN cluster is shown in Figure 4; we summarize its main principles as follows:
(1)
The user submits the application in the YARN cluster client and launches the driver process;
(2)
The driver process sends a request to the ResourceManager (RS) to launch the ApplicationMaster (AM);
(3)
RS receives the request and randomly selects a NodeManager (NM) to launch the AM;
(4)
After the AM is launched, a multi-process executor is launched via the NM for the RS requesting container resources;
(5)
Finally, the executor is reverse-registered to the driver, who sends a task to the executor for data processing. It is worth noting that in an actual production environment, YARN automatically schedules and manages these resources, and the user only needs to set the relevant hardware parameters (memory, cores, etc.) without considering the complex resource scheduling issues.
The spatial adaptive filtering algorithm must extract the phase information for each pixel of the generated N(N−1)/2 interferograms according to the predefined search window size, and each pixel must read the corresponding phase information in the interferogram. The frequent use of memory to transfer large data to computational nodes can significantly degrade the performance and even lead to memory overflow when the data volume is large. In this case, Spark provides a broadcast RDD to effectively solve this problem by providing a large copy of the input data for each node through serialization. Figure 5 shows how the data are transferred after the broadcast RDD is used. When the data are broadcasted, they are distributed to the executors of each node, and each executor has a complete copy of the broadcasted data. This reasonably solves the problem of the memory overflow and performance loss caused by oversized shared datasets. In addition, in a distributed computing environment, if the image data are chunked and distributed to each node for parallel processing, the inherent connectivity of the image data itself and the characteristics of the graph computation exhibiting strong coupling need to be taken into account [49]. In fact, in the spatial adaptive filtering algorithm, the interferogram phase information data that need to take into account the above characteristics can be broadcast by the broadcast RDD so that a complete copy of the data is available at each node, thus eliminating the need to consider the loss in accuracy due to these characteristics. The proposed Spark parallelization algorithm mainly splits and computes the row file corresponding to each pixel point. As there is no direct correlation between each pixel point, the concept of RDD can be used to slice the data and distribute the data to each executor by using the memory interaction to start a task for independent computation. Finally, the data are merged by the RDD to complete the filtering process.
In Algorithm 2, we present the pseudocode for parallelized processing using Spark. The RDD operations primarily used include textFile, broadcast, map, and collect. Please refer to the definition and explanation given in Algorithm 2 for the specific calculation and parameters. The principle of Spark-based spatially adaptive filtering is shown in Figure 6. In this process, we mainly used the various operators provided by Spark for data distribution, computation, and subsumption processing. First, we read the phase information of the N(N−1)/2 interferograms and the regular parameter information of the remaining interferograms through the master node. As the interferogram phase information requires the entire file to be transferred to each executor and, thus, occupies a large amount of memory, we need to serialize the data into the executors using the broadcast algorithm to reduce the communication between the data and relieve the pressure of excessive memory. This ensures that the executors can read the phase data as read-only variables without consuming a large amount of performance. Subsequently, we created an RDD by reading the SHP data files pre-stored in the HDFS through the textFile operator, which mainly include the rows and columns corresponding to each DS candidate point that meets the threshold requirement (25) in the interferogram and the SHP data identified in the search window. We then parallelized the process by loading these data into the memory and splitting and distributing them into multiple generated tasks. It should be noted that the remaining regular parameters are simply distributed based on the memory interactions, owing to the small amount of data, which has a minimal impact on the performance. In addition, Spark provides a map algorithm that performs spatially adaptive filtering of the data loaded into the memory in different tasks, in which each task performs interferogram filtering using the spatially adaptive filtering algorithm imported by the map operator separately to return a new distributed dataset. Finally, the resulting data, after spatial adaptive filtering, are collected by the master node through the collect operator and are written to the local node to complete the interferogram phase filtering.
Algorithm 2. Parallel Spark-based Spatial Adaptive Filtering Algorithm
Input: Three-dimensional complex array of interferogram phase values p[s,i,j]. The SHP-related data stored in the HDFS contains a total of M/t data, where t denotes the number of initiated Tasks.
Output: Phase value after 1D filtering fliter
  1
  2 textfile<path,Task> ⃪ Distribute data to the started Task
  3
  4 phase_bc = broadcast(p[s,i,j]) ⃪ Broadcast phase values to Executor
  5
  6 map(lambda x: func(x))
  7 for each n∊[0,M/t] do
  8 phase_bc_va ⃪ Obtain the interferogram phase broadcast to Executor
  9 phase[n][m] ⃪ Obtain the interferogram phase value of the corresponding pixel phase_bc_va[:,n,m]
10 fliter = (phase·W[n][m])/sum[n][m]
11 output fliter
12 end for
13
14 collect() ⃪ Collect data from each task for the master node

3. Experimental Results and Analysis

3.1. Overview of the Experimental Area

Kunming is located in the Yunnan-Guizhou plateau region in southwestern China. The region suffers from inherited subsidence tilt and, in recent years, a large number of projects have been constructed around Dianchi in Kunming, which are large in scale and cover a wide area, resulting in the compression of the soft soil substratum. Second, a large amount of groundwater has been extracted because of the development of industry, leading to subsidence around Dianchi and the city center [50,51]. For the phenomenon of decoherence and atmospheric delay effects caused by high altitude, complex topography, and large surface vegetation and water coverage in Kunming and its surrounding areas, the distributed scatterer InSAR can obtain more surface deformation information than the PS-InSAR technique [52]. The main experimental area of this study is located in the main urban area of Kunming, Yunnan Province, and the surrounding areas, which mainly involve the Kunming Wuhua, Xishan, Guandu, Chenggong, and Jinning districts. The specific experimental area range is shown in Figure 7. The test dataset is 32-view descending track Sentinel-1A data provided by the ESA covering the Kunming area, and the entire dataset has 139 GB of raw files, collected between May 2020 and May 2021.

3.2. Data Processing Flow

In the field of InSAR parallelization processing, secondary development using open-source software is a common technical tool [53]. The most widely disseminated open-source software is Doris, a data preprocessing software developed by Technische Universiteit Delft, and StaMPS, a time-series analysis software developed by Stanford University [54]. The Doris software is an open-source InSAR processing software written based on the programming language C++ and has a modular structure, where each module identifies the different algorithms for the processing steps, and users can choose the most appropriate interference module for data processing according to the actual situation. It is worth noting that each pre-processing module of Doris, in the process of generating interferograms, needs to set different parameters to generate the datasets, which can be based on these datasets for secondary development and processing and is a characteristic point of Doris. StaMPS focuses on the time-series analysis modules. In fact, StaMPS can perform the pre-processing of the SAR data (co-registration, interference, differential interference, coordinate system conversion, etc.) by calling Doris for the identification and filtering of SHPs in the later stage. Finally, it reconstructs the phase information of the coherence map by combining the phase triangulation algorithm to complete the time series analysis by combining the extracted DS and PS to obtain the subsidence rate results of the study area. Notably, because Doris and StaMPS only support PS-InSAR algorithm data processing, we implemented the entire DS point extraction and phase optimization steps of SqueeSAR in this process, based on the Python code. Therefore, our parallelization algorithm was also developed and implemented based on the Python code.
The entire data processing flow of the SqueeSAR algorithm is shown in Figure 8. When the differential interference is completed based on Doris, StaMPS is used to generate the interferogram sequence and partition the data into data blocks named PATCHs; the purpose of this step is mainly to prevent the overflow of memory when reading the data, owing to an oversized dataset. By setting the search window size and amplitude dispersion index, the KS test algorithm is used for the initial selection of the DS points from the raw data of the intensity image, and the extracted data are stored in the HDFS for distributed storage for the next step of the parallel calculation of spatial adaptive filtering. Subsequently, the baseline generates the information of the N(N−1)/2 interferograms based on Doris, and the interferogram range is partitioned accordingly, based on the chunk size of the PATCHs. The data information of each PATCH is read by the driver in the Spark distributed computing engine, and the interferogram phase data with a larger data volume is sent to the executor by the broadcast RDD. The data stored in the HDFS are then distributed to each executor for filtering, and the data from each executor are collected by the master node, based on the collected RDD, and written to the storage node disk. Finally, the phase triangle algorithm is used to invert the phase of the filtered interferogram to obtain more accurate phase information and extract the DS candidate points. Time-series analysis was performed jointly with the PS candidate points to obtain the deformation rate results of the study area.

3.3. Comparative Experiments and Analysis of Results

The key to the spatially adaptive filtering process is the ability to consider whether the two image pixels of an interferometric data stack are statistically homogeneous. Once a suitable estimation window is defined for each image pixel, the noise in the magnitude data can be removed, the interferometric phase values can be filtered, and the coherence values of the interferogram can be correctly estimated by the careful selection of the SHP [10]. Both the conventional serial and parallel processing approaches adopted in this study use a search window of 15 × 21 and a DS candidate point threshold of 25 as the parameter conditions for DS candidate point extraction.
To properly evaluate the accuracy and performance of the spatial adaptive filtering using the parallelization approach proposed in this paper, the interferogram phase was first serially filtered using the conventional filtering approach, and the time required for the algorithm to complete the processing and the filtered phase information were counted. The interferogram was then processed by parallelizing the spatial adaptive filtering in the same manner. In this process, the results of the completed processing of the split data are collected by the master node and written to the disk by the collected RDD, which is the same as the writing method of the conventional filtering algorithm; thus, they only differ in terms of data processing over the entire process.
Due to the limited hard disk space, we cropped a smaller area of the image strips to verify the proposed parallelized processing scheme. The size of the entire experimental area was 2600 × 11,800 pixels, and the data were divided into 72 (8 × 9) PATCHs for processing based on StaMPS. In Figure 9, we present the annual average subsidence rate map of the study area after the phase optimization and time series solution, based on the SqueeSAR algorithm, which jointly processes the DS and PS points and, finally, collects the information of a total of 2,401,517 PS points over the entire study area through time series analysis. As this study focuses on parallelization, little analysis has been conducted on the subsidence area in the study area.

3.3.1. Accuracy Evaluation

In the comparison experiments, the first point worth considering is the consistency, in terms of accuracy, which is related to whether the parallelization algorithm is worth applying to practical application scenarios. In general, splitting data tends to affect the accuracy to different degrees, depending on the parallelization strategy used. However, in practice, because this study only splits the row and column files of the data and the SHP files corresponding to the pixel point identification, the entire process of the row and column and SHP correspond separately; thus, the parallelization process by splitting the data does not cause any loss in the accuracy of the filtering. To more intuitively observe the change in the interferogram phase information before and after filtering, we used the cpxfiddle script of Doris to generate the interferogram before and after the spatial adaptive filtering in the case of multiview 3/1. Figure 10 represents a certain differential interferogram generated in the study area before filtering. Figure 11 shows the differential interferogram after using spatially adaptive filtering, whereas Figure 11a represents the interferogram after filtering with conventional processing, and Figure 11b represents the interferogram after filtering with a parallel algorithm. From Figure 11, it is clear that there is no significant difference between the interferograms after filtering using the two methods. In fact, after filtering the phase of the interferogram using the two different processing methods, 100,000 pixels were sampled randomly for an accuracy comparison, which clearly shows that the phase values calculated by the two processing methods are exactly the same, which confirms our previous hypothesis. Broadcasting by the broadcast RDD can not only effectively solve the problem of the performance loss due to the large data transmission, but also ensures that accuracy will not be lost. A comparison of Figure 10 and Figure 11 shows that when the interferogram has undergone spatial adaptive filtering, the speckle noise of the interferogram in non-urban areas is significantly reduced, and the overall quality of the interferogram is significantly improved.

3.3.2. Performance Evaluation

The parallelization strategy used in this study is mainly a YARN cluster consisting of three nodes deployed in a Hadoop environment, and the entire cluster environment consists of one master node and two worker nodes, each equipped with an 8-core CPU (Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70 GHz) and 64 GB of RAM. Among them, the master node acts as a data node and computation node; therefore, 2 TB of hard disk space is allocated for storing the intermediate and resultant data generated by the data processing chain of the SqueeSAR algorithm, while the worker nodes only act as computation nodes and therefore do not require more hard disk space, and the software configuration of the entire cluster is shown in Table 1. The proposed parallelization strategy filters the data with single and 2-core processing units in each executor. In this configuration environment, owing to YARN’s load-balancing technology, the entire cluster does not suffer from memory overflow.
To evaluate the performance of the spatial adaptive filtering in a distributed cluster environment, we used the concept of speedup as a metric. The metric was estimated using a common dataset and a varying amount of executor processing data to evaluate the improvement as available computational resources. Speedup is defined as follows:
S p = T s e q T p a r ( p )
where T s e q represents the execution time of the regular algorithm, T p a r ( p ) represents the execution time of the algorithm by the p executors processing data in parallel, and S p represents the concept of speedup. In addition, the selection of the resource-allocation strategies for the different data streams is of great significance to the load and task division of a cluster. There are three main resource allocation strategies for YARN clusters: the FIFO Scheduler, Fair Scheduler, and Capacity Scheduler. Considering that the filtering algorithm only requires the spatial averaging of the data within the search window, we use the FIFO Scheduler for resource scheduling, and this strategy can effectively utilize the idle computing resources.
In Spark, the dataset must be partitioned for optimal task selection, and we use different numbers of partitions for the performance evaluation. For the YARN clusters, a reasonable number of partitions can ensure that the time of the data transfer and the task start-up overhead are maintained in a relatively balanced state with the task processing time. When the number of tasks is large, the start-up time dominates, thus leading to a longer overall data processing time. When the number of tasks is small, the time of the data transfer dominates and, in this case, the data processing time will also be extended accordingly. To maximize the performance, we used different numbers of tasks to evaluate the performance of the entire YARN cluster in terms of the data processing of the filtering algorithm. Table 2 shows the time comparison of the impact of the different numbers of tasks on the parallel processing performance of the entire dataset when we started eight executors (each executor was assigned two core processing units). As can be seen in Table 2, the use of different tasks has a significant impact on the performance, with the data processing time being reduced from 99 s to 86 s with 240 and 480 tasks, respectively, resulting in an overall performance improvement of 13.1%. When the number of tasks increases from 480 to 640, the overall data processing time does not decrease, but rather increases, which confirms our suspicion that the task startup time dominates in this process. Therefore, when evaluating the performance of the parallel Spark-based filtering algorithms for YARN cluster data processing, we used 480 tasks for data partitioning as a reference.
To effectively evaluate the performance of our proposed algorithm, we parallelized the entire dataset with executors assigned different processing units separately and evaluated the parallelized processing performance with executors assigned single-core and 2-core processing units. The speedup and time comparison plots for the spatially adaptive filtering processing of a single PATCH file and the entire dataset are depicted in Figure 12, Figure 13, Figure 14 and Figure 15 for the assigned single and dual cores, respectively, in a YARN cluster environment. The entire dataset had 72 PATCH files, and the total time spent filtering the interferogram using the conventional filtering algorithm was 3.68 h. It is clear from Figure 13 and Figure 15 that the parallelized processing time is lower than that of the conventional filtering algorithm when the number of cores in the cluster reaches two. However, the results of the two parallelization strategies were significantly different when the number of allocated cores was the same. With a single core allocated per executor, the performance of the overall data processing decreased when the number of executors exceeded 16. Compared with the speedup ratio of the computation of the parallelization strategy with two cores, as shown in Figure 12, the parallelization strategy with two cores increased slowly with an increase in the number of executors, and the speedup ratio of the parallelization strategy with a single core was lower with the same number of cores. The main reason for this difference is the frequent data communication between the different nodes, which gradually dominates the process of broadcasting large datasets to the executors via the broadcast algorithm as the number of executors increases. Despite the addition of the processing units, the process of broadcasting the variable serialization consumes more time.
Although the entire dataset can achieve a performance boost in a clustered environment, the effect is not significant, and the reasons for this can be explored in various ways. On one hand, the nature of the YARN cluster is mainly a high-performance computer composed of multiple computers, whose data interaction has a slight gap compared with traditional high-performance computers, and the memory interaction and network bandwidth will partially affect the computational performance. In addition, frequent data splitting and merging operations can take a toll on the performance of the data processing, particularly when they are too frequent (which is almost inevitable when the number of executors increases). Finally, the size of the data volume and the relatively simple calculation process of the filtering algorithm are perhaps key factors that limit the slow growth of the overall performance. Due to the limited hardware equipment’s conditions (possessing only three nodes) and the limitation of the Spark broadcast RDDs only supporting the serialization of data below 2 GB, we only filtered the data of a small area and could not fully exploit the advantages of Spark’s distributed computing in the field of Big Data.

4. Conclusions

In this study, we proposed a parallelization solution for the SqueeSAR spatially adaptive filtering algorithm based on the Spark computing engine in a Hadoop cluster environment. Through this solution, not only is the processing time of the filtering algorithm reduced without affecting the accuracy of the conventional algorithm, but idle computer resources are also reasonably utilized, and the advantages of distributed cluster computing and storage are fully utilized. This can provide a feasible and valuable data processing solution for personnel and research units engaged in InSAR research. In addition, the parallelization solution is implemented by Spark, a popular distributed computing framework, which is highly portable, suitable for the effective development of distributed computing cloud platforms, easier to implement, and more applicable than the traditional multi-point interface (MPI) architecture. From the perspective of performance optimization, we determined the impact of different parameters on the performance of the filtering algorithm in the performance evaluation process. Using different resource-allocation mechanisms for different datasets can effectively balance the load of the cluster and lead to the reasonable use of the resources of the distributed cluster.
The parallelized processing scheme was combined with the current, more mature Hadoop distributed cluster framework to process the SqueeSAR spatial adaptive filtering algorithm. A total of 496 interferograms (with a total data size of 114 GB) generated from 32 scenes of Sentine-1A satellite data, containing the main urban area of Kunming, were used for testing the performance and for accuracy comparison. The spatial adaptive filtering algorithm can compress the processing time of the entire study area from 3.68 h to 1.84 h when the entire study area is divided into 72 PATCHs, and when the study area is larger, hundreds of PATCHs can result in a more significant speedup. It is worth mentioning that, owing to the limited hardware conditions, the scalability of the cluster was not highlighted in this study. With respect to resource utilization, when the amount of data is too large, the advantage of dynamically scalable nodes can be used in the face of huge data flows by dynamically adding and deleting nodes for reasonable computation and storage resources, which greatly reduces the idleness of the computational resources. For future work, we will primarily focus on the cloud environment, solving the problem of Spark not being able to serialize broadcast data over 2 GB and provide more powerful support for InSAR applications in the Big Data environment by taking advantage of the cloud environment.

Author Contributions

Conceptualization, Y.L. (Yongning Li), W.S. and B.J.; software, X.Z. and Y.L. (Yongfa Li); validation, K.C.; resources, W.S. and Y.L. (Yongning Li); writing, Y.L. (Yongning Li) and W.S.; supervision, B.J.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 42161067) and the Yunnan Province Key Research and Development Program (No. 202202AD080010).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the Copernicus program for free access to Sentinel-1 images processed in this analysis. The authors would also like to thank the open-source projects Hadoop and Spark provided by the Apache Foundation for their contributions to distributed computing and storage and Doris and StaMPS developers for their contributions to SAR data processing.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ferretti, A.; Prati, C.; Rocca, F. Permanent scatterers in SAR interferometry. IEEE Trans. Geosci. Remote Sens. 2001, 39, 8–20. [Google Scholar] [CrossRef]
  2. Berardino, P.; Fornaro, G.; Lanari, R.; Sansosti, E. A new algorithm for surface deformation monitoring based on small baseline differential SAR interferograms. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2375–2383. [Google Scholar] [CrossRef] [Green Version]
  3. Hooper, A. A multi-temporal InSAR method incorporating both persistent scatterer and small baseline approaches. Geophys. Res. Lett. 2008, 35, 1–5. [Google Scholar] [CrossRef] [Green Version]
  4. Lanari, R.; Mora, O.; Manunta, M.; Mallorqui, J.; Berardino, P.; Sansosti, E. A small-baseline approach for investigating deformations on full-resolution differential SAR interferograms. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1377–1386. [Google Scholar] [CrossRef]
  5. Hooper, A.; Zebker, H.; Segall, P.; Kampes, B. A new method for measuring deformation on volcanoes and other natural terrains using InSAR persistent scatterers. Geophys. Res. Lett. 2004, 31, 1–5. [Google Scholar] [CrossRef]
  6. Lv, X.; Yazıcı, B.; Zeghal, M.; Bennett, V.; Abdoun, T. Joint-Scatterer Processing for Time-Series InSAR. IEEE Trans. Geosci. Remote Sens. 2014, 52, 7205–7221. [Google Scholar] [CrossRef]
  7. Fornaro, G.; Verde, S.; Reale, D.; Pauciullo, A. CAESAR: An Approach Based on Covariance Matrix Decomposition to Improve Multibaseline–Multitemporal Interferometric SAR Processing. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2050–2065. [Google Scholar] [CrossRef]
  8. Dong, J.; Zhang, L.; Tang, M.; Liao, M.; Xu, Q.; Gong, J. Mapping landslide surface displacements with time series SAR interferometry by combining persistent and distributed scatterers: A case study of Jiaju landslide in Danba, China. Remote Sens. Environ. 2018, 205, 180–198. [Google Scholar] [CrossRef]
  9. Jiang, M.; Ding, X.L.; Li, Z.W. Homogeneous pixel selection algorithm for multitemporal InSAR. Chin. J. Geophys. 2018, 61, 4767–4776. [Google Scholar] [CrossRef]
  10. Gao, Y.; Gao, F.; Dong, J.; Wang, S. Change detection from synthetic aperture radar images based on channel weighting-based deep cascade network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 4517–4529. [Google Scholar] [CrossRef]
  11. Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
  12. Kang, M.S.; Baek, J.M. Efficient SAR Imaging Integrated With Autofocus via Compressive Sensing. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  13. Ferretti, A.; Fumagalli, A.; Novali, F.; Prati, C.; Rocca, F.; Rucci, A. A New Algorithm for Processing Interferometric Data-Stacks: SqueeSAR. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3460–3470. [Google Scholar] [CrossRef]
  14. Zhu, J.; Li, Z.; Hu, J. Research progress and methods of InSAR for deformation monitoring. Acta Geod. Cartogr. Sin. 2017, 46, 1717. [Google Scholar] [CrossRef]
  15. Lin, H.; Ma, P.; Weixi, W. Urban infrastructure health monitoring with spaceborne multi-temporal synthetic aperture radar interferometry. Acta Geod. Cartogr. Sin. 2017, 46, 1421. [Google Scholar] [CrossRef]
  16. Mingzhou, W.; Tao, L.I.; Liming, J.; Kan, X.; Wenhao, W. An Improved Coherent Targets Technology for Monitoring Surface Deformation. Acta Geod. Cartogr. Sin. 2016, 45, 36. [Google Scholar] [CrossRef]
  17. Massey, F.J., Jr. The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc. 1951, 46, 68–78. [Google Scholar] [CrossRef]
  18. Casu, F.; Elefante, S.; Imperatore, P.; Zinno, I.; Manunta, M.; Luca, C.D.; Lanari, R. SBAS-DInSAR Parallel Processing for Deformation Time-Series Computation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 3285–3296. [Google Scholar] [CrossRef]
  19. Zinno, I.; Elefante, S.; Mossucca, L.; Luca, C.D.; Manunta, M.; Terzo, O.; Lanari, R.; Casu, F. A First Assessment of the P-SBAS DInSAR Algorithm Performances Within a Cloud Computing Environment. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 4675–4686. [Google Scholar] [CrossRef]
  20. De Luca, C.; Cuccu, R.; Elefante, S.; Zinno, I.; Manunta, M.; Casola, V.; Rivolta, G.; Lanari, R.; Casu, F. An On-Demand Web Tool for the Unsupervised Retrieval of Earth’s Surface Deformation from SAR Data: The P-SBAS Service within the ESA G-POD Environment. Remote Sens. Multidiscip. Digit. Publ. Inst. 2015, 7, 15630–15650. [Google Scholar] [CrossRef] [Green Version]
  21. Zinno, I.; Mossucca, L.; Elefante, S.; Luca, C.D.; Casola, V.; Terzo, O.; Casu, F.; Lanari, R. Cloud Computing for Earth Surface Deformation Analysis via Spaceborne Radar Imaging: A Case Study. IEEE Trans. Cloud Comput. 2016, 4, 104–118. [Google Scholar] [CrossRef]
  22. Zinno, I.; Casu, F.; Luca, C.D.; Elefante, S.; Lanari, R.; Manunta, M. A Cloud Computing Solution for the Efficient Implementation of the P-SBAS DInSAR Approach. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 802–817. [Google Scholar] [CrossRef]
  23. De Luca, C.; Zinno, I.; Manunta, M.; Lanari, R.; Casu, F. Large areas surface deformation analysis through a cloud computing P-SBAS approach for massive processing of DInSAR time series. Remote Sens. Environ. 2017, 202, 3–17. [Google Scholar] [CrossRef]
  24. Duan, W.; Zhang, H.; Wang, C.; Tang, Y. Multi-Temporal InSAR Parallel Processing for Sentinel-1 Large-Scale Surface Deformation Mapping. Remote Sens. 2020, 12, 3749. [Google Scholar] [CrossRef]
  25. Tang, Y.; Wang, C.; Zhang, H.; You, H.; Zhang, W.; Duan, W.; Wang, J.; Dong, L. Parallel CS-InSAR for Mapping Nationwide Deformation in China. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 25 January 2021; pp. 3392–3395. [Google Scholar]
  26. Da-di, M.; Yu-xin, H.; Tao, S.; Rui, S.; Xiao-bo, L. Airborne SAR real-time imaging algorithm design and implementation with CUDA on NVIDIA GPU. J. Radars 2013, 2, 481–491. [Google Scholar] [CrossRef]
  27. Sheng, G.; Qi-Ming, Z.; Jian, J.; Cun-Ren, L.; Qing-xi, T. Parallel processing of InSAR interferogram filtering with CUDA programming. Sci. Surv. Mapp. 2015, 1, 54–68. [Google Scholar]
  28. Popov, S.E.; Potapov, V.P. A Fast Search Algorithm for SqueeSAR Distributed Scatterers in the Problem of Calculating Displacement Velocities. Program. Comput. Softw. 2021, 47, 426–438. [Google Scholar] [CrossRef]
  29. Zhong, H.; Tang, J.; Zhang, S.; Huang, P. Combined Minimum Discontinuity Phase Unwrapping Based on Clusters. Geomat. Inf. Sci. Wuhan Univ. 2019, 44, 1363–1368. [Google Scholar] [CrossRef]
  30. Sadooghi, I.; Martin, J.H.; Li, T.; Brandstatter, K.; Maheshwari, K.; Ruivo, T.P.P.D.L.; Garzoglio, G.; Timm, S.; Zhao, Y.; Raicu, L. Understanding the Performance and Potential of Cloud Computing for Scientific Applications. IEEE Trans. Cloud Comput. 2017, 5, 358–371. [Google Scholar] [CrossRef]
  31. White, T. Hadoop: The Definitive Guide; O’Reilly Media, Inc.: Newton, MA, USA, 2012. [Google Scholar]
  32. Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Boston, MA, USA, 22–25 June 2010; p. 10. [Google Scholar]
  33. Ibrahim, S.; Jin, H.; Lu, L.; Wu, S.; Shi, X. Evaluating MapReduce on Virtual Machines: The Hadoop Case. In Proceedings of the Cloud Computing, First International Conference, CloudCom 2009, Beijing, China, 1–4 December 2009; pp. 519–528. [Google Scholar] [CrossRef]
  34. Shi, J.; Qiu, Y.; Minhas, U.F.; Jiao, L.; Wang, C.; Reinwald, B.; Özcan, F. Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 2015, 8, 2110–2121. [Google Scholar] [CrossRef] [Green Version]
  35. Yipeng, W.; Yongzhi, Z.; Chaoying, Z.; Xiaojie, L.; Yingyun, Z. Design and analysis of cloud platform for landslide monitoring in Heifangtai, Gansu province based on GPS and InSAR data. Bull. Surv. Mapp. 2019, 8, 106. [Google Scholar] [CrossRef]
  36. Huang, W.; Meng, L.; Zhang, D.; Zhang, W. In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3–19. [Google Scholar] [CrossRef]
  37. Yipeng, W.; Yongzhi, Z.; Chaoying, Z.; Yulei, L.; Zhang, T. Design of landslide monitoring cloud platform based on cloud computing technology: Taking Jingyang as an example. Bull. Surv. Mapp. 2019, 3, 128. [Google Scholar] [CrossRef]
  38. Li, Z.; Su, D.; Zhu, H.; Li, W.; Zhang, F.; Li, R. A Fast Synthetic Aperture Radar Raw Data Simulation Using Cloud Computing. Sens. Multidiscip. Digit. Publ. Inst. 2017, 17, 113. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Vavilapalli, V.K.; Murthy, A.C.; Douglas, C.; Agarwal, S.; Konar, M.; Evans, R.; Graves, T.; Lowe, J.; Shah, H.; Seth, S.; et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, CA, USA, 1–3 October 2013. [Google Scholar]
  40. Sadashiv, N.; Kumar, S.M.D. Cluster, grid and cloud computing: A detailed comparison. In Proceedings of the 2011 6th International Conference on Computer Science Education (ICCSE), Singapore, 3–5 August 2011; pp. 477–482. [Google Scholar] [CrossRef]
  41. Xia, J.B.; Wei, Z.K.; Fu, K.; Chen, Z. Review of research and application on Hadoop in cloud computing. Comput. Sci. 2016, 43, 6–11. [Google Scholar]
  42. Reyes-Ortiz, J.L.; Oneto, L.; Anguita, D. Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf. Procedia Comput. Sci. 2015, 53, 121–130. [Google Scholar] [CrossRef] [Green Version]
  43. Borthakur, D. HDFS architecture guide. Hadoop Apache Proj. 2008, 53, 2. [Google Scholar]
  44. Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big Data and cloud computing: Innovation opportunities and challenges. Int. J. Digit. Earth Taylor Fr. 2017, 10, 13–53. [Google Scholar] [CrossRef] [Green Version]
  45. Verma, A.; Mansuri, A.H.; Jain, N. Big data management processing with Hadoop MapReduce and spark technology: A comparison. In Proceedings of the 2016 Symposium on Colossal Data Analysis and Networking (CDAN), Indore, India, 18–19 March 2016; pp. 1–4. [Google Scholar] [CrossRef]
  46. Pan, F.; Yue, Y.; Xiong, J.; Hao, D. I/O characterization of big data workloads in data centers. In Proceedings of the Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Salt Lake City, UT, USA, 1 March 2014; pp. 85–97. [Google Scholar]
  47. Zhang, Y.; Gao, Q.; Gao, L.; Wang, C. iMapReduce: A distributed computing framework for iterative computation. J. Grid Comput. 2012, 10, 47–68. [Google Scholar] [CrossRef]
  48. Kambatla, K.; Chen, Y. The Truth About {MapReduce} Performance on {SSDs}. In Proceedings of the 28th Large Installation System Administration Conference (LISA14), Seattle, WA, USA, 9–14 November 2014; pp. 118–126. [Google Scholar]
  49. Yu, G.; Gu, Y.; Bao, Y.B.; Wang, Z.G. Large scale graph data processing on cloud computing environments. Jisuanji Xuebao (Chin. J. Comput.) 2011, 34, 1753–1767. [Google Scholar] [CrossRef]
  50. Xiong, P.; Zuo, X.; Li, Y.; You, H. Application of dual-polarized Sentinel-1 data to subsidence monitoring in Kunming. Prog. Geophys. 2020, 35, 1317–1322. [Google Scholar]
  51. Guo, S.P.; Zhang, W.F.; Kang, W.; Zhang, T.; Li, Y. The Study on Land Subsidence in Kunming by Integrating PS, SBAS and DS InSAR. Remote Sens. Technol. Appl. 2022, 37, 460–473. [Google Scholar]
  52. Keren, D.A.I.; Yongbo, T.I.E.; Qiang, X.U.; Ye, F.E.N.G.; Guanchen, Z.H.U.O.; Xianlin, S.H.I. Early identification of potential landslide geohazards in alpine-canyon terrain based on SAR interferometry—A case study of the middle section of Yalong river. J. Radars 2020, 9, 554–568. [Google Scholar]
  53. Marinkovic, P.S.; Hanssen, R.F.; Kampes, B.M. Utilization of parallelization algorithms in InSAR/PS-InSAR processing. In Proceedings of the Envisat & ERS Symposium, Salzburg, Austria, 6–10 September 2004; p. 572. [Google Scholar]
  54. Hooper, A.; Spaans, K.; Bekaert, D.; Cuenca, M.C.; Arıkan, M.; Oyen, A. StaMPS/MTI Manual; Delft Institute of Earth Observation and Space Systems Delft University of Technology: Delft, The Netherlands, 2010; pp. 1–44. [Google Scholar]
Figure 1. Comparison before and after filtering: (a) before filtering and (b) after filtering (the resultant figure is after cropping a differential interferogram in the study area).
Figure 1. Comparison before and after filtering: (a) before filtering and (b) after filtering (the resultant figure is after cropping a differential interferogram in the study area).
Applsci 13 01869 g001
Figure 2. Pixel identification information within the search window (12 × 13) obtained based on KS test.
Figure 2. Pixel identification information within the search window (12 × 13) obtained based on KS test.
Applsci 13 01869 g002
Figure 3. Hadoop Distributed File System (HDFS) framework and processing schematic.
Figure 3. Hadoop Distributed File System (HDFS) framework and processing schematic.
Applsci 13 01869 g003
Figure 4. Principle of Spark on YARN operation.
Figure 4. Principle of Spark on YARN operation.
Applsci 13 01869 g004
Figure 5. The principle of distributing data to executors via Broadcast.
Figure 5. The principle of distributing data to executors via Broadcast.
Applsci 13 01869 g005
Figure 6. Spark-based parallelization algorithm for spatial adaptive filtering (red represents the different operators provided by spark).
Figure 6. Spark-based parallelization algorithm for spatial adaptive filtering (red represents the different operators provided by spark).
Applsci 13 01869 g006
Figure 7. Study area(including the main urban area of Kunming and its surroundings).
Figure 7. Study area(including the main urban area of Kunming and its surroundings).
Applsci 13 01869 g007
Figure 8. SqueeSAR data processing flow.
Figure 8. SqueeSAR data processing flow.
Applsci 13 01869 g008
Figure 9. Subsidence rate map of the study area obtained after time series analysis.
Figure 9. Subsidence rate map of the study area obtained after time series analysis.
Applsci 13 01869 g009
Figure 10. A certain differential interferogram before filtering using Doris to generate the study area.
Figure 10. A certain differential interferogram before filtering using Doris to generate the study area.
Applsci 13 01869 g010
Figure 11. Comparison of the two processing methods: (a) filtering with conventional data processing, (b) filtering with spark parallel algorithm.
Figure 11. Comparison of the two processing methods: (a) filtering with conventional data processing, (b) filtering with spark parallel algorithm.
Applsci 13 01869 g011
Figure 12. Number of executors versus speedup ratio for the case of three nodes: the x-axis represents the number of executors, where the first value indicates that one executor is started in the case of a single core, and the remaining values start 2 cores. The y-axis represents the speedup.
Figure 12. Number of executors versus speedup ratio for the case of three nodes: the x-axis represents the number of executors, where the first value indicates that one executor is started in the case of a single core, and the remaining values start 2 cores. The y-axis represents the speedup.
Applsci 13 01869 g012
Figure 13. Comparison of filtering processing time with increasing number of executors for the entire study area (all PATCHs): the x-axis represents the number of executors, where the first value indicates that one executor is started in the case of a single core and the remaining values all start 2 cores. the y-axis represents the filtering processing.
Figure 13. Comparison of filtering processing time with increasing number of executors for the entire study area (all PATCHs): the x-axis represents the number of executors, where the first value indicates that one executor is started in the case of a single core and the remaining values all start 2 cores. the y-axis represents the filtering processing.
Applsci 13 01869 g013
Figure 14. The relationship between the number of executor-allocated single-core processing units and the speedup in the case of three nodes: the x-axis represents the number of executors. The y-axis represents the speedup as the number of executors increases.
Figure 14. The relationship between the number of executor-allocated single-core processing units and the speedup in the case of three nodes: the x-axis represents the number of executors. The y-axis represents the speedup as the number of executors increases.
Applsci 13 01869 g014
Figure 15. Comparison of filtering processing time with increasing number of executors (single cores) for the entire study area (all PATCHs): x-axis represents the number of executors. y-axis represents the filtering processing time with increasing number of executors.
Figure 15. Comparison of filtering processing time with increasing number of executors (single cores) for the entire study area (all PATCHs): x-axis represents the number of executors. y-axis represents the filtering processing time with increasing number of executors.
Applsci 13 01869 g015
Table 1. YARN cluster software configuration information.
Table 1. YARN cluster software configuration information.
NameConfiguration
operating systemLinux RedHat 4.8.5
HadoopVersion 3.2.2
SparkVersion 3.0.2
JavaVersion 1.8
PythonVersion 2.7.5
Doris (Master node installation only)Version 5.0.3
StaMPS (Master node installation only)Version 4.1
Table 2. Processing time of a single PATCH for different tasks with 8 executors.
Table 2. Processing time of a single PATCH for different tasks with 8 executors.
Task NumberRun Time/s
24099
32095
40090
48086
56089
64093
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Song, W.; Jin, B.; Zuo, X.; Li, Y.; Chen, K. A SqueeSAR Spatially Adaptive Filtering Algorithm Based on Hadoop Distributed Cluster Environment. Appl. Sci. 2023, 13, 1869. https://doi.org/10.3390/app13031869

AMA Style

Li Y, Song W, Jin B, Zuo X, Li Y, Chen K. A SqueeSAR Spatially Adaptive Filtering Algorithm Based on Hadoop Distributed Cluster Environment. Applied Sciences. 2023; 13(3):1869. https://doi.org/10.3390/app13031869

Chicago/Turabian Style

Li, Yongning, Weiwei Song, Baoxuan Jin, Xiaoqing Zuo, Yongfa Li, and Kai Chen. 2023. "A SqueeSAR Spatially Adaptive Filtering Algorithm Based on Hadoop Distributed Cluster Environment" Applied Sciences 13, no. 3: 1869. https://doi.org/10.3390/app13031869

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop