*Article* **Cloud and Snow Segmentation in Satellite Images Using an Encoder–Decoder Deep Convolutional Neural Networks**

**Kai Zheng <sup>1</sup> , Jiansheng Li 1,\*, Lei Ding <sup>2</sup> , Jianfeng Yang <sup>1</sup> , Xucheng Zhang <sup>1</sup> and Xun Zhang <sup>1</sup>**


**Abstract:** The segmentation of cloud and snow in satellite images is a key step for subsequent image analysis, interpretation, and other applications. In this paper, a cloud and snow segmentation method based on a deep convolutional neural network (DCNN) with enhanced encoder–decoder architecture—ED-CNN—is proposed. In this method, the atrous spatial pyramid pooling (ASPP) module is used to enhance the encoder, while the decoder is enhanced with the fusion of features from different stages of the encoder, which improves the segmentation accuracy. Comparative experiments show that the proposed method is superior to DeepLabV3+ with Xception and ResNet50. Additionally, a rough-labeled dataset containing 23,520 images and fine-labeled data consisting of 310 images from the TH-1 satellite are created, where we studied the relationship between the quality and quantity of labels and the performance of cloud and snow segmentation. Through experiments on the same network with different datasets, we found that the cloud and snow segmentation performance is related more closely to the quantity of labels rather than their quality. Namely, under the same labeling consumption, using rough-labeled images only performs better than rough-labeled images plus 10% fine-labeled images.

**Keywords:** satellite image; semantic segmentation; encoder–decoder; CNN; TH-1; cloud and snow detection; label quality

#### **1. Introduction**

With satellites becoming indispensable infrastructures for the development of the national economy, the acquisition of remote sensing images (RSIs) has become easier. RSIs have been widely used in a variety of fields such as infrastructure, agriculture, forestry, geology, hydrology, transportation, disaster prediction, etc. However, 66.7% of the Earth's surface is covered by clouds [1], which is a major factor restricting the application of optical RSIs. Additionally, due to the similar characteristics (e.g., high reflectivity rate) of cloud and snow in optical bands, some traditional methods such like threshold-based methods cannot distinguish them and often lead to misjudgment. This greatly hinders the automatic processing of RSIs. Furthermore, there is a deeper need to detect cloud and snow such like the construction of the atmospheric reflectance database, which can be used to serve the retrieval of atmospheric aerosols [2]. Therefore, it is of great significance to segment cloud and snow quickly, accurately, and automatically.

A number of image segmentation methods have been proposed since the 1970s, among which the most classic one is the Otsu [3] proposed by Nobuyuki Otsu in 1979. It uses the exhaustive method to determine the threshold that results in the maximum variance between objects in images, thus segmenting images into foreground and background images. Due to the high reflectivity of clouds in optical bands, we can use the Otsu method to segment the cloud and background from the visible images. However, cloud and snow

**Citation:** Zheng, K.; Li, J.; Ding, L.; Yang, J.; Zhang, X.; Zhang, X. Cloud and Snow Segmentation in Satellite Images Using an Encoder–Decoder Deep Convolutional Neural Networks. *ISPRS Int. J. Geo-Inf.* **2021**, *10*, 462. https://doi.org/10.3390/ ijgi10070462

Academic Editors: Gloria Bordogna and Cristiano Fugazza

Received: 21 May 2021 Accepted: 1 July 2021 Published: 6 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

often share similar characteristics in optical bands, making it hard to distinguish them from each other, and it is almost impossible to segment cloud from other objects using Otsu all at once. Although many methods have been proposed to improve the Otsu (such as the multi-Otsu), they suffer from limitations such as large amount of calculations, low robustness, etc. Figure 1 shows the cloud segmentation result of the Otsu, where it produces totally different results on similar images.

**Figure 1.** The segmentation examples of the Otsu algorithm. (**a**) Cloud; (**b**) binarization image; (**c**) segmentation result.

The cloud and snow segmentation method based on elevation assistance [4] captures the difference of elevation among cloud, snow, and other objects. The three-dimensional geometric features of cloud are obtained through the dense matching of multiple images, after which the differences between clouds and background objects are exploited by comparing the existing elevation information. Although this kind of methods has high accuracy, it involves many conditional and time-consuming operations such as dense matching and digital elevation model registration.

In recent decades, with the development of pattern recognition and machine learning technologies, researchers have studied intelligent methods for the cloud segmentation and have achieved good results. Amato et al. [5] applied a principal component analysis (PCA) for image cloud detection based on statistical theory. Merchant et al. [6] proposed a cloud detection algorithm based on full probability Bayes theory. Zhao Xiao [7] used fuzzy C-Means clustering to complete sample iteration clustering by minimizing the objective function and used support vector machine (SVM) to perform the classification, which has the advantage of better segmenting results under the condition of empirical knowledge, while human intervention greatly hinders the segmentation efficiency. Additionally, there are sparse perceptual classifiers, automatic codec, and other methods. Generally speaking, the time cost of machine learning based algorithms increases linearly with the number of pixels in the image. Therefore, for large-scale RSIs, the operation time of such an algorithm is often very long and it is hard to satisfy the needs in real-world applications.

Recently, deep learning has made great progress in image classification, image segmentation, object detection, and other vision tasks. In 2015, Long et al. [8] proposed the fully convolution neural network (FCN) and applied it to image semantic segmentation. Unlike

the classical CNN, which uses the fully connected layer after the convolution layer to obtain a fixed length feature vector for classification, FCN operates on images of flexible size and performs pixel-level segmentation. For example, Shao et al. [9] proposed a method based on the multiscale feature model MF-CNN, which can detect thin clouds and thick clouds in RSIs, which results in good detection accuracy on the Landsat 8 data. To summarize, the methods based on deep learning are emerging in cloud and snow segmentation in RSIs.

ResNet [10] is a feature extraction backbone proposed by He et al., which is built on the idea of residual learning to solve the problem of gradient disappearance/explosion and network degradation in traditional CNN (when the number of network increases). ResNet module bypasses the input information to the output through an additional connection channel to protect the integrity of the input. The whole network only needs to learn the part of the difference between input and output, which simplifies the difficulty of learning when the network deepens and helps to retain more original semantic information.

DeepLab [11–13] are deep learning semantic segmentation models proposed by Google, which contain encoder–decoder FCN structure and the ASPP module to fuse multiscale features and make better use of image features (compared to the plain FCN). The encoder–decoder structure is reintroduced into DeepLabV3. Combining the Xception backbone [14] and the ASPP module as the encoder, which uses a different expansion rate of perforated convolution to extract features. After feature fusion and 4× upsampling, it fuses with the low-level features extracted by Xception in the decoder and 4× upsampling are used to get the segmentation results. Due to the use of ASPP in DeepLabV3+, it improves the ability of multiscale access to semantic information and achieves state-of-the-art accuracy on multiple datasets.

TH-1 [15], which is the first generation of the transmission stereo mapping satellite in China, carries three types of five camera loads, including a three linear array CCD camera, two-meter resolution panchromatic camera, and ten-meter RGB camera. Our research data consist of 470 tiles of RSIs with RGB bands taken by the TH-1 satellite from 2018 to 2019. A total of 200 tiles were collected on 20 September, 2018, while the others were collected on 5 April 2019. In order to ensure the generalization ability of the network, the selected images cover various underlying surfaces, acquisition seasons and time phases, considering different geographical locations, climatic conditions, and cloud features. The geographic longitude and latitude range is 28◦350 E–120◦050 E, 5 ◦250 N–60◦050 N, respectively, covering different ground surfaces such as deserts, grasslands, cities, and mountains. Additionally, in order to segment clouds and snow simultaneously, 175 scene images with snow are selected.

The threshold-based methods are still used in the actual production. To solve the above problems, we propose a cloud and snow segmentation method based on DCNN with an encoder–decoder structure. Compared with traditional methods, it does not rely on prior knowledge in feature selection and extraction. We combine the advantages of ResNet50 [10] and encoder–decoder structure and improve the decoder to realize the simultaneous segmentation of cloud and snow in RSIs. By improving the network structure, using the exponential activation function (ELU) [16] and focal loss function [17], our objective is to get optimization of cloud edge segmentation and enhancement of the generalization ability of the network. On the other hand, with the DCNN becoming the mainstream of image semantic segmentation, the production of a high-quality pixel level label for semantic segmentation has attracted more and more attention. However, accurate image annotation requires a lot of manpower and the existing datasets are not all fine-labeled. At present, there are more and more DCNN networks, and the accuracy of image semantic segmentation is higher and higher, which, however, based on open-source datasets such as Microsoft COCO [18] and PASCAL-VOC-2012 [19], is mainly studied to innovate the network structure and improve the segmentation accuracy without analysis on the impact of label quality and quantity on the results. So, the other direction of this paper is to explore the influence of different data quality and quantity on the performance of cloud and snow segmentation on RSIs. Our major research contributions are summarized as follows:

First, we propose an end-to-end DCNN framework with encoder–decoder architecture, ED-CNN, which improves the decoder by fusion of features from different encoding stages. The outputs of ASPP, which after Conv 1 × 1 and 4 × upsampling, are concatenated with the enhanced low-level features from the enhanced decoder. Then, the concatenated feature maps are sent to Conv 3 × 3 and 4 × upsampling to recover their original size to segment the image pixels. Second, we present a TH-1 satellite dataset, which contains 23,520 coarselabeled images with annotations. Additionally, a fine-labeled dataset of 300 images is added to support our experiments. Third, experiments have been conducted based on different datasets, including TH-1 images of a different temporal phase and Google Earth images, which demonstrates that the proposed network is superior to DeepLabV3+ with Xception and the ResNet50 and can be applied to multisource RSIs. Finally, we discuss the effects of labeling quality and quantity in the dataset through extensive experiments with the proposed network. It is demonstrated that the performance of cloud and snow segmentation is positively related mainly to the labeling quantity. Namely, the smaller rough-labeled dataset plus some fine-labeled images, which is 10% of the total images, is equal to the larger rough-labeled dataset with the same total image quantity. Furthermore, under the same labeling consumption, the larger rough-labeled dataset exceeds the smaller rough-labeled dataset plus a few fine-labeled images.

#### **2. Methodology**

#### *2.1. Datasets Establishment*

First, we produced a BMP image. The data file of TH-1 RSIs was saved from the TIF format with four channels to the "BMP" image with a resolution of 6000 pixels × 6000 pixels. Second, the dataset was divided and the images were cut. The 470 tiles of images were divided into the training set, the validation set, and the testing set according to the numeric ratio of 3:1:1. Since the original image was large in size and took up too much memory, it could not be trained directly in the network. Therefore, we cut the images into patches with 480 pixels × 360 pixels. A total of 23,520 images were generated, including 13,924 training images, 4798 validation images, and 4798 testing images. Third, images were rough-labeled. Labelme [20] was used to roughly mark the cloud area of each image, generate JSON files, and convert the JSON files into label marked images with the same label size in batches. The rough-labeled image and its mask are shown in Figure 1, where the red color represents the cloud, green represents the snow, and black represents the background. Fourth, some images were fine-labeled. In order to verify the influence of the fine annotation image and rough annotation image on the training results (a detailed description is in Section 2.3.2), 310 extra images were randomly selected and then labeled carefully to generate JSON files with the time of 6 times that rough-labeling costs on each image, which were then transformed into labeled images as Step 3. Fine-labeled images had more accurate edge marking (errors less than 5 pixels). The fine-labeled image and its mask are shown in Figure 2, where label 1 represents the cloud, label 2 represents the snow, and label 0 represents the background located in the lower right corner of the image. Fifth, image preprocessing was performed. In Reference [21] it is proved that the network performance can be effectively improved by data augmentation. In order to enhance the generalization ability of the network and prevent overfitting, the dataset was augmented. The augmentation operations include a vertical flip, horizontal flip, contrast change, etc. The original and augmented images are shown in Figures 3 and 4.

**Figure 2.** The coarse-labeled images and their masks. (**a**) Cloud; (**b**) mask; (**c**)snow; (**d**) mask; (**e**) cloud and snow; (**f**) mask.

**Figure 3.** *Cont*.

**Figure 3.** The fine-labeled images and their masks. (**a**) Cloud, (**b**) rough mask, (**c**) fine mask, (**d**) snow, (**e**) rough mask, (**f**) fine mask, (**g**) cloud and snow, (**h**) rough mask, and (**i**) fine mask.

**Figure 4.** Original and augmented images. (**a**) Original image, (**b**) vertical flip, (**c**) horizontal flip, and (**d**) contrast change.

#### *2.2. Methods*

2.2.1. Network Backbone

How to extract features from a different kind of RSIs properly and effectively is a key problem. Cloud and snow in RSIs mostly present a planar structure. Their semantic information is simple, whereas their detailed information is rich, which puts forward a high demand for the detail extraction ability. For example, with the parameter amount of 22.8 M [14], Xception has a large number of parameters suitable for segmentation tasks with many kinds of objects, Meanwhile, it requires huge computational resources and is difficult to train, thus it is obviously not fully suitable for the task of cloud and snow segmentation. In this paper, Resnet50 backbone was selected as the encoder to extract features of cloud and snow (as shown in Figure 5). The parameter size of ResNet50 was only 0.85 M [10] and more direct connections were added in the network. Considering its advantages such as less parameters, easy training, and fast convergence, it is more suitable for cloud and snow segmentation compared to Xception.

**Figure 5.** Network architecture of the proposed method.

#### 2.2.2. Enhanced Decoder

In encoder–decoder architectures such like DeepLabV3+, the decoder subnet gradually recovers the spatial information, which is usually not as powerful as the encoder. In this regard, besides replacing the backbone to the ResNet50, we added skip connections in the decoder. To be specific, we selected features in the stages 1, stage 3, stage 4, and stage 5 of the ResNet50 to construct a top–down connection feature map pyramid, which enriched the semantic representation of low-level features to better utilize the spatial information as shown in Figure 5. The low-level feature maps with high resolution and high-level feature maps with rich semantic information were fused, which can quickly construct the decoder with better semantic information from 4 stages instead of a single stage without obvious cost increases.

As Figure 6 shows, first the feature map from stage 5 was 2 × upsampled after 1 × 1 convolution, which was added with the output from stage 4, after 1 × 1 convolution too. The dimension numbers of these stages were all set to 256, ensuring that these feature maps could be added. Second, the added feature map was 2 × upsampled the second time to be the same size of the output from stage 3, then the process was repeated to get the fused feature map with output from stage 1. Third, the feature map of the enhanced decoder was concatenated with the feature map generated from the ASPP module after the 3 × 3 convolution, batch-normalization, and ELU operations. Finally, the segmentation map was obtained by 3 × 3 convolution and 4 × upsampling.

**Figure 6.** Network workflow of the proposed method ( means element-wise sum).

#### 2.2.3. Loss Function

In practice, there are many kinds of clouds with different shapes. Generally, the proportion of thin clouds and cirrus clouds is less than that of thick clouds. The training dataset in this paper also reflected the characteristics of less data of thin clouds, cirrus clouds, and snow. Cross entropy (CE) loss cannot balance the learning of fewer samples. Its formula is as follows,

$$CE(p\_l) = -\log(p\_l) \tag{1}$$

Through the combination of different parameters, focal loss [17] can solve the problem of sample imbalance in the semantic segmentation task, which is an improved version of the CE loss by adding a weight. Its formula is as follows:

$$FL(p\_t) = -(\lambda - p\_t)^\gamma \log(p\_t) \tag{2}$$

where *λ* and *γ* are two hyperparameters and *p<sup>t</sup>* is the prediction probability of the label. (*λ* − *pt*) can be regarded as the weight of Equation (1). The paper [17] sets *γ* = 2 and *λ* = 1, when the prediction of a certain category is accurate, that is, close to 1, the value is close to 0. The more inaccurate the prediction is, that is, close to 0, the closer to 1 it will be. For the samples that are easy to distinguish, the weight corresponding to the loss will be small, whereas for objects that are difficult to distinguish, their corresponding weight will be larger in order to retain the loss value of difficult samples and reduce the loss value of simple samples. We set *γ* = 2 and *λ* = 1.

#### 2.2.4. Activation Function

The exponential linear unit (ELU) was proposed in the paper [16], which can make the mean value of output close to 0, thus accelerating the convergence of the network and effectively overcome problems such as gradient vanishing. If the output of a node is X, the output after passing through the ELU layer is illustrated in Equation (3). We adopted ELU activation after convolution layers.

$$f(\mathbf{x}) = \begin{cases} \ x, \mathbf{x} \ge 0 \\ \ a(e^{\mathbf{x}} - 1), \mathbf{x} < 0. \end{cases} \tag{3}$$

#### 2.2.5. Evaluation Metrics

Here we used pixel accuracy (PA) and the mean intersection over union (MIOU) as the evaluation metrics.

1. P.A: the simplest and direct indicator, which only calculates the ratio of the number of correctly classified pixels to the number of all pixels. The calculation is shown in Equation (4).

$$PA = \frac{\sum\_{i=0}^{k} p\_{ii}}{\sum\_{i=0}^{k} \sum\_{j=0}^{k} p\_{ij}} \tag{4}$$

2. MIOU: it calculates the coincidence ratio between the intersection and union of two sets, that is, the intersection union ratio between real segmentation and algorithm segmentation. This ratio can be redefined as the number of true positive cases (intersections) divided by the total number (including true positive cases, false negative cases, and false positive cases (Union)). MIOU is calculated by class, and then averaged. The calculation is shown in Equation (5).

$$MIoolI = \frac{1}{k+1} \sum\_{i=0}^{k} \frac{p\_{ii}}{\sum\_{j=0}^{k} p\_{ij} + \sum\_{j=0}^{k} p\_{ji} - p\_{ii}} \tag{5}$$

#### *2.3. Experiments*

#### 2.3.1. Experimental Platform

The experimental hardware is the Lenovo workstation with 2.1 GHz CPU frequency of the Intel Xeon processor, and the GPU is NVIDIA Titan XP. We used the Tensorflow + Keras framework to build deep learning models. The network training was carried out on the basis of data augmentation. The initial learning rate was set to 3e <sup>−</sup><sup>4</sup> and the learning rate decreasing drop was set to 0.1. Using the ELU activation function and Adam [22] optimizer, the batch size was set to 8 according to the capability of the GPU.

#### 2.3.2. Group Experiment Settings

We set the experiment into two stages numbered from 0 to 4 shown in Table 1. First, the performance of a different network was evaluated on dataset 1, which included 23,520 images. Second, another smaller dataset 2 including 6660 rough-labeled images and 310 fine-


**Table 1.** Experiment setting.

quality of image labels on the accuracy of cloud and snow segmentation.

#### **3. Results and Analysis**

#### *3.1. Training Process and Results*

First, we used dataset 1 to train the proposed network, ResNet50, and Xception for 60 epochs, respectively. Figure 7 shows the loss and accuracy comparison between three networks on the training set and the validation set. The red line represents the proposed method, while the yellow and green lines represent ResNet50 and Xception, respectively.

labeled images was chosen from dataset 1 to explore the influence of the quantity and

**Figure 7.** Results of three networks on dataset 1. (**a**) Training loss; (**b**) training PA; (**c**) training MIoU; (**d**) validation loss; (**e**) validation PA; (**f**) validation MIoU.

Second, dataset 2 of 6660 images was randomly chosen from set 1, dividing into 4660 for the training set, 1000 for the validation set, and 1000 for the testing set. Another 310 fine-labeled images were added to the training set. Four group of experiments were performed to explore the influence of data use strategy on the proposed network. Figure 8 shows the results.

**Figure 8.** Results of the proposed network on dataset 2. (**a**) Training loss; (**b**) training PA; (**c**) training MIoU; (**d**) validation loss; (**e**) validation PA; (**f**) validation MIoU.

#### *3.2. Comparison and Analysis*

First, the testing results of the three methods on dataset 1 are shown in Table 2. In the testing set of 4798 images, the PA of cloud and snow segmentation obtained by the proposed method was 90.3%, while the PA of Xception, ResNet50, and DANet [23] were 87.4%, 89.2%, and 88.9% respectively. The MIoU of the proposed method was 81.1%, exceeding Xception and ResNet50 by 4.2%, 0.6%, and 0.8% respectively. When the proposed enhanced decoder was used to enhance low-level features, the PA and MIoU of the network increased obviously.

**Table 2.** Results of the three networks on dataset 1.


Second, four groups of experiments were conducted on dataset 2 to explore the influence of different data quality and quantity on the proposed network. The results are shown in Table 3, in which group 1 and group 2 proved that a 10% reduction in the number of training data will reduce the overall performance. On the contrary, group 2 and group 3 show that a 10% increase of fine-labeled data replacing the reduction had a slight side effect on performance. In our opinion, adding different features of data when the training set is not large enough will lead to the above problem. Additionally, on average, it took six times as long to label a fine-labeled image than making a rough-labeled image. However, group 4 shows that 4660 rough-labeled images resulted in better results compared to the other three groups.


**Table 3.** Results of a different data use strategy on dataset 2.

Finally, the qualitative analysis of typical images was conducted to compare the results of the proposed method with the Otsu and another two networks (Xception and ResNet50). Figure 9 shows the comparison results of the traditional Otsu method versus the three networks, and the red part is the cloud while the green part is snow. It can be found that when the image illumination conditions were not ideal and the underlying surface contrast was not high, the Otsu created a false alarm in Figure 9b and it could not segment cloud and snow at the same time (as in Figure 9d). The reason is that the Otsu algorithm does not consider the neighborhood information in segmentation and is sensitive to noise. In Figure 9a,c, Xception missed some pixels and got the wrong result in Figure 9b, while getting it totally wrong in Figure 9d. The performance of the proposed network was slightly better than that of the ResNet50 in preserving the spatial details.

**Figure 9.** Testing results of the proposed method on the TH-1 image acquired from different temporal phases.

Furthermore, TH-1 images from different time-phases are chosen to verify the generalization performance of the proposed method. The results are shown in Figure 10. It is demonstrated that the proposed method could accurately segment cloud and snow, showing a better generalization ability.

**Figure 10.** Testing results of the proposed method on the TH-1 image acquired from different temporal phases. (**a**) Cloud; (**b**) snow ground.

#### **4. Conclusions**

#### *4.1. The Proposed Method for Cloud and Snow Segmentation*

In this paper, we proposed an end-to-end cloud and snow segmentation network for TH-1 RSIs, which combined the advantages of the encoder–decoder architecture and the enhanced decoder. On the one hand, it avoided the shortcomings of traditional cloud detection algorithms (such as they are parameter-dependent, time-consuming, and the scope of application is limited). It achieved the mIoU of 81.1% on 4798 testing images and reduced the segmenting time (on a single 480 × 360 image) to 49.2 ms, which could basically meet the requirements of image preprocessing. On the other hand, the enhancement of the decoder proved to be a useful way to improve the segmentation performance through exploiting features at different encoder stages, which bridges the gap between different levels of features. Additional experiments show that this method can be used on images acquired from different sensors, as shown in Figure 11.

**Figure 11.** Testing results of the proposed method on a Google Earth image.

There is still room for further improvement in the cloud and snow segmentation method proposed in this paper, such as improving the training accuracy and using additional multisource RSIs for transfer learning. Another research direction is to use multispectral information to solve the problem of distinguishing cloud, fog, and snow in mixed regions.

#### *4.2. Influence of Different Datasets on Segmentation Performance*

Given a certain network, we found that its segmentation performance was positively related mainly to the number of training images and labels. Specifically, when the training time was sufficient, more training images led to higher accuracy, whereas a 10% increase of fine-labeled data replacing the original rough-labeled reduction had a slight side effect on performance. Considering fewer categories and a lower complexity of cloud and snow segmentation, our conclusion was that when the same labeling time was considered, we achieved better results by only roughly labeling the data. Instead of spending more manual resources to make fine-labeled masks, roughly labeling more data can lead to the same segmentation accuracy.

There is a margin for further research on the effects of label quality and quantity, such as clarifying the pixel error of coarse-marking labels and exploring the effect of error types and sizes on cloud and snow segmentation results.

**Author Contributions:** Conceptualization, Kai Zheng and Jiansheng Li; methodology, Kai Zheng; software, Kai Zheng and Lei Ding; validation, Kai Zheng and Jianfeng Yang; formal analysis, Xun Zhang; investigation, Xucheng Zhang; resources, Xucheng Zhang; data curation, Jianfeng Yang; writing—original draft preparation, Kai Zheng; writing—review and editing, Jiansheng Li; visualization, Lei Ding; supervision, Jiansheng Li. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Exclude.

**Informed Consent Statement:** Exclude.

**Data Availability Statement:** Not available.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Soft Integration of Geo-Tagged Data Sets in J-CO-QL**<sup>+</sup>

**Paolo Fosci and Giuseppe Psaila \***

Department of Management, Information and Production Engineering, University of Bergamo, Viale Marconi 5, 24044 Dalmine, Italy

**\*** Correspondence: giuseppe.psaila@unibg.it; Tel.: +39-035-205-2355

**Abstract:** The possibility offered by the current technology to collect and store data sets regarding public places located on the Earth globe is posing new challenges, as far as the integration of these data sets is concerned. Analysts usually need to perform such an integration from scratch, without performing complex and long preprocessing or data-cleaning tasks, as well as without performing training activities that require tedious and long labeling of data; furthermore, analysts now have to deal with the popular *JSON* format and with data sets stored within *JSON* document stores. This paper demonstrates that a methodology based on soft integration (i.e., data integration performed through soft computing and fuzzy sets) can now be effectively applied from scratch, through the *J-CO* Framework, which is a stand-alone tool devised to process *JSON* data sets stored within *JSON* document stores, possibly by performing soft querying on data sets. Specifically, the paper provides the following contributions: (1) It presents a soft-computing technique for integrating data sets describing public places, without any preliminary pre-processing, cleaning and training, which can be applied from scratch; (2) it presents current capabilities for soft integration of *JSON* data sets, provided by the *J-CO* Framework; (3) it demonstrates the effectiveness of the soft integration technique; (4) it shows how a stand-alone tool able to support soft computing (as the *J-CO* Framework) can be effective and efficient in performing data-integration tasks from scratch.

**Keywords:** off-line integration of geo-tagged data sets; data sets about public places; soft integration methodology; effective soft integration through a stand-along tool

#### **1. Introduction**

Integrating geo-spatial information has become a crucial task in the current world. In fact, in the era of Open Data and Big Data, a plethora of sources can provide both authoritative and non-authoritative data sets concerning places. The situation is further complicated by the fact that social media provide people with tools for describing places in a non-controlled way. For example, *Facebook* provides its users with the functionality to define a "page"; a specific category of the page describes a "public place", such as restaurants, pubs, air dressers, universities, parks and so on; through its API (Application Programming Interface), pages could be queried, on the basis of their category, location, coordinates, and so on. Another interesting service is called *Google Places*: it is a sub-service of *Google Maps*; *Google Places* API can be used to query its corpus to find places of interest, on the basis of category, location, and so on; this corpus is built by *Google Maps* by integrating both authoritative and non-authoritative data, these latter ones given by users through the social interface provided by *Google Maps*.

In the current scenario, it is very easy to collect data sets from multiple sources, such that these data sets provide geo-tagged information about public places. Since current APIs of social media and Open-Data portals provide data as (possibly) geo-tagged *JSON* documents (*JSON* stands for JavaScript Object Notation, see [1]), *JSON* document stores are the natural storage where to save such data sets. Consequently, integrating geotagged data sets describing public places asks for suitable tools, which are able to work on *JSON* document stores. This is the reason why at University of Bergamo (Italy), we

**Citation:** Fosci, P.; Psaila, G. Soft Integration of Geo-Tagged Data Sets in J-CO-QL+. *ISPRS Int. J. Geo-Inf.* **2022**, *11*, 484. https://doi.org/ 10.3390/ijgi11090484

Academic Editors: Wolfgang Kainz and Huayi Wu

Received: 4 June 2022 Accepted: 7 September 2022 Published: 13 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

are devising [2–5] an innovative tool, called *J-CO* Framework, to perform the complex integration and querying of (possibly geo-tagged) *JSON* data sets.

Nevertheless, integrating geo-tagged data sets describing public places is not a novel problem; in general, traditional approaches rely on machine-learning techniques that require a preliminary training phase. In [6], the problem was addressed in a different way, because the context of "online" aggregation was considered; a fuzzy relation was defined, which provides an easy-to-compute metric that is suitable for online integration of data about public places; the experiments demonstrated that the approach is effective and comparable, in terms of effectiveness, with off-line classification techniques. However, the technique presented in [6] was hard-coded within the software prototype; this fact made us able to include some pre-processing steps on strings that were performed on the fly, immediately after place descriptors were acquired. However, the approach seems to be general and could be applied for integrating data sets in an off-line way too, with data sets stored within *JSON* stores. Paradoxically, this apparently small change of context constitutes a significant challenge: in fact, the straightforward solution could be to still hard-code the technique into a software tool, but this approach is not coherent with the world of *JSON* document stores. We think that exploiting a stand-alone tool able to query *JSON* stores is preferable, since it is transparent and comprehensible for analysts; however, a stand-alone tool for processing *JSON* data sets is necessarily less flexible than a programming language. Thus, the challenge is the following: is it possible to identify a stand-alone tool and adapt the integration technique presented in [6] to the case of off-line integration of geo-tagged *JSON* data sets from *JSON* stores?

The current evolution of *J-CO-QL*+, the query language of the *J-CO* Framework, provides constructs for evaluating membership of *JSON* documents to fuzzy sets [7–9]. Thus, the straightforward idea of checking the current capability of *J-CO-QL*<sup>+</sup> for integrating *JSON* data sets describing public places has come out: specifically, since *J-CO-QL*<sup>+</sup> is able to deal with complex soft querying of *JSON* data sets and the technique presented in [6] for online integration of data sets concerning public places is based on fuzzy relations, we had the intuition of mixing the two approaches. In other words, given two sets of place descriptors represented as *JSON* documents and stored within a *JSON* document store, in this paper, we experiment with the application of the fuzzy technique presented in [6] (to be precise, a slightly evolved version of it) by means of the *J-CO* Framework, in an off-line manner. The goal is to verify that this approach is suitable in an off-line context, without previous training activities and intervention by humans to label data sets for driving the learning phase (typical of classification techniques). Definitely, we want to demonstrate that the availability of a stand-alone tool such as the *J-CO* Framework, which is able to process *JSON* data sets by applying soft computing and fuzzy sets, indeed provides analysts with a powerful tool to address a problem faced by data analysts, in an effective and (possibly) efficient way.

Summarizing, the contribution of the paper is manifold: (1) Presenting a soft-computing technique for integrating data sets describing public places, without any preliminary preprocessing, cleaning and training, which can be applied from scratch; (2) presenting current capabilities for soft integration of *JSON* data sets, as they are provided by *J-CO-QL*+; (3) demonstrating the effectiveness of the soft integration technique in a harder context than that considered in [6]; (4) showing how a stand-alone tool able to support soft computing (as *J-CO-QL*+) can be effective and efficient in performing data-integration tasks from scratch.

The paper is organized as follows. Section 2 presents relevant related work concerned with the paper. Section 3 provides a brief introduction to relevant concepts concerning fuzzy-set theory. Section 4 introduces the main features of the *J-CO* Framework. Section 5 precisely explains the addressed problem and introduces the methodology we follow, which relies on the concept of fuzzy relation. Section 6 presents and discusses the script written by means of *J-CO-QL*+, which practically applies the methodology presented in Section 5; each single instruction is explained, in order to illustrate how it behaves and its contribution within the script. Section 7 reports the results of an experimental evaluation, in which we evaluated effectiveness and, marginally, execution times. Finally, Section 8 draws the conclusions and possible future work.

#### **2. Related Work**

This paper embraces two different research lines: soft querying on databases in general and on *JSON* document stores in particular (Section 2.1), as well as the integration of data sets describing public places (Section 2.2).

#### *2.1. Soft Querying on Databases*

Providing data users with capabilities for flexibly querying databases is an old challenge. In particular, when selection conditions can rely on vague predicates, queries become "soft", meaning that they are tolerant to thresholds (e.g., given a Boolean predicate price <= 30 to select cheap products, a product whose price is 30.45 is not selected, while instead it could be of interest) and selected items could be ranked on the basis of their relevance to the selection condition. Fuzzy sets appeared as the formal framework to specify soft selection conditions [10]. Since relational-database technology dominated the panorama of database technology, many works were conducted to propose an extension of SQL (the standard query language for relational databases) towards soft querying based on fuzzy sets. Some popular proposals are *SQLf* [11,12] (for which we can mention an attempt to implement it [13]) and its extension named *SQLf3* (which copes with constructs introduced in *SQL3*), as well as *FQUERY for Access* [14,15] (designed to operate on databases managed by Microsoft Access). Among all these proposals, *SoftSQL* [16–18] provided users with a statement to define non-trivial "linguistic predicates", to be used in the extended SELECT statement to select table rows through linguistic predicates. The interested reader can find various surveys on the topic [19,20]; in particular, the work [21] is a very large handbook that summarizes all research work on this topic.

The advent of *NoSQL* (Not only SQL) databases [22], i.e., databases that do not rely on the classical relational model, has started a novel era in data management. In particular, the popularity obtained by the *JSON* (JavaScript Object Notation) format to represent any kind of complex data is facing the data engineer with the novel (with respect to relational databases) concept of "*JSON* document store", i.e., a database which stores *JSON* documents in a native way. The most famous *JSON* document store is *MongoDB* [23], but many others are available (such as *CouchDB* [24], exploited within the block-chain platform called HyperLedger Fabric [25]). As a result, this novel scenario is revamping the topic of soft querying on databases, this time on *NoSQL* databases in general and on *JSON* document stores in particular.

An extension of MQL, the *MongoDB* query language, is proposed in [26]; in this extension, called "fMQL", "fuzzy labels" can be used to query *JSON* documents, since they are equivalent to linguistic predicates; unfortunately, the work [26] does not provide any indication about how to define fuzzy labels. A further limitation of the proposal is that, for each single *JSON* document, only one membership degree is implicitly evaluated (in contrast, *J-CO-QL*<sup>+</sup> allows for dealing with many membership degrees for each single document).

Finally, the work [27] proposes an approach for soft querying *JSON* documents: the corpus of *JSON* documents is preliminarily translated into fuzzy RDF triples [28]; then, the query is translated into *fSPARQL* [29], a fuzzy extension of *SPARQL* [30]. In our opinion, this approach is not suitable for processing *JSON* documents, because it does not work on the original documents, but on an alternative representation of them.

#### *2.2. Integrating Data Sets Describing Public Places*

The topic of aggregating information about public places coming from internet sources has been investigated in the last decade. Many different approaches have been followed.

For example, the work [31] adopts the *DAS* technique to integrate data about public places uniquely by exploiting string similarity on names, in particular by comparing the two strings without and with tokenization.

The work [32] compares different string-similarity metrics with various machine learning methods, to solve the problem of toponym matching. The results demonstrate that machine learning methods (in particular, classifiers) perform better than string-similarity metrics. Obviously, they cannot be applied from scratch, without preliminary labeling and training activities. Similarly, the work [33] exploits a neural network to perform "toponym matching", i.e., pairing strings that represent the same location.

The work [34] addresses the problem of "geo-spatial data conflation" (the general name of the problem addressed in this paper) by adopting an entropy-based technique: the key idea is to use phonetic transcriptions, to compensate mistakes in writing names.

Another work that can be considered as related to this paper is [35], in which "semantic aligning" of heterogeneous geo-spatial data sets (GDs) is addressed. Specifically, it proposed an efficient similarity matching technique, which integrates various category systems simultaneously.

Finally, the closest work to this paper is [6]: this work could be considered the natural evolution of it. Specifically, a complex fuzzy relation is defined to perform public-place conflation in an online way. A comparison with a famous classification technique (i.e., "Random-Forest" classifiers) was performed, showing that the fuzzy approach is effective in a comparable way. Here, the definition of the fuzzy relation is improved to cope with not-cleaned names and addresses, as well as it is applied within the context of the *J-CO* Framework for the off-line integration of *JSON* data sets.

#### **3. Basic Notions on Fuzzy Sets**

In [36], Zadeh introduced the Fuzzy-Set Theory. It was rapidly clear that it had (and still has) an enormous potentiality to be successfully applied to many areas of computer science, such as decision making, control theory, expert systems, artificial intelligence, natural-language processing, and so on. Here, we report some basic concepts, which constitute the basis to understand the main contribution of this paper.

**Definition 1.** *Fuzzy Set. Consider a "universe set" U. A fuzzy set (or type-1 fuzzy set) A in U (A* ⊆ *U) is a mapping A* : *U* → [0, 1]*. The value A*(*x*) *is referred to as the* membership degree *of the element x to the fuzzy set A. Alternatively, the notation µA*(*x*) ∈ [0, 1] *can be used.*

Clearly, given an item *x* ∈ *U*, if *A*(*x*) = 0, this means that *x* does not belong at all to *A*; an intermediate value 0 < *A*(*x*) < 1 means that *x* partially belongs to *A* (the greater the value, the higher its degree of membership); if *A*(*x*) = 1, this means that the item *x* fully belongs to *A*.

Consequently, a fuzzy set is "empty" if and only if its membership function is identically zero for each *x* ∈ *U*.

Furthermore, given two fuzzy sets *A* in *U* and *B* in *U*, they are "equal" (denoted as *A* = *B*), if and only if *A*(*x*) = *B*(*x*) (alternatively, *µA*(*x*) = *µB*(*x*)) for all *x* ∈ *U*.

Operators on fuzzy sets can be easily defined, by extending the classical operators on traditional sets.

**Definition 2.** *Union, Intersection and Complement. Consider a universe U and two fuzzy sets A in U and B in U.*

*The* union *of two fuzzy sets A and B, denoted as S* = *A* ∪ *B , generates a novel fuzzy set S whose membership function is S*(*x*) = *max*(*A*(*x*), *B*(*x*))*, for each x* ∈ *U (alternatively, µS*(*x*) = *max*(*µA*(*x*), *µB*(*x*))*).*

*The* Intersection *of two fuzzy sets A and B, denoted as I* = *A* ∩ *B , generates a novel fuzzy set S whose membership function is I*(*x*) = *min*(*A*(*x*), *B*(*x*))*, for each x* ∈ *U (alternatively, µI*(*x*) = *min*(*µA*(*x*), *µB*(*x*))*).*

*The* Complement *of a fuzzy set A, denoted as C* = *A , generates a novel fuzzy set C whose membership function is C*(*x*) = 1 − *A*(*x*)*, for each x* ∈ *U (alternatively, µC*(*x*) = 1 − *µA*(*x*)*).*

Classical logical operators are mapped onto operators on fuzzy sets: the OR operator is mapped onto the union; the AND operator is mapped onto the intersection; the NOT operator is mapped onto the complement.

Fuzzy sets are useful to represent vague concepts, which characterize many real-life application contexts. For example, if the universe is the set of people, we could think to divide them into "young" and "old". However, is a person whose age is 40 actually young or old? He/she is a little bit young and a little bit old, neither fully young nor fully old.

Various other operators on fuzzy sets can be defined. In the following definition, we introduce the "weighted aggregation" operator.

**Definition 3.** *Weighted Aggregation. Given a universe U and two fuzzy sets A in U and B in U, the* weighted aggregation *operator W* = *wagβ*(*A*, *B*) *(with β* ∈ [0, 1]*) generates a new fuzzy set W whose membership function is defined as W*(*x*) = *β* × *A*(*x*) + (1 − *β*) × *B*(*x*) *(alternatively, µW*(*x*) = *β* × *µA*(*x*) + (1 − *β*) × *µB*(*x*)*).*

**Example 1.** *Through the membership degree, it is possible to denote partial membership of an item x* ∈ *U to A; this way, vague linguistic concepts can be modeled. For example, given a public place p, its membership to the PopularPlaces fuzzy set could be partial, denoting a place that is not so popular; thus, the membership degree measures its degree of popularity, for example on the basis of the number of likes obtained on social media.*

*Suppose that on the same universe of public places, we conceive the CheapRestaurants fuzzy set, whose membership degree denotes the perception that a public place is cheap (this perception could be induced by analyzing menus published on social media).*

*We now illustrate how to aggregate the PopularPlaces and the CheapRestaurants fuzzy sets to obtain interesting places.*


*The three above-mentioned searches are examples of "soft queries", where selection conditions are expressed in a vague way; the resulting membership degree denotes the "relevance" of an item to the soft query.*

*Furthermore, notice that when the names given to fuzzy sets linguistically characterize items in a proper way, these names can be used in soft conditions to linguistically express them.*

**Definition 4.** *Fuzzy Relation. Consider two universes U*<sup>1</sup> *and U*2*. A* fuzzy relation *R on U*<sup>1</sup> *and U*2*, is defined as R* : *U*<sup>1</sup> × *U*<sup>2</sup> → [0, 1]*. R*(*x*1, *x*2) ∈ [0, 1]*, with x*<sup>1</sup> ∈ *U*<sup>1</sup> *and x*<sup>2</sup> ∈ *U*2*, is the membership degree of the relation between x*<sup>1</sup> *and x*2*; the meaning of the relation is linguistically expressed by the name of the relation.*

Through the concept of fuzzy relation, it is possible to model the strength of a relation between two items *x*<sup>1</sup> ∈ *U*<sup>1</sup> and *x*<sup>2</sup> ∈ *U*2. Nevertheless, notice that a fuzzy relation is a particular case of fuzzy set in the universe *U* = *U*<sup>1</sup> × *U*2. Thus, we can reformulate the relation as *R* : *U* → [0, 1], where *x* = h*x*1, *x*2i ∈ *U*; consequently, we can write, in an equivalent way, either *R*(*x*1, *x*2) or *R*(h*x*1, *x*2i).

In this paper, we work on the universe of *JSON* documents. So, given a document *d* ∈ *U*, the focus will be on the evaluation of its membership degrees to one or more fuzzy sets.

#### **4. The** *J-CO* **Framework**

The *J-CO* Framework is a suit of software tools able to process *JSON* data sets, in a way that is independent of the data source. In fact, it is able to obtain data sets both from *JSON* document stores (such as *MongoDB*) and from web sources. It is built around the *J-CO-QL*<sup>+</sup> language and it contains various tools, as illustrated in Figure 1.


**Figure 1.** The *J-CO* Framework.

#### *4.1. The Query Language*

*J-CO-QL*<sup>+</sup> is the current evolution of the original *J-CO-QL* (see [3–5]): as its predecessor, it is designed to provide high-level and declarative statements, which does not require programming skills to be used; by means of them, it is possible to specify complex procedures (scripts) that are able to retrieve, integrate, transform and save *JSON* data sets. With respect to its predecessor, *J-CO-QL*<sup>+</sup> maintains the same approach, but revises syntax and semantics of statements, to improve their usability and effectiveness. Hereafter, we present its data model and its execution model.

#### 4.1.1. Data Model

Here, we present the data model on which *J-CO-QL*<sup>+</sup> relies.

	- **–** The root-level ~fuzzysets field is used to represent membership degrees of a *d* document to fuzzy sets. It works as a "key-value" map: given a field within ~fuzzysets, the field name is the name of the fuzzy set to which the membership degree has been evaluated; the value is a real number in the range [0, 1], which denotes the membership degree. This way, given a *d* document, it is possible to represent its membership to many fuzzy sets.
	- **–** The root-level ~geometry field represents geometries (also called "geo-tagging") of spatial entities represented as *JSON* documents. In this paper, we do not make use of geometries (the interested reader can refer to [5]).

#### 4.1.2. Execution Model

The execution model is the same presented in previous publications [5,7]. Hereafter, we briefly summarize it.

	- **–** *tc* is called "temporary collection", since it is a collection of *JSON* documents that passes through the pipe of instructions, that contains temporary results of the query process.
	- **–** *IR* is the "Intermediate-Results databaase", i.e., a database that is exclusive for the query process, to store intermediate results to be used later.
	- **–** *DBS* is the set of "database descriptors", used to handle connections with external *JSON* document stores.
	- **–** *FO* is the set of "Fuzzy Operators" defined within the query; they allows for evaluating membership degrees to fuzzy sets (see Section 6.2).
	- **–** *JSF* is the set of user-defined "JavaScript Functions"; they are defined throughout the query to complete computational capabilities of the query language (see [38]).

#### **5. Problem and Methodology**

In this section, we discuss the premises from which the paper has originated and introduce the problem we addressed as a case study (Section 5.1). Then, we present the methodological framework, which this work relies on (Section 5.2).

#### *5.1. Premises and Problem*

In [6], a fuzzy method for the online aggregation of POIs (Points of Interest) is presented. The problem addressed in that paper can be summarized as follows: if a web application has to integrate descriptors of public places (or POIs) caught on the fly from external services, the decision whether two descriptors actually describe the same public place or not must be taken in real time: techniques that require off-line work cannot be used.

In [6], it was proved that the technique can obtain very high levels of accuracy, absolutely comparable with off-line techniques; consequently, here, we argue that the same technique could be effectively adopted to integrate two data sets describing public places, in an off-line way. In particular, the novel support for soft querying [7] provided by *J-CO-QL*<sup>+</sup> (the query language of the *J-CO* Framework) has modified the scenario: in fact, the *J-CO* Framework is a stand-alone tool designed for manipulating and querying collections of *JSON* data sets. Consequently, it is straightforward to explore the possibility to exploit it for applying the fuzzy technique presented in [6] for integrating two collections of publicplace descriptors coming from two different sources, by adopting a database approach (querying data by means of a query language) instead of hard coding the methodology with a programming language.

Hereafter, we present the problem. Then, Section 5.2 presents an improved formulation of the fuzzy technique presented in [6] that will be applied in *J-CO-QL*<sup>+</sup> scripts (discussed in Section 6).

**Problem 1.** *Consider two collections of descriptors D*<sup>1</sup> *and D*2*. A* descriptor *d (such that either d* ∈ *D*<sup>1</sup> *or d* ∈ *D*2*) describes a public place; we assume that d is a tuple whose minimal shape is d* = h*name*, *address*, *lat*, *lon*i*, where d*.*name is the name of the public place, d*.*address is the raw address (i.e., as it is provided by the data source, without any pre-processing or cleaning) of the public place, while d*.*lat and d*.*lon are, respectively, the latitude and longitude of the public place. Depending on the source, these fields could be missing (either null value or zero-length string).*

*Supposing that descriptors in D*<sup>1</sup> *and D*<sup>2</sup> *are related to the same municipality, we want to build the collection SP* = {*p*1, *p*2, . . . } *of pairs of descriptors p<sup>i</sup>* : h*d*1,*<sup>h</sup>* , *d*2,*<sup>k</sup>* i *(with d*1,*<sup>h</sup>* ∈ *D*<sup>1</sup> *and d*2,*<sup>k</sup>* ∈ *D*2*) such that it is very likely that d*1,*<sup>h</sup> and d*2,*<sup>k</sup> actually describe the same public place.*

#### *5.2. Fuzzy Relation for Matching Public Places*

The key contribution of [6] is a fuzzy relation called *MatchingPlaces*. Given two descriptors *d*<sup>1</sup> and *d*2, it is written as *MatchingPlaces*(*d*1, *d*2). Its membership degree denotes the possibility that *d*<sup>1</sup> and *d*<sup>2</sup> describe the same place. If we consider the universe *P* = *D*<sup>1</sup> × *D*<sup>2</sup> of pairs *p<sup>i</sup>* : h*d*1,*<sup>h</sup>* , *d*2,*<sup>k</sup>* i, through the *MatchingPlaces* fuzzy relation we want to build the fuzzy set *PRP* in *P* of Possibly-Relevant Pairs for which the membership degree of *p<sup>i</sup>* is *PRP*(*pi*) > 0.

To actually decide whether descriptors in a *p<sup>i</sup>* pair actually describe the same public place, a minimum threshold *α* ∈ [0, 1] is used to focus on Relevant Pairs *RP* ⊆ *PRP*, where *RP*(*pi*) ≥ *α*.

However, given a descriptor *d*1,*<sup>h</sup>* ∈ *D*1, it could appear several times in *RP*, because there might be many relevant pairs in which it is involved. For each *d*1,*<sup>h</sup>* ∈ *D*1, the subset *RP*1,*<sup>h</sup>* ⊆ *RP* is the set of pairs *p<sup>i</sup>* ∈ *RP* such that *p*1.*d*<sup>1</sup> = *d*1,*<sup>h</sup>* ; if *RP*1,*<sup>h</sup>* is not empty, the pair *p*1,*<sup>h</sup>* ∈ *RP*1,*<sup>h</sup>* such that *RP*1,*<sup>h</sup>* (*p*1,*<sup>h</sup>* ) ≥ *RP*1,*<sup>h</sup>* (*pi*), for all *p<sup>i</sup>* ∈ *RP*1,*<sup>h</sup>* , appears in *SP* (because the two paired descriptors are actually supposed to describe the same place).

In the remainder of this section, we introduce the complete formal framework.

#### 5.2.1. Basic Functions and Relations

The *MatchingPlaces* fuzzy relation is defined by means of some basic functions and fuzzy relations.

Given two pairs of coordinates, i.e., *lat*1, *lon*<sup>1</sup> and *lat*2, *lon*<sup>2</sup> (denoting latitude and longitude of two points on the earth globe), the *Distance* function computes the "Geodesic Distance" [39] between the two points in km; it is denoted as *Distance*(*lat*1, *lon*1, *lat*2, *lon*2). On this basis, it is possible to define the *Close* membership function that, given a distance *dist* (in km) determines whether the distance denotes that two points are close; it is denoted as *Close*(*dist*); an example of a typical membership function for this concept (the same exploited in [6]) is depicted in Figure 2: notice that, on the basis of the geodesic distance between the two points, the membership degree is 1 when the distance is between 0 and 50 m; then, it linearly decreases from 50 up to 1000 m. In Section 6.2, we will define a more sophisticated membership function.

The *Close* membership function can be used as the basis for defining the *ClosePlaces* fuzzy relation: given two place descriptors *d*<sup>1</sup> and *d*2, *ClosePlaces*(*d*1, *d*2) = *Close*(*Distance* (*d*1.*lat*, *d*1.*lon*, *d*2.*lat*, *d*2.*lon*)).

**Figure 2.** Sample *Close* membership function taken from [6].

Given two strings *s*<sup>1</sup> and *s*2, the *Similar* fuzzy relation is denoted as *Similar*(*s*1,*s*2). As a membership function, any string-similarity metric whose value is in the range [0, 1] could be used; in [6], the Jaro-Winkler similarity metric [40–43] was used; here, we still use it, but in a more sophisticated way (see Section 6.2).

Based on the *Similar* relation, which is defined on the universe of strings, it is possible to define two derived relations that are defined on the universe of descriptor pairs *P* = *D*<sup>1</sup> × *D*2.

The *SimilarAddress* fuzzy relation denotes the extent to which the *address* fields of the two descriptors are similar; it is defined as *SimilarAddress*(*d*1, *d*2) = *Similar*(*d*1.*address*, *d*2.*address*).

The *SimilarName* fuzzy relation denotes the extent to which the *name* fields of the two descriptors are similar; it is defined as *SimilarName*(*d*1, *d*2) = *Similar*(*d*1.*name*, *d*2.*name*).

#### 5.2.2. The *SameLocation* Relation

The *MatchingPlaces* relation is obtained by previously evaluating the *SameLocation* fuzzy relation. It is denoted as *SameLocation*(*d*1, *d*2). Its membership function changes, depending on the fact that fields concerning geographical aspects (i.e., address and coordinates) in *d*<sup>1</sup> and *d*<sup>2</sup> are missing or not. Hereafter, we provide three different definitions of the *SameLocation* relation, one for each sub-case to deal with.

• **Case A: missing address(es).** If *d*1.*address* is missing, or *d*2.*address* is missing or both, but the two pairs of coordinates are available, only these latter ones can be used to evaluate the *SameLocation* relation.


Once the three cases of interest have been identified, it is possible to define the *SameLocation* relation.

**Definition 5.** *Case A. Given two descriptors d*<sup>1</sup> *and d*2*, for which either d*1.*address or d*2.*address or both are missing, while the lat and lon fields are not null in both d*<sup>1</sup> *and d*2*, the SameLocation relation is defined as follows:*

*SameLocation*(*d*1, *d*2) = *ClosePlaces*(*d*1, *d*2)

*i.e., the membership degree of the SameLocation relation coincides with the membership degree of the ClosePlaces relation.*

**Definition 6.** *Case B. Given two descriptors d*<sup>1</sup> *and d*<sup>2</sup> *for which at least one among d*1.*lat, d*1.*lon, d*2.*lat and d*2.*lon is null, while both d*1.*address and d*2.*address are available, the SameLocation relation is defined as follows:*

*SameLocation*(*d*1, *d*2) = *SimilarAddress*(*d*1, *d*2)

*i.e., the membership degree of the SameLocation relation coincides with the membership degree of the SimilarAddress relation.*

**Definition 7.** *Case C. Given two descriptors d*<sup>1</sup> *and d*2*, for which all the fields d*1.*address, d*1.*lat, d*1.*lon, d*2.*address, d*2.*lat and d*2.*lon are available, the SameLocation relation is defined as follows:*

*SameLocation*(*d*1, *d*2) = *wagβgeo* (*SimilarAddress*(*d*1, *d*2), *ClosePlaces*(*d*1, *d*2))

*i.e., the membership degree of the SameLocation relation is the weighted aggregation of the Similar Address relation and of the ClosePlaces relation; βgeo* ∈ [0, 1] *is the weight of the first term (the similarity between addresses).*

In Section 6.3, we use *βgeo* = 0.55: this way, the similarity between addresses slightly prevails over closeness; indeed, if two addresses are very similar, their similarity contributes more than coordinates; this way, the effect of erroneous coordinates that give rise to high distances is mitigated.

5.2.3. Global *MatchingPlaces* Relation

At this point, we can define the global *MatchingPlaces* relation.

**Definition 8.** *Given two descriptors d*<sup>1</sup> *and d*2*, for which both fields d*1.*name and d*2.*name are available, and for which the SameLocation relation is defined and SameLocation*(*d*1, *d*2) ≥ *αgeo (with αgeo* ∈ [0, 1]*), the MatchingPlaces relation is defined as follows:*

*MatchingPlaces*(*d*1, *d*2) = *wagβname* (*SimilarName*(*d*1, *d*2), *SameLocation*(*d*1, *d*2)) *i.e., the membership degree of the MatchingPlaces relation is obtained by aggregating the membership degrees of the SimilarName relation and of the SameLocation relation, by means of the weighted aggregator with weight βname for similarity of names.*

In Section 6.4, we set *βname* = 0.6: this way, the similarity between names prevails over the membership degree of the *SameLocation* relation. The rationale is the following: given two similar names, they contribute only for the 60%; the remaining 40% is given by the geographical contribution. However, in order to avoid the two descriptors whose geographical contribution is not significant, the *αgeo* threshold is introduced: if the membership degree of the *SameLocation* fuzzy relation is less than *αgeo*, *d*<sup>1</sup> and *d*<sup>2</sup> are no longer considered eligible to be the same place: two places can have very similar names (even

identical—imagine two restaurants of the same chain), but if there is the doubt that they are reasonably close, they could be a wrong pair. In Section 6.3, we set this threshold as *αgeo* = 0.4.

The membership degree of the *MatchingPlaces* fuzzy relation is used to determine whether a pair actually belongs to the *RP* set of relevant pairs, i.e., *RP*(*pi*) ≥ *α* means *MatchingPlaces* (*p*1.*d*1, *p*1.*d*2) ≥ *α*. In Section 6.4, we set this threshold as *α* = 0.8, because in our experiments (see Section 7.1), we found this is the threshold that gives the best effectiveness.

#### **6. Presenting the Script**

In this section, we provide the technical contribution of the paper. Specifically, we demonstrate how the current version of *J-CO-QL*<sup>+</sup> is able to perform the soft integration of two collections containing *JSON* documents that describe public places, obtained from two different data sources.

#### *6.1. Data Set*

A *MongoDB* database called ijgiDb contains two collections of *JSON* documents: the first one is called FacebookDescriptors and its documents are descriptors of pages that present public places mostly located in the area of Manchester (UK); the second collection is called GoogleDescriptors and its documents are descriptors of places mostly located in the area of Manchester (UK) as well, obtained from *Google Places*. The FacebookDescriptors collection contains 5738 documents, while the GoogleDesciptors collection contains 5214 documents. Figure 3a shows a sample document in the FacebookDescriptors collection, while Figure 3b reports a sample document in the GoogleDescriptors collection. The reader can notice that *Facebook* descriptors clearly distinguish the address (in the fbStreet field) from the city name (in the fbCity field) from the ZIP code (in the fbZip field). In contrast, within a *Google Places* descriptor, the content of the gAddress field is less clean, because it contains the city name too. This also demonstrates that we are working on names and addresses as they are provided by *Facebook* and *Google Places*, without any pre-processing or cleaning (in [6], addresses were cleaned from numbers and urban designations, such as "street"). Consequently, here, we are addressing a less favorable situation.

**Figure 3.** Examples of documents representing place descriptors. (**a**) Example of document in the FacebookDescriptors collection. (**b**) Example of document in the GoogleDescriptors collection.

#### *6.2. Defining Fuzzy Operators*

We start presenting the *J-CO-QL*<sup>+</sup> script. The first part of the script is reported in Listing 1.

The key concept provided by *J-CO-QL*<sup>+</sup> to evaluate membership degrees of *JSON* documents is the concept of "fuzzy operator". Such an operator is called within soft conditions: given some actual parameters (expressions based on document fields), the operator returns a membership degree. This degree will be used to evaluate the overall membership degree of a document to a specific fuzzy set.

```
Listing 1. J-CO-QL+ script: fuzzy operators.
```

```
1. CREATE FUZZY OPERATOR Close
    PARAMETERS
      distance TYPE Float
    PRECONDITION 
        distance >= 0
    EVALUATE 
        distance
    POLYLINE
      [ (0.00, 1.00), (0.05, 1.00), (0.20, 0.50), (0.60, 0.10), (1.00, 0.00) ];
2. CREATE FUZZY OPERATOR Similar
    PARAMETERS
      st1 TYPE String,
      st2 TYPE String
    EVALUATE 
        JARO_WINKLER_SIMILARITY(st1, st2)
    POLYLINE 
      [ (0.00, 0.00), (0.60, 0.40), (0.70, 0.80), (0.80, 1.00), (1.00, 1.00) ];
3. CREATE FUZZY OPERATOR WeightedAggregationBeta
    PARAMETERS
      f1 TYPE Float,
      f2 TYPE Float,
      beta TYPE Float
    PRECONDITION
      f1 IN_RANGE [0, 1] AND
      f2 IN_RANGE [0, 1] AND
      beta IN_RANGE [0, 1]
    EVALUATE 
      f1*beta + f2*(1-beta)
    POLYLINE 
      [ (0.00, 0.00), (1.00, 1.00) ];
```
#### 6.2.1. The Close Fuzzy Operator

The instruction on line 1 of the *J-CO-QL*<sup>+</sup> script in Listing 1 defines the Close fuzzy operator: it evaluates the degree of closeness of two places, on the basis of the distance between them. Hereafter, we describe the instruction in details.


Figure 4a reports the polyline defined for the Close fuzzy operator. Notice that it is not the same defined in [6] (reported in Figure 2): in fact, we opted for a function that immediately penalizes distances that are between 50 m and 600 m, because two places in the same neighborhood are not perceived as very close when their distance becomes greater than 100 m.

**Figure 4.** Membership functions for the fuzzy operators in Listing 1. (**a**) Close; (**b**) Similar; (**c**) WeightedAggregationBeta.

#### 6.2.2. The Similar Fuzzy Operator

The instruction on line 2 of the *J-CO-QL*<sup>+</sup> script in Listing 1 creates the Similar fuzzy operator. Its goal is to evaluate a membership degree on the basis of the similarity degree of two strings. The operator is described in detail hereafter.


#### 6.2.3. The WeightedAggregationBeta Fuzzy Operator

The instruction on line 3 of the *J-CO-QL*<sup>+</sup> script in Listing 1 defines the third fuzzy operator. This is called WeightedAggregationBeta and its goal is to perform the "weighted aggregation" *wag<sup>β</sup>* (see Definition 3). In fact, *J-CO-QL*<sup>+</sup> does not provide such an operator in its language; through the WeightedAggregationBeta fuzzy operator, we show how to introduce novel fuzzy concepts. The fuzzy operator is described in detail hereafter.


#### *6.3. Retrieving and Pairing Descriptors*

Once the three fuzzy operators are defined, it is time to start working on the data set. This is conducted by the second part of the *J-CO-QL*<sup>+</sup> script, which is reported in Listing 2.

```
Listing 2. J-CO-QL+ script: retrieving and joining collections.
```

```
4. USE DB ijgiDb
     ON SERVER MongoDB 'http://127.0.0.1:27017';
5. JOIN OF COLLECTIONS
     FacebookDescriptors@ijgiDb AS f, GoogleDescriptors@ijgiDb AS g 
   CASE
     // Case A: missing address(es)
     WHERE( FIELD .f.fbStreet IS NULL OR 
            FIELD .g.gAddress IS NULL OR
                  .f.fbStreet = "" OR
                  .g.gAddress = "" ) AND
          FIELD .f.fbLatitude IS NOT NULL AND 
          FIELD .f.fbLongitude IS NOT NULL AND 
          FIELD .g.gLatitude IS NOT NULL AND 
          FIELD .g.gLongitude IS NOT NULL
       GENERATE
        CHECK FOR
          FUZZY SET ClosePlaces USING 
                    Close ( GEODESIC_DISTANCE( .f.fbLatitude, .f.fbLongitude, 
                                              .g.gLatitude, .g.gLongitude ) ),
          FUZZY SET SameLocation USING ClosePlaces
        ALPHACUT 0.4 ON SameLocation
     // Case B: missing coordinate(s)
     WHERE FIELD .f.fbStreet IS NOT NULL AND 
            FIELD .g.gAddress IS NOT NULL AND
                  .f.fbStreet != "" AND
                  .g.gAddress != "" AND
            ( FIELD .f.fbLatitude IS NULL OR 
             FIELD .f.fbLongitude IS NULL OR
             FIELD .g.gLatitude IS NULL OR 
             FIELD .g.gLongitude IS NULL )
        GENERATE
          CHECK FOR 
            FUZZY SET SimilarAddress USING Similar (.f.fbStreet, .g.gAddress), 
            FUZZY SET SameLocation USING SimilarAddress
          ALPHACUT 0.4 ON SameLocation
     // Case C: addresses and coordinates all available
     WHERE FIELD .f.fbStreet IS NOT NULL AND
            FIELD .g.gAddress IS NOT NULL AND
                  .f.fbStreet != "" AND
                  .g.gAddress != "" AND
            FIELD .f.fbLatitude IS NOT NULL AND 
            FIELD .f.fbLongitude IS NOT NULL AND 
            FIELD .g.gLatitude IS NOT NULL AND 
            FIELD .g.gLongitude IS NOT NULL
       GENERATE
        CHECK FOR 
          FUZZY SET SimilarAddress USING Similar ( .f.fbStreet, .g.gAddress ), 
          FUZZY SET ClosePlaces USING 
                    Close ( GEODESIC_DISTANCE( .f.fbLatitude, .f.fbLongitude,
                                              .g.gLatitude, .g.gLongitude ) ),
          FUZZY SET SameLocation USING
                      WeightedAggregationBeta( MEMBERSHIP_OF(SimilarAddress), 
                                              MEMBERSHIP_OF(ClosePlaces), 0.55 )
        ALPHACUT 0.4 ON SameLocation;
```
The instruction on line 4 connects the query process to the database. After this instruction, it will be possible to access the ijgiDb database to retrieve and store collections.

The JOIN OF COLLECTIONS instruction on line 5 retrieves the two source collections (called FacebookDescriptors and GoogleDescriptors) and creates all possible pairs of documents contained in the two collections. Then, the subsequent CASE clause evaluates a pool of conditions on these pairs to possibly evaluate fuzzy sets on the actually-interesting pairs and discards the others. The instruction is explained in detail hereafter.

• The instruction retrieves the FacebookDescriptors collection from the ijgiDb database and aliases it as f; similarly, it retrieves the GoogleDescriptors collection from the same database and aliases it as g.

For each *f* document from the f collection and for each *g* document from the g collection, a new *d* document is created. This document contains two fields: the first one is called f and its value is the source *f* document; the second one is called g and its value is the source *g* document. The *d* document is further processed by the subsequent CASE clause.

Figure 5 reports an example of *d* document, which is obtained by joining the two sample documents reported in Figure 3; notice the names of the root-level fields.

```
{
 "f" : {
   "id" : 266,
   "idLink" : "1761andLilysBar"
   "fbName" : "1761 & Lily's Bar",
   "fbCity" : "Manchester",
   "fbCountry" : "United Kingdom",
   "fbLatitude" : 53.4802297,
   "fbLongitude" : -2.2435781,
   "fbStreet" : "2 Booth Street",
   "fbZip" : "M2 4AT",
 },
 "g" : {
   "gId" : "ChIJ--wWob6xe0gRBF_8…",
   "gName" : "Hope Studios"
   "gAddress" : "52 Newton Street, Ma…",
   "gCity" : "Manchester",
   "gLatitude" : 53.4821537,
   "gLongitude" : -2.2322586,
 }
}
```
**Figure 5.** Example of document generated by the JOIN OF COLLECTIONS instruction on line 5 of the *J-CO-QL*<sup>+</sup> script, before the CASE clause.

	- **–** The first WHERE branch deals with the case *A* of the *SameLocation* fuzzy relation, defined in Definition 5. The condition is true if either the value for the fbStreet field is missing or the value for the gAddress field is missing or both are missing, and all coordinates are available. If a *d* document meets the condition, the GENERATE block further processes *d* through the CHECK FOR clause, whose goal is to evaluate the membership degrees of *d* to fuzzy sets.

Specifically, two FUZZY SET branches are present: the former evaluates the ClosePlaces fuzzy set, the latter evaluates the SameLocation fuzzy set.

The membership degree to the ClosePlaces fuzzy set is obtained by the associated USING clause: this is a "soft condition", in which fuzzy operators (such as those defined in Section 6.2) and fuzzy-set names can be composed by the usual (fuzzy) logical operators AND, OR and NOT; the resulting membership degree is the membership degree to the evaluated fuzzy set. If this is the first membership degree evaluated for *d*, then *d* does not have the special ~fuzzysets field: in this case, the field is added and within it only one single field is present, having the same name of the evaluated fuzzy set, whose value is the computed membership degree. In contrast, if the ~fuzzysets field is already present, it is extended with one extra internal field, describing the membership degree to the new evaluated fuzzy set.

Specifically, the first branch evaluates the membership degree to the ClosePlaces fuzzy set, by means of the Close fuzzy operator (see Listing 1), which is called passing the geodesic distance computed by the GEODESIC\_DISTANCE built-in function.

The second FUZZY SET branch evaluates the membership degree to the SameLocation fuzzy set, by assuming that it coincides with the ClosePlaces fuzzy set (see Definition 5). Finally, the ALPHACUT clause discards the *d* document from the output temporary collection if its membership degree to the SameLocation fuzzy set is less than 0.4; remember that this is the *αgeo* threshold mentioned within Definition 8. Figure 6a reports a sample document generated

by the first WHERE branch; notice the presence of the ~fuzzysets field and its inner fields.


The third FUZZY SET branch evaluates the membership degree to the fuzzy set named SameLocation: according to Definition 7, it is obtained by calling the WeightedAggregationBeta fuzzy operator, whose goal is to perform the weighted aggregation: it receives the two values (in the range [0, 1]) to aggregate and the *β* weight.

The USING soft condition calls the WeightedAggregationBeta fuzzy operator, passing the membership values to the SimilarAddress fuzzy set and to the ClosePlaces fuzzy set, which are obtained by means of the MEMBERSHIP\_OF builtin function (that extracts the membership degree from within the ~fuzzysets field). The third parameter is the constant value 0.55: thisis the *βgeo* weight presented and discussed in Definition 7. The ALPHACUT clause discards the evaluated document if its membership degree to the SameLocation fuzzy set is less than 0.4 (the *αgeo* threshold mentioned in Definition 8).

Figure 6c reports a sample document generated by the third branch; notice that the ~fuzzysets field has three inner fields.


The temporary collection produced by the instruction on line 5 of Listing 2 contains heterogeneous documents, as far as the structure of the ~fuzzysets field is concerned, but all have the inner SameLocation field, denoting the membership degree to the SameLocation

fuzzy set; it will be used in the next instruction, to evaluate the membership degree to the MatchingPlaces fuzzy set.

Furthermore, notice that SameLocation, ClosePlaces and SimilarAddresses are called "fuzzy sets", while they were defined in Section 5 as "fuzzy relations": this is not a mistake, but the consequence of the fact that *JSON* documents represent pairs of descriptors; consequently, fuzzy relation on pairs are translated into fuzzy sets on *JSON* documents.

#### *6.4. Relevant Pairs*

All documents contained in the temporary collection produced by the instruction on line 5 (Listing 2) have the membership degree to the SameLocation fuzzy set no less than *αgeo* = 0.4, as required by Definition 8. The FILTER instruction on line 6 in Listing 3 actually evaluates the membership degree to the MatchingPlaces fuzzy set, which corresponds to the *MatchingPlces* relation defined in Definition 8. The FILTER instruction on line 6 is described in detail hereafter.


The three sample documents reported in Figure 6 become as reported in Figure 7; notice the presence of the MatchingPlaces inner field within the ~fuzzysets field.

**Listing 3.** *J-CO-QL*<sup>+</sup> script: matching places.

```
6. FILTER 
   CASE
     WHERE WITH .f.fbName, .g.gName AND
            KNOWN FUZZY SETS SameLocation 
       GENERATE
         CHECK FOR 
          FUZZY SET SimilarName USING Similar( .f.fbName, .g.gName ),
          FUZZY SET MatchingPlaces USING 
            WeightedAggregationBeta ( MEMBERSHIP_OF (SimilarName), 
                                      MEMBERSHIP_OF (SameLocation), 0.60)
         ALPHACUT 0.8 ON MatchingPlaces
         BUILD{
          .f : .f,
          .g : .g,
          .rank: MEMBERSHIP_OF(MatchingPlaces) }
         DEFUZZIFY;
7. SAVE AS RelevantPairs@ijgiDb;
```


**Figure 7.** Examples of documents transformed by the FILTER instruction on line 6 before the BUILD section. (**a**) Example for Case A. (**b**) Example for Case B. (**c**) Example for Case C.



**Figure 8.** Examples of documents generated by the FILTER instruction on line 6. (**a**) Example for Case A. (**b**) Example for Case B. (**c**) Example for Case C.

The instruction on line 7 in Listing 3 saves the temporary collection into the ijgiDb database, with name RelevantPairs. Its documents contain the most promising pairs of descriptors (remember the *RP* set mentioned in Section 5.2), but it could happen that, e.g., the same *Google Places* descriptor is associated with more than one *Facebook* descriptor. Clearly, it is the case to choose the pair having the highest rank (i.e., building the final *SP* set mentioned in Section 5.2). This is discussed in Section 6.5.

#### *6.5. Choosing the Best Pairs*

The last part of the *J-CO-QL*<sup>+</sup> script is reported in Listing 4. It actually chooses the best pairs that involves each single *Google Places* descriptor obtained by line 6 in Listing 3. Indeed, the original *J-CO-QL* language (from which *J-CO-QL*<sup>+</sup> derives) was designed to cope with this kind of task too (see [3,4]). Hereafter, we briefly describe this last part of the script.


As a result, the temporary collection generated by line 10 contains as many documents in the RankedPairs collection, but now they are tagged with the relative order for *Google Places* descriptors on the basis of the rank field. Figure 9b reports an example of an unnested document.



**Figure 9.** Examples of documents during selection of BestPairs in Listing 4. (**a**) Example of document after GROUP instruction on line 9. (**b**) Example of document during EXPAND instruction on line 10 before the BUILD clause. (**c**) Example of document after EXPAND instruction on line 10.

#### **Listing 4.** *J-CO-QL*<sup>+</sup> script: Selecting the best pairs.

```
8. GET COLLECTION RelevantPairs@ijgiDb;
9. GROUP
   PARTITION WITH .g.gId
     BY .g.gId 
     INTO .gGroup
       ORDER BY .rank TYPE NUMERIC DESC;
10. EXPAND
     UNPACK WITH .gGroup
       ARRAY .gGroup
       TO .gPair;
11. FILTER 
     CASE WHERE WITH .gPair AND
               .gPair.position = 1
       GENERATE
         BUILD {
          .f : .gPair.item.f,
          .g : .gPair.item.g,
          .rank : .gPair.item.rank };
12. SAVE AS SamePlaces@ijgiDb;
```
#### **7. Experimental Evaluation**

In this section, we report a brief evaluation of the results that can be obtained by the *J-CO-QL*<sup>+</sup> script. We exploited the same data set adopted in [6], related with the city of Manchester (UK). Remember, from Section 6.1, that the FacebookDescriptors collection contains 5738 descriptors, while the GoogleDescriptors collection contains 5214 descriptors. Both collections contain descriptors about a variety of different public places, such as restaurants, pubs, hairdressers, universities, parks and so on.

#### *7.1. Effectiveness*

In order to evaluate the effectiveness of the method, we performed a sensitivity analysis by varying the value of the *α* threshold from 0.5 to 0.99.

We used the same test set again, which was used in the work [6]: it contained a total of 400 pairs, selected among the 5738 × 5214 total pairs, evaluated by a human as *Good* or *Bad*. We randomly selected 300 pairs from the starting 400 pairs, and from each pair we extracted the related 300 *Google Places* descriptors and 300 *Facebook* descriptors; among all possible pairs, 103 pairs were labeled as *Good* pairs (and obviously the remaining 197 were labeled as *Bad*).

Then, we run the script on these two reduced collections of descriptors. Table 1 reports the results of our experiments. Specifically, the first colum reports the single values for the alpha-cut *α*; the second and third columns report the number of relevant pairs saved by line 7 of the *J-CO-QL*<sup>+</sup> script (Listing 3) into the RelevantPairs collection and the number of pairs generated by line 12 of the script (Listing 4) and saved into the SamePlaces collection, respectively. The, columns from 4 to 7 reports the number *TP* of true positive pairs, the number *TN* of negative pairs, the number *FP* of false positive pairs and the number *FN* of false negative pairs, respectively. Finally, the last three columns reports "Precision" (defined as *TP*/(*TP* + *FP*)), "Recall" (defined as *TP*/(*TP* + *FN*)) and "Accuracy" (defined as (*TP* + *TN*)/(*TP* + *TN* + *FP* + *FN*)), respectively. These three latter values are depicted in Figure 10: the *x*-axis reports the values of the alpha-cut *α* parameter; precision is depicted by the blue line, recall is depicted by the red line and accuracy is depicted by the black line.

Analyzing Table 1 and Figure 10, it is possible to see that the best combination of values for precision, recall and accuracy were obtained for *α* = 0.8: precision is 0.962, recall is 0.981 and accuracy is 0.980. Indeed, this value for *α* appears to be the best compromise between the need to keep as many pairs as possible and the fact that those pairs actually describe the same place, even though names, addresses and coordinates are different. The reader can further notice that higher values of *α* give rise to a precision of 1 with poor recall, while lower values for *α* give rise to a recall of 1 with poor precision. To conclude this

analysis, notice that with *α* = 0.85, the accuracy is the same as the one obtained for *α* = 0.8; it could be considered as a valid alternative choice, with better precision but low recall.


**Table 1.** Sensitivity analysis.

**Figure 10.** Sensitivity analysis of precision, recall and accuracy.

Thus, we can state that the novel formulation for the *MatchingPklaces* relation and the complex membership functions adopted for the Similar fuzzy operator and for the Close fuzzy operator are effective, provided that *α* = 0.8.

We can consider the results reported in [6] as a baseline for further evaluating the effectiveness of the novel formulation of the technique.

Remember that the version presented in [6] (for online aggregation) performed preprocessing tasks on names and addresses, so as to clean them from urban designations and numbers. In contrast, the present version does not. The main reason is that such a kind of pre-processing and cleaning is not easy to do within *J-CO-QL*<sup>+</sup> scripts; however, the flexibility of the CREATE FUZZY OPERATOR statement as far as the possibility to define complex shapes for membership functions seems to be effective. Consequently, the proper baseline to consider is the best result presented in [6]. There, a comparison with a machinelearning technique, namely "Random Forest" classification, was performed, by applying it on the same data set. Results are reported in Table 2: for the three considered techniques, precision, recall and F1-score (defined as 2 × (*precision* × *recall*)/(*precision* + *recall*)) are

reported. Notice that in [6], the proposed technique was as effective as random-forest classifiers; the current version outperform them, even though names and addresses are neither pre-processed nor cleaned. Observe that the old version of the fuzzy technique and the Random-Forest technique, applied on the data set describing public places in Manchester (UK), obtain the same identical effectiveness; this is why [6] states that the two techniques are comparable.


**Table 2.** Comparison with experimental results presented in [6].

Consequently, we can say that the current version improves the old one and is suitable to be executed as a *J-CO-QL*<sup>+</sup> script. Furthermore, it still maintains the advantage provided by the old version in comparison with classification techniques, i.e., it can be applied from scratch, without knowing the data set; in contrast, classification techniques ask for a training step on previously labeled training sets, which is a time-consuming and critical activity.

Best of [6] 0.931 0.931 0.931 Random Forest (from [6]) 0.931 0.931 0.931

#### *7.2. About Execution Times*

Before concluding this work, we report some considerations about execution times.

Usually, this aspect is not considered in the literature about integration of geographical data sets: authors were focused on the effectiveness of the proposed techniques, but did not consider efficiency. However, in our opinion, this is not a negligible aspect for the practical use of integration techniques, in particular with large data sets to integrate.

In this paper, the goal is neither to provide the most efficient technique, nor to evaluate execution times of a plethora of techniques proposed in the literature. Here, the goal is to observe "what to expect" while running the *J-CO-QL*<sup>+</sup> script on a real data set, such as the one we used for our experiments.

We decided to consider a working environment that could be a common situation: analysts are equipped with stand-alone PCs, on which they perform their daily activities. On these PCs, they might have a running instance of *MongoDB*, which stores their data sets about geographical places, as well as an installation of the *J-CO* Framework; indeed, not necessarily analysts are equipped with super-computers. Consequently, we run experiments on a laptop PC powered by an Intel quad-Core i7-8550-U processor, running at 1.80 GHz, equipped with 16 GB RAM and 250 GB Solid-State Drive and running the Java Virtual Machine version 1.8.0\_251 (the *J-CO-QL*<sup>+</sup> *Engine* is written in the Java programming language).

Table 3 reports the execution times of the script discussed in this paper. We used the full data set reporting descriptors of public places located in Manchester (UK). Remember that it contains 5738 *Facebook* descriptors and 5214 *Google* descriptors.

Table 3 reports the execution times observed for each single instruction in the script; the right-most column reports the cumulative execution time after each instruction. Clearly, the overall execution time is dominated by the JOIN OF COLLECTIONS instruction, which actually builds 5738 × 5214 = 29, 917, 932 pairs of descriptors; it takes 216.61 s. Looking at the other instructions they contribute only less than 4 s, so that the overall execution time is 220.804 s, i.e., about 3.6 min.

In terms of user perception, waiting for about 3 min is acceptable in this context, in which near-real time performances are not expected. Furthermore, we want to remark that once the two data sets to integrate are available, the *J-CO-QL*<sup>+</sup> script can be applied from scratch, and a few minutes later the integrated data set is obtained. This is an incredible advantage if compared with the adoption of classification techniques, because there is no need to build training sets labeled by humans; this activity can take from several hours to several days (depending on the size of the training set) and is prone to errors and misunderstanding, as well as its effectiveness depends on the way the training sets are built before labeling.


**Table 3.** Execution times of the *J-CO-QL*<sup>+</sup> script.

#### **8. Conclusions and Future Work**

To conclude, it is time to summarize and discuss the contribution of the paper, as well as to sketch future developments.

#### *8.1. Conclusions*

This paper addresses the problem of soft integrating data sets describing public places, when these data sets are represented as *JSON* documents and are stored within a *JSON* document store. Specifically, fuzzy-set theory provides the formal framework for the integration methodology presented in the paper: the *MatchingPlaces* fuzzy relation is the core of the proposed methodology. Then, the soft integration method is applied in a practical way by means of the *J-CO* Framework: a script (or query) written in *J-CO-QL*<sup>+</sup> (the query language of the *J-CO* Framework) is written, which implements the soft integration method, by exploiting its fuzzy capabilities. This is the main contribution of the paper: showing that a novel stand-alone tool (the *J-CO* Framework), suitable for performing the soft integration of geo-tagged *JSON* data sets stored within *JSON* document stores, is now available for analysts and spatial-data engineers.

Hereafter, we would like to perform some considerations.


#### *8.2. Future Work*

As future works, many activities are planned along the development of the *J-CO* Framework and its application to data integration problems.


The *J-CO* Framework is available on a public GitHub repository (https://github.com/ JcoProjectTeam/JcoProjectPage, accessed on 1 September 2022).

**Author Contributions:** Conceptualization and methodology, Giuseppe Psaila; software, Paolo Fosci; writing—original draft preparation, Giuseppe Psaila; writing—review and editing, Paolo Fosci and Giuseppe Psaila; experiments, Paolo Fosci. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** All collection data sets used for the experiments, reported in this paper, can be freely download from the *J-CO* Framework GitHub repository at: https://github.com/ JcoProjectTeam/JcoProjectPage/tree/main/papers/dataset/ijgi2022.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

