Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology

Bundzel, Marek; Jaščur, Miroslav; Kováč, Milan; Lieskovský, Tibor; Sinčák, Peter; Tkáčik, Tomáš

doi:10.3390/rs12223685

Open AccessArticle

Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology

by

Marek Bundzel

^1,*

,

Miroslav Jaščur

¹,

Milan Kováč

²,

Tibor Lieskovský

³,

Peter Sinčák

¹ and

Tomáš Tkáčik

¹

Department of Cybernetics and Artificial Intelligence, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Letná 9, 042 00 Košice, Slovakia

²

Center for Mesoamerican Studies, Faculty of Arts, Comenius University in Bratislava, Gondova 2, 811 02 Bratislava, Slovakia

³

Department of Theoretical Geodesy, Faculty of Civil Engineering, Slovak University of Technology in Bratislava, Radlinského 11, 810 05 Bratislava, Slovakia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2020, 12(22), 3685; https://doi.org/10.3390/rs12223685

Submission received: 13 October 2020 / Revised: 6 November 2020 / Accepted: 7 November 2020 / Published: 10 November 2020

Download

Browse Figures

Versions Notes

Abstract

:

Airborne LiDAR produced large amounts of data for archaeological research over the past decade. Labeling this type of archaeological data is a tedious process. We used a data set from Pacunam LiDAR Initiative survey of lowland Maya region in Guatemala. The data set contains ancient Maya structures that were manually labeled, and ground verified to a large extent. We have built and compared two deep learning-based models, U-Net and Mask R-CNN, for semantic segmentation. The segmentation models were used in two tasks: identification of areas of ancient construction activity, and identification of the remnants of ancient Maya buildings. The U-Net based model performed better in both tasks and was capable of correctly identifying 60–66% of all objects, and 74–81% of medium sized objects. The quality of the resulting prediction was evaluated using a variety of quantifiers. Furthermore, we discuss the problems of re-purposing the archaeological style labeling for production of valid machine learning training sets. Ultimately, we outline the value of these models for archaeological research and present the road map to produce a useful decision support system for recognition of ancient objects in LiDAR data.

Keywords:

LiDAR; Maya archaeology; convolutional neural network; semantic segmentation; decision support; U-Net; Mask R-CNN

Graphical Abstract

1. Introduction

In 2016, the Pacunam LiDAR Initiative (PLI) undertook the largest LiDAR survey to date of the lowland Maya region, mapping a total of 2144 km

^{2}

of the Maya Biosphere Reserve in Petén, Guatemala [1]. The result of this endeavor is a standardized and consistent data set that enables archaeological studies of an area covered by tropical forest that limits the on-the-ground research. LiDAR can produce high-resolution data sets quickly and cheaply, therefore a much larger portion of given area is likely to be covered soon. LiDAR-derived products can be easily integrated into a Geographic Information System (GIS). In our case, the product has the form of a high-resolution digital elevation model (DEM). The revealed micro-topography is analyzed and interpreted by experts aided by various enhancements (calculation of slope angle, sky-view factor, openness, prismatic openness, etc.) and visualizations (2D and 3D rendering) [1,2,3], see example in Figure 1. This task is very demanding on time and personnel resources. Our goal is to simplify identification of objects of interest in DEMs using semantic segmentation based on Deep Learning (DL) [4]. The first step is the production of valid training data from DEM and the GIS stored geo-referenced vector objects. We have investigated how the objects are labeled and what is the relation between the objects and the actual topography features present today. Further, we describe the pipeline of object mask creation, data pre-processing, building the DL models, post-processing, and thresholding the outputs. The DL architectures of choice were U-Net [5] and Mask R-CNN [6] here. The results are evaluated and discussed. Based on the experimental results, we propose how to build a decision support system that aids object labeling.

2. Acquisition and Processing of LiDAR-Derived Terrain Models in Archaeology

2.1. Using Airborne LiDAR in Tropical Environment

In archaeology, the laser scanning system LiDAR has an advantage over other remote sensing methods such as aerial photography and satellite imaging, in that it is capable of penetrating the vegetation canopy and reveal the features of bare earth [1]. LiDAR is an active sensor that illuminates objects using laser beam pulses to detect their distance and reflectivity. Distance to the object is calculated from the measured time of flight of the pulse and the known speed of light. When using airborne LiDAR, the position and orientation of the sensor are also recorded. The localization of the sensor is usually obtained by fusing GPS positioning data (combination of satellite signals and ground GPS unit signal for reference) and inertial measurement unit data. The sensor scans the surface below the aircraft with laser pulses emitted in rapid succession defined by the pulse repetition frequency (PRF) [7]. The laser footprint represents the area illuminated by the laser pulse on the surface. Both the laser footprint and PRF limit the quality of the measurement. A smaller laser footprint is better, because the pulse is reflected from fewer objects, and higher PRF is better, because higher scanning resolution can be obtained. PRF is especially a limiting factor when a bare earth model of forested regions is to be created. Certain percentage of laser shots will be completely blocked by the foliage so that none of the laser energy reaches the ground and returns to the sensor at all. While temperate forests block 70–80% of laser shots, the thick rain forests in the tropics can block up to 96% of them, thus enabling us to reconstruct only very sparse data when the aircraft moves fast [8]. The reflections from the canopy, the undergrowth and the ground can be detected as spikes in the reflected energy and are called returns.

Computational methods are used to reconstruct a geo-referenced point cloud from the recorded data, that are usually transformed to fit a regular grid using interpolation, thus creating, for example, a digital elevation model (DEM) or a digital terrain model (DTM). Airborne LiDAR for archaeology in Central and South America has been used since 2000. Moller and Fernandez-Diaz [9] have surveyed the evolution of the technologies used in the area since. The data used in our study were acquired by Teledyne Optech Titan MW by PLI in 2016, see Figure 2. Fernandez-Diaz et al. [8] describe the capabilities of the LiDAR model used for this study in great detail. Titan uses a total of three separate laser channels, each firing at PRF up to 300 kHz. Each laser is set at different angle, with one pointing straight down, one pointing 3.5

^{\circ}

forward, and one pointing 7

^{\circ}

forward thus improving the chances that part of a laser beam will reach the ground.

2.2. Overview of Deep Learning Methods for Semantic Segmentation

The goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented; this can be viewed as pixel-wise classification of the input image. Relatively new computer vision methods such as SIFT are now often replaced with Convolutional Neural Networks (CNNs) and sometimes hybridized with [10]. Using rectified linear unit (ReLU) activation function in hidden layers, opposed to traditionally used sigmoid function, together with other modifications enabled backpropagation learning in very deep networks, in a process called Deep Learning (DL) [11]. Various CNN architectures usually differ only in layer organization, rather than in layer types. Most of these architectures like GoogleNet [12] or ResNet [13] are known from the ImageNet competition [14] and incorporate residual connections that improve training performance [15].

Fiorucci et al. [16] argues that due to the ’black box’ nature of the modern DL classifiers, they are being slowly implemented to cultural heritage applications but pioneers in this field aim to make DL methods a standard tool in this domain [17,18,19,20,21]. LiDAR sensing produces large amounts of data that are processed with the aid of a wide range of computer vision methods. Processing large image data is usually done by splitting original images into small overlapping segments also called tiles [22]. The tiles are further processed using a classification or a semantic segmentation algorithm often based on DL. In classification, the whole tile is categorized. In instance segmentation [23], a separate segmentation masks is created for every object instance.

Encoder–decoder type architectures, such as described by Long et al. [24], replace the dense output layers with up-sampling layers to get a segmentation mask on the output. U-Net is a convolutional neural network that was originally developed for biomedical image segmentation [5]. It is comprised of almost symmetric contractive and expansive paths, connected with shortcut connections into a U-shaped architecture. Figure 3 shows an example of U-Net architecture as published in the original paper. One downside of this approach is that it allows only for semantic segmentation and not for the instance segmentation. Semantic segmentation was often achieved by extending the capabilities of region-based CNNs (R-CNN) [25], that identify the position of an object by estimating a bounding box. The region proposal module enables speeding up the object localization process. Faster R-CNN [26] advanced this stream by learning the attention mechanism with a Region Proposal Network. Faster R-CNN is flexible and robust to many follow-up improvements.

Mask R-CNN [6] added instance segmentation capabilities to Faster R-CNN using the feature up-sampling technique. Instance segmentation uses an additional object detection step to obtain the individual instances of all classes in an image. Mask R-CNN surpassed all previous state-of-the-art single-model results on the COCO instance segmentation task [27]. Mask R-CNN added to the two Faster R-CNN outputs for each candidate object (a class label and a bounding-box offset) a third branch that outputs the object mask. The additional mask output requires extraction of much finer spatial layout of an object. One of the key elements of Mask R-CNN is pixel-to-pixel alignment that was the main missing feature of Fast/Faster R-CNN.

2.3. Deep Learning Methods for LiDAR-Derived Terrain Models in Archaeology

As DL found a way to many application areas, archaeology was not an exception. Verschoof-van der Vaart and Lambers [17] introduced a new technique based on Faster R-CNN for automated detection of multiple classes of archaeological objects in LiDAR data gathered from the central part of the Netherlands. The experiments have shown that it was possible to detect and categorize prehistoric barrows and Celtic fields, but not the charcoal kilns. The authors reported variability of the performance of Faster R-CNN between the experiments. In the follow up research, the authors were able to identify also the charcoal kilns, and proposed an iterative workflow to bridge remote sensing, machine learning and citizen science, to address aforementioned issues [28]. While we do not explicitly follow this workflow, many of its aspects are implicitly included in our approach.

Moyes and Montgomery [29] addressed discovering cave entrances in the Chiquibul Forest Reserve, a heavily forested area in western Belize. Caves are an important part of Maya cultural heritage, as those were utilized by the ancient Maya people as ritual spaces. The goal of the authors was to locate and investigate sinkholes and other types of cave entrances with point cloud modeling using local relief models. They confirmed a high rate of accuracy in locating cave entrances that varied in both, size and morphology. Here, we focus on high level processing that involves recognition of objects of various types. Somrak et al. [30] developed a CNN to classify Maya structures of the Chactún archaeological site in Campeche, Mexico. They investigated how various parameters impact the performance of the network. Specifically, they used 6 different mixtures of input layers, 2 different edge buffer sizes, different data augmentations and froze various layers of the CNNs. The resulting models exceeded the overall classification accuracy of 95%.

Gallwey et al. [18] presented a U-Net [5] based DL method for semi-automated detection of historic mining pits in aerial LiDAR data. A CNN that was pre-trained on lunar LiDAR data set was reapplied in archaeology. The feature extracting part of a trained CNN may work well across a wide range of applications; therefore, it is possible to reuse it with limited or none retraining.

While body of research [18,31,32,33] utilized transfer learning to a different level of success, Pires de Lima and Marfurt [31] present a systematic review, where their results show that the transfer learning is a powerful tool for remote sensing. Furthermore, they show, that models pre-trained on more generic natural images outperform those pre-trained on domain-specific data sets.

Politz et al. and Kazimi et al. [19,20] investigated the use of several CNN based approaches on detection of archaeological objects in airborne LiDAR data. The authors classified input tiles into several classes and combined the output to form a heatmap correlated with the probability of the object of interest presence at given position. The authors also explored the use of Mask R-CNN for instance segmentation. Both approaches showed promising results.

To summarize the literature review, processing LiDAR data using advanced computer vision methods such as DL has a big potential, not only for archaeology. The number of DL applications in this area is slowly increasing. The problems mentioned in the publications often address the selection of the suitable DL approach. There is a lack of investigation of the problems related to the production of a training data set. Banaszek et al. [34] and Soroush et al. [21] elaborate shortcomings of relying on the experts’ knowledge for data set labeling. In their works, they identify how uncertainties, bias, and error of the expert can degrade the resulting classifier’s accuracy. Quintus et al. [35] make several recommendations for the labeling of archaeological LiDAR data sets. Specifically, they show a need for quantitative assessment of the labels through field validation and confidence intervals for the expert labels. The experts labeling the objects usually follow standards that are not necessarily optimal for machine learning. This may lead to noisy or conflicting features. The lack of ground truth verification of the object’s labels may be another problem. Data pre-processing comprises an important part of the model production. As such it requires a responsible approach and well thought out methods.

3. Uaxactun Data Set

The PLI LiDAR survey of lowland Maya region mapped an area of total 2144 km² of the Maya Biosphere Reserve (MBR) in Petén, Guatemala. Using these data, it was possible for the first time to distinguish agriculture, physical infrastructure, and ancient settlements above a broad swath of the central Maya Lowlands using a standardized and consistent data set [1]. From these data, we had at our disposal the area of the ancient Maya city of Uaxactun with a polygon of app. 160 km

^{2}

. The rectangular lowland region in Northern Guatemala, roughly 9 × 17 km, Figure 4, is archaeologically investigated by the Regional Archaeological Project Uaxactun (PARU) of Comenius University in Bratislava, Slovakia. Besides buildings and platforms, the area contains agricultural fields, roads and other important ancient infrastructure. Because every category has different features, we have decided to build a separate model for each one. Here, we focused on the remains of ancient construction activity because of its high importance for field research. Specifically, we focus on semantic segmentation of ancient Maya structures such as buildings (houses, temples, pyramids) and platforms, where the remains of the buildings either stand (if they were made of stone), or which are empty (if the buildings were made of a perishable material). Five thousand and eighty structures were identified in the whole research polygon, grouped in 72 settlements of various sizes. The comparative basis was formed by data from the same polygon, which passed the so-called malerization, i.e., by redrawing LiDAR visible archaeological structures into schematic icons evoking their shape, size, and height. The most prominent problem is that the ancient structures are outlined as they once stood, and not as the remains are dispersed now. The present features do not perfectly overlap with the past ones. Labeling was performed by PARU experts and was subsequently validated to a large extent directly in the field as ground-truthing in 2017 and 2019 [1,36,37]. The PLI data captured a variety of topography and ancient settlements which provide an unprecedented chance to expand studies of ancient Maya civilization, for example, to estimate regional population levels, determine patterns of settlement density, calculate agricultural capacity, and evaluate the extent of infrastructural investment in water storage.

The LiDAR raster data were in the standard GeoTIFF format. The last return (ground reflection) was used. The DEM values form a bare-earth terrain model on a regular grid, 1 × 1 m per pixel. The scanner was designed to produce nominal laser density of 15 pulses/m² from a flying altitude of 650 m above ground level. However, due varying vegetation densities, canopy and reflectivity the average ground densities range from 1.1 to 5.3 returns per m² [1]. The values are given in meters above sea level (m.a.s.l) using floating point numbers represented on 32 bits. We selected two rectangular areas, both app. 5 × 9 km, the northern rectangle for training and validation and the southern one for testing. We avoided using data from the central part of the research area because of presence of larger recent structures and excavated ancient structures, see Figure 4. PARU experts labeled the data using geo-referenced vector objects in GIS. The full separation of the training and testing areas and their 1:1 division enabled us to construct output pictures with confidence into the generality of the reported results.

PLI’s preliminary field validation suggest that identifications are accurate and slightly conservative. Further, field validation suggests 15% increase of structures [1]. While we currently do not possess any confidence intervals for the labels, the work on creating ground truth data is still on-going, and certain level of noise in the data is to be expected. In addition, the Uaxactun data set is labeled by a wide group of PARU experts, which reduces the bias of the resulting labels.

4. Semantic Segmentation of Maya Ruins by CNNs

We will refer to the group of algorithms that are chained to process DEM data as to the model. A trained CNN comprises a part of it. The steps needed to produce a working model are the following:

Object mask creation.
CNN input preparation.
CNN usage that contains:
(a)
CNN training,
(b)
CNN output processing,
(c)
thresholding.
Semantic segmentation quality evaluation.

The CNNs used in this study are U-Net and Mask R-CNN.

4.1. Object Mask Creation

Hutson [38] uses the term prism for the imperfect 2D interpretations of the ancient Maya buildings. Vectorized prisms are used to label buildings also in our GIS data. The geo-referenced vector depictions of the objects of interest were stored in *.shp file format.

Let us have binary

M \times N

image mask matrix

M

composed of zero valued elements (indicating absence of an object) and one valued elements (indicating presence of an object). Let

\neg M

implement pixel-wise logical negation of

M

Let ∪ represent union of two image mask matrices implemented as pixel-wise logical disjunction (OR operator) of the elements and ∩ represent intersection of two image mask matrices implemented as pixel-wise logical conjunction (AND operator) of the elements.

We used two output masks for the objects of interest:

Structures, containing buildings and platforms the buildings are constructed on.
Mounds, containing buildings only.

To obtain the Structures mask we implemented a script that used the stored series of

< x, y >

coordinates of the vertices of the polygonal chain outlining an object or a portion of it. ‘NaN’ tags in the series indicate a start of a new polygonal chain belonging to the same object. In the optimal case the polygonal chains formed polygons. The procedure to create the output mask of a single object is illustrated in Figure 5 and was implemented as follows:

The input series is split into multiple polygonal chains defined by sequentially connected $< x, y >$ coordinates. Each occurrence of ‘NaN’ in the series is considered to be the start of a new polygonal chain, Figure 5a.
Each polygonal chain is processed:
(a)
If the polygonal chain contains 3 or more vertices, and the first and the last vertex are identical, a polygon is produced.
(b)
If the polygonal chain contains 3 or more vertices, and the first and the last vertex are not identical, the last vertex is connected to the nearest vertex from the polygonal chain and a polygon is created.
(c)
If the polygonal chain contains fewer than 3 vertices, it is discarded.
This process can result in either zero, one or multiple 2D polygons representing the output structure. The structure is then reconstructed as a union of the areas enclosed by the polygons, Figure 5b.
The areas delimited by the vertices of the individual polygons are filled, Figure 5c.

Pixel-wise, the Mounds mask is a subset of the Structures mask. The Structures mask focuses on the areas of ancient human construction activity, in general, i.e., buildings and platforms. The training area contained app. 600 and the area for testing app. 1300 ancient Maya structures of various sizes. Thirty per cent this area was dedicated to validation. Figure 6 shows a detail with examples of the objects of interest.

The Mounds mask shows the assumed remains of the buildings only, without platforms. From the computer vision perspective, it is of higher quality because it is more specific, and the mounds (remains of buildings covered by a relatively thin layer of soil) differ from natural phenomena more than the platforms. From the archaeological perspective, building segmentation enables archaeologists to estimate the ancient population size. The area for training and validation contained app. 670 and the area for testing app. 1400 mounds of various sizes.

The polygonal approach to mask creation was somewhat problematic because of the imperfections in the PARU labeling. For example, if the starting and the ending vertex of the polygonal chain did not align, some areas were not masked correctly. This usually produced small fragments of noise in the segmentation mask, that needed to be removed by further processing or ignored. Therefore, a less refined but efficient method was used to produce the Mounds mask. We used a simple rasterization algorithm to project the vector polygonal chains on a binary image

O

, that had the resolution of 0.5 × 0.5 m per pixel (double the resolution of the original DEM): if a line segment intersects with a pixel in

O

, the value of the pixel becomes 1; otherwise the pixel value is 0. Let

Bgr

(background) be the largest connected component in the binary image

\neg O

(a negative of the projection canvas

O

). The objects of interest are then

\neg Bgr

. The resulting object mask is obtained by downsapmling

\neg Bgr

to half its size back to 1 × 1 m per pixel resolution.

This approach, although not universal, proved to be resistant to the imperfections in manual labeling using the vector objects. It does not solve situations when objects are enclosed within another objects. This may lead to incorrect masking but the buildings in our data were not that complex and the resulting object mask was nearly faultless.

4.2. CNN Input Preparation

The CNNs accept a matrix (or a tensor) of certain size at the input. The U-Net CNN used here takes a 64 × 64 × 1 matrix. We prepared tiles of this size from the selected DEM area with the stride of 16 pixels both horizontally and vertically; there was a 75% overlap between the subsequent tiles. The DEM values were normalized into

[0, 1)

interval in two steps:

The minimal altitude in m.a.s.l within the tile was subtracted from all values.
The result was divided by a normalization constant.

In the first step, the overall altitude information is eliminated as this should not influence the resulting segmentation. The second step assured that the values are within the required range considering the maximal estimated elevation difference that can appear within the tile. The local elevation difference ratios remained preserved.

The source of the output data was the object mask, i.e., the occupancy grid produced as described in Section 4.1. We have accepted the PARU expert labeled areas to contain ancient remains with 100% certainty as we had no additional confidence information. We have also assumed that the present outline may be wider than labeled due to the objects falling apart over centuries. The output pixels inside the PARU labeled outline received the value of 1.0. The pixels near the outline received values, that gradually decreased toward zero with the distance from the object. This “confidence shadow” was 10 m long for the Structures mask and 4 m long for the Mounds mask. When assembling the training set only the tiles that contained objects of interest were included. Simple data augmentation was performed: besides, the original tiles their horizontally and vertically flipped and rotated images (in multiples of 90

^{\circ}

) were included too.

The input was prepared in a similar manner for Mask R-CNN. Because this model used a pre-trained backbone, the input size was 1024 × 1024 × 3 so the input preparation algorithm was adjusted accordingly. The same tile was used in all 3 layers of the input tensor.

4.3. U-Net CNN Usage

The architecture we used in our experiments has been implemented using Keras and Tensorflow [39]. Total number of trainable parameters was 1,177,649 and the total number of layers was 49. The 64 × 64 input prepared from the DEM data was gradually contracted to 4 × 4 × 256 tensor by a succession of batch normalization, activation, max pooling, and dropout layers. The expansive path that used a succession of concatenation, dropout, transposed convolution, and batch normalization layers expanded the contracted tensor back to the input size. The output layer contained neurons with sigmoid activation function with the range of

(0, 1)

. We used Adam optimization algorithm to minimize mean absolute error (MAE). MAE is more tolerant towards outliers than, for example, mean squared error [40].

Once trained, CNN is used to make predictions on novel data. Let matrix

T

represent the testing area of the DEM. For simpler notation, let

T_{(x, y)}

be a 64×64 portion of

T

with

x, y

representing the coordinates of the upper left corner of

T_{(x, y)}

within

T

. Let

{Tp}_{(x, y)}

represent an input to U-Net prepared as described in Section 4.2 from

T_{(x, y)}

. To every

{Tp}_{(x, y)}

U-Net produces a 64×64 output matrix

O

:

O = U N e t ({Tp}_{(x, y)})

(1)

The values of

O

are from the range

(0, 1)

. The input

{Tp}_{(x, y)}

can be taken from the testing set with various strides s. We have used the same stride for x and y directions. For example, starting from the upper left corner of

T

(

x = 1, y = 1

) we can obtain

{Tp}_{(1, 1)}, {Tp}_{(1 + s, 1)}, {Tp}_{(1 + 2 s, 1)}, {Tp}_{(1, 1 + s)}, \dots

. Let

T w

and

T h

be the width and height of

T

respectively. Let

OT

define the output of U-Net for

T

.

OT

has the same dimensions as

T

and is initialized with zeros.

OT

is calculated with the following Algorithm 1:

Algorithm 1 Assembling U-Net predictions

x \leftarrow 1

y \leftarrow 1

while

x + 64 \leq T w

do

while

y + 64 \leq T h

do

{OT}_{(x, y)} \leftarrow {OT}_{(x, y)} + U N e t ({Tp}_{(x, y)})

y \leftarrow y + s

end while

x \leftarrow x + s

end while

Stride s cannot be smaller than the input size (64 in our case). We have observed that better results are obtained if s is a fraction of the input size, one quarter or one half. Every pixel is then evaluated several times in a slightly different context and the U-Net outputs for these pixels are summed. We used

s = 32

(every pixel is evaluated 4 times). Depending on s, the values of

OT

range between 0 and the number of evaluations of a pixel. The pixels near to the boundaries of

T

may not be evaluated as many times as other pixels. This problem can be solved, for example, by zero padding

T

if outside data are not available.

OT

can be interpreted as a heat map with the values of

OT

corresponding to the models confidence. The heat map as a pre-final product is useful. Our goal was to produce binary classification of every pixel and we have applied thresholding to

OT

values. The resulting product is a binary semantic segmentation of

T

. It is necessary to determine the threshold value. The question is, on what optimality criterion the threshold should be based.

To find the threshold value we have constructed

OT

for the training area. Using the ground truth, we calculated the quality quantifiers (see Section 4.5) for the training set when using various thresholds. Using numerical optimization, threshold values that optimize one of the following criteria were found:

Classification accuracy (A, maximization).
Number of misclassified pixels further than 10 m from true positives over all misclassified pixels. ( $M O R 10 R$ , minimization).
Intersection over Union ( $I o U$ , maximization).

When optimized for different criteria than the above, the thresholds converged to the same value as for

I o U

and therefore the binary segmentation results would be the same. The found thresholds were then used on the U-Net output for the testing area. If the stride s used on the testing area is different than the stride, that was used on the training area, the value of the found threshold must be multiplied by a corresponding coefficient reflecting the number of times each pixel was evaluated in the former and the latter case.

We have observed that smoothing the U-Net output

OT

before thresholding may improve the binary segmentation of the Structures mask. We have used averaging box filters of various sizes for this purpose.

To summarize, the U-Net model usage follows this pipeline:

Preparing the training set.
Training the U-Net CNN.
Finding optimal thresholds for binary segmentation.
Preparing novel inputs from the testing area.
Evaluating the inputs to produce output matrix $OT$ .
Optional—smoothing the output matrix.
Thresholding: producing binary segmentation.

4.4. Mask R-CNN Usage

For comparison of experimental results and for verification we have selected Mask R-CNN [6], more precisely, implementation by Abdulla [41]. Compared to the U-Net architecture, Mask R-CNN is more advanced because it also enables instance segmentation, but we did not use this feature in the presented research.

As a backbone, we used the COCO data set pre-trained ResNet101, from a DeepLab model zoo [42]. The last four stages of the backbone are merged into a Feature Pyramid Network (FPN), that is used to detect objects in different scales and forms the input to the RPN. We have reduced the size of anchor boxes of RPN to be appropriate for the average size of the detected objects that is smaller than in the COCO data set. We have trained only the network heads, i.e., everything except ResNet101 backbone. This transfer learning approach required to use a three-band image on the input, so we stacked the original input three times along third axis. The input size was 1024 × 1024. We have prepared CNN inputs in the same manner as for the U-Net CNN save for the different tile size. The output of Mask R-CNN is a set of output matrices, one for every detected instance. In this case the universal threshold used to obtain binary segmentation was 0.5 because it always led to the best results. Since we were only looking for semantic segmentation, we have combined all binary segmentation masks from the Mask R-CNN network using the ∪ operator (logical OR, defined above) into a single segmentation mask. Otherwise, we used the same data division routines for training, validation and testing as with the U-Net model.

4.5. Semantic Segmentation Quality Evaluation

To measure the segmentation quality, we used a range of quality quantifiers. Because we have built a separate model for every task, the segmentation was always binary. The following quantifiers were based on the confusion matrix with counts of pixels, that were classified as true negative (

T N

), true positive (

T P

), false negative (

F N

), and false positive (

F P

). P is the number of real positive pixels and N is the number of real negative pixels in the ground truth mask.

A = accuracy

$A = \frac{T P + T N}{T + N}$

(2)
$T P R$ = true positive rate (sensitivity, recall)

$T P R = \frac{T P}{P}$

(3)
$T N R$ = true negative rate (specificity)

$T N R = \frac{T N}{N}$

(4)
$B A$ = balanced accuracy

$B A = \frac{T P R + T N R}{2}$

(5)
$P P V$ = positive predictive value (precision)

$P P V = \frac{T P}{T P + F P}$

(6)
$N P V$ = negative predictive value

$N P V = \frac{T N}{T N + F N}$

(7)
$F 1$ = harmonic mean of precision and sensitivity

$F 1 = 2 \cdot \frac{P P V \cdot T P R}{P P V + T P R}$

(8)
$M C C$ = Matthews correlation coefficient

$M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}$

(9)

M C C

is a balanced measure which can be used even if the classes are of very different sizes as in our case [43]. In addition to these measures we have calculated Intersection over Union (

I o U

, Jaccard index) for the objects of interest class, for the background class and average

I o U

. Let

\sum M

be the count of all non-zero elements of

M \times N

binary matrix

M

:

\sum M = \sum_{i = 1}^{M} \sum_{j = 1}^{N} M_{i, j}

(10)

Let

I

be the predicted binary image mask and let

GT

be the ground truth image mask.

I o U_{p o s}

for positive class is calculated as:

I o U_{p o s} (I, GT) = \frac{\sum (I \cap GT)}{\sum (I \cup GT)}

(11)

I o U_{n e g}

for negative class is calculated as:

I o U_{n e g} (I, GT) = \frac{\sum (\neg I \cap \neg GT)}{\sum (\neg I \cup \neg GT)}

(12)

And

I o U

calculated as:

I o U (I, GT) = \frac{I o U_{p o s} (I, GT) + I o U_{n e g} (I, GT)}{2}

(13)

We have proposed

M O R 10 R

quantifier, that is calculated as the number of misclassified pixels further than 10 m from true positives over all pixels. The idea behind it is to minimize the proportion of misclassified pixels outside the area that will be scrutinized by the expert, which is near the true positives. We hypothesize that the errors close to true positives are therefore less detrimental. The actual implementation uses binary dilation (⨁) by a

3 \times 3

cross-shaped binary structuring element

S

. Let

⨁_{10}

denote dilation being repeated 10 times. Each repetition expands the objects in the image by one pixel in all directions. Let

TP

be the image mask indicating the true positives, calculated as:

TP = I \cap GT

(14)

Let

MIS

be the image mask indicating the misclassified pixels, calculated as (using an implementation of XNOR operator):

MIS = (I \cup GT) \cap (\neg I \cup \neg GT)

(15)

M O R 10 R

is then calculated as:

M O R 10 R (I, GT) = \frac{\sum MIS - \sum ((TP ⨁_{10} S) \cap MIS)}{T + N}

(16)

The above quantifiers are based on a pixel-wise match between the prediction mask and the ground truth mask. The objects in the images’ masks form connected components, that are relatively small and more or less regular blobs. This is especially the case of the Mounds mask where the buildings are predominantly delineated by rectangles. Let

B_{G T} \in {B_{G T 1}, \dots, B_{G T M}}

be a blob in the ground truth mask and

B_{P R} \in {B_{P R 1}, \dots, B_{P R N}}

a blob in the prediction mask.

Our intention was to express how many of the existing objects would be discovered by the models, how many would be ignored and how many

B_{P R}

would not overlap with any of

B_{G T}

. Not discovered and falsely identified objects incur cost in terms of archaeological research. We have also investigated how often the models discover objects of various sizes. To localize an object in the LiDAR data the predicted and the ground truth blobs do not need to overlap completely. Based on this, we have calculated the hit rate.

B_{G T}

is considered hit if there exist at least one such

B_{P R}

such that

B_{G T} \cap B_{P R} \neq

. If n is the count of

B_{G T}

that have been hit, the total hit rate is

n / M

.

We have calculated the hit rates for sets of

B_{G T}

grouped by size into small, medium and large objects. Using the cumulative histogram of the counts of the objects of various sizes we have identified size

S_{50 %}

such that 50% of

B_{G T}

are smaller or equal to

S_{50 %}

(small objects) and size

S_{95 %}

such that 95% of

B_{G T}

are smaller or equal to

S_{95 %}

. Medium objects are larger than

S_{50 %}

and smaller or equal to

S_{95 %}

. Large objects are larger than

S_{95 %}

.

The character of the Mounds mask with mostly regular and not overlapping blobs is suitable also to report the counts of the total number of predicted

B_{P R}

, the number of hit

B_{G T}

and the number of false positively predicted

B_{P R}

. These counts show the proportion of the discovered objects and the predicted objects, that do not actually exist.

5. Experiments

We have tested both U-Net and Mask R-CNN based models on a rectangular area with dimensions 4910 × 8350 m in the southern part of the research area. Data used for testing were not used in the training and validation process at any point.

5.1. Segmentation of Structures

The U-Net CNN was initialized and trained without transfer learning. It took 400 epochs for it to converge. We investigated how the segmentation thresholds optimized for various criteria and smoothing the

OT

output matrix influence the results on the training set. Based on this, we consider the U-Net model that uses 30 × 30 averaging box filter applied to

OT

and

I o U

optimizing threshold for binary segmentation to be the best performing model, outperforming Mask R-CNN model by a narrow margin. Mask R-CNN was trained for 100 epochs, no output smoothing was used.

Table 1 summarizes the quality quantifiers calculated for the models. Figure 7 shows predictions of the U-Net model on a subset of the testing region containing an ancient settlement.

When we compare the two models relatively to each other, they perform approximately on the same level, with U-Net being slightly better judging by the

\underline{I o U}

and

\underline{M C C}

values. This does not necessarily imply that one model is more suitable for the purpose of identifying ancient Maya structures for archaeological research. We also investigated how successful were the models discovering objects of various sizes. Fifty per cent of the objects in the Structures testing set were below 220 m

^{2}

(small objects, with 13 m

^{2}

minimum), 95% of the objects were below 2209 m

^{2}

(medium objects, larger than 220 m

^{2}

and smaller than 2209 m

^{2}

). The remaining 5% of the objects were large, with the largest object having 43,984 m

^{2}

. Table 2 summarizes the hit rates of the models. U-Net surpasses Mask R-CNN here in all measures. The blobs in the Structures mask vary a lot in size and in shape so counting the discovered and falsely predicted objects may be misleading, for example, because a predicted blob often covers several distinct ground truth objects. Therefore, we do not report these counts for Structures mask. Pixel based quantifiers are better to be used for comparison in this case. Figure 8a shows another detail of U-Net model’s classification results in the segmentation of Structures.

Because the segmentation thresholds were determined on training area a question has arisen, how close are these to the optimal thresholds for the testing area. Table 3 shows the estimates of the thresholds optimized for various quality indicators on the training and on the testing areas together with the resulting

I o U

. Based on the results we can conclude that the thresholds are very close to the optimal values and are therefore valid to use. Please note, that in the case of the threshold optimized for accuracy the optimal threshold decreased

I o U

. This is possible because

I o U

is a different criterion than the one the threshold was optimized for.

5.2. Segmentation of Mounds

The process of training the models for this task differed only in using a different objects’ mask. In this case, the U-Net model significantly outperformed Mask R-CNN. The U-Net model was reinitialized and required app. 500 epochs to converge. Based on the results on the training set, we have determined the optimal values of the various criteria optimizing thresholds. We have observed that the thresholds differ significantly only when optimized for accuracy,

I o U

and

M O R 10 R

. When optimized for different criteria the solution converged to the

I o U

optimizing one. Based on the results on the training area, we determined that the U-Net model with no smoothing of the output matrix and with

I o U

optimizing threshold had the best performance. This was also the case on the testing area. Mask R-CNN was trained for 100 epochs, no output smoothing was used.

Table 4 summarizes the quality quantifiers calculated for the models on the testing area. We included three variants of the U-Net model that performed similarly. Smoothing the U-Net model’s output deteriorated its performance in all measures except for

M O R 10 R

. Visualizations of the segmentation errors of the best U-Net model are in Figure 8b and Figure 9. Figure 8 also enables side by side comparison of the output of the best performing U-Net models over the same area in semantic segmentation of Structures and Mounds.

We investigated the hit rates and the counts of discovered and falsely predicted objects on the testing area. The total number of objects in the Mounds testing area was 1422. Fifty per cent of the objects were sized below 74 m

^{2}

(small objects, the smallest are 3 m

^{2}

), 95% of the objects were below 268 m

^{2}

(medium objects, larger than 74 m

^{2}

and smaller than 268 m

^{2}

). The remaining 5% of the objects were large, with the largest object having 5510 m

^{2}

. Table 5 summarizes the hit rates of the models. With the best performing U-Net model the objects that were hit were hit on average by 103 true positive predictions. Please note that the counts of the objects correctly predicted and the objects falsely predicted do not have to add up to the total number of predicted objects. It may happen that a predicted object overlaps with more ground truth objects and that more predicted objects overlap with a single ground truth object.

6. Discussion

The experimental results indicate superiority of the U-Net based, randomly initialized models over the Mask R-CNN models that used transfer learning. On the Structures object mask, the models achieved similar results. We did not find a clear reason why Mask R-CNN under-performed on the Mounds object mask. More experiments with various backbones and various parameter settings are needed.

The best performing model trained on Structures mask sometimes falsely positively predicted in areas that resembled platforms but were of natural origin. On the other hand, some of the mounds were probably misclassified because of the lack of distinctive features captured by the DEM. Figure 10 shows examples of segmentation errors. Figure 10b,c indicate that combining more models by assembling or boosting could improve the resulting segmentation quality.

By investigating the erroneous predictions of the models, we have found possible mistakes in the original object labeling, that would render the predictions to be true. We must also consider the quality of interpretation from PARU experts, which may not be a homogeneous comparative quality criterion. While the eastern test area represents the remains of the relatively well researched Maya city Jimbal and its surroundings verified by excavations and extensive ground-truthing, the western test area is called Sakapuk, with more scattered jungle settlements and only very poorly verified structures. It is therefore possible that in an area with a more homogeneous verification of archaeological structures, the results would be slightly different. In any case, due to the number of objects and their inaccessibility it is not possible to expect that the basis for evaluation could be ideally chosen. We have formulated several suggestions for archaeological labeling of LiDAR data:

If the labeling was done without ground-truthing, ideally, several experts should evaluate each object and indicate the confidence in their decision. The confidence level is a valuable information when calculating CNN loss.
If presence of an object was verified by ground-truthing, this should be indicated.
When labeling objects of ancient construction activity, besides the maler outlining also the area covering the present-day features belonging to the object should be indicated.
When labeling objects such as looting trenches, agricultural fields, etc., the entire area of the objects’ features should be indicated (for example the trench and the debris hill).

We have tried to identify a single self-standing quantifier, that would reflect the semantic segmentation quality well. None of the quantifiers used is optimal, but in our experiments

I o U

served this purpose the best.

M O R 10 R

did not perform as expected, and we will try to develop the idea and improve it further. Additional quality aspects need to be considered in the future, for example the decision support value for the labeling expert and the cost of false positive and false negative predictions. Exact criteria to evaluate the decision support value need to be formulated in an interdisciplinary discussion as well as the methods of evaluation of the cost of errors of various types. We assume that including these quantifiers into the CNN learning process in a form of a novel loss function would improve the resulting segmentation quality.

Estimating the decision support value of the system is arguably the most important outcome of the presented research. Canuto et al. [1] states that human experts failed to discover app. Fifteen per cent of the objects in the research area but we do not have information on the rates of their false positives and negatives. Judging from the overall hit rate of the U-Net model on the Mounds mask, the system is helpful but further improvement by 10% while lowering the number of falsely predicted objects would be needed for the system to become a useful research tool. However, we consider the success of identifying medium-sized structures and mounds (74% and 81% respectively) to be particularly important. Namely, very small objects, where we have a lower success rate, are at the same time the most subjective factor in the interpretation of PARU experts, where interchangeability with natural elements such as uprooted trees, large anthills or landslides is relatively large. The medium-sized structures are the most objective factor of identification from a practical point of view.

7. Conclusions and Future Work

To identify a certain type of structures captured by LiDAR, the work of at least one, usually several, expert with many years of experience in the field is required. These experts can identify individual structures from the data. In the case of the PARU Uaxactun project, the data which we used as a basis in our study, this search process lasted from several months to one year of expert work. However, this recognition process can also be learned by a machine because its characteristics are objectively definable.

Several authors have discussed the problems of applying computer vision in identification of anthropogenic objects in archaeological prospection using remote sensing data. Casana [44] argues that expert-led analysis may be the best means of generating nuanced, contextual understandings of complex archaeological landscapes. Traviglia et al. [45] on the other hand states that the ongoing research demonstrates there are major benefits in developing computer vision applications for this purpose. The proposed system does not and cannot have the ambition to replace an archaeologist in examining a LiDAR image. However, it can significantly help to speed up the process of reviewing and evaluating tens and hundreds of square kilometers of visual data, often of an ambiguous nature. It can offer suggestions based on learned recognition, which can be constantly expanded. It is in this assistant function that we see the greatest potential.

We have performed semantic segmentation of ancient Maya structures using two DL models based on U-Net and Mask R-CNN. U-Net performed better in our experiments. Re-purposing the original labeling of the source LiDAR data for DL is not without problems and we suggested updates to the methodology. A self-standing numerical quantifier that would optimally reflect the quality of the semantic segmentation was not found. We have proposed a novel measure based on counting the misclassified pixels outside a ten-meter radius from true positives with limited success. The most suitable of the used quantifiers was

I o U

.

If the percentage of false positivity and negativity were similar to the human work of experts (and now we really have reason to believe that we are approaching this goal), an application based on the proposed procedure could save a large amount of energy and money. We believe that the results we have reached open the door to this possibility. In its development, we can rely on cooperation with the PARU project, which has a wealth of data and an archaeological team that has the capacity to perform ground-truthing of the proposed results. In cooperation with field practice, our development is not only theoretical, but is planned from the beginning as a practical tool for archaeological research. The Maya Lowlands, especially the Guatemalan province of Petén, are covered with a dense layer of jungle, and for a very long time the archaeological research will be based on LiDAR generated data. We plan to identify not only architectural structures but also accompanying infrastructure, such as water sources, roads, and agricultural elements. Another research direction is classification of the found objects as, for example, astronomical observatories (so-called groups E), triadic groups, certain characteristic settlement clusters, etc. Because some of these elements are characteristic only for certain periods, it is also possible to capture sets of elements that outline the chronological relevance of the captured objects.

From the technological vantage point, future work includes optimization of the data set labeling process and training sets’ production, further experimental analysis of the CNNs’ setup and training, and development of a specific loss function that would reflect the segmentation’s decision support value for the human expert.

Finally, the instruments we create could also have a wider impact on archaeological research in other parts of the world. In northern Petén, where no recent settlements exist, every identified structure can be attributed to the ancient Maya culture, the number of possible mistakes decreases significantly. It is precisely this area and others like that, where a future application could bring considerable help to archaeologists. We believe that it can prominently accelerate the pace of research and the quality of its outputs in the future.

Author Contributions

Data curation: M.K., T.L.; Conceptualization: M.B., M.J., T.T., P.S., M.K., T.L.; Methodology: M.B., M.J., T.T., P.S., M.K., T.L.; Software: M.B., M.J., T.T.; Validation: M.B., M.J., T.T., M.K., T.L.; Supervision: M.B., P.S., M.K., T.L.; Writing-original draft: M.B., M.J., T.T., M.K., T.L.; Funding acquisition: P.S., M.K., T.L.; Visualization: M.B., M.J., T.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Slovak Research and Development Agency under the contract APVV-17-0648, by the Scientific Grant Agency under the contract VEGA 1/0858/17 and by AI4EU—A European AI On Demand Platform and Ecosystem H2020-825619.

Acknowledgments

We thank the PACUNAM-funded PLI project for making LiDAR data available for further research. The authors also gratefully acknowledge PARU experts that labeled and ground verified the data used in this research, and students of the program Intelligent Systems at Dept. of Cybernetics and AI that contributed to the initial research.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Canuto, M.; Estrada-Belli, F.; Garrison, T.; Houston, S.; Acuña, M.; Kováč, M.; Marken, D.; Nondédéo, P.; Auld-Thomas, L.; Cyril, C.; et al. Ancient Lowland Maya Complexity as Revealed by Airborne Laser Scanning of Northern Guatemala. Science 2018, 361, 1–12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Doneus, M. Openness as visualization technique for interpretative mapping of airborne lidar derived digital terrain models. Remote Sens. 2013, 5, 6427–6442. [Google Scholar] [CrossRef] [Green Version]
Zakšek, K.; Oštir, K.; Kokalj, Ž. Sky-view factor as a relief visualization technique. Remote Sens. 2011, 3, 398–415. [Google Scholar] [CrossRef] [Green Version]
Kattenborn, T.; Eichel, J.; Fassnacht, F.E. Convolutional Neural Networks enable efficient, accurate and fine-grained segmentation of plant species and communities from high-resolution UAV imagery. Sci. Rep. 2019, 9, 17656. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Anderson, K.; Hancock, S.; Disney, M.; Gaston, K.J. Is waveform worth it? A comparison of LiDAR approaches for vegetation and landscape characterization. Remote Sens. Ecol. Conserv. 2016, 2, 5–15. [Google Scholar] [CrossRef]
Fernandez-Diaz, J.C.; Carter, W.E.; Glennie, C.; Shrestha, R.L.; Pan, Z.; Ekhtari, N.; Singhania, A.; Hauser, D.; Sartori, M. Capability assessment and performance metrics for the Titan multispectral mapping lidar. Remote Sens. 2016, 8, 936. [Google Scholar] [CrossRef] [Green Version]
Moller, A.; Fernandez-Diaz, J.C. Airborne Lidar for Archaeology in Central and South America. Lidar Magazine, 1 January 2019; 9. [Google Scholar]
Yan, K.; Wang, Y.; Liang, D.; Huang, T.; Tian, Y. CNN vs. SIFT for Image Retrieval: Alternative or complementary? In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 407–411. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Aloysius, N.; Geetha, M. A Review on Deep Convolutional Neural Networks. In Proceedings of the 2017 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 6–8 April 2017; pp. 0588–0592. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Fiorucci, M.; Khoroshiltseva, M.; Pontil, M.; Traviglia, A.; Del Bue, A.; James, S. Machine Learning for Cultural Heritage: A Survey. Pattern Recognit. Lett. 2020, 133, 102–108. [Google Scholar] [CrossRef]
Verschoof-van der Vaart, W.B.; Lambers, K. Learning to Look at LiDAR: The Use of R-CNN in the Automated Detection of Archaeological Objects in LiDAR Data from the Netherlands. J. Comput. Appl. Archaeol. 2019, 2. [Google Scholar] [CrossRef] [Green Version]
Gallwey, J.; Eyre, M.; Tonkins, M.; Coggan, J. Bringing lunar LiDAR back down to earth: Mapping our industrial heritage through deep transfer learning. Remote Sens. 2019, 11, 1994. [Google Scholar] [CrossRef] [Green Version]
Kazimi, B.; Thiemann, F.; Sester, M. Object Instance Segmentation in Digital Terrain Models. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Salerno, Italy, 3–5 September 2019; pp. 488–495. [Google Scholar]
Politz, F.; Kazimi, B.; Sester, M. Classification of Laser Scanning Data Using Deep Learning. In Proceedings of the 38th Scientific-Technical Annual Conference of the DGPF and PFGK18 in Munich, Deutsche Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation (DGPF), Munich, Germany, 7–9 March 2018; Volume 27, pp. 597–610. [Google Scholar]
Soroush, M.; Mehrtash, A.; Khazraee, E.; Ur, J.A. Deep learning in archaeological remote sensing: Automated qanat detection in the Kurdistan region of Iraq. Remote Sens. 2020, 12, 500. [Google Scholar] [CrossRef] [Green Version]
Zeggada, A.; Melgani, F.; Bazi, Y. A deep learning approach to UAV image multilabeling. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 694–698. [Google Scholar] [CrossRef]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Martinez-Gonzalez, P.; Garcia-Rodriguez, J. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Lambers, K.; Verschoof van der Vaart, W.B.; Bourgeois, Q.P. Integrating remote sensing, machine learning, and citizen science in Dutch archaeological prospection. Remote Sens. 2019, 11, 794. [Google Scholar] [CrossRef] [Green Version]
Moyes, H.; Montgomery, S. Locating cave entrances using lidar-derived local relief modeling. Geosciences 2019, 9, 98. [Google Scholar] [CrossRef] [Green Version]
Somrak, M.; Džeroski, S.; Kokalj, Ž. Learning to classify structures in ALS-derived visualizations of ancient Maya Settlements with CNN. Remote Sens. 2020, 12, 2215. [Google Scholar] [CrossRef]
Pires de Lima, R.; Marfurt, K. Convolutional Neural Network for Remote-Sensing Scene Classification: Transfer Learning Analysis. Remote Sens. 2020, 12, 86. [Google Scholar] [CrossRef] [Green Version]
Zingman, I.; Saupe, D.; Penatti, O.A.; Lambers, K. Detection of fragmented rectangular enclosures in very high resolution remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 4580–4593. [Google Scholar] [CrossRef]
Trier, Ø.; Salberg, A.; Pilø, L. Semi-automatic Mapping of Charcoal Kilns From Airborne Laser Scanning Data Using Deep Learning. In Proceedings of the 44th Conference on Computer Applications and Quantitative Methods in Archaeology on CAA2016, Oslo, Norway, 29 March–2 April 2016; pp. 219–231. [Google Scholar]
Banaszek, Ł.; Cowley, D.C.; Middleton, M. Towards national archaeological mapping. Assessing source data and methodology—A case study from Scotland. Geosciences 2018, 8, 272. [Google Scholar] [CrossRef] [Green Version]
Quintus, S.; Day, S.S.; Smith, N.J. The efficacy and analytical importance of manual feature extraction using lidar datasets. Adv. Archaeol. Pract. 2017, 5, 351–364. [Google Scholar] [CrossRef]
Kováč, M. Verificaciones de los Rasgos Agrícolas Identificados por LiDAR. In Nuevas Excavaciones en Uaxactun IX; CMS-CHRONOS: Bratislava, Slovakia, 2019; pp. 205–220. [Google Scholar]
Kováč, M. Recorridos, Verificaciones y Rescate en Uaxactun y Alrededor. In Nuevas Excavaciones en Uaxactun IX; CMS-CHRONOS: Bratislava, Slovakia, 2019; pp. 220–230. [Google Scholar]
Hutson, S.R. “Unavoidable Imperfections”: Historical Contexts for Representing Ruined Maya Buildings. In The Past Presented: Archaeological Illustration in the Americas; Dumbarton Oaks Research Library Collection: Washington, DC, USA, 2012. [Google Scholar]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 9 November 2020).
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Abdulla, W. Mask R-CNN for Object Detection and Instance Segmentation on Keras and TensorFlow. 2017. Available online: https://github.com/matterport/Mask_RCNN (accessed on 9 November 2020).
TensorFlow DeepLab Model Zoo. 2020. Available online: https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md (accessed on 9 November 2020).
Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef] [PubMed]
Casana, J. Global-Scale Archaeological Prospection using CORONA Satellite Imagery: Automated, Crowd-Sourced, and Expert-led Approaches. J. Field Archaeol. 2020, 45, S89–S100. [Google Scholar] [CrossRef] [Green Version]
Traviglia, A.; Cowley, D.; Lambers, K. Finding common ground: Human and computer vision in archaeological prospection. AARGnews 2016, 53, 11–24. [Google Scholar]

Figure 1. Examples of visualizations of ancient Maya ruins from Uaxactun data set (see Section 3) using (a): slope angle, (b): positive openness.

Figure 2. The Titan multispectral LiDAR sensor integrated into a DHC-6 Twin Otter aircraft: (a) Overview of installation layout from the port side of sensor head; (b) View from front looking aft, sensor control rack is in the foreground, sensor head in the background; (c) View of the sensor head through the mapping port of the aircraft [8].

Figure 3. U-Net architecture example [5]. Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

Figure 4. The Uaxactun research area displayed as digital elevation model (DEM).

Figure 5. Visualization of the transformation of the vector representation of the prism to the output mask. (a): The prism as outlined by PARU expert. (b): Union of definite polygons. (c): The final output mask with holes filled.

Figure 6. Example of the masks used to identify Structures and Mounds.

Figure 7. Predictions of the U-Net model, Structures.

Figure 8. Side by side comparison of the predictions of the U-Net model, detail, (a) on the Structures mask and (b) on the Mounds mask.

Figure 9. Predictions of the U-Net model, Mounds object mask.

Figure 10. Examples of U-Net segmentation errors. (a): Falsely positively classified recent features as mounds. (b): False negative segmentation, probably due to the lack of the features defining a mound. (c): Output of the U-Net trained on Structures mask that is able to partially correctly identify the area in the image (b), because of the presence of platform features.

Table 1. Models quality quantifiers, Structures mask. Superior values are in bold.

Model	U-Net	Mask R-CNN
A	0.9875	0.9881
$B A$	0.7958	0.7484
$T P R$	0.5973	0.5002
$T N R$	0.9944	0.9966
$P P V$	0.6524	0.7246
$N P V$	0.9929	0.9912
$F 1$	0.6236	0.5918
$M O R 10 R$	0.4894	0.5774
$I o U_{p o s}$	0.4531	0.4203
$I o U_{n e g}$	0.9874	0.9879
$I o U$	0.7202	0.7041
$M C C$	0.6179	0.6015

Table 2. Hit rates of the models for objects of various sizes in the Structures mask. Superior values are in bold.

Model	Small	Medium	Large	Total
U-Net	0.4318	0.7388	0.9808	0.6054
Mask R-CNN	0.4155	0.6408	0.9615	0.5503

Table 3. Predicted and optimal thresholds for U-Net model optimized on the training set for various criteria.

Optimized for	Predicted	IoU with Predicted	Optimal	IoU with Optimal
Accuracy	10.012	0.7126	10.3429	0.7059
MOR10R	9.2119	0.7202	9.2116	0.7202
IoU	9.2119	0.7202	9.0501	0.7204

Table 4. Models’ quality quantifiers, Mounds object mask. Superior values are in bold.

Model	U-Net	U-Net	U-Net	M-RCNN
Smoothing	none	none	30 × 30	none
Criterion	IoU_ave	Accuracy	MOR10R	n/a
A	0.9966	0.9969	0.9897	0.9967
$B A$	0.7708	0.7408	0.8208	0.5906
$T P R$	0.5436	0.4827	0.6506	0.1813
$T N R$	0.9983	0.9988	0.9910	0.9998
$P P V$	0.5515	0.6143	0.2176	0.7891
$N P V$	0.9982	0.9980	0.9986	0.9969
$F 1$	0.5473	0.5406	0.3256	0.2949
$M O R 10 R$	0.4705	0.4752	0.3194	0.7558
$I o U_{p o s}$	0.3768	0.3704	0.1944	0.1729
$I o U_{n e g}$	0.9967	0.9969	0.9897	0.9967
$I o U$	0.6868	0.6836	0.5921	0.5848
$M C C$	0.5456	0.5430	0.3721	0.3773

Table 5. Hit rates of the models for objects of various sizes in the Mounds mask and counts of the predicted objects. OPT: objects predicted total; OCP: objects correctly predicted; OFP: objects falsely predicted. The total number of objects in the Mounds testing area was 1422. Superior values are in bold.

Model	Smoothing	Criterion	Small	Medium	Large	Total	OPT	OCP	OFP
U-Net	none	IoU	0.5577	0.8056	0.9305	0.6882	1509	978	531
U-Net	none	Accuracy	0.5183	0.7978	0.9167	0.6643	1454	944	510
U-Net	30 × 30	MOR10R	0.5254	0.7445	0.8611	0.6411	495	911	127
M-RCNN	none	n/a	0.1167	0.3384	0.5714	0.2585	440	389	37

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bundzel, M.; Jaščur, M.; Kováč, M.; Lieskovský, T.; Sinčák, P.; Tkáčik, T. Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology. Remote Sens. 2020, 12, 3685. https://doi.org/10.3390/rs12223685

AMA Style

Bundzel M, Jaščur M, Kováč M, Lieskovský T, Sinčák P, Tkáčik T. Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology. Remote Sensing. 2020; 12(22):3685. https://doi.org/10.3390/rs12223685

Chicago/Turabian Style

Bundzel, Marek, Miroslav Jaščur, Milan Kováč, Tibor Lieskovský, Peter Sinčák, and Tomáš Tkáčik. 2020. "Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology" Remote Sensing 12, no. 22: 3685. https://doi.org/10.3390/rs12223685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation of Airborne LiDAR Data in Maya Archaeology

Abstract

1. Introduction

2. Acquisition and Processing of LiDAR-Derived Terrain Models in Archaeology

2.1. Using Airborne LiDAR in Tropical Environment

2.2. Overview of Deep Learning Methods for Semantic Segmentation

2.3. Deep Learning Methods for LiDAR-Derived Terrain Models in Archaeology

3. Uaxactun Data Set

4. Semantic Segmentation of Maya Ruins by CNNs

4.1. Object Mask Creation

4.2. CNN Input Preparation

4.3. U-Net CNN Usage

4.4. Mask R-CNN Usage

4.5. Semantic Segmentation Quality Evaluation

5. Experiments

5.1. Segmentation of Structures

5.2. Segmentation of Mounds

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI