Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods

Smith, Jack M. W.; Paraskevopoulou, Chrysothemis

doi:10.3390/infrastructures10040098

Open AccessArticle

Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods

by

Jack M. W. Smith

^*

and

Chrysothemis Paraskevopoulou

School of Earth and Environment, University of Leeds, Leeds LS2 9JT, UK

^*

Author to whom correspondence should be addressed.

Infrastructures 2025, 10(4), 98; https://doi.org/10.3390/infrastructures10040098

Submission received: 28 February 2025 / Revised: 3 April 2025 / Accepted: 12 April 2025 / Published: 18 April 2025

(This article belongs to the Special Issue Remote Sensing for Infrastructure Health Monitoring: Advancements in Sensors and Analysis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Masonry-lined tunnels form a vital part of the world’s operational railway networks. However, in many cases their structural condition is deteriorating, so it is vital to undertake regular condition assessments to ensure their safety. In order to reduce costs and improve the repeatability of these assessments, automated deep learning-based tunnel analysis workflows have been proposed. However, for such methods to be applied in practice to a safety-critical situation, it is necessary to validate their conclusions. This study analysed how uncertainty quantification methods can be used to assess the test time performance of neural networks trained for masonry joint segmentation without the laborious labelling of additional ground truths. It applies test-time augmentation (TTA) and Monte Carlo dropout (MCD) to evaluate both the aleatoric and epistemic uncertainties of a selection of trained models. It then shows how these can be used to generate uncertainty maps to aid an engineer’s interpretation of the neural network output.

Keywords:

masonry tunnel; deep learning; uncertainty quantification; condition assessment; Monte Carlo dropout; test-time augmentation

1. Introduction

The advent of deep neural networks has created a significant opportunity to automate the documentation and condition assessment of historic infrastructure. A large proportion of the world’s critical transportation infrastructure dates from the mid-19th century [1] and much of it is constructed from brick masonry. Against a backdrop of increased utilisation and tighter maintenance budgets, many of these structures are deteriorating due to their age. As a result, there is substantial pressure to find cost-effective structural assessment methods. In addition, reducing the required time on site to conduct these assessments leads to shorter infrastructure closures and improved health and safety outcomes.

Automated condition assessment procedures could save time both on site and during post-inspection analysis. These methods enable structural defects, such as cracking and spalling, to be automatically labelled on to a digital model of a structure directly from photographic or lidar data. Deep learning-based methods can semantically segment individual damage locations and create detailed structural models that aid an engineer in conducting a structural condition assessment from an office setting. Such methods have been widely developed for modern concrete infrastructure [2]. However, the variety of materials, construction geometries and historic deteriorations present in older masonry structures makes it challenging to develop an automated method that can be trusted to operate effectively when applied in practice [3]. Masonry joints complicate the masonry lining surface profile, making it difficult to apply traditional computer vision methods to detect lining abnormalities in these tunnels [4].

Masonry lined tunnels are particularly difficult to inspect due to the frequent and widespread occurrence of surface damage, such as spalling, efflorescence and water ingress, that can obscure more critical structural issues, such as lining deformations and dislocations [5]. Furthermore, as constructing new tunnels can be expensive and risky [6] and wholescale refurbishment is complex [1], tunnels often need to be continually maintained beyond their design life. This leads to the creation of a wide variety of often poorly documented local patch repairs. In order to consistently differentiate between surface damage and lining deformations, [7] developed a workflow that segments each individual brick instance on a tunnel’s lining from lidar data. They then analysed surface spalling damage on a brick-by-brick basis. A key step in this workflow involves training a deep learning model to semantically segment masonry joint locations so that each masonry block instance can be identified. It is vital that condition assessment tasks are digitalised, as automated digital analysis workflows enable better standardisation and traceability of the reasoning behind maintenance recommendations [8]. Generating consistent and reliable automated structural condition assessments would also pave the way for more effective predictive maintenance strategies [9,10], reducing the cost and improving the safety of an asset manager’s portfolio.

Multiple studies have tackled the task of masonry joint segmentation using supervised deep learning [11,12,13,14], and there has been a significant focus on creating a generalisable method that performs well on unseen structures. Despite the excellent performance achieved within these studies on the datasets analysed, there has been limited industry uptake of these methods. Given the safety-critical nature of tunnel condition assessments, for these methods to be practical in real-world applications, engineers need to be able to trust that such methods will yield acceptable performance on the specific structure to be analysed. In addition, given the variety of masonry tunnel lining surfaces and the black box nature of deep neural networks, it is difficult to intuitively determine whether a structure’s features are in or out of the distribution of the training data and what the requirements are for a specific trained network to perform well. Even for more homogeneous concrete structures, concerns about generalisability have limited adoption.

There are multiple methods that can improve the performance of neural networks on unseen structures, including data augmentation [15], domain adaptation methods [16] and active learning [17]. However, even when these methods have been adopted, it is vital that the performance can be quantified in the target application for the method to be trusted. For the case of masonry joint segmentation, this would typically involve manually labelling a section of a tunnel before assessing the segmentation performance against standard quality metrics such as the Intersection Over Union (IOU) or receiver operating characteristic (ROC). This involves time-consuming manual annotation by a trained operator. A further issue with this approach is that lining damage and historical repairs are usually localised, so assessments performed on sample lining data taken at one part of the tunnel may not be representative of the performance throughout the tunnel length.

For deep learning-based tunnel analysis workflows to be effectively applied, it is vital that an engineer can quantify the uncertainty in their predictions [18]. Ideally, this would involve generating uncertainty maps so that further analysis can be conducted to verify the performance or manually correct a model’s segmentation in localised areas. This retains the time-saving benefits of these automated workflows where they perform well but reduces the risk of unverified analysis where there are possibly inaccurate segmentations.

There have been substantial recent developments in uncertainty quantification for deep learning computer vision applications in the healthcare field [19]. However, there has been limited application of these methods for infrastructure condition assessment tasks. The aim of this study was to provide an evaluation of which uncertainty quantification methods have the most potential for validating deep learning models applied to real-world railway tunnel condition assessment tasks. This study compared methods to determine which were most applicable to the task of masonry joint semantic segmentation from lidar data. It considered how uncertainty quantification could be applied in real-world masonry tunnel condition assessments and demonstrated the utility of creating uncertainty maps. This study also assessed the correlation between segmentation uncertainty and performance. Overall, this study aimed to provide guidance on where uncertainty quantification methods could provide value for an engineer and which would need further development for their insights to be useful and readily interpretable.

2. Methods

As most modern deep learning approaches are black boxes, researchers have looked at various methods to verify whether a neural network prediction is trustworthy. While many studies have focused on how to explain a model’s conclusion [20], for the task of semantic segmentation, many of these methods have been proven to be either misleading or overcomplex, restricting their interpretability [21]. As a result, this study restricted its analysis to assessing the validity of neural network predictions through uncertainty quantification methods. In the field of deep learning research, uncertainty quantification methods focus on identifying, characterising and quantifying the level of uncertainty in a model’s predictions.

2.1. Types of Uncertainty

Uncertainty can be classified into two types: aleatoric and epistemic [22]. Aleatoric uncertainty refers to inherent randomness in the data that can cause changes in the model predictions. In this application, it could be caused by any aspect with unpredictable values for a specific pixel, such as noise in the lidar scan or surface roughness of the masonry lining. Arguably, any feature relationship that is present in the training data but too complex for the network to characterise could also be considered as a cause of aleatoric uncertainty, as its impacts are effectively random [23].

Epistemic uncertainty represents uncertainty in the output caused by a lack of understanding of the target domain by the model. This is caused by test data having features that are either not represented in the training set or not properly characterised by the model. This results in the test images being out of the distribution of relationships learned by the network, leading to an unpredictable performance. Features that cause epistemic uncertainty will be those most impacted by the domain shift between different tunnels. Examples would be masonry block sizing and shapes; levels and types of damage observed; and differences in the mortar joint composition, condition and thickness.

Ideally, the pixelwise softmax probability output by the neural network should represent the level of confidence that the model has in its prediction. However, modern convolutional neural network designs are often challenging to properly calibrate within the training data domain and tend to yield confident but incorrect predictions on out-of-domain test data [24]. As a result, uncertainty quantification methods have been developed to assess a neural network’s uncertainty when applied to a specific dataset. Monte Carlo dropout (MCD) and test time augmentation (TTA) are two commonly used methods that were chosen for this study.

Unsupervised methods for anomaly segmentation and associated uncertainty estimation have also been developed [25]. These methods typically involve self-training a student network with a teacher one and have recently been applied to medical images [26] and for structural surface damage detection [27]. However, these methods have not been designed to quantify the aleatoric and epistemic uncertainties of the output from existing trained models, so they are not examined further.

2.2. Monte Carlo Dropout (MCD)

Monte Carlo dropout enables the estimation of epistemic uncertainty by varying the neural network design when testing the target dataset. Dropout was initially developed to prevent model overfitting and involves randomly omitting feature detectors during training each time gradients are updated [28]. This prevents the learning of overly complex and probably meaningless feature relationships that are unique to each training data sample, and thus, reduce the performance on test data.

While the network is usually frozen and all neurons are retained for testing, MCD involves applying dropout during testing. The testing data are run through the network with different neurons dropped out each time and the resulting variations in output give an indication of the level of uncertainty in the network’s predictions. First proposed by [29] and theoretically proven as a Bayesian equivalent in [30], test time dropout approximates the posterior distribution of the network’s weights by Monte Carlo sampling the network’s predictions. By assessing the model variance on a particular test image, it is possible to assess the epistemic uncertainty of the prediction [31].

MCD has been applied relatively extensively for semantic segmentation tasks in the medical field, where quantifying uncertainty levels on medical imaging tasks are vital for a clinician to make informed decisions about a patients’ treatment options [32,33,34,35,36]. However, while some studies have applied MCD for the uncertainty quantification of construction object segmentation [37] and concrete damage assessment [38,39,40], it has not been applied to semantic segmentation tasks on older, less homogenous infrastructure, such as masonry lined tunnels. This is despite the need for similar safety-critical decisions to be made on these structures based on a neural network segmentation output.

2.3. Test-Time Augmentation

Test time augmentation (TTA) is a commonly applied method for estimating the heteroscedastic aleatoric uncertainty in a model’s prediction. While data augmentation during training to improve the test time performance is well documented [15], Ref. [41] were the first to apply test-time augmentation to help understand the aleatoric uncertainty in their model’s output. They made 128 copies of their test samples and ran them through standard training image augmentations to produce 128 variants of their test data. They then put these images into their trained image classification network and observed the variations in outputs between the transformed images. Ref. [42] later formalised their method and assessed its potential for uncertainty quantification using the Volume Variation Coefficient (VVC) of segmented structures for brain tumor segmentation. They showed a negative correlation between the VVC, which is calculated by dividing the segmentation volume variance by its mean, and the segmentation Dice score. Although it is less commonly adopted than MCD, TTA has been used for the uncertainty quantification of other types of medical images [43,44]. It has not been applied to structural condition assessments.

2.4. Study Contributions

The object of this research was to analyse the potential of MCD and TTA for quantifying the uncertainty of neural networks trained for masonry joint segmentation from lidar data. This will increase the trustworthiness of automated masonry-lined tunnel condition assessment procedures that rely on deep learning-based masonry joint segmentation. This study aimed to provide a method for automatically highlighting anomalous segmentations to an engineer so that they can be manually adjusted or removed. It provides the following three contributions:

A comparison of MCD with TTA for structural condition assessment uncertainty quantification.
An analysis of whether the uncertainty can effectively predict performance.
Consideration of the usefulness of generated uncertainty maps.

This paper briefly outlines the procedure for training a neural network for masonry joint semantic segmentation developed by [7]. It then adapts the model to analyse two uncertainty quantification methods:

Monte Carlo dropout (MCD).
Test time augmentation (TTA).

This study compared these methods and investigated the correspondence between the uncertainty and model performance.

2.5. Datasets

Lidar surveys were taken of the linings on 4 different masonry-lined railway tunnels in the southwest of England. Managed by Network Rail, the UK’s railway infrastructure asset owner, the tunnels were built in the 1850s at the time of construction of the railway. Tunnels 1 and 3 were lined with stone masonry, while Tunnels 2 and 4 were brick-lined. They were all in a serviceable condition; however, large areas of shallow spalling and mortar loss (<10 mm depth) alongside efflorescence were present on the masonry. Three-dimensional point clouds were created from the lidar data. These were grid-sampled into a minimum 5 mm point spacing to reduce the required computing power for subsequent analysis.

2.6. Data Preprocessing

The 3D point clouds were further processed to prepare them for neural network training. Although there are multiple state-of-the-art 3D deep learning methods for the semantic segmentation of 3D point clouds [45], applying 2D vision methods to rasterised depth map images of the tunnel lining has been shown to be the most effective for masonry tunnel joint semantic segmentation [13]. The data preparation workflow followed in [7] was therefore used in this study.

A cylinder was fitted to the tunnel point cloud using principal component analysis, which was then unrolled to flatten the tunnel lining. The vertical offsets of the resulting point cloud were then rasterised into a 2D float32 image depicting a depth map of the tunnel lining. The pixel intensities in the depth map image corresponded to the out of plane distance of each point from an ideal cylindrical tunnel lining. The joint locations were manually labelled on these images to create ground truth joint location masks. The joints were labelled as constant 9-pixel width lines using QGIS 3.38.2. This was challenging in places, as in some locations the mortar joints were very narrow or the mortar was level with the brick surface. This made it difficult to identify each brick from the depth map images. Additionally, as shown in Figure 1, the surface damage made it difficult to visually determine the locations of the masonry joints in the point cloud from either the 3D or intensity data.

After the labelling step, the images were upsampled such that each tunnel had the same average number of pixels per masonry block. Tunnel 3 had the largest masonry blocks, so 1, 2 and 4 were adjusted to match. The data from each tunnel were then split into training, validation and testing sets in a 3:1:1 ratio. Finally, the images were split into 384 × 384-pixel patches. This was performed to ensure that the neural networks could be trained within the available VRAM constraints.

2.7. Neural Network Training

In order to focus on the uncertainty quantification performance, a basic U-Net style neural network was chosen for the analysis. The network architecture from [46] was used. This consisted of 4 downsampling convolutional layers in an encoder followed by 4 upsampling convolutional layers in the decoder. The model was trained using the hyperparameters outlined in Table 1 on a 10-core Intel Xeon 6138 CPU with 48 GB of system RAM and an Nvidia V100 GPU.

2.8. Quantifying Epistemic and Aleatoric Uncertainties

It is necessary to wholistically assess the performance of each uncertainty quantification method given the wide possible variations in the feature distributions of both the trained neural network selected and the target tunnel lining. As the aim of the uncertainty quantification was to indicate the neural network’s performance and applicability to a specific tunnel, performance on both the in- and out-of-distribution tunnel data needed to be assessed.

Four different tunnels were available for this study, so the neural network was trained three times using different hyperparameters to simulate a total of 12 different domain-shift scenarios. This created different levels of epistemic uncertainty. The details of these networks are described in Table 2. The differences between the trained networks acted as a proxy for the wide variety of possible differences in features between the training and testing data tunnels due to their geometries, material types and damage levels. The aleatoric uncertainty also needed to be modelled. In order to artificially create uncertainty in the test data, random Gaussian and Perlin noises were added to each test image. Two levels of noise were chosen, with scale factors of 0.15 and 0.3 applied to the magnitude of the noise. All data augmentations were implemented using Albumentations 1.4.10.

2.9. Data Augmentations

Training data augmentations are transformations applied to training data to artificially increase the volume and variety of data. This widens the feature distribution of the training data and helps to increase the robustness of the trained network so that it generalises better to different datasets. Networks A, B and C were trained with random vertical and horizontal flips. Through trial and error, the data augmentations that led to the best test data performance were determined and applied for Network A. Network A was trained with the following additional augmentations applied randomly:

Brightness shifts—offsets in all pixel values. This simulated different tunnel shapes.
Contrast adjustments—scaled the pixel values. This simulated different mortar joint depths.
Perlin noise—randomly generated gradient offsets. This simulated masonry surface roughness changes.
Gaussian noise—random pixel value offsets. This simulated noise that occurred during the data collection due to the accuracy of the laser scanning equipment.
Elastic transforms—random displacement maps. This simulated masonry deformations.
Crop and resize—up- and downscaling of the input image. This simulated different masonry block sizes.

3. Results

3.1. Neural Network Performance

Each network was trained on Tunnel 1 before being applied to the test data of all four tunnels. The performance was assessed using the Intersection Over Union (IOU) score, which compares the number of True Positive (TP) predictions with the False Positive (FP) and False Negative (FN) predictions. It is considered more useful than Accuracy, Precision or Recall values alone, as it only considers the target positive class and is insensitive to class imbalance. It is calculated as shown in Equation (1).

IOU = TP/(TP + FP + FN)

(1)

For masonry block documentation, it is vital that the segmented masonry joints fully enclose each block in order to separate block instances. The exact location of each masonry joint, while contributing to the masonry block segmentation performance, is less critical than the identified block sizes and shapes, which are dependent on the joint connectivity. While a segmentation with a high masonry joint IOU would also generate a good block segmentation, a small offset of the detected joints leads to a substantial decrease in the joint IOU, despite having a minimal impact on the overall block documentation performance. Typically, the IOU would be applied to assess the performance of the positive class (masonry joint locations) alone. However, in this case, the IOU was calculated blockwise on the negative class using the method developed in [47]. A connected component analysis was conducted to identify the separate masonry block instances. The segmented blocks were then assigned by centroid location to their nearest ground truth block, and the IOU was calculated between each of these assignments. The average IOU per detected block was then calculated to give the blockwise IOU used here.

As using MCD required training with dropout enabled, the networks were trained with and without dropout in order to assess whether using dropout had negative performance impacts. The performances of the networks trained without dropout are shown in Table 3 and the outputs are visualised in Figure 2. It is clear that Network A had a better generalisation performance than B and C. The overfitting of Network B to Tunnel 1 yielded the best results on the Tunnel 1 test data at the expense of Tunnels 2, 3 and 4’s performances. The performances on Tunnels 2 and 4 were worse than on Tunnel 3. This was likely due to the differences in features between the stone- and brick-lined tunnels. Although the decrease in performance when moving from Network A to B and C is qualitatively visible in Figure 2 for Tunnels 2 and 4, the IOU shown in Table 3 decreased substantially. This was because even when the joints were segmented largely correctly, small gaps in the joints connected adjacent block instances, which led to a substantial breakdown in the block segmentation performance.

The model was then trained with dropout enabled, with a dropout probability of 0.5 for each neuron. As shown in Table 4, this produced slightly higher but largely similar results. Tunnel 4’s performance was particularly improved. This was possibly due to the regularising effect of using dropout.

The impact of the artificially added noise on the neural network performance was also assessed. Figure 3 shows how the added noise impacted the segmentations of a section of testing data taken from Tunnel 3. Figure 4 visualises how the distribution of performance decreased for each tunnel segment when the amount of added noise was increased. It can clearly be seen that for every network, adding noise decreased the performance. For Network B, the change was less pronounced. This was likely because it achieved a poor performance on Tunnels 2, 3 and 4 in the no-added-noise case, but a good performance on Tunnel 1. Adding noise had little impact when the performance was already low. This is reflected in the visualisations in Figure 3.

3.2. Uncertainty Metrics

The MCD and TTA produced a number of output segmentations that needed to be compared for the uncertainty to be quantified. Two different uncertainty metrics were chosen:

Uncertainty Mean Intersection Over Union (UMIOU): The IOU between each output segmentation and the mean of all generated segmentations was calculated, which led to an IOU score for every Monte Carlo sample. The UMIOU was the mean of all the sample IOUs. A higher uncertainty led to a lower UMIOU.
Area Variation Coefficient (AVC): This is a 2D version of the VVC used within [42]. A summation of the number of predicted masonry joint pixels for each Monte Carlo sample was performed. The AVC was calculated by dividing the standard deviation between these summations by their mean. A higher uncertainty led to an increased AVC.

Uncertainty maps were generated for each segmented image patch by calculating the standard deviation between the segmentation prediction of each Monte Carlo sample for each pixel. Each pixel value on the uncertainty map was then set as the calculated standard deviation.

3.3. Test-Time Augmentation Results

Test-time augmentation was applied to each of the three neural networks once trained with dropout enabled. Gaussian noise, Perlin noise, brightness and contrast shifts, and random vertical and horizontal flips were chosen as the augmentations. Each 384 × 384 test image crop was augmented in 50 different ways. The UMIOU and AVC were calculated between each of the 50 segmentations for each image.

The generated uncertainty maps are visualised in Figure 5 for an image patch with varying levels of added noise in the input image. For Network A, although the AVC increased with increasing noise levels, visually, there appeared to be fewer areas of uncertainty. Inspecting Figure 3, the network did not identify as many joint locations with the increase in noise. This suggests that if the noise level is significant enough to completely obscure image features, then the network will be more confident in its prediction, even though it is incorrect. The AVC was able to reflect the increased uncertainty as it was normalised by the area of predicted joints, so it was robust to the decreased segmentation area. While Networks B and C showed decreases in the UMIOU with the increased noise, it is unclear that the level of uncertainty increased from viewing the uncertainty maps alone.

The distribution of increases in the AVC over all the image test patches is visualised within the histograms in Figure 6. The mean AVC value increased for all the networks when the level of added noise was increased. For Network B, the increases were lower. As Network B had a poor generalisation performance, it was unable to effectively characterise the out-of-distribution data. It was less able to reduce its prediction confidence in uncertain situations. As a result, it likely produced universally more confident predictions due to incorrectly identifying the features in the test data as being those from the training data.

The AVC and UMIOU uncertainties on each patch are plotted against the segmentation performance in Figure 7. No clear relationships between the uncertainty value and the segmentation performance were observed for the TTA output.

3.4. Monte Carlo Dropout Results

The effectiveness of applying Monte Carlo dropout was analysed for each of the three trained neural networks. Dropout was applied during inference with a probability of 0.5 per layer. Each 384 × 384 test image crop was put through the trained network 100 times. The UMIOU and AVC were calculated between each of the 100 Monte Carlo-sampled segmentations for each image. The generated uncertainty maps are visualised in Figure 8 for an image patch from both a section of Tunnel 1 data, which should have a low level of epistemic uncertainty, and a section of Tunnel 3, which should have a higher level of epistemic uncertainty.

For Network A, the uncertainty maps clearly show the locations where varying segmentations are possible. For example, there was clearly uncertainty in Network A’s segmentation at the top of the Tunnel 1 patch. Observing the regularity of the masonry, it was likely that at this location, masonry cracks were being predicted as joints. For Network B, there was a substantial increase in both the AVC and the level of uncertainty that could be qualitatively observed. The change in the epistemic uncertainty was the most extreme for Network B, as it was overfitted to Tunnel 1. This demonstrates how uncertainty maps can be used to determine which tunnel areas are likely to be within the distribution of the trained model.

The correlation between the MCD uncertainty and segmentation performance can be observed in Figure 9. There was a weak negative correlation between the AVC and IOU for Networks B and C; however, it was less pronounced for Network A. This was possibly because Network A generalised better to unseen data than Networks B and C. This would have caused the epistemic uncertainty to have less of a negative performance impact than for Networks B and C. An alternate explanation is that Network A could have captured a broader variety of feature relationships than Networks B and C due its exposure to a wider variety of training data. The more complex relationships were more adversely impacted by applying dropout than the simpler relationships in Networks B and C. This would have led to the MCD generating universally higher levels of uncertainty and suggests that MCD can be used to indicate the generalisability of a trained network. A further analysis covering a wider variety of datasets would need to be conducted to fully separate the impacts of these factors.

3.5. Uncertainty Visualisations

Having analysed the capabilities of each uncertainty quantification method, it is important to be able to visualise each metric in an accessible and informative way. We propose projecting the AVC scores of each 384 × 384 image patch onto the output segmentation maps. This enables an engineer to efficiently scan the tunnel lining for areas with high uncertainty. The pixelwise uncertainty maps should be inspected when a specific area requires a more detailed inspection. They can then be used to view alternate possible segmentations. An example of this is shown applied to a section of the Tunnel 3 test data in Figure 10. The correlation between the segmentation performance and MCD uncertainty can be observed.

Without uncertainty maps, it would be necessary for an engineer to inspect every segmentation map in detail to validate the segmentations. However, Section 3.3 and Section 3.4 show that although high levels of uncertainty were correlated with a poor performance, it is possible for a patch with a low epistemic or aleatoric uncertainty to also generate a poor segmentation. As a result, areas with a poor performance, where the segmentation needs to be manually analysed and corrected, cannot be exclusively determined using uncertainty values. It is necessary for an engineer to take a holistic approach when identifying locations with a poor segmentation performance. The following workflow is suggested:

An engineer should identify the typical masonry block dimensions from the segmentation maps. If there are multiple types of masonry present, then the engineer should conduct the following steps over each type of masonry in turn, as uncertainty values are not directly comparable between areas with substantially different properties.
The TTA and MCD image patches should be sorted by their uncertainty levels.
Starting from the patches with the highest uncertainty, the patch predictions should be observed alongside the pixelwise uncertainty values and the input depth map. If the predicted joint locations do not appear realistic, then the segmentation should be manually corrected. The MCD pixel uncertainty maps show segmentation outputs with variants of the trained neural network. They can therefore be used as a guide to identify more realistic segmentation candidates.
Step 3 should be conducted for patches with progressively smaller uncertainties until the observed patches have qualitatively acceptable segmentations.
It is necessary to account for areas where there may be a poor segmentation performance despite a low level of uncertainty being identified. These regions are likely caused by abnormalities in the input depth map caused by tunnel features that have not been encountered during training and are challenging to accurately identify from the depth map alone. While many of these cases will lead to epistemic uncertainty, it is possible for the network to be confidently incorrect if a joint is not visible in the depth map. This may occur, for example, if the mortar is level with the masonry surface. In addition, high levels of noise are not always detected by TTA as aleatoric uncertainty if no reasonable segmentations can be generated. As a result, the engineer should conduct further pixelwise segmentation verification in areas they have identified as anomalous during their on-site qualitative inspection of the tunnel.

Although this method is not guaranteed to remove all incorrect segmentations, it is a cost-effective procedure for improving the segmentation performance given limited available manual analysis time and would substantially reduce the analysis time compared with the full manual labelling of masonry block locations.

4. Discussion and Conclusions

Both test time augmentation (TTA) and Monte Carlo dropout (MCD) generated uncertainty maps that could aid in the interpretability of deep learning-based masonry joint segmentations. Using MCD enabled alternate possible segmentations to be viewed, which visualised the sensitivity of the output to the neural network training environment. The Area Variation Coefficient (AVC) could be used to demonstrate how increased epistemic uncertainty led to decreased segmentation performance. However, the uncertainty values generated by the Monte Carlo dropout were sensitive to aleatoric uncertainty. Adding noise decreased the AVC of the MCD uncertainties.

Applying TTA showed how small changes in the input data led to changes in the segmentation output. Qualitatively, the produced TTA uncertainty maps were strongly dependent on how robust a trained network was to noise. The AVC of the TTA outputs was increased with increased noise levels, which enabled it to be used as an indicator of the quality of the input images and the resulting aleatoric uncertainty. However, it was not shown to correlate with the segmentation performance, as this was strongly driven by the epistemic uncertainty caused by the domain shift between the tunnels analysed.

For both the TTA and MCD scenarios, the UMIOU was not shown to be a useful metric of uncertainty. The UMIOU did not correlate in most cases with the AVC or the perceived level of uncertainty observed in the uncertainty maps. As a result, the AVC is proposed as the standard segmentation uncertainty evaluation metric.

There was a substantial runtime increase when implementing the uncertainty quantification methods. The runtime increase was proportional to the number of augmentations or dropout variations that were assessed since the inference needed to be computed for every Monte Carlo and augmentation sample. For this study, implementing TTA and MCD increased the inference time by approximately 1500%, as 100 MCD samples and 50 TTA samples were used. Reducing the number of samples reduced the effectiveness of the method, as it was necessary to generate a distribution of outputs in order to more confidently determine the mean and standard deviation of them. With a low number of Monte Carlo samples, it would be possible that key augmentation/dropout permutations were missed, and thus, generated misleading results. It is recommended that MCD and TTA should only be implemented on standard office hardware when the computational cost of inference is small. Alternatively, cloud computing services could be rented for the inference process. This would prevent the analysis time from forming a bottleneck in conducting a condition assessment without requiring the purchase of expensive specialist hardware that may only have occasional use over the lifetime of a project.

To conclude, applying both TTA and MCD provided valuable insights into the uncertainty of masonry block segmentation outputs, and AVC score maps could be effectively used to indicate to a tunnel inspector the locations where a neural network has high levels of uncertainty. Within a specific tunnel, the AVC value correlated with the segmentation performance, which can enable an inspector to easily identify the most effective locations to conduct manual validation or correction of the output masonry joint segmentation map. However, there are limitations with using uncertainty alone as a proxy for segmentation performance since their power is dependent on the specific trained neural network. It is suggested that the level of uncertainty is calibrated against the segmentation performance on samples of unseen testing data before being applied in practice. A well-trained and generalised neural network should generate more strongly correlated uncertainty scores. However, it is still necessary to conduct a qualitative visual inspection of a tunnel lining to identify where obvious lining anomalies may impact the segmentation performance.

In order to determine how uncertainty maps could be integrated into real-world tunnel condition assessments, further work needs to be conducted to analyse how accessible and interpretable these methods are for engineers who are not familiar with machine learning. Furthermore, asset managers need to be consulted to determine how far uncertainty quantification methods would improve their perception of the trustworthiness of automated tunnel analysis workflows.

Author Contributions

Conceptualization, J.M.W.S. and C.P.; methodology, J.M.W.S.; software, J.M.W.S.; validation, J.M.W.S. and C.P.; formal analysis, J.M.W.S.; writing—original draft preparation, J.M.W.S.; writing—review and editing, C.P.; supervision, C.P.; funding acquisition, C.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by an EPSRC Environment Doctoral Training Partnership grant, project number 2601289: Stability assessment for sustainable and resilient tunnelling using AI.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to thank Network Rail and Bedi Consulting Ltd. for collecting and sharing the point cloud data of the operational railway tunnels analysed herein.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MCD	Monte Carlo Dropout
TTA	Test Time Augmentation
IOU	Intersection Over Union
AVC	Area Variation Coefficient
UMIOU	Uncertainty Mean Intersection Over Union

References

Atkinson, C.; Paraskevopoulou, C.; Miller, R. Investigating the Rehabilitation Methods of Victorian Masonry Tunnels in the UK. Tunn. Undergr. Space Technol. 2021, 108, 103696. [Google Scholar] [CrossRef]
Sony, S.; Dunphy, K.; Sadhu, A.; Capretz, M. A Systematic Review of Convolutional Neural Network-Based Structural Condition Assessment Techniques. Eng. Struct. 2021, 226, 111347. [Google Scholar] [CrossRef]
Ibrahim, Y.; Nagy, B.; Benedek, C. Deep Learning-Based Masonry Wall Image Analysis. Remote Sens. 2020, 12, 3918. [Google Scholar] [CrossRef]
Valero, E.; Bosché, F.; Forster, A. Automatic Segmentation of 3D Point Clouds of Rubble Masonry Walls, and Its Application to Building Surveying, Repair and Maintenance. Autom. Constr. 2018, 96, 29–39. [Google Scholar] [CrossRef]
Chiu, Y.C.; Wang, T.T.; Huang, T.H. Investigating Continual Damage of a Nineteenth Century Masonry Tunnel. Proc. Inst. Civ. Eng.-Forensic Eng. 2015, 167, 109–118. [Google Scholar] [CrossRef]
Paraskevopoulou, C.; Dallavalle, M.; Konstantis, S.; Spyridis, P.; Benardos, A. Assessing the Failure Potential of Tunnels and the Impacts on Cost Overruns and Project Delays. Tunn. Undergr. Space Technol. 2022, 123, 104443. [Google Scholar] [CrossRef]
Smith, J.; Paraskevopoulou, C.; Cohn, A.G.; Kromer, R.; Bedi, A.; Invernici, M. Automated Masonry Spalling Severity Segmentation in Historic Railway Tunnels Using Deep Learning and a Block Face Plane Fitting Approach. Tunn. Undergr. Space Technol. 2024, 153, 106043. [Google Scholar] [CrossRef]
Weise, M.; Böhms, M.; Allaix, D.; Sánchez-Rodríguez, A.; Rigotti, M. Importance of Digitalization and Standardization for Bridge and Tunnel Monitoring and Predictive Maintenance. ce/papers 2023, 6, 592–599. [Google Scholar] [CrossRef]
Tichý, T.; Brož, J.; Bělinová, Z.; Pirník, R. Analysis of Predictive Maintenance for Tunnel Systems. Sustainability 2021, 13, 3977. [Google Scholar] [CrossRef]
Lei, X.; Dong, Y.; Frangopol, D.M. Sustainable Life-Cycle Maintenance Policymaking for Network-Level Deteriorating Bridges with a Convolutional Autoencoder–Structured Reinforcement Learning Agent. J. Bridge Eng. 2023, 28, 04023063. [Google Scholar] [CrossRef]
Loverdos, D.; Sarhosis, V. Geometrical Digital Twins of Masonry Structures for Documentation and Structural Assessment Using Machine Learning. Eng. Struct. 2023, 275, 115256. [Google Scholar] [CrossRef]
Ibrahim, Y.; Nagy, B.; Benedek, C. A GAN-Based Blind Inpainting Method for Masonry Wall Images. In Proceedings of the International Conference on Pattern Recognition 2020, Milan, Italy, 10–15 January 2021; pp. 3178–3185. [Google Scholar] [CrossRef]
Smith, J.; Paraskevopoulou, C. 3D Deep Learning for Segmentation of Masonry Tunnel Joints. In Proceedings of the SMAR 2024—7th International Conference on Smart Monitoring, Assessment and Rehabilitation of Civil Structures, Salerno, Italy, 4–6 September 2024. Structural Integrity Procedia, 2024. [Google Scholar]
Brackenbury, D.; Dejong, M. Mapping Mortar Joints in Image Textured 3D Models to Enable Automatic Damage Detection of Masonry Arch Bridges. In Proceedings of the 17th International Conference on Computing in Civil and Building Engineering 2018, Tampere, Finland, 4–7 June 2018. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Guan, H.; Liu, M. Domain Adaptation for Medical Image Analysis: A Survey. IEEE Trans. Biomed. Eng. 2022, 69, 1173–1185. [Google Scholar] [CrossRef] [PubMed]
Budd, S.; Robinson, E.C.; Kainz, B. A Survey on Active Learning and Human-in-the-Loop Deep Learning for Medical Image Analysis. Med. Image Anal. 2021, 71, 102062. [Google Scholar] [CrossRef]
Zhang, Y.; Tino, P.; Leonardis, A.; Tang, K. A Survey on Neural Network Interpretability. IEEE Trans. Emerg. Top Comput. Intell. 2021, 5, 726–742. [Google Scholar] [CrossRef]
Lambert, B.; Forbes, F.; Doyle, S.; Dehaene, H.; Dojat, M. Trustworthy Clinical AI Solutions: A Unified Review of Uncertainty Quantification in Deep Learning Models for Medical Image Analysis. Artif. Intell. Med. 2024, 150, 102830. [Google Scholar] [CrossRef]
Samek, W.; Montavon, G.; Lapuschkin, S.; Anders, C.J.; Müller, K.R. Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications. Proc. IEEE 2021, 109, 247–278. [Google Scholar] [CrossRef]
van Zyl, C.; Ye, X.; Naidoo, R. Harnessing EXplainable Artificial Intelligence for Feature Selection in Time Series Energy Forecasting: A Comparative Analysis of Grad-CAM and SHAP. Appl. Energy 2024, 353, 122079. [Google Scholar] [CrossRef]
Der Kiureghian, A.; Ditlevsen, O. Aleatory or Epistemic? Does It Matter? Struct. Saf. 2009, 31, 105–112. [Google Scholar] [CrossRef]
Valdenegro-Toro, M.; Saromo, D. A Deeper Look into Aleatoric and Epistemic Uncertainty Disentanglement. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LO, USA, 19–20 June 2022. [Google Scholar]
Minderer, M.; Djolonga, J.; Romijnders, R.; Hubis, F.; Zhai, X.; Houlsby, N.; Tran, D.; Lucic, M. Revisiting the Calibration of Modern Neural Networks. Adv. Neural Inf. Process. Syst. 2021, 19, 15682–15694. [Google Scholar]
Gao, Y.; Shi, Y.; Zhang, J.; Ling, T.; Lu, J.; Zheng, Y.; Yu, Q.; Qi, L. Inconsistency-Aware Uncertainty Estimation for Semi-Supervised Medical Image Segmentation. IEEE Trans. Med. Imaging 2022, 41, 608–620. [Google Scholar] [CrossRef]
Adiga, V.S.; Dolz, J.; Lombaert, H. Anatomically-Aware Uncertainty for Semi-Supervised Image Segmentation. Med. Image Anal. 2024, 91, 103011. [Google Scholar] [CrossRef]
Lei, X.; Sun, M.; Zhao, R.; Wu, H.; Zhou, Z.; Dong, Y.; Sun, L. Unsupervised Vision-Based Structural Anomaly Detection and Localization with Reverse Knowledge Distillation. Struct. Control Health Monit. 2024, 2024, 8933148. [Google Scholar] [CrossRef]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
Kendall, A.; Badrinarayanan, V.; Cipolla, R. Bayesian Segnet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding. In Proceedings of the British Machine Vision Conference 2017, BMVC 2017, London, UK, 4–7 September 2017. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar] [CrossRef]
Nair, T.; Precup, D.; Arnold, D.L.; Arbel, T. Exploring Uncertainty Measures in Deep Networks for Multiple Sclerosis Lesion Detection and Segmentation. Med. Image Anal. 2020, 59, 101557. [Google Scholar] [CrossRef] [PubMed]
Wickstrøm, K.; Kampffmeyer, M.; Jenssen, R. Uncertainty and Interpretability in Convolutional Neural Networks for Semantic Segmentation of Colorectal Polyps. Med. Image Anal. 2020, 60, 101619. [Google Scholar] [CrossRef]
Seeböck, P.; Orlando, J.I.; Schlegl, T.; Waldstein, S.M.; Bogunović, H.; Klimscha, S.; Langs, G.; Schmidt-Erfurth, U. Exploiting Epistemic Uncertainty of Anatomy Segmentation for Anomaly Detection in Retinal OCT. IEEE Trans. Med. Imaging 2019, 39, 87–98. [Google Scholar] [CrossRef]
Laves, M.H.; Ihler, S.; Ortmaier, T.; Kahrs, L.A. Quantifying the Uncertainty of Deep Learning-Based Computer-Aided Diagnosis for Patient Safety. Curr. Dir. Biomed. Eng. 2019, 5, 223–226. [Google Scholar] [CrossRef]
Arega, T.W.; Bricq, S.; Meriaudeau, F. Leveraging Uncertainty Estimates to Improve Segmentation Performance in Cardiac MR. In Proceedings of the 3rd International Workshop, UNSURE 2021, and 6th International Workshop, PIPPI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, 1 October 2021; Lecture Notes in Computer Science; Lecture Notes in Artificial Intelligence; Lecture Notes in Bioinformatics. Volume 12959 LNCS, pp. 24–33. [Google Scholar] [CrossRef]
Vassilev, H.; Laska, M.; Blankenbach, J. Uncertainty-Aware Point Cloud Segmentation for Infrastructure Projects Using Bayesian Deep Learning. Autom. Constr. 2024, 164, 105419. [Google Scholar] [CrossRef]
Dos Santos, K.R.M.; Chassignet, A.G.J.; Pantoja-Rosero, B.G.; Rezaie, A.; Savary, A.J.; Beyer, K. Uncertainty Quantification for a Deep Learning Models for Image-Based Crack Segmentation. J. Civ. Struct. Health Monit. 2024, 15, 1231–1269. [Google Scholar] [CrossRef]
Rathnakumar, R.; Pang, Y.; Liu, Y. Epistemic and Aleatoric Uncertainty Quantification for Crack Detection Using a Bayesian Boundary Aware Convolutional Network. Reliab. Eng. Syst. Saf. 2023, 240, 109547. [Google Scholar] [CrossRef]
Sajedi, S.O.; Liang, X. Model Uncertainty Quantification for Reliable Deep Vision Structural Health Monitoring. arXiv 2020, arXiv:2004.05151. [Google Scholar]
Ayhan, M.S.; Berens, P. Test-Time Data Augmentation for Estimation of Heteroscedastic Aleatoric Uncertainty in Deep Neural Networks. In Proceedings of the Medical Imaging with Deep Learning 2018, Amsterdam, The Netherlands, 4–6 July 2018. [Google Scholar]
Wang, G.; Li, W.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T. Aleatoric Uncertainty Estimation with Test-Time Augmentation for Medical Image Segmentation with Convolutional Neural Networks. Neurocomputing 2019, 338, 34–45. [Google Scholar] [CrossRef]
Lin, Q.; Chen, X.; Chen, C.; Garibaldi, J.M. A Novel Quality Control Algorithm for Medical Image Segmentation Based on Fuzzy Uncertainty. IEEE Trans. Fuzzy Syst. 2023, 31, 2532–2544. [Google Scholar] [CrossRef]
Zheng, Z.; Oda, M.; Misawa, K.; Mori, K. Taking Full Advantage of Uncertainty Estimation: An Uncertainty-Assisted Two-Stage Pipeline for Multi-Organ Segmentation. Prog. Biomed. Opt. Imaging Proc. SPIE 2022, 12033, 120333M. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern. Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef]
Do, H.P.; Guo, Y.; Yoon, A.J.; Nayak, K.S. Accuracy, Uncertainty, and Adaptability of Automatic Myocardial ASL Segmentation Using Deep CNN. Magn. Reson. Med. 2020, 83, 1863–1874. [Google Scholar] [CrossRef]
Smith, J.; Paraskevopoulou, C. Practical Assessment of Masonry Tunnel Joint Segmentation Using Topological Machine Learning. Civ. Eng. Des. 2025, in press. [Google Scholar]

Figure 1. Sample of masonry lining lidar data with substantial efflorescence and spalling. (a) Depth map image generated from lidar data by unrolling and rasterising. (b) Lidar intensity image.

Figure 2. Comparison of the segmentation outputs of each trained neural network on a section of testing data from each tunnel.

Figure 3. Comparison of the segmentation sigmoid outputs by the neural network computed on a single patch of testing data given different levels of added noise in the input image patch. The outputs for low and high noise levels were calculated as the average of the outputs over 5 different added noise maps.

Figure 4. Histograms showing the change in IOU for each patch when the magnitude of the added noise was increased. The mean increase in the IOU over all patches is shown at the top of each figure.

Figure 5. TTA uncertainty maps for a patch of Tunnel 3 data given different trained models and synthetically added noise magnitudes. The standard deviation of the sigmoid output over all the samples is shown for each image pixel (larger standard deviation values represent more uncertainty).

Figure 6. Histograms showing the change in AVC uncertainty for each patch when the magnitude of the added noise was increased. The mean increase in AVC uncertainty over all the patches is shown at the top of each figure.

Figure 7. TTA uncertainty values plotted against segmentation performance for each trained neural network. AVC and UMIOU uncertainty metrics are compared.

Figure 8. MCD uncertainty maps for patches of Tunnel 1 and Tunnel 3 data given different models. The standard deviation of the sigmoid output over all the samples is shown for each image pixel (larger standard deviation values represent more uncertainty).

Figure 9. MCD uncertainty values plotted against segmentation performance for each trained neural network. AVC and UMIOU uncertainty metrics are compared.

Figure 10. Patchwise and pixelwise uncertainty metrics visualised over an area of Tunnel 3 to show how they can be interpreted over a larger area of the tunnel. The pixel uncertainty shows the standard deviation of each pixel (larger standard deviation values represent more uncertainty). The patch certainty is 1 minus the AVC uncertainty for each input image patch (larger AVC is less uncertainty).

Table 1. Chosen neural network architecture and hyperparameters.

Architecture	Optimiser	Weight Decay	Batch Size	Initial Learning Rate	Pretraining
U-Net	AdamW	0.002	8	0.001	ImageNet

Table 2. Differences between trained neural networks.

Network	Description	Augmentations	Training Epochs
A: Well-generalised	This was a well-trained network that generalised effectively to operate on different structures. Due to the data augmentations, a wider feature distribution was represented in the network’s training.	High	1200
B: Overfitted	This network was trained to achieve an excellent performance on the training set. It had a narrower feature distribution, so features from other tunnels were more likely to be poorly represented in the network.	Low	1000
C: Underfitted	The network was undertrained, so it did not learn a detailed representation of the target domain. However, it retained a wide feature distribution from the pretraining.	Low	200

Table 3. Neural network performance without dropout.

Network	IOU on Test Data
Network	Tunnel 1	Tunnel 2	Tunnel 3	Tunnel 4
A	0.62	0.38	0.68	0.33
B	0.63	0.02	0.44	0.02
C	0.60	0.07	0.71	0.09

Table 4. Neural network performance with dropout (dropout probability = 0.5).

Network	IOU on Test Data
Network	Tunnel 1	Tunnel 2	Tunnel 3	Tunnel 4
A	0.65	0.42	0.67	0.46
B	0.67	0.02	0.56	0.19
C	0.69	0.2	0.71	0.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Smith, J.M.W.; Paraskevopoulou, C. Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods. Infrastructures 2025, 10, 98. https://doi.org/10.3390/infrastructures10040098

AMA Style

Smith JMW, Paraskevopoulou C. Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods. Infrastructures. 2025; 10(4):98. https://doi.org/10.3390/infrastructures10040098

Chicago/Turabian Style

Smith, Jack M. W., and Chrysothemis Paraskevopoulou. 2025. "Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods" Infrastructures 10, no. 4: 98. https://doi.org/10.3390/infrastructures10040098

APA Style

Smith, J. M. W., & Paraskevopoulou, C. (2025). Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods. Infrastructures, 10(4), 98. https://doi.org/10.3390/infrastructures10040098

Article Menu

Uncertainty Quantification to Assess the Generalisability of Automated Masonry Joint Segmentation Methods

Abstract

1. Introduction

2. Methods

2.1. Types of Uncertainty

2.2. Monte Carlo Dropout (MCD)

2.3. Test-Time Augmentation

2.4. Study Contributions

2.5. Datasets

2.6. Data Preprocessing

2.7. Neural Network Training

2.8. Quantifying Epistemic and Aleatoric Uncertainties

2.9. Data Augmentations

3. Results

3.1. Neural Network Performance

3.2. Uncertainty Metrics

3.3. Test-Time Augmentation Results

3.4. Monte Carlo Dropout Results

3.5. Uncertainty Visualisations

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI