1. Introduction
The advent of deep neural networks has created a significant opportunity to automate the documentation and condition assessment of historic infrastructure. A large proportion of the world’s critical transportation infrastructure dates from the mid-19th century [
1] and much of it is constructed from brick masonry. Against a backdrop of increased utilisation and tighter maintenance budgets, many of these structures are deteriorating due to their age. As a result, there is substantial pressure to find cost-effective structural assessment methods. In addition, reducing the required time on site to conduct these assessments leads to shorter infrastructure closures and improved health and safety outcomes.
Automated condition assessment procedures could save time both on site and during post-inspection analysis. These methods enable structural defects, such as cracking and spalling, to be automatically labelled on to a digital model of a structure directly from photographic or lidar data. Deep learning-based methods can semantically segment individual damage locations and create detailed structural models that aid an engineer in conducting a structural condition assessment from an office setting. Such methods have been widely developed for modern concrete infrastructure [
2]. However, the variety of materials, construction geometries and historic deteriorations present in older masonry structures makes it challenging to develop an automated method that can be trusted to operate effectively when applied in practice [
3]. Masonry joints complicate the masonry lining surface profile, making it difficult to apply traditional computer vision methods to detect lining abnormalities in these tunnels [
4].
Masonry lined tunnels are particularly difficult to inspect due to the frequent and widespread occurrence of surface damage, such as spalling, efflorescence and water ingress, that can obscure more critical structural issues, such as lining deformations and dislocations [
5]. Furthermore, as constructing new tunnels can be expensive and risky [
6] and wholescale refurbishment is complex [
1], tunnels often need to be continually maintained beyond their design life. This leads to the creation of a wide variety of often poorly documented local patch repairs. In order to consistently differentiate between surface damage and lining deformations, [
7] developed a workflow that segments each individual brick instance on a tunnel’s lining from lidar data. They then analysed surface spalling damage on a brick-by-brick basis. A key step in this workflow involves training a deep learning model to semantically segment masonry joint locations so that each masonry block instance can be identified. It is vital that condition assessment tasks are digitalised, as automated digital analysis workflows enable better standardisation and traceability of the reasoning behind maintenance recommendations [
8]. Generating consistent and reliable automated structural condition assessments would also pave the way for more effective predictive maintenance strategies [
9,
10], reducing the cost and improving the safety of an asset manager’s portfolio.
Multiple studies have tackled the task of masonry joint segmentation using supervised deep learning [
11,
12,
13,
14], and there has been a significant focus on creating a generalisable method that performs well on unseen structures. Despite the excellent performance achieved within these studies on the datasets analysed, there has been limited industry uptake of these methods. Given the safety-critical nature of tunnel condition assessments, for these methods to be practical in real-world applications, engineers need to be able to trust that such methods will yield acceptable performance on the specific structure to be analysed. In addition, given the variety of masonry tunnel lining surfaces and the black box nature of deep neural networks, it is difficult to intuitively determine whether a structure’s features are in or out of the distribution of the training data and what the requirements are for a specific trained network to perform well. Even for more homogeneous concrete structures, concerns about generalisability have limited adoption.
There are multiple methods that can improve the performance of neural networks on unseen structures, including data augmentation [
15], domain adaptation methods [
16] and active learning [
17]. However, even when these methods have been adopted, it is vital that the performance can be quantified in the target application for the method to be trusted. For the case of masonry joint segmentation, this would typically involve manually labelling a section of a tunnel before assessing the segmentation performance against standard quality metrics such as the Intersection Over Union (IOU) or receiver operating characteristic (ROC). This involves time-consuming manual annotation by a trained operator. A further issue with this approach is that lining damage and historical repairs are usually localised, so assessments performed on sample lining data taken at one part of the tunnel may not be representative of the performance throughout the tunnel length.
For deep learning-based tunnel analysis workflows to be effectively applied, it is vital that an engineer can quantify the uncertainty in their predictions [
18]. Ideally, this would involve generating uncertainty maps so that further analysis can be conducted to verify the performance or manually correct a model’s segmentation in localised areas. This retains the time-saving benefits of these automated workflows where they perform well but reduces the risk of unverified analysis where there are possibly inaccurate segmentations.
There have been substantial recent developments in uncertainty quantification for deep learning computer vision applications in the healthcare field [
19]. However, there has been limited application of these methods for infrastructure condition assessment tasks. The aim of this study was to provide an evaluation of which uncertainty quantification methods have the most potential for validating deep learning models applied to real-world railway tunnel condition assessment tasks. This study compared methods to determine which were most applicable to the task of masonry joint semantic segmentation from lidar data. It considered how uncertainty quantification could be applied in real-world masonry tunnel condition assessments and demonstrated the utility of creating uncertainty maps. This study also assessed the correlation between segmentation uncertainty and performance. Overall, this study aimed to provide guidance on where uncertainty quantification methods could provide value for an engineer and which would need further development for their insights to be useful and readily interpretable.
2. Methods
As most modern deep learning approaches are black boxes, researchers have looked at various methods to verify whether a neural network prediction is trustworthy. While many studies have focused on how to explain a model’s conclusion [
20], for the task of semantic segmentation, many of these methods have been proven to be either misleading or overcomplex, restricting their interpretability [
21]. As a result, this study restricted its analysis to assessing the validity of neural network predictions through uncertainty quantification methods. In the field of deep learning research, uncertainty quantification methods focus on identifying, characterising and quantifying the level of uncertainty in a model’s predictions.
2.1. Types of Uncertainty
Uncertainty can be classified into two types: aleatoric and epistemic [
22]. Aleatoric uncertainty refers to inherent randomness in the data that can cause changes in the model predictions. In this application, it could be caused by any aspect with unpredictable values for a specific pixel, such as noise in the lidar scan or surface roughness of the masonry lining. Arguably, any feature relationship that is present in the training data but too complex for the network to characterise could also be considered as a cause of aleatoric uncertainty, as its impacts are effectively random [
23].
Epistemic uncertainty represents uncertainty in the output caused by a lack of understanding of the target domain by the model. This is caused by test data having features that are either not represented in the training set or not properly characterised by the model. This results in the test images being out of the distribution of relationships learned by the network, leading to an unpredictable performance. Features that cause epistemic uncertainty will be those most impacted by the domain shift between different tunnels. Examples would be masonry block sizing and shapes; levels and types of damage observed; and differences in the mortar joint composition, condition and thickness.
Ideally, the pixelwise softmax probability output by the neural network should represent the level of confidence that the model has in its prediction. However, modern convolutional neural network designs are often challenging to properly calibrate within the training data domain and tend to yield confident but incorrect predictions on out-of-domain test data [
24]. As a result, uncertainty quantification methods have been developed to assess a neural network’s uncertainty when applied to a specific dataset. Monte Carlo dropout (MCD) and test time augmentation (TTA) are two commonly used methods that were chosen for this study.
Unsupervised methods for anomaly segmentation and associated uncertainty estimation have also been developed [
25]. These methods typically involve self-training a student network with a teacher one and have recently been applied to medical images [
26] and for structural surface damage detection [
27]. However, these methods have not been designed to quantify the aleatoric and epistemic uncertainties of the output from existing trained models, so they are not examined further.
2.2. Monte Carlo Dropout (MCD)
Monte Carlo dropout enables the estimation of epistemic uncertainty by varying the neural network design when testing the target dataset. Dropout was initially developed to prevent model overfitting and involves randomly omitting feature detectors during training each time gradients are updated [
28]. This prevents the learning of overly complex and probably meaningless feature relationships that are unique to each training data sample, and thus, reduce the performance on test data.
While the network is usually frozen and all neurons are retained for testing, MCD involves applying dropout during testing. The testing data are run through the network with different neurons dropped out each time and the resulting variations in output give an indication of the level of uncertainty in the network’s predictions. First proposed by [
29] and theoretically proven as a Bayesian equivalent in [
30], test time dropout approximates the posterior distribution of the network’s weights by Monte Carlo sampling the network’s predictions. By assessing the model variance on a particular test image, it is possible to assess the epistemic uncertainty of the prediction [
31].
MCD has been applied relatively extensively for semantic segmentation tasks in the medical field, where quantifying uncertainty levels on medical imaging tasks are vital for a clinician to make informed decisions about a patients’ treatment options [
32,
33,
34,
35,
36]. However, while some studies have applied MCD for the uncertainty quantification of construction object segmentation [
37] and concrete damage assessment [
38,
39,
40], it has not been applied to semantic segmentation tasks on older, less homogenous infrastructure, such as masonry lined tunnels. This is despite the need for similar safety-critical decisions to be made on these structures based on a neural network segmentation output.
2.3. Test-Time Augmentation
Test time augmentation (TTA) is a commonly applied method for estimating the heteroscedastic aleatoric uncertainty in a model’s prediction. While data augmentation during training to improve the test time performance is well documented [
15], Ref. [
41] were the first to apply test-time augmentation to help understand the aleatoric uncertainty in their model’s output. They made 128 copies of their test samples and ran them through standard training image augmentations to produce 128 variants of their test data. They then put these images into their trained image classification network and observed the variations in outputs between the transformed images. Ref. [
42] later formalised their method and assessed its potential for uncertainty quantification using the Volume Variation Coefficient (VVC) of segmented structures for brain tumor segmentation. They showed a negative correlation between the VVC, which is calculated by dividing the segmentation volume variance by its mean, and the segmentation Dice score. Although it is less commonly adopted than MCD, TTA has been used for the uncertainty quantification of other types of medical images [
43,
44]. It has not been applied to structural condition assessments.
2.4. Study Contributions
The object of this research was to analyse the potential of MCD and TTA for quantifying the uncertainty of neural networks trained for masonry joint segmentation from lidar data. This will increase the trustworthiness of automated masonry-lined tunnel condition assessment procedures that rely on deep learning-based masonry joint segmentation. This study aimed to provide a method for automatically highlighting anomalous segmentations to an engineer so that they can be manually adjusted or removed. It provides the following three contributions:
A comparison of MCD with TTA for structural condition assessment uncertainty quantification.
An analysis of whether the uncertainty can effectively predict performance.
Consideration of the usefulness of generated uncertainty maps.
This paper briefly outlines the procedure for training a neural network for masonry joint semantic segmentation developed by [
7]. It then adapts the model to analyse two uncertainty quantification methods:
This study compared these methods and investigated the correspondence between the uncertainty and model performance.
2.5. Datasets
Lidar surveys were taken of the linings on 4 different masonry-lined railway tunnels in the southwest of England. Managed by Network Rail, the UK’s railway infrastructure asset owner, the tunnels were built in the 1850s at the time of construction of the railway. Tunnels 1 and 3 were lined with stone masonry, while Tunnels 2 and 4 were brick-lined. They were all in a serviceable condition; however, large areas of shallow spalling and mortar loss (<10 mm depth) alongside efflorescence were present on the masonry. Three-dimensional point clouds were created from the lidar data. These were grid-sampled into a minimum 5 mm point spacing to reduce the required computing power for subsequent analysis.
2.6. Data Preprocessing
The 3D point clouds were further processed to prepare them for neural network training. Although there are multiple state-of-the-art 3D deep learning methods for the semantic segmentation of 3D point clouds [
45], applying 2D vision methods to rasterised depth map images of the tunnel lining has been shown to be the most effective for masonry tunnel joint semantic segmentation [
13]. The data preparation workflow followed in [
7] was therefore used in this study.
A cylinder was fitted to the tunnel point cloud using principal component analysis, which was then unrolled to flatten the tunnel lining. The vertical offsets of the resulting point cloud were then rasterised into a 2D float32 image depicting a depth map of the tunnel lining. The pixel intensities in the depth map image corresponded to the out of plane distance of each point from an ideal cylindrical tunnel lining. The joint locations were manually labelled on these images to create ground truth joint location masks. The joints were labelled as constant 9-pixel width lines using QGIS 3.38.2. This was challenging in places, as in some locations the mortar joints were very narrow or the mortar was level with the brick surface. This made it difficult to identify each brick from the depth map images. Additionally, as shown in
Figure 1, the surface damage made it difficult to visually determine the locations of the masonry joints in the point cloud from either the 3D or intensity data.
After the labelling step, the images were upsampled such that each tunnel had the same average number of pixels per masonry block. Tunnel 3 had the largest masonry blocks, so 1, 2 and 4 were adjusted to match. The data from each tunnel were then split into training, validation and testing sets in a 3:1:1 ratio. Finally, the images were split into 384 × 384-pixel patches. This was performed to ensure that the neural networks could be trained within the available VRAM constraints.
2.7. Neural Network Training
In order to focus on the uncertainty quantification performance, a basic U-Net style neural network was chosen for the analysis. The network architecture from [
46] was used. This consisted of 4 downsampling convolutional layers in an encoder followed by 4 upsampling convolutional layers in the decoder. The model was trained using the hyperparameters outlined in
Table 1 on a 10-core Intel Xeon 6138 CPU with 48 GB of system RAM and an Nvidia V100 GPU.
2.8. Quantifying Epistemic and Aleatoric Uncertainties
It is necessary to wholistically assess the performance of each uncertainty quantification method given the wide possible variations in the feature distributions of both the trained neural network selected and the target tunnel lining. As the aim of the uncertainty quantification was to indicate the neural network’s performance and applicability to a specific tunnel, performance on both the in- and out-of-distribution tunnel data needed to be assessed.
Four different tunnels were available for this study, so the neural network was trained three times using different hyperparameters to simulate a total of 12 different domain-shift scenarios. This created different levels of epistemic uncertainty. The details of these networks are described in
Table 2. The differences between the trained networks acted as a proxy for the wide variety of possible differences in features between the training and testing data tunnels due to their geometries, material types and damage levels. The aleatoric uncertainty also needed to be modelled. In order to artificially create uncertainty in the test data, random Gaussian and Perlin noises were added to each test image. Two levels of noise were chosen, with scale factors of 0.15 and 0.3 applied to the magnitude of the noise. All data augmentations were implemented using Albumentations 1.4.10.
2.9. Data Augmentations
Training data augmentations are transformations applied to training data to artificially increase the volume and variety of data. This widens the feature distribution of the training data and helps to increase the robustness of the trained network so that it generalises better to different datasets. Networks A, B and C were trained with random vertical and horizontal flips. Through trial and error, the data augmentations that led to the best test data performance were determined and applied for Network A. Network A was trained with the following additional augmentations applied randomly:
Brightness shifts—offsets in all pixel values. This simulated different tunnel shapes.
Contrast adjustments—scaled the pixel values. This simulated different mortar joint depths.
Perlin noise—randomly generated gradient offsets. This simulated masonry surface roughness changes.
Gaussian noise—random pixel value offsets. This simulated noise that occurred during the data collection due to the accuracy of the laser scanning equipment.
Elastic transforms—random displacement maps. This simulated masonry deformations.
Crop and resize—up- and downscaling of the input image. This simulated different masonry block sizes.
3. Results
3.1. Neural Network Performance
Each network was trained on Tunnel 1 before being applied to the test data of all four tunnels. The performance was assessed using the Intersection Over Union (IOU) score, which compares the number of True Positive (TP) predictions with the False Positive (FP) and False Negative (FN) predictions. It is considered more useful than Accuracy, Precision or Recall values alone, as it only considers the target positive class and is insensitive to class imbalance. It is calculated as shown in Equation (1).
For masonry block documentation, it is vital that the segmented masonry joints fully enclose each block in order to separate block instances. The exact location of each masonry joint, while contributing to the masonry block segmentation performance, is less critical than the identified block sizes and shapes, which are dependent on the joint connectivity. While a segmentation with a high masonry joint IOU would also generate a good block segmentation, a small offset of the detected joints leads to a substantial decrease in the joint IOU, despite having a minimal impact on the overall block documentation performance. Typically, the IOU would be applied to assess the performance of the positive class (masonry joint locations) alone. However, in this case, the IOU was calculated blockwise on the negative class using the method developed in [
47]. A connected component analysis was conducted to identify the separate masonry block instances. The segmented blocks were then assigned by centroid location to their nearest ground truth block, and the IOU was calculated between each of these assignments. The average IOU per detected block was then calculated to give the blockwise IOU used here.
As using MCD required training with dropout enabled, the networks were trained with and without dropout in order to assess whether using dropout had negative performance impacts. The performances of the networks trained without dropout are shown in
Table 3 and the outputs are visualised in
Figure 2. It is clear that Network A had a better generalisation performance than B and C. The overfitting of Network B to Tunnel 1 yielded the best results on the Tunnel 1 test data at the expense of Tunnels 2, 3 and 4’s performances. The performances on Tunnels 2 and 4 were worse than on Tunnel 3. This was likely due to the differences in features between the stone- and brick-lined tunnels. Although the decrease in performance when moving from Network A to B and C is qualitatively visible in
Figure 2 for Tunnels 2 and 4, the IOU shown in
Table 3 decreased substantially. This was because even when the joints were segmented largely correctly, small gaps in the joints connected adjacent block instances, which led to a substantial breakdown in the block segmentation performance.
The model was then trained with dropout enabled, with a dropout probability of 0.5 for each neuron. As shown in
Table 4, this produced slightly higher but largely similar results. Tunnel 4’s performance was particularly improved. This was possibly due to the regularising effect of using dropout.
The impact of the artificially added noise on the neural network performance was also assessed.
Figure 3 shows how the added noise impacted the segmentations of a section of testing data taken from Tunnel 3.
Figure 4 visualises how the distribution of performance decreased for each tunnel segment when the amount of added noise was increased. It can clearly be seen that for every network, adding noise decreased the performance. For Network B, the change was less pronounced. This was likely because it achieved a poor performance on Tunnels 2, 3 and 4 in the no-added-noise case, but a good performance on Tunnel 1. Adding noise had little impact when the performance was already low. This is reflected in the visualisations in
Figure 3.
3.2. Uncertainty Metrics
The MCD and TTA produced a number of output segmentations that needed to be compared for the uncertainty to be quantified. Two different uncertainty metrics were chosen:
Uncertainty maps were generated for each segmented image patch by calculating the standard deviation between the segmentation prediction of each Monte Carlo sample for each pixel. Each pixel value on the uncertainty map was then set as the calculated standard deviation.
3.3. Test-Time Augmentation Results
Test-time augmentation was applied to each of the three neural networks once trained with dropout enabled. Gaussian noise, Perlin noise, brightness and contrast shifts, and random vertical and horizontal flips were chosen as the augmentations. Each 384 × 384 test image crop was augmented in 50 different ways. The UMIOU and AVC were calculated between each of the 50 segmentations for each image.
The generated uncertainty maps are visualised in
Figure 5 for an image patch with varying levels of added noise in the input image. For Network A, although the AVC increased with increasing noise levels, visually, there appeared to be fewer areas of uncertainty. Inspecting
Figure 3, the network did not identify as many joint locations with the increase in noise. This suggests that if the noise level is significant enough to completely obscure image features, then the network will be more confident in its prediction, even though it is incorrect. The AVC was able to reflect the increased uncertainty as it was normalised by the area of predicted joints, so it was robust to the decreased segmentation area. While Networks B and C showed decreases in the UMIOU with the increased noise, it is unclear that the level of uncertainty increased from viewing the uncertainty maps alone.
The distribution of increases in the AVC over all the image test patches is visualised within the histograms in
Figure 6. The mean AVC value increased for all the networks when the level of added noise was increased. For Network B, the increases were lower. As Network B had a poor generalisation performance, it was unable to effectively characterise the out-of-distribution data. It was less able to reduce its prediction confidence in uncertain situations. As a result, it likely produced universally more confident predictions due to incorrectly identifying the features in the test data as being those from the training data.
The AVC and UMIOU uncertainties on each patch are plotted against the segmentation performance in
Figure 7. No clear relationships between the uncertainty value and the segmentation performance were observed for the TTA output.
3.4. Monte Carlo Dropout Results
The effectiveness of applying Monte Carlo dropout was analysed for each of the three trained neural networks. Dropout was applied during inference with a probability of 0.5 per layer. Each 384 × 384 test image crop was put through the trained network 100 times. The UMIOU and AVC were calculated between each of the 100 Monte Carlo-sampled segmentations for each image. The generated uncertainty maps are visualised in
Figure 8 for an image patch from both a section of Tunnel 1 data, which should have a low level of epistemic uncertainty, and a section of Tunnel 3, which should have a higher level of epistemic uncertainty.
For Network A, the uncertainty maps clearly show the locations where varying segmentations are possible. For example, there was clearly uncertainty in Network A’s segmentation at the top of the Tunnel 1 patch. Observing the regularity of the masonry, it was likely that at this location, masonry cracks were being predicted as joints. For Network B, there was a substantial increase in both the AVC and the level of uncertainty that could be qualitatively observed. The change in the epistemic uncertainty was the most extreme for Network B, as it was overfitted to Tunnel 1. This demonstrates how uncertainty maps can be used to determine which tunnel areas are likely to be within the distribution of the trained model.
The correlation between the MCD uncertainty and segmentation performance can be observed in
Figure 9. There was a weak negative correlation between the AVC and IOU for Networks B and C; however, it was less pronounced for Network A. This was possibly because Network A generalised better to unseen data than Networks B and C. This would have caused the epistemic uncertainty to have less of a negative performance impact than for Networks B and C. An alternate explanation is that Network A could have captured a broader variety of feature relationships than Networks B and C due its exposure to a wider variety of training data. The more complex relationships were more adversely impacted by applying dropout than the simpler relationships in Networks B and C. This would have led to the MCD generating universally higher levels of uncertainty and suggests that MCD can be used to indicate the generalisability of a trained network. A further analysis covering a wider variety of datasets would need to be conducted to fully separate the impacts of these factors.
3.5. Uncertainty Visualisations
Having analysed the capabilities of each uncertainty quantification method, it is important to be able to visualise each metric in an accessible and informative way. We propose projecting the AVC scores of each 384 × 384 image patch onto the output segmentation maps. This enables an engineer to efficiently scan the tunnel lining for areas with high uncertainty. The pixelwise uncertainty maps should be inspected when a specific area requires a more detailed inspection. They can then be used to view alternate possible segmentations. An example of this is shown applied to a section of the Tunnel 3 test data in
Figure 10. The correlation between the segmentation performance and MCD uncertainty can be observed.
Without uncertainty maps, it would be necessary for an engineer to inspect every segmentation map in detail to validate the segmentations. However,
Section 3.3 and
Section 3.4 show that although high levels of uncertainty were correlated with a poor performance, it is possible for a patch with a low epistemic or aleatoric uncertainty to also generate a poor segmentation. As a result, areas with a poor performance, where the segmentation needs to be manually analysed and corrected, cannot be exclusively determined using uncertainty values. It is necessary for an engineer to take a holistic approach when identifying locations with a poor segmentation performance. The following workflow is suggested:
An engineer should identify the typical masonry block dimensions from the segmentation maps. If there are multiple types of masonry present, then the engineer should conduct the following steps over each type of masonry in turn, as uncertainty values are not directly comparable between areas with substantially different properties.
The TTA and MCD image patches should be sorted by their uncertainty levels.
Starting from the patches with the highest uncertainty, the patch predictions should be observed alongside the pixelwise uncertainty values and the input depth map. If the predicted joint locations do not appear realistic, then the segmentation should be manually corrected. The MCD pixel uncertainty maps show segmentation outputs with variants of the trained neural network. They can therefore be used as a guide to identify more realistic segmentation candidates.
Step 3 should be conducted for patches with progressively smaller uncertainties until the observed patches have qualitatively acceptable segmentations.
It is necessary to account for areas where there may be a poor segmentation performance despite a low level of uncertainty being identified. These regions are likely caused by abnormalities in the input depth map caused by tunnel features that have not been encountered during training and are challenging to accurately identify from the depth map alone. While many of these cases will lead to epistemic uncertainty, it is possible for the network to be confidently incorrect if a joint is not visible in the depth map. This may occur, for example, if the mortar is level with the masonry surface. In addition, high levels of noise are not always detected by TTA as aleatoric uncertainty if no reasonable segmentations can be generated. As a result, the engineer should conduct further pixelwise segmentation verification in areas they have identified as anomalous during their on-site qualitative inspection of the tunnel.
Although this method is not guaranteed to remove all incorrect segmentations, it is a cost-effective procedure for improving the segmentation performance given limited available manual analysis time and would substantially reduce the analysis time compared with the full manual labelling of masonry block locations.
4. Discussion and Conclusions
Both test time augmentation (TTA) and Monte Carlo dropout (MCD) generated uncertainty maps that could aid in the interpretability of deep learning-based masonry joint segmentations. Using MCD enabled alternate possible segmentations to be viewed, which visualised the sensitivity of the output to the neural network training environment. The Area Variation Coefficient (AVC) could be used to demonstrate how increased epistemic uncertainty led to decreased segmentation performance. However, the uncertainty values generated by the Monte Carlo dropout were sensitive to aleatoric uncertainty. Adding noise decreased the AVC of the MCD uncertainties.
Applying TTA showed how small changes in the input data led to changes in the segmentation output. Qualitatively, the produced TTA uncertainty maps were strongly dependent on how robust a trained network was to noise. The AVC of the TTA outputs was increased with increased noise levels, which enabled it to be used as an indicator of the quality of the input images and the resulting aleatoric uncertainty. However, it was not shown to correlate with the segmentation performance, as this was strongly driven by the epistemic uncertainty caused by the domain shift between the tunnels analysed.
For both the TTA and MCD scenarios, the UMIOU was not shown to be a useful metric of uncertainty. The UMIOU did not correlate in most cases with the AVC or the perceived level of uncertainty observed in the uncertainty maps. As a result, the AVC is proposed as the standard segmentation uncertainty evaluation metric.
There was a substantial runtime increase when implementing the uncertainty quantification methods. The runtime increase was proportional to the number of augmentations or dropout variations that were assessed since the inference needed to be computed for every Monte Carlo and augmentation sample. For this study, implementing TTA and MCD increased the inference time by approximately 1500%, as 100 MCD samples and 50 TTA samples were used. Reducing the number of samples reduced the effectiveness of the method, as it was necessary to generate a distribution of outputs in order to more confidently determine the mean and standard deviation of them. With a low number of Monte Carlo samples, it would be possible that key augmentation/dropout permutations were missed, and thus, generated misleading results. It is recommended that MCD and TTA should only be implemented on standard office hardware when the computational cost of inference is small. Alternatively, cloud computing services could be rented for the inference process. This would prevent the analysis time from forming a bottleneck in conducting a condition assessment without requiring the purchase of expensive specialist hardware that may only have occasional use over the lifetime of a project.
To conclude, applying both TTA and MCD provided valuable insights into the uncertainty of masonry block segmentation outputs, and AVC score maps could be effectively used to indicate to a tunnel inspector the locations where a neural network has high levels of uncertainty. Within a specific tunnel, the AVC value correlated with the segmentation performance, which can enable an inspector to easily identify the most effective locations to conduct manual validation or correction of the output masonry joint segmentation map. However, there are limitations with using uncertainty alone as a proxy for segmentation performance since their power is dependent on the specific trained neural network. It is suggested that the level of uncertainty is calibrated against the segmentation performance on samples of unseen testing data before being applied in practice. A well-trained and generalised neural network should generate more strongly correlated uncertainty scores. However, it is still necessary to conduct a qualitative visual inspection of a tunnel lining to identify where obvious lining anomalies may impact the segmentation performance.
In order to determine how uncertainty maps could be integrated into real-world tunnel condition assessments, further work needs to be conducted to analyse how accessible and interpretable these methods are for engineers who are not familiar with machine learning. Furthermore, asset managers need to be consulted to determine how far uncertainty quantification methods would improve their perception of the trustworthiness of automated tunnel analysis workflows.