Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network

Ye, Jianlong; Yu, Hongchuan; Liu, Gaoyang; Zhou, Jiong; Shu, Jiangpeng

doi:10.3390/buildings14040983

Open AccessArticle

Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network

by

Jianlong Ye

¹,

Hongchuan Yu

²,

Gaoyang Liu

³,

Jiong Zhou

¹ and

Jiangpeng Shu

^2,*

¹

Zhejiang Communications Group Inspection Technology, Co., Ltd., Hangzhou 310058, China

²

College of Civil Engineering and Architecture, Zhejiang University, Hangzhou 310058, China

³

School of Civil Engineering, Shaoxing University, Shaoxing 312000, China

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(4), 983; https://doi.org/10.3390/buildings14040983

Submission received: 9 January 2024 / Revised: 27 February 2024 / Accepted: 1 March 2024 / Published: 2 April 2024

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

:

Component identification and depth estimation are important for detecting the integrity of post-disaster structures. However, traditional manual methods might be time-consuming, labor-intensive, and influenced by subjective judgments of inspectors. Deep-learning-based image visual inspection is a new approach to overcome these problems, but repeated modeling is required for different inspection tasks, which limits inspection accuracy and practical deployment efficiency. In this study, it is observed that the matched ratios of pixel pairs between component identification and depth estimation reach a high value, which indicates the dual tasks are highly related. Therefore, the Multi-Scale Task Interaction Network (MTI-Net) is proposed for structural images to simultaneously accomplish both tasks for accurate and efficient structural inspection. It propagates distilled task information from lower to higher scales. Then, it aggregates the refined task features from all scales to produce the final per-task predictions. Experiments show that MTI-Net delivers the full potential of multi-task learning, with a smaller memory footprint and higher efficiency compared to single-task learning. For the evaluation metrics of model performance, the mean Intersection over Union (mIoU) of component identification improves by 2.30, and root mean square error (RMSE) drops by 0.36 m with the aid of the multi-task strategy. The multi-task deep learning framework has great potential value in engineering applications.

Keywords:

multi-task deep learning; multi-scale task interaction strategy; structural inspection; component identification; depth estimation; computer vision

1. Introduction

The inventory of as-built civil structures that have already served for their intended duration is visibly increasing due to the risk of deterioration from use and structural degradation [1,2]. Consequently, pressing structural maintenance and condition assessment issues have been attracting increasing attention. Structural health monitoring (SHM) [3,4,5] plays a significant role in condition assessment, management, and maintenance to assist asset owners in decision making. Current practice of SHM relies on routine and manual inspection. Yet, on-site visual inspection requires substantial field labor, experience, and operational interruptions [6,7]. A thorough inspection may be a safe but not efficient strategy. Owing to the inaccessibility of specific critical components of some structures, human-conducted inspection may even pose threats to personal safety of technicians. Moreover, in some underdeveloped regions, it may become an overburden on the shoulders of the competent department since it may already be stretched to its limits in terms of budget and manpower [8]. In the past few decades, the rise of computer vision (CV) has provided insight into automatic structural health assessment [8,9]. Nevertheless, conventional image processing techniques (IPTs), hand-crafted image filters, and heuristics, albeit reliable in precision, also have limitations such as a lack of compatibility with extensively varying backgrounds in images and may be easily influenced by environmental uncertainties [10]. Consequently, it is promising to consider deep learning as a complement or even alternative to the conventional IPTs in the context of SHM. Deep-learning-assisted CV techniques have been successfully applied to image collection [11] and retrieval [12], damage classification [13,14,15,16], and structural assessment [17,18] in the field of post-disaster civil infrastructure inspection.

Component identification is a costly and time-consuming crucial first step in the assessment of structural integrity. Component identification is also the primary means of SHM since a specified type of damage appearing on varying components may lead to different consequences. Consequently, previous studies have been dedicated to component recognition or detection by resorting to deep learning applications. Gao et al. [19] introduced VGGNet and the transfer learning strategy to identify images containing components into two classes: beam/column and wall. Components are then localized with a class activation map (CAM) after the binary classification. Liang [20] proposed a convolutional neural network (CNN) for object detection and a principled manner of such selection based on Bayesian optimization to conduct component-level bridge column detection and observed promising results. Resorting to semantic segmentation networks, Saida et al. [21] proposed a feature pyramid network (FPN) to achieve component identification for building images. Experimental results indicated that FPN was capable of remarkably learning semantic information and obtained high-precision segmentation results. Wang et al. [22] introduced an improved UNet by considering geometric information for post-disaster building component segmentation. The proposed model achieved better segmentation performance than the original UNet. Narazaki et al. [23] investigated three approaches to scene understanding results in component segmentation: naïve configuration, parallel configuration, and sequential configuration. And the last one was regarded to be effective in improving the robustness of recognition for images in extensively varying real-world scenes.

Except for image data, components can also be recognized and segmented from multi-source data, such as videos and point clouds. Narazaki et al. [24] introduced a CNN with a recurrent architectures framework for automated bridge component recognition using video data. Contextual information was fully leveraged by the proposed framework, and the information from past frames was adopted to augment the understanding of the current frame. Kim et al. [25] developed a deep learning methodology to automatically extract a set of point clouds from bridge points and annotate them as a specific component for bridges that may have curved decks or different pier heights. Lee et al. [26] proposed a graph-based hierarchical dynamic graph CNN (DGCNN) model to represent railway bridges having electric poles more accurately and realistically from 3D point clouds. Detailed local features were captured by considering neighboring points incrementally while the total number of neighbors remained the same. Kim et al. [27] compared three deep learning models: PointNet, PointCNN, and DGCNN to classify the components of bridges. Point cloud data were acquired from three types of bridges (Rahmen, girder, and gravity bridges) to determine the optimal model for use across all three types. Xia et al. [28] presented a combined local descriptor and machine-learning-based framework to automatically detect structural components of bridges from point clouds. A multi-scale local descriptor was designed, which was then leveraged to train a deep classification neural network. A result refinement algorithm was adopted to optimize the segmentation results.

Component identification through deep learning techniques has been thoroughly investigated. Nevertheless, largely, there exist few studies to date that devoted their efforts to depth estimation, which has yet to be fully researched in the context of SHM. One of the preliminary efforts toward monocular depth estimation has been presented by Narazaki et al. [24]. However, this research has been limited to the dataset generation in a single manually created synthetic environment. We contend that depth estimation is also a crucial issue in terms of SHM since it can provide a geographic location and geometric structure of the surrounding environment, especially after a catastrophe. Consequently, it is deemed to be crucial for the success of SHM that the component identification along with the depth estimation is carried out concurrently. This necessitates multi-task deep learning that is capable of tackling multiple tasks simultaneously in the context of SHM. One of the preliminary efforts toward multi-task deep learning regarding vision-based SHM was presented by Hoskere et al. [29]. A deep CNN, called MaDnet, was proposed to simultaneously identify material type, fine and coarse damage (i.e., multi-task learning in terms of material and damage). The interdependence between material and damage is incorporated through shared filters learned by multi-objective optimization. Regrettably, to the best of the author’s knowledge, solutions endeavoring to achieve multi-task deep learning containing component identification and depth estimation have yet, to date, to be researched.

This paper intends to present a novel multi-scale task interaction deep learning strategy toward a multi-task CNN for post-earthquake inspections of buildings. The novelty of the proposed method lies in introducing a state-of-the-art multi-scale task interaction strategy to multi-task learning which can take full advantage of commonalities and differences among tasks at multiple scales so that considerable accuracy can be achieved while largely retaining the superiority of multi-task frameworks such as low computational cost. Component segmentation and monocular depth estimation are involved since they can provide adequate information for SHM. To this end, the following research questions should be solved: (1) What is the underlying basis of the incorporation of component segmentation and depth estimation to conduct multi-task learning, and how can the task affinity be defined? (2) Which and how many scales should be predesignated to carry out multi-scale task interaction so that tasks could fully leverage features and representations from each other at multiple scales?

2. Methods

2.1. Multi-Task Deep Learning in Computer Vision

With conventional neural networks, tasks such as semantic segmentation and monocular depth estimation are handled in isolation. As previously mentioned, however, most real-world problems are inherently multi-modal, requiring that a multitude of tasks be solved concurrently. In the current practice of SHM and civil structural condition assessment, decisions should be made cautiously after considering multiple sources of diverse inspection information. Deep learning models can, theoretically, infer all desired task outputs from the given training signals of related tasks by utilizing domain-specific information in multi-task learning. Shared representations are applicable for multi-task supervisory signals in deep neural networks. Definite merits have been brought to the table [30]. First, the memory footprint is substantially reduced due to the layer sharing. Evidence has been substantiated in the literature for certain pairs of tasks such as segmentation and depth estimation [31]. Second, learning and predicting speed visibly accelerate, as features in shared layers explicitly avoid repeated calculation. Moreover, it has been shown that this method has the potential to achieve higher accuracy and robustness when associated tasks share complementary information or act as a regularizer for one another.

Existing multi-task deep learning architectures in computer vision for dense predictions can briefly categorized into two main groups: encoder-focused and decoder-focused [30]. Encoder-focused frameworks share task features in the encoder stage before processing them with task-specific heads. All predictions are generated once in parallel or sequentially, as plotted in Figure 1a. Representative models include cross-stitch networks [32], neural discriminative dimensionally reduction CNNs [33], multi-task attention networks (MTANs) [34], and fully-adaptive feature sharing (FAFS) [35]. In contrast, decoder-focused frameworks exchange information in the decoder stage. As plotted in Figure 1b, initial predicted features are leveraged to improve each task output in a one-off or recursive manner. Typical models involve PAD-Net [36], PAP-Net [37], and joint task-recursive learning [38]. Encoder-focused architectures fail to capture commonalities and differences among tasks, which may be mutually beneficial since they directly output all predictions in one processing cycle. Consequently, decoder-focused architecture is utilized in the present study.

2.2. Multi-Scale Task Interaction Strategy

All decoder-focused architectures mentioned herein are confirmed to have a pattern of interaction between tasks and multi-modal distillation implemented at a fixed scale. This principle highly depends on the strict assumption that all relevant task interactions can solely be modeled through a designated filter operation with a specific receptive field. Nevertheless, such an assumption is not always the case and even does not agree with our perceptual cognition. Here, an example is named using our component recognition and depth estimation tasks. Intuitively, local patches in the depth map provide less information for component segmentation in this scene than patches on a more global scale. The shape of the component is revealed when the receptive field is enlarged, hinting at the semantic information of this scene. However, local patches are not to be ignored since they can improve the aligning when edge detection is assigned. Consequently, high pattern affinity at a specific scale may not be retained at other scales, and vice versa. This necessitates a multi-scale task interaction model that can modify multi-task learning frameworks to be more sophisticated and robust. Several modules were introduced by MTI-Net [30] to implement multi-scale task interaction for our multi-task structural inspection. The proposed framework for multi-task deep learning is visualized in Figure 2.

Backbone module. The backbone module aims to acquire information from different tasks. It generates initial predictions for each task through task-specific heads, and the features contained in the initial predictions are more task-aware than those shared in the backbone. The backbone adopts High-Resolution Net (HRNet) [39] to extract features at different scales. By continuously concatenating low-resolution feature maps to high-resolution feature maps, HRNet maintains a high resolution throughout the entire process, improving the network’s performance in solving computer vision tasks, such as segmentation and classification.

Multi-scale, multi-modal distillation module. Multi-scale feature representation is extracted from the input by the backbone module, which was utilized to generate predictions at different scales. After acquiring a group of task-specific representations of varying scales, task features are extracted by distilling information contained in the remaining tasks leveraging spatial attention mechanisms, called multi-modal distillation. Each scale of distillation is repeated in this manner.

Feature propagation module. Propagation of features at different scales is achieved through a feature harmonization block. It is utilized to create a shared representation of tasks, which is adopted to refine the former scale, before passing them to the task-specific head at the next scale. In this framework, the access control mechanisms are performed through squeeze and excitation modules (SE). The feature propagation module is visualized in Figure 3.

Feature aggregation module. Extracted features of different scales after the distillation module are up-sampled to the maximum scale and connected. Eventual feature representations for every task are acquired and then decoded to the final prediction using task-specific heads.

To sum up, starting from an off-the-shelf backbone that distills multi-scale information, initial inferences of separate tasks are predicted at all scales. Task features are individually refined to acquire multi-scale task interactions. Then, eventual predictions of individual tasks are obtained by combining the extracted task features from all scales. To further improve model performance, a feature propagation module is introduced to extend the framework that passes distilled representations from smaller scales to larger ones.

2.3. Pixel Affinity

The motivation behind multi-task learning derives from the phenomenon that affinitive pairs recur with high frequency in various tasks. To quantify the degree to which tasks possess common information, the pixel affinities are measured on the label space of each task. For component identification, a pair of pixels is regarded as similar if they belong to the identical type of building components. For depth estimation, the absolute error between a pair of pixels is adopted as a similarity metric; pixels below the error threshold are similar. After determining the similarity of the pixel pairs, the accumulated number of matched similar and dissimilar pixel pairs, located in the same spatial position in respective images, to the total number of pixel pairs is defined as the matched ratio for the dual tasks. A high matched ratio indicates a high relevance between tasks, facilitating the deployment of the multi-task framework.

Three typical images are chosen to illustrate the pixel affinities of component segmentation and depth estimation tasks. As shown in Figure 4, the similar and dissimilar pairs of common locations in the dual tasks are denoted as white and black dots, respectively. It is obvious that the matched pairs in component and depth-labeled images occur frequently, initially showing the high relevance between the dual tasks. The calculated results of matched ratios for the dual tasks are indicated in Figure 5, where it can be seen that the threshold in depth estimation in this experiment was set to 0.195 m. In this project, the background was ignored in component segmentation, while the depth was truncated to a pixel value of 255 as shown in Figure 4b. This setting leads to a significant drop in Figure 4b, accounting for the lowest of 51.98%. It can be observed that the matched ratios of matching pixels are rather high, reaching around 50–70%. This is in agreement with a similar observation made earlier by [30,37]. Figure 4 and Figure 5 demonstrate that component identification and depth estimation tasks are highly related.

The task interactions and multi-modal distillation are performed at a specific scale. Intuitively, as illustrated in Figure 4, the blue local patches in depth maps provide less information for component identification than those with larger scales. When the perceptual field expands, the shape of the building components becomes visible, thus suggesting semantic information about the scene. However, local patches should not be neglected, as they can improve the alignment of component edges. Consequently, high task affinity at a specific scale may not be preserved at other scales. Pattern affinities of component segmentation and depth estimation are calculated at each scale repeatedly in the abovementioned manners. After quantification, four scales (1/4, 1/8, 1/32, and 1/64) were determined to implement the multi-scale task.

2.4. Evaluation Metrics

In the present study, Intersection over Union (IoU) was adopted as the evaluation metrics for component segmentation and RMSE for depth estimation. IoU and RMSE are defined as Equations (1) and (2), respectively:

IoU = \frac{T P}{T P + F P + F N},

(1)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(d_{i} - {\hat{d}}_{i})}^{2}},

(2)

where TP, FP, and FN denote the number of true positive, false positive, and false negative pixels, respectively. And

d_{i}

and

{\hat{d}}_{i}

refer to actual and predicted depth values, and n denotes the number of pixels.

3. Dataset

Identification of post-earthquake critical structural components with a deep learning strategy relies on numerous training data including structural images and related labels. However, it is difficult to acquire these data as current main methods still require manual labeling for the collected images. This study leveraged a publicly available large-scale synthetic image dataset based on physics-based graphics models (PBGMs) [40] for structural component identification and depth estimation. PBGMs are effective in creating synthetic data of automatically labeled images at a pixel level which can be utilized for training a deep network. Various structural conditions in synthetic environments were simulated to improve the model generalizability, such as light and colors.

The procedures for generating PBGMs are shown in Figure 6. The notable feature of the PBGMs lies in introducing a finite-element model which provides the damage information such as cracking and spalling of components. Thus, the generated results can be utilized to carry out overall evaluations based on the realistic structural state. The simulated unmanned aerial vehicles (UAVs) follow predetermined flight paths to capture and render images of the damaged structure from multiple angles. Figure 7 indicates examples of images with corresponding labels. Each image is related to a pair of labels including components and a depth map.

Based on the acquired dataset, our work aims to use multi-scale deep learning methods to automatically solve structural inspection tasks, including the combined component identification and depth estimation, which has a significant impact on overall decision-making. The dataset includes 3804 images of size 1920 × 1080, which are divided into a training set and validation set with a ratio of 9:1. There are 3423 images in the training set and 381 images in the validation set.

4. Experiments and Results

4.1. Experimental Setup

RGB and depth images were randomly scaled with the selected ratio in {1, 1.2, 1.5} and randomly horizontally flipped. The Adam optimizer was adopted. A polynomial learning rate lr decay scheme [41] was used as illustrated in Equation (3):

l r = l r_{0} * {(1 - \frac{i}{T_{i}})}^{p o w e r},

(3)

where

l r_{0}

is the initial learning rate, i denotes the number of epochs, and

T_{i}

is the total number of the current epoch. In our experiments,

l r_{0}

,

T_{i}

, and power were set to 1 × 10⁻⁴, 200, and 0.9, respectively. The experiments were carried out using NVIDIA GTX 2080Time 12 GB graphics cards. The single-task model and MTI-Net were trained with the same configuration.

Two hyperparameters of MTI-Net were selected for optimization, i.e., batch_size and weight_decay. The search grids of the values for batch_size and weight_decay (WD) were set to [2:2:32] and [2 × 10⁻⁴:2 × 10⁻⁴:1 × 10⁻³], respectively. The total segmentation loss curves versus different hyperparameter combinations are shown in Figure 8. It is obvious that WD has a large impact on model performance, as it is related to the optimization process of the network. By comparison, the loss curves keep fluctuating across different values of batch_size. The values of batch_size and weight_decay could be 22 and 6 × 10⁻⁴, respectively. The hyperparameter selection procedures for the other two single-task models were the same as those discussed above.

4.2. Computational Efficiency Results

To demonstrate the superiority of the proposed multi-task framework concerning computational efficiency, the quantitative metrics were calculated and are reported in Table 1. Lower training time and inference latency indicate higher model deployment efficiency, while smaller numbers of parameters and floating point operations per second (FLOPS) indicate lower computational cost. For convenience of comparison, the summed values of two single models are used as a benchmark. It is observed that the computational efficiency of MTI-Net is inferior to that of the single model for component identification. This phenomenon is rational because the multi-task network is more complex. When comparing MTI-Net with the benchmark, MTI-Net indicates higher computational efficiency with lower cost. This is mainly attributed to two merits of the multi-task framework. First, the memory footprint is substantially reduced due to the layer sharing. Second, learning and predicting speed visibly accelerate, as repeated calculation of features in shared layers is explicitly avoided. Efficiently accomplishing two tasks at the same time without repeated modeling will facilitate model deployment in practical structural inspection.

4.3. Component Segmentation Results

The IoUs of different components and their mean values are listed in Table 2. Representative examples of component segmentation results of the single-task model and MTI-Net are illustrated in Figure 9.

Evidently, IoU values presented by MTI-Net are all higher than 96%, demonstrating remarkable segmentation ability. It is indicated by Table 2 that the proposed MTI-Net outperformed the single-task neural network in terms of both component-wise IoU and mean IoU. Δ in Table 2 means the absolute different value between IoU reported by MTI-Net and that reported by the single-task network. The maximum Δ value, 9.33, was reported by the windows frame component, which is subjected to data imbalance since it has few pixels in the dataset. This demonstrates that the proposed method can tackle data imbalance issues with the assistance of depth information so that the segmentation performance would be improved. Auxiliary information was provided by the depth map to improve network performance with respect to component segmentation.

The effectiveness of the proposed method is further substantiated by visualized segmentation results shown in Figure 9. It is obvious that the segmentation results of MTI-Net are more accurate, with remarkable edge segmentation performance of beams and columns, while the single model has segmentation flaws for the edges of beams and columns. Due to the developed multi-scale task interaction strategy, MTI-NET effectively leverages shared representations, enhancing the overall segmentation performance. Loss values during training of four scales are plotted in Figure 10. Loss values dropped quickly at the beginning of training, converged later, and remained stable, demonstrating the rationality of the dataset.

4.4. Depth Estimation Results

RMSE is utilized to evaluate the performance of networks in terms of depth estimation. RMSEs reported by MTI-Net and our single-task network are summarized in Table 3. Typical examples of depth estimation results of MTI-Net and the single-task network are visualized in Figure 11.

It is obvious that the proposed MTI-Net shows superior performance compared with the conventional single-task network. A reliable RMSE value of 0.27 m is reported by MTI-Net, dropping from 0.63 m reported by the single-task network. Consequently, we contend that component segmentation and depth maps can provide auxiliary information for each other, resulting in a robust and sophisticated network. The accurate depth estimation by MTI-Net facilitates the detection of tilted or incomplete structures after earthquakes for further management decisions.

Figure 11 further validates the effectiveness of MTI-Net regarding depth estimation. It is observed that the predicted depth maps of MTI-Net are closer to the ground truth compared with the single model. MTI-Net is able to capture intricate depth details through shared representations in the multi-task framework. Loss values during training of four scales and the total loss of component segmentation and depth estimation are plotted in Figure 12, also demonstrating the rationality of the dataset.

Overall, the developed multi-scale task interaction strategy enables MTI-Net to achieve more accurate component identification and depth estimation through information interaction between these two tasks at different scales.

To sum up, the proposed model can assist rapid civil infrastructure inspection. The component identification results are useful for assessing the structural integrity of the post-earthquake structure. When the visualized component identification results show obvious missing or deformed components, it indicates that the structure may have suffered damage and requires further inspection and assessment. The depth estimation results provide geometric information, which can be utilized to detect the incline or settlement of the post-earthquake structure by comparing depth values at different times. The proposed MTI-Net can automatically accomplish both tasks simultaneously by simply providing structural images, which facilitates the efficient detection of post-earthquake structures. The limitation of the model is that damage detection has not yet been incorporated into the multi-task framework. The damage detection task can be accomplished by introducing an additional task-specific head into the framework.

5. Conclusions

In the present study, a novel multi-scale task interaction deep learning strategy toward a multi-task CNN was presented for post-earthquake inspection of buildings. The key insight behind this method lies in leveraging domain-specific information and shared representations at different patch sizes so as to take full advantage of task affinities at varying scales which have the potential for higher accuracy and robustness while improving computational efficiency. The building inspection involves two tasks: component segmentation and depth estimation. Conclusions are drawn as follows:

(1): Component segmentation and depth estimation are highly related tasks and have the potential to enhance each other’s performance. It is observed that the matched ratios of matching pixels between component segmentation and depth estimation account for a rather high value, reaching around 50–70% in the dataset, indicating that the dual tasks are very suitable for the multi-task learning strategy.
(2): The proposed multi-task framework is superior to single-task networks concerning computational efficiency. Quantitative results indicate that MTI-Net has faster training and inference speed and lower memory footprint. As the multi-task framework distills features in shared layers for different tasks, it avoids repeatedly modeling and cuts down the computational cost.
(3): Component segmentation and depth estimation are incorporated to carry out the multi-task learning strategy. Compared with the conventional single-task network, mean IoU in terms of component segmentation rose from 96.84% to 99.14%. The additional incorporation of depth information can provide spatial information of building images and can greatly improve the performance of component identification.
(4): The RMSE of depth estimation decreases from 0.63 m for the single-task network to 0.27 m for the multi-task network. The proposed multi-task multi-scale deep learning network performs well in both tasks, and the component identification and depth map can provide each other with auxiliary information to achieve more accurate structural inspection.

Future work should focus on exploring the model applicability to real-world datasets and incorporating damage detection into the multi-scale task interaction framework.

Author Contributions

Conceptualization, J.Y. and G.L.; methodology, G.L.; software, H.Y.; validation, G.L. and J.Z.; formal analysis, J.S.; investigation, J.S.; resources, G.L.; data curation, J.S.; writing—original draft preparation, J.Y., H.Y. and G.L.; writing—review and editing, J.S.; visualization, G.L.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2023YFE0115000), Zhejiang Provincial Natural Science Foundation (LQ22E080004), and Science and Technology Research Project of Zhejiang Province Transportation Department (202217).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Authors Jianlong Ye and Jiong Zhou were employed by the company Zhejiang Communications Group Inspection Technology, Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kilic, G.; Caner, A. Augmented Reality for Bridge Condition Assessment Using Advanced Non-Destructive Techniques. Struct. Infrastruct. Eng. 2021, 17, 977–989. [Google Scholar] [CrossRef]
Mishra, M.; Lourenço, P.B.; Ramana, G.V. Structural Health Monitoring of Civil Engineering Structures by Using the Internet of Things: A Review. J. Build. Eng. 2022, 48, 103954. [Google Scholar] [CrossRef]
Sofi, A.; Jane Regita, J.; Rane, B.; Lau, H.H. Structural Health Monitoring Using Wireless Smart Sensor Network—An Overview. Mech. Syst. Signal Process. 2022, 163, 108113. [Google Scholar] [CrossRef]
Gordan, M.; Sabbagh-Yazdi, S.-R.; Ismail, Z.; Ghaedi, K.; Carroll, P.; McCrum, D.; Samali, B. State-of-the-Art Review on Advancements of Data Mining in Structural Health Monitoring. Measurement 2022, 193, 110939. [Google Scholar] [CrossRef]
Tian, Y.; Chen, C.; Sagoe-Crentsil, K.; Zhang, J.; Duan, W. Intelligent Robotic Systems for Structural Health Monitoring: Applications and Future Trends. Autom. Constr. 2022, 139, 104273. [Google Scholar] [CrossRef]
Akbar, M.A.; Qidwai, U.; Jahanshahi, M.R. An Evaluation of Image-Based Structural Health Monitoring Using Integrated Unmanned Aerial Vehicle Platform. Struct. Control Health Monit. 2019, 26, e2276. [Google Scholar] [CrossRef]
Insa-Iglesias, M.; Jenkins, M.D.; Morison, G. 3D Visual Inspection System Framework for Structural Condition Monitoring and Analysis. Autom. Constr. 2021, 128, 103755. [Google Scholar] [CrossRef]
Dong, C.-Z.; Catbas, F.N. A Review of Computer Vision–Based Structural Health Monitoring at Local and Global Levels. Struct. Health Monit. 2021, 20, 692–743. [Google Scholar] [CrossRef]
Deng, J.; Singh, A.; Zhou, Y.; Lu, Y.; Lee, V.C.-S. Review on Computer Vision-Based Crack Detection and Quantification Methodologies for Civil Structures. Constr. Build. Mater. 2022, 356, 129238. [Google Scholar] [CrossRef]
Spencer, B.F.; Hoskere, V.; Narazaki, Y. Advances in Computer Vision-Based Civil Infrastructure Inspection and Monitoring. Engineering 2019, 5, 199–222. [Google Scholar] [CrossRef]
Lenjani, A.; Yeum, C.M.; Dyke, S.; Bilionis, I. Automated Building Image Extraction from 360° Panoramas for Postdisaster Evaluation. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 241–257. [Google Scholar] [CrossRef]
Wogen, B.E.; Choi, J.; Zhang, X.; Liu, X.; Iturburu, L.; Dyke, S.J. Automated Bridge Inspection Image Retrieval Based on Deep Similarity Learning and GPS. J. Struct. Eng. 2024, 150, 04023238. [Google Scholar] [CrossRef]
Yeum, C.M.; Choi, J.; Dyke, S.J. Automated Region-of-Interest Localization and Classification for Vision-Based Visual Assessment of Civil Infrastructure. Struct. Health Monit. 2019, 18, 675–689. [Google Scholar] [CrossRef]
Yeum, C.M.; Dyke, S.J.; Benes, B.; Hacker, T.; Ramirez, J.; Lund, A.; Pujol, S. Postevent Reconnaissance Image Documentation Using Automated Classification. J. Perform. Constr. Facil. 2019, 33, 04018103. [Google Scholar] [CrossRef]
Aloisio, A.; Rosso, M.M.; De Leo, A.M.; Fragiacomo, M.; Basi, M. Damage Classification after the 2009 L’Aquila Earthquake Using Multinomial Logistic Regression and Neural Networks. Int. J. Disaster Risk Reduct. 2023, 96, 103959. [Google Scholar] [CrossRef]
Yilmaz, M.; Dogan, G.; Arslan, M.H.; Ilki, A. Categorization of Post-Earthquake Damages in RC Structural Elements with Deep Learning Approach. J. Earthq. Eng. 2024, 1–32. [Google Scholar] [CrossRef]
Khankeshizadeh, E.; Mohammadzadeh, A.; Arefi, H.; Mohsenifar, A.; Pirasteh, S.; Fan, E.; Li, H.; Li, J. A Novel Weighted Ensemble Transferred U-Net Based Model (WETUM) for Postearthquake Building Damage Assessment From UAV Data: A Comparison of Deep Learning- and Machine Learning-Based Approaches. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701317. [Google Scholar] [CrossRef]
Marano, G.C.; Quaranta, G. A New Possibilistic Reliability Index Definition. Acta Mech. 2010, 210, 291–303. [Google Scholar] [CrossRef]
Gao, Y.; Mosalam, K.M. Deep Transfer Learning for Image-Based Structural Damage Recognition. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 748–768. [Google Scholar] [CrossRef]
Liang, X. Image-Based Post-Disaster Inspection of Reinforced Concrete Bridge Systems Using Deep Learning with Bayesian Optimization. Comput.-Aided Civ. Infrastruct. Eng. 2019, 34, 415–430. [Google Scholar] [CrossRef]
Saida, T.; Rashid, M.; Nemoto, Y.; Tsukamoto, S.; Asai, T.; Nishio, M. CNN-Based Segmentation Frameworks for Structural Component and Earthquake Damage Determinations Using UAV Images. Earthq. Eng. Eng. Vib. 2023, 22, 359–369. [Google Scholar] [CrossRef]
Wang, Y.; Jing, X.; Chen, W.; Li, H.; Xu, Y.; Zhang, Q. Geometry-Informed Deep Learning-Based Structural Component Segmentation of Post-Earthquake Buildings. Mech. Syst. Signal Process. 2023, 188, 110028. [Google Scholar] [CrossRef]
Narazaki, Y.; Hoskere, V.; Hoang, T.A.; Fujino, Y.; Sakurai, A.; Spencer, B.F., Jr. Vision-Based Automated Bridge Component Recognition with High-Level Scene Consistency. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 465–482. [Google Scholar] [CrossRef]
Narazaki, Y.; Hoskere, V.; Hoang, T.A.; Spencer, B.F., Jr. Automated Bridge Component Recognition Using Video Data. arXiv 2018, arXiv:1806.0682. [Google Scholar]
Kim, H.; Yoon, J.; Sim, S.-H. Automated Bridge Component Recognition from Point Clouds Using Deep Learning. Struct. Control Health Monit. 2020, 27, e2591. [Google Scholar] [CrossRef]
Lee, J.S.; Park, J.; Ryu, Y.-M. Semantic Segmentation of Bridge Components Based on Hierarchical Point Cloud Model. Autom. Constr. 2021, 130, 103847. [Google Scholar] [CrossRef]
Kim, H.; Kim, C. Deep-Learning-Based Classification of Point Clouds for Bridge Inspection. Remote Sens. 2020, 12, 3757. [Google Scholar] [CrossRef]
Xia, T.; Yang, J.; Chen, L. Automated Semantic Segmentation of Bridge Point Cloud Based on Local Descriptor and Machine Learning. Autom. Constr. 2022, 133, 103992. [Google Scholar] [CrossRef]
Hoskere, V.; Narazaki, Y.; Hoang, T.A.; Spencer, B.F., Jr. MaDnet: Multi-Task Semantic Segmentation of Multiple Types of Structural Materials and Damage in Images of Civil Infrastructure. J. Civ. Struct. Health Monit. 2020, 10, 757–773. [Google Scholar] [CrossRef]
Vandenhende, S.; Georgoulis, S.; Van Gool, L. MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 527–543. [Google Scholar]
Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE Xplore: Piscataway, NJ, USA, 2015; pp. 2650–2658. [Google Scholar]
Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-Stitch Networks for Multi-Task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Xplore: Piscataway, NJ, USA, 2016; pp. 3994–4003. [Google Scholar]
Gao, Y.; Ma, J.; Zhao, M.; Liu, W.; Yuille, A.L. NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE Xplore: Piscataway, NJ, USA, 2019; pp. 3205–3214. [Google Scholar]
Liu, S.; Johns, E.; Davison, A.J. End-to-End Multi-Task Learning with Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE Xplore: Piscataway, NJ, USA, 2019; pp. 1871–1880. [Google Scholar]
Lu, Y.; Kumar, A.; Zhai, S.; Cheng, Y.; Javidi, T.; Feris, R. Fully-Adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, 21–26 July 2017; IEEE Xplore: Piscataway, NJ, USA, 2017; pp. 5334–5343. [Google Scholar]
Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE Xplore: Piscataway, NJ, USA, 2018; pp. 675–684. [Google Scholar]
Zhang, Z.; Cui, Z.; Xu, C.; Yan, Y.; Sebe, N.; Yang, J. Pattern-Affinitive Propagation Across Depth, Surface Normal and Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE Xplore: Piscataway, NJ, USA, 2019; pp. 4106–4115. [Google Scholar]
Zhang, Z.; Cui, Z.; Xu, C.; Jie, Z.; Li, X.; Yang, J. Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE Xplore: Piscataway, NJ, USA, 2018; pp. 235–251. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE Xplore: Piscataway, NJ, USA, 2019; pp. 5693–5703. [Google Scholar]
Hoskere, V.; Narazaki, Y.; Spencer, B.F. Physics-Based Graphics Models in 3D Synthetic Environments as Autonomous Vision-Based Inspection Testbeds. Sensors 2022, 22, 532. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]

Figure 1. Two typical underlying frameworks of multi-task deep learning. (a) Encoder-focused, (b) decoder-focused.

Figure 2. Proposed multi-scale task interaction multi-task deep learning framework.

Figure 3. Feature propagation module.

Figure 4. Pixel affinities visual exhibition among component segmentation and depth estimation.

Figure 5. Statistics results (matched ratios) of the three typical images in Figure 4.

Figure 6. Procedures for generating PBGMs followed in this study.

Figure 7. Examples of images along with component and depth labels in the dataset.

Figure 8. Loss curves versus different hyperparameter combinations.

Figure 9. Examples of component segmentation results.

Figure 10. Loss values of component segmentation task of four scales.

Figure 11. Examples of depth estimation results.

Figure 12. Loss values of depth estimation of four scales and the total loss of the two tasks. (a) depth estimation of four scales, (b) total loss of the two tasks.

Table 1. Quantitative results of computational efficiency for different models.

Model	Training Time	Inference latency	Parameter	FLOPS
Single model a ¹	−33%	−45%	−29%	−39%
Single model b ²	−67%	−55%	−71%	−61%
Benchmark a + b	89,076 s	285 s	11.9 M	30.6 G
MTI-Net	−19%	−38%	−20%	−14%

¹ a denotes the component identification task; ² b denotes the depth estimation task.

Table 2. IoU (%) of each component and mean IoU of MTI-Net and single-task network.

Component	Single Task	MTI-Net	$Δ$
Wall	98.2195	99.6910	1.4715
Beam	97.2283	99.3113	2.0830
Column	97.1979	99.5937	2.3958
Window frame	89.0310	98.3641	9.3331
Window pane	97.5756	99.7225	2.1469
Balcony	99.0902	99.7582	0.6680
Slab	96.5930	96.7125	0.1195
Ignore	99.7461	99.9583	0.2122
Mean	96.8352	99.1389	2.3038

Table 3. RMSE (m) reported by MTI-Net and single-task network.

Single Task	MTI-Net	$Δ$
0.6314	0.2662	0.3652

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, J.; Yu, H.; Liu, G.; Zhou, J.; Shu, J. Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network. Buildings 2024, 14, 983. https://doi.org/10.3390/buildings14040983

AMA Style

Ye J, Yu H, Liu G, Zhou J, Shu J. Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network. Buildings. 2024; 14(4):983. https://doi.org/10.3390/buildings14040983

Chicago/Turabian Style

Ye, Jianlong, Hongchuan Yu, Gaoyang Liu, Jiong Zhou, and Jiangpeng Shu. 2024. "Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network" Buildings 14, no. 4: 983. https://doi.org/10.3390/buildings14040983

APA Style

Ye, J., Yu, H., Liu, G., Zhou, J., & Shu, J. (2024). Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network. Buildings, 14(4), 983. https://doi.org/10.3390/buildings14040983

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Component Identification and Depth Estimation for Structural Images Based on Multi-Scale Task Interaction Network

Abstract

1. Introduction

2. Methods

2.1. Multi-Task Deep Learning in Computer Vision

2.2. Multi-Scale Task Interaction Strategy

2.3. Pixel Affinity

2.4. Evaluation Metrics

3. Dataset

4. Experiments and Results

4.1. Experimental Setup

4.2. Computational Efficiency Results

4.3. Component Segmentation Results

4.4. Depth Estimation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI