1. Introduction
As the service life of infrastructure such as bridges increases, it becomes exposed to varying degrees of safety risks from decay and deterioration. To determine and maintain the safety of such infrastructure, there needs to be regular inspection. Bridge inspection is a critical step in bridge maintenance. The traditional approach by expert inspectors, using manual tools and sensors, is highly subjective, with the associated drawbacks of being time-consuming, laborious, and costly. In recent years, with the development of CV technology, UAV can help improve bridge inspection by acquiring image data without traffic obstructions, and by accessing areas difficult to reach by humans. Other methods to achieve the image data acquisition work may require the inspectors to use large equipment or to work facing safety risks [
1]. Thus, automation and semi-automation of UAV-based inspection work has been advocated. The most significant part of such automated inspection work is the identification of the captured bridge images to obtain information on bridge type and components, providing the essential information for a more detailed diagnosis. Additionally, the ability to identify structural details of a bridge allows the UAV to determine the direction of its next action, or to help locate its position on the bridge. The application of a range of artificial intelligence technologies, such as deep learning, computer vision, and image processing techniques, is enabling UAV to identify and judge targets automatically.
With this background of limitations in conventional inspection systems, bridge inspection and diagnosis based on artificial intelligence technologies have been widely studied, with applications for nondestructive testing, damage detection and diagnosis, bridge dynamics, and static load evaluations. A suite of intelligent inspection equipment and technologies such as UAVs and robots, and various data science algorithms such as data mining, computer vision, and deep learning enhance bridge inspection techniques and effectively improve their accuracy and efficiency [
2,
3]. As the core technology of intelligent image recognition, computer vision automatically extracts valuable information from image data to qualitatively or quantitatively understand or represent the physical world. Computer vision methods can be used to automate traditional human vision tasks; it is an application based on deep learning, which will be described in
Section 3. The initial efforts to apply computer vision methods started in the 1960s, and attempted to extract information about the shape of objects using edges and primitive shapes [
4]. As image patterns were developed, computer vision methods began to consider more complex perceptual problems, including for example, optical character recognition [
5], face recognition [
6], and pedestrian and vehicle detection, etc. Currently, the main research areas in computer vision are image classification, object detection, and image segmentation [
5,
6].
Image classification is one of the most standard applications in computer vision, mainly used for face recognition [
7] and object–scene recognition [
8]. Image classification sorts an image into a single category, usually corresponding to the most prominent object in the image. The large-scale ImageNet Image Recognition Challenge (ILSVRC), which started in 2010, has created a demand for research into CNN-based algorithms for image recognition [
9]. Face recognition, which focuses on identity detection based on facial features, is also a widely researched area, and its accuracy has been significantly improved [
10]. Currently, CNN models commonly used for image classification include AlexNet, VGG-Net, ZF-Net, and GoogleNet.
Object detection can identify multiple targets in an image, and localize different targets by outputting bounding boxes. This target detection can be applied to intelligent surveillance [
11], autonomous driving [
12], and security systems. The commonly used objective detection models include fast-RCNN, YOLO, SSD, MobileNet, and ShuffleNet.
Image segmentation can be categorized into semantic segmentation, instance segmentation, and panoramic segmentation. Semantic segmentation classifies each pixel point in an image into corresponding classes, thus achieving pixel-level classification. Instance segmentation requires pixel-level classification and distinguishes different classes of instances. Panoramic segmentation is a generalization of semantic segmentation and instance segmentation. Unlike semantic segmentation, panoramic segmentation needs to distinguish individual object instances. In addition, the targets in panoramic segmentation are required to be non-overlapping. Image segmentation has been applied to many fields, such as medical imaging, pedestrian detection, and traffic control. The commonly used target segmentation models include FCN [
13], Deeplab [
14], DenseNet [
15], and MaskR-CNN [
16].
Combining computer vision technology with remote camera and UAV acquisition offers a promising non-contact solution for civil infrastructure assessment. This system can realize the automatic and consistent transformation of images or video data into useful information [
11]. Computer vision applications for civil infrastructure are now recognized as critical components for improved inspection and monitoring. Many studies have been carried out on the use of CV techniques for detection and monitoring of civil infrastructure. Spencer and Narazaki et al. [
17] reviewed recent advances in CV techniques for civil infrastructure assessment, presenting the applications of computer vision techniques to infrastructure inspection and monitoring. Inspection applications include using images for characterizing structural components, identifying local and overall visible damage, and detecting changes. Monitoring applications include measuring static strain and quantifying static and dynamic displacement for modal analysis. Zhu et al. [
18] addressed the influence of subjective or empirical factors in manual inspection, through the use of migration learning and CNN to automatically analyze and identify many bridge inspection images, improving the accuracy and efficiency of detection and identification. Suzuki et al. [
19] used CNN to determine the degree of damage to bridge components, and conducted a comparative study of two bridges with multiple classifications of bridge damage based on CNN. Combined with the analysis of questionnaires from manual inspectors, the method was able to effectively identify the degree of damage of bridge components. Liang et al. [
20] proposed a deep learning method based on Bayesian optimization to analyze post-disaster inspection images of reinforced concrete bridge systems, and used different convolutional neural networks to achieve an intelligent evaluation of bridge performance at three levels: system fault classification, bridge combination detection, and local damage localization. Dung et al. [
21] proposed an inverse learning method based on deep convolutional neural networks. By fine-tuning the trained, fully-connected layer with the top convolutional layer in a VGG16 model, combined with data enhancement, excellent performance of crack detection at steel bridge connections was obtained. Kurisu et al. [
22] constructed a CNN model to determine the level of damage to bridge members based on image data acquired from manual bridge inspections. Grad-CAM was applied to verify the CNN detection model, which provided a visual basis for determining the damage level. By comparing the heat map from Grad-CAM, they determined that the features used in the CNN to judge the damage level were consistent with those used by the inspection engineers.
Many previous studies have focused on the detection and evaluation of specific damage by applying CNN-based deep learnings to image data. However, the automation and semi-automation of bridge inspection work should include not only a focus on damage detection, but also the recognition of bridge type and components, in order to provide an efficient approach for UAV bridge inspection systems. In this study, we categorized the recognition tasks involved in automatic bridge inspection into three levels, based on the recognition distance and the applicable computer vision technique. A CNN model for each task was constructed, and its performance and applicability were evaluated.
The content of this paper is as follows:
Section 2 describes multilevel bridge detection and component segmentation, and explains the relationship of the related CNN principles to CV techniques.
Section 3,
Section 4 and
Section 5 present our CNN model constructions, and their results are shown for each of the three levels of recognition tasks.
Section 3 shows the classification of bridge type using Resnet50 for the far-distance recognition level; the bridge component detection by YOLOv3 for the mid- distance recognition is contained in
Section 4; and the segmentation of bridge components by Mask-RCNN for the close-distance recognition is set out in
Section 5.
Section 6 summarizes the study and discusses future work.
2. CV-Based Framework for Multilevel Bridge Inspection
Figure 1 shows the overall structure of a multilevel structural component detection and segmentation model for bridge inspection. Here, there were three levels of detection/inspection; bridge types, bridge components, and structural members. Image data were assumed to be collected by drones from far- to close- distances to the target bridge. The acquired images were processed to form the data sets for training corresponding to the CV technology. In this study, specific CNN models were used to verify the appropriateness of this idea.
Figure 2 is UAV-assisted bridge inspection to which this paper applies.
Most computer vision techniques use a CNN network as their backbone, with other specific algorithms that implement the CV functions. The CNN consists of an input layer, an output layer, and hidden layers in the middle of the network. Each hidden layer includes the convolution layer, the activation layer, the pooling layer, and the fully connected layer.
Figure 3 shows a sample of overall CNN architecture.
The hidden layer performs the convolution operations to extract the feature information [
23]. The pooling layer is placed after the convolution layer, and is also known as the subsampling layer; it reduces the width and height of neurons while retaining the depth to prevent overfitting during model training and to ensure the significant features of input data are preserved. The most commonly used pooling methods are average pooling and maximum pooling. The fully connected layer is located behind the convolution and pooling layers, and takes the role of a classifier in the overall convolution neural network. Due to the large number of parameters in the fully connected layer, some networks, such as ResNet and GoogLeNet, use a global average pooling layer instead of a fully connected layer to integrate the learned depth features. Finally, a loss function such as Softmax is used as the network objective function to make the final judgment.
The activation layer is combined with the pooling and convolution layers to solve non-linear problems during training. The commonly used activation functions are ReLU, Sigmoid, and Tanh. The normalization layer, also known as the Softmax function, is the last layer, through which the results of the network are output and the classifications of data are predicted. Specifically, in this layer, a normalized exponential function converts the input data into the probability that each sample belongs to a certain class.