Next Article in Journal
UWB Frequency-Selective Surface Absorber Based on Graphene Featuring Wide-Angle Stability
Next Article in Special Issue
A Solar Irradiance Forecasting Framework Based on the CEE-WGAN-LSTM Model
Previous Article in Journal
High-Precision 3D Reconstruction Study with Emphasis on Refractive Calibration of GelStereo-Type Sensors
Previous Article in Special Issue
Pred-SF: A Precipitation Prediction Model Based on Deep Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data-Augmented Deep Learning Models for Abnormal Road Manhole Cover Detection

1
Key Laboratory of Electromagnetic Wave Information Technology and Metrology of Zhejiang Province, China Jiliang University, Hangzhou 310018, China
2
Department of Building, School of Design and Environment, National University of Singapore, Singapore 119077, Singapore
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(5), 2676; https://doi.org/10.3390/s23052676
Submission received: 2 February 2023 / Revised: 20 February 2023 / Accepted: 27 February 2023 / Published: 1 March 2023
(This article belongs to the Special Issue AI and Big Data Analytics in Sensors and Applications)

Abstract

:
Anomalous road manhole covers pose a potential risk to road safety in cities. In the development of smart cities, computer vision techniques use deep learning to automatically detect anomalous manhole covers to avoid these risks. One important problem is that a large amount of data are required to train a road anomaly manhole cover detection model. The number of anomalous manhole covers is usually small, which makes it a challenge to create training datasets quickly. To expand the dataset and improve the generalization of the model, researchers usually copy and paste samples from the original data to other data in order to achieve data augmentation. In this paper, we propose a new data augmentation method, which uses data that do not exist in the original dataset as samples to automatically select the pasting position of manhole cover samples and predict the transformation parameters via visual prior experience and perspective transformations, making it more accurately capture the actual shape of manhole covers on a road. Without using other data enhancement processes, our method raises the mean average precision (mAP) by at least 6.8 compared with the baseline model.

1. Introduction

As an integral part of the road, the working condition of manhole covers is of great importance to the safety of drivers and pedestrians. As cities continue to expand in size, manhole covers are becoming more widespread and numerous, making manual supervision more difficult.
In recent years, deep learning methods [1,2,3] have been increasingly applied to object detection [4,5,6,7]. Thus, attention has been paid to the automatic detection of abnormal manhole covers [8,9]. Vehicles equipped with video cameras have great potential for anomalous manhole cover detection. Different from traditional methods [10,11], object detection models based on state-of-the-art conversational networks require a lot of training data. However, using vehicle cameras to collect data, we found very few anomalous manhole covers on the carriageway, far less than the number needed to train the model.
To tackle this problem, we focus on using data augmentation methods [12] to improve the efficiency of abnormal manhole cover detection data. Copy–paste augmentations can create copies from dataset samples and then paste them into other samples, which can alleviate the shortcomings of the original dataset. When using this data augmentation method, we can adjust the hyperparameter, such as the number of pasted objects from the source image and the extent of scale jittering, to find the most effective way to train our deep learning model. Prior work [13] uses randomly pasted object samples or models the surrounding visual context to decide the location and size of pasted object samples. In contrast, we find a new strategy of using perspective transformation and segmentation to decide upon the shape and size of manhole covers before being pasted to the target image, providing significant boosts on object detection models for the manhole cover detection task.
In this paper, a new data augmentation method based on the copy and paste method for road abnormal manhole cover detection is proposed. The proposed method is evaluated on an abnormal manhole cover dataset made by ourselves:
  • A sample expansion method for the abnormal manhole cover dataset is proposed. This method allows obtaining a variety of anomalous coverage samples from images using geodetic information and perspective transformations to provide samples for subsequent data augmentation.
  • Using extracted abnormal manhole cover samples, we proposed a visually guided copy–paste data augmentation method for abnormal manhole covers, namely VGCopy-paste. This method combines prior visual and spatial information to more intuitive paste anomalous manhole cover samples onto the image, alleviating the problems of sample imbalance and an insufficient number of samples during training.
  • Better performance under different training configurations and epochs compared with the current state-of-the-art object detection models: The experimental results show that networks using the data enhancement method in this paper have higher accuracy and faster convergence than networks that do not use this method with the same configuration.

2. Materials and Methods

2.1. Data Augmentation for Deep Learning

Object detection is crucial in many downstream tasks. Detecting various objects on the road, such as pedestrians [14], vehicles [15], traffic signs [16], road markings [17], etc., in high-resolution images from the vehicle’s camera is necessary to deploy self-driving cars safely. The total loss of the model in the verification set should be gradually reduced as training proceeds to train a deep learning model with sufficient generalization ability. Many efforts to improve model performance are centered on changing the architectures of the backbone, which may lead to an increasing number of parameters with respect to the model and make it more challenging to train. In addition to increasing the complexity of the model, image data augmentation uses the semantic invariance of an image to introduce a priori knowledge via random horizontal flipping, color jittering, random crop, and other methods of the original image to improve its performance.
The above image transformation will not affect other images in the dataset, and no additional objects will be added to the transformed image. Mixing images is another kind of data augmentation. Its main idea is that the artificial generalization of training data is realized by mixing the two images. Inoue proposed Mixup [18], randomly picking two images, A and B, that are randomly flipped horizontally from the training set and then calculating the mean value of the two images. Then, the two images are mixed up in the color channel dimension, and A-labels are only used for training. Mixup will make the training samples unnatural and obtain an unclear class to training samples in the object detection task, which may lead to model confusion. Sangdoo Yun et al. proposed the CutMix [19] method. Similarly to the Mixup method, they mixed the two images from the training set. The difference is that instead of mixing at the pixel level, they replaced the original image with a sample block of another image. In the same way, the Mosaic data augmentation [20] method proposed by Alexey Bochkovskiy picks multiple different images and puts them together into one composite image after a random crop. It increases the diversity of images, enriches the image’s background, and improves the batch size in disguise during training. It is also not very friendly to datasets with many small objects. Combining augmentations that have no object awareness can result in massively inflated dataset sizes. In the case of limited training data, it may lead to overfitting.
Copy–paste augmentation and CutMix have something in common. They both paste the target from other images onto the original image. The difference is that the former only copies the precise pixels of the object and not the entire rectangular area containing the object and part of the background. Both Nikita Dvornik et al. [21] and Georgios Georgakis et al. [22] extracted the semantics in the image scene by training a deep learning model to determine the pasting position of the object. InstaBoost [23] also trained a deep learning model to extract contextual information from images. However, it does not copy the object from other images but only copies and pastes the existing object in the current image. Golnaz Ghiasi et al. adopted the method of randomly selecting positions to paste objects. They did not model the surrounding context and directly synthesized the targets in different backgrounds into one image regardless of whether the relative size and position of the objects are intuitively appropriate. Unlike [21], we model the context and consider the actual size and angle of the manhole covers pasted on the image.

2.2. Deep Learning Manhole Cover Detection

The traditional inspection method for road manhole covers is usually on-site manual inspection. This method has many potential safety hazards during rush hours. With the development of the lidar system, the method is becoming more integrated and multifunctional. Zhanying Wei et al. [24] used multiple cameras arranged symmetrically and combined with high-density lidar to obtain high-density point clouds and ultra-high-resolution images on the ground. They detected the manhole covers by combining the histogram of oriented gradients (HOGs) descriptor with symmetry features and support vector machine (SVM). However, mobile lidar and multiple cameras are very expensive, and it takes a long time to process high-precision images, resulting in a long period of manhole cover detection. Haotian Ren et al. [25] improved yolov4 and proposed a manhole cover detection method by integrating image depth information. Due to the lack of training data, the model is trained using images that are crawled from the Internet, and the quality of the images is uneven, resulting in the inability of this kind of data in training a sufficiently robust model. Baoding Zhou et al. [26] used a mobile phone fixed on the vehicle to shoot manhole covers, used the accelerometer and gyroscope of the mobile phone to record the vibration experienced by the vehicle when passing the manhole covers, and then calculated the instantaneous acceleration. They trained a model that can judge the settlement amplitude of the manhole covers by combining the two.

3. Data-Augmented Deep Learning Model

A normal manhole cover should be placed flush with the road’s surface while maintaining its appearance integrity. Damage to the surface of the manhole cover and a deviation in the position of the manhole cover will pose a threat to the safety of vehicles and pedestrians. To more clearly discuss the data augmentation of manhole covers, we divide the abnormal well cover into three categories, namely “Damaged”, “Dislocated”, and “Missing”. Here, “Damaged” represents cracks or extra holes in the appearance of the manhole cover, “Dislocated” represents a manhole cover that is not flush with the road’s surface, including the manhole cover bulge and depression, and “Missing” represents a manhole cover that is missing, and the road’s surface is exposed with holes. If a dislocated manhole cover is damaged, it will be classified as “Dislocated”.
The overall design of our method is shown in Figure 1. Two phases in the following subsections are presented: abnormal manhole cover sample expansion and visually guided copy–paste data augmentation.

3.1. Abnormal Manhole Cover Sample Expansion

Due to the small number of abnormal manhole covers on the road, mobile devices are used additionally to find and take images of abnormal manhole covers from multiple locations. In an effort to paste the captured manhole cover samples onto the image in the dataset, it needs to be further transformed because the viewing angle of the mobile device is different from that of the vehicle’s camera, and their visual features change with distance. However, in prior copy–paste-like works, instance segmentation masks provided in the dataset or made by us are used to make a copy of any object from the original location, and random transformations are applied. Then, the copy is pasted to other images in the training set. We suppose that the manhole cover image captured by the mobile device is directly pasted into the dataset captured by the vehicle’s camera without processing. In that case, the composite image will look unnatural, and with respect to manhole cover samples, it will be easier to introduce the background features of the original image where it is located. The model will have high accuracy in detecting the copied and pasted covers but will not work well on the actual data due to the cover’s shape, angle, and color, which is contrary to the idea presented in this paper.
The principal idea behind our algorithm is to use ellipses to fit the shape of manhole covers and use perspective transformation to transform them into regular circular manhole covers.
For restoring an irregular elliptical manhole cover to a circular one, a standard ellipse could be used to fit the boundary of the manhole cover in the image. Then, the transformation matrix is solved via point pairs using perspective transformation to restore the image to a circular manhole cover. Equations (1)–(3) show the process of projecting the object to a new view plane:
x y w = H u v w
H = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33
where u and v are the coordinates of the object in the original image, and H R 3 × 3 is the transformation matrix. From the equations above, the coordinates of the object in the new view plane can be expressed as follows:
{ x = x w = a 11 u + a 12 v + a 13 a 31 u + a 32 v + a 33 y = y w = a 21 u + a 22 v + a 23 a 31 u + a 32 v + a 33 p r 1 x p r 1 y 1 0 0 0 p r 1 x · p s 1 x p r 1 y · p s 1 x 0 0 0 p r 1 x p r 1 y 1 p r 1 x · p s 1 y p r 1 y · p s 1 y p r 2 x p r 2 y 1 0 0 0 p r 2 x · p s 2 x p r 2 y · p s 2 x 0 0 0 p r 2 x p r 2 y 1 p r 2 x · p s 2 y p r 2 y · p s 2 y p r 3 x p r 3 y 1 0 0 0 p r 3 x · p s 3 x p s 3 y · p s 3 x 0 0 0 p r 3 x p r 3 y 1 p r 3 x · p s 3 y p r 3 y · p s 3 y p r 4 x p r 4 y 1 0 0 0 p r 4 x · p s 4 x p r 4 y · p s 4 x 0 0 0 p r 4 x p r 4 y 1 p r 4 x · p s 4 y p r 4 y · p s 4 y a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 = p s 1 x p s 1 y p s 2 x p s 2 y p s 3 x p s 3 y p s 4 x p s 4 y
where x and y are the coordinates of the object in the new view plane after perspective transformation, and p r i x , p r i y and p s i x , p s i y represent the coordinates of points p r i and p s i ,   i 1 , 2 , 3 , 4 , respectively. Normally, a 33 is made equal to 1 by normalization. Therefore, the perspective transformation matrix has 8 degrees of freedom. Thus, generally speaking, four points correspond to only a two-dimensional perspective transformation. Given a standard ellipse e , for obtaining the appropriate four points from the edge of the manhole cover in the picture taken by the mobile device and determining the diameter of the manhole cover after restoration, first, we fit the manhole cover in images by adjusting the long axis, short axis, and inclination angle of e ; then, we use the four vertices p e 1 , p e 2 , p e 3 , p e 4 of ellipse e to construct its circumscribed rectangle r . To ensure that the resolution of the manhole cover restored to a circular shape will not cause a loss of its appearance features, we then form four pairs of points corresponding to the four vertices p r 1 , p r 2 , p r 3 , p r 4 of the circumscribed rectangle r and the four vertices p s 1 , p s 2 , p s 3 , p s 4 of the square view plane s formed by the long side of the circumscribed rectangle, and we calculate the corresponding transformation matrix, H . Finally, Equations (1) and (3) are used to project the irregular elliptical manhole cover onto the square view plane via H to form a circular manhole cover.
The example of an image taken by the mobile device and the results of recovering the shape of the manhole cover based on perspective transformations are presented in Figure 2. The four vertices of bounding rectangle r of ellipse e and the four vertices of the square view plane form four groups of point pairs, p r 1 , p s 1 ,   p r 2 , p s 2 , ,   p r 3 , p s 3 ,   and   p r 4 , p s 4 , to solve transformation matrix H .

3.2. Visually Guided Copy–Paste Data Augmentation

In this subsection, we address the problem of pasting manhole cover samples. The major steps can be grouped into two stages: (1) the pasting method of abnormal manhole cover samples and (2) the adaptive pasting method combined with scene semantics information.

3.2.1. Pasting Method of Abnormal Manhole Cover Samples

A new dataset containing only circular manhole covers was made after using matching point pairs to construct homograph matrix H and extracting the manhole cover from the original images. Due to the different devices used by collectors when taking images of the original dataset and the significant differences in the position, size, and angle of the covers in the images, the difference in size between manhole covers in the new dataset will be large. Nevertheless, in practice, the size of each type of manhole cover is fixed. The round cover samples cannot be pasted onto the images directly because the restored round manhole covers can be seen as being taken vertically from the top of the covers, and the shooting angle of the vehicle’s camera is not perpendicular to the ground.
A new homography matrix, H 2 , will be constructed using the perspective transform to paste the circular manhole cover onto the target image. As shown in Figure 3, we suppose that m and m are the imaging of plane π in two cameras. In this case, plane π was considered as a circular manhole cover; m as an image was taken vertically with a mobile device from the top of π ; m as an image of π was taken from the perspective of the vehicle’s camera. The unit normal vector of plane π in the mobile device coordinate system is n , and the distance from π to the center of the mobile device (coordinate origin) is d ; plane π can be expressed mathematically by Equation (4):
1 d n T X i = 1 ,   X i π
where X i denotes the coordinate of 3D point X in the mobile device’s coordinate system and then the coordinate of X in the vehicle camera coordinate system is X j , which is mathematically shown in Equation (5):
X j = R X i + T
where R R 3 × 3 denotes the rotation matrix, and T R 3 × 1 denotes the translation matrix. The homography matrix H of two different camera coordinate systems in the same plane π can be obtained by combining Equations (4) and (5), and this is mathematically shown in Equation (6).
X j = R X i + T 1 d n T X i = R + T 1 d n T X i = H X i H = R + T 1 d n T
The H mentioned above represents the mapping of 3D points between two coordinate systems, and it is also necessary to transform 3D points into a 2D imaging plane coordinate system. Equations (7)–(9) show the conversion of H to H 2 using different camera internal parameter matrices:
x i = K i X i x j = K j X j
K j 1 x j = H K i 1 x i   x j = K j H K i 1 x i
H 2 = K j R + T 1 d n T K i 1
where K i denotes the internal parameter matrices of mobile devices, K j denotes the internal parameter matrices of the vehicle’s camera, and H 2 denotes the homography matrix between m and m .
When processing pictures, we adjust the position of the virtual camera via rotation matrix R and translation matrix T so that it moves from the shooting direction of the mobile device to the vehicle’s camera. Equations (10) and (11) show R and T :
R = 1 0 0 0 c o s θ s i n θ 0 s i n θ c o s θ
T = 0 s i n θ 0 + 0 0 d c o s θ d
where d denotes the distance between the vehicle’s camera and the manhole cover, and i and j represent the shooting direction of the mobile device and vehicle’s camera, respectively.

3.2.2. Adaptive Pasting Method Combined with Scene Semantics Information

While the mobile devices used by the image collector are all different, the vehicle’s camera was not calibrated before starting shooting. Since there is no depth information in the images, K i , K j , d , and d cannot be directly computed. We assume that deep learning techniques are used to predict parameters based on paste positions. A large amount of data are required to train the model, which are unavailable in real-world scenarios. By observing the images taken by the vehicle’s camera, it can be found that the contour of the manhole cover located slightly away from the camera cannot be observed clearly because of the shooting angle, and only part of the manhole cover on the road can show the complete contour. Thus, the manhole cover on the lower half of the image is used for data augmentation. Parameter d is set to 1, and the perspective transformation θ , d , position, and other parameters are manually adjusted and recorded to make the pasted manhole cover similar to the original manhole cover in the image; meanwhile, we set K i = K j = w 2 0 w 2 0 w 2 h 2 0 0 1 , where w represents the width of the manhole cover image, and h represents the height of the manhole cover image. As shown in Figure 4, a total of 21 groups were recorded, and the data were fitted with a least squares polynomial so that it can automatically generate the appropriate homography matrix according to the different positions in the image taken by the vehicle’s camera. In the actual operation process, we made a processing tool with a UI interface to simplify the recording and adjust the perspective transformation parameters. The speed of manually pasting manhole cover samples with tools is about 1–2 pieces per minute, which took about 30 min in total.
To solve the problem of the unnatural appearance of composite images, we followed the approach of [27] to fuse the color of the manhole cover to the color of the background image. The implementation details are described in Section 4.3.
A lightweight semantic segmentation algorithm based on deep learning for road segmentation was proposed to paste the manhole cover in the appropriate position in images automatically. The architecture of the road segmentation model is illustrated in Table 1. MobileNetV3-small is used as the backbone network, and feature fusion is performed on Unet-like decoders via skipping connections. Due to the depthwise separable convolution having fewer parameters and computation, our road segmentation model has enough depth to extract image features and maintains low parameters and high efficiency. It could cost fewer computing resources to predict where the manhole cover can be pasted. Object samples can be automatically pasted using the road segmentation model and fitted perspective transformation parameters.
The VGCopy-paste algorithm is detailed in Algorithm 1.
Algorithm 1 VGCopy-Paste Data Augmentation for Road Manhole Cover Detection
(1)  Input the abnormal manhole cover image taken by a mobile device;
(2)  Fit the cover edge with an ellipse and use (2) and (3) to calculate the H ;
(3)  Extract abnormal manhole cover samples;
(4)  Input the image taken by the vehicle camera;
(5)  Find the pasting position of the cover sample in the image taken by the vehicle camera through the road segmentation model;
(6)  Calculate the H 2 by (9) to paste the sample into the image;
(7)  Repeat steps 1 to steps 6 until the images in the train set are all enhanced.

4. Discussion

4.1. Experimental Data

A road manhole cover dataset was made by continuously shooting along the road with an engineering vehicle equipped with a fixed-angle camera to train the road manhole cover detection and classification model. In the experiment, Hikvision DS-TCC200 was selected as the vehicle’s camera. The acquisition frequency of the camera sensor was 50 Hz, and the shooting frequency was set to once per second.
On roads with different vegetation coverage, shadows attached to manhole covers will affect their visual characteristics. A total of 22,872 photos, including high-vegetation roads, roads around buildings, and urban highways without buildings, were collected in the dataset with the aim of making the experimental data cover all kinds of roads. In total, 82 images of abnormal manhole covers at close range in several outdoor scenes were taken by handheld phones for data augmentation.
Figure 5 and Table 2 show the classification of manhole covers. “Dislocated” represents raised or depressed covers; “Damaged” indicates that there are cracks or holes on the surface of the covers; “Missing” indicates that the entire cover was missing, and the inspection passage is exposed; “Normal” indicates that the appearance of covers is complete and in the correct position. In total, 60% of abnormal manhole covers before executing data augmentation were used as training samples and 40% were used as testing samples. The examples of VGCopy-paste are shown in Figure 6.
UESTC All-Day Scenery [28] (UAS) is the all-day outdoor road image segmentation dataset. The entire dataset contains a total of 6380 images and four kinds of weather, including dusk, night, rainy, and sunny weather. The performance of road segmentation models was evaluated on the UAS test set.

4.2. Models

In our experiment, FCN [29], UNet [30], FastestDet [31], YOLOv5 [32], CenterNet [33], Retinanet [34], and YOLOv7 [35] were adopted as our baseline. As the size of the road manhole cover dataset is far smaller than the typical public dataset, FastestDet, as a lightweight network, was tested, and in all YOLOv5 series models, only YOLOv5s was tested to avoid overfitting. In addition, CenterNet with a DLA34 backbone and Retinanet with a depth of 34 were tested in our experiments. For the YOLOv7 series, the experiment only tested its basic model without expansion. FCN, UNet, and Mobile-UNet used to be evaluated on the test set of the UAS. Baselines were retrained using the corresponding open implementations. The experimental results show the impact of our data augmentation on model performance.

4.3. Implementation Details

The experiments are implemented in the environment built by the Pytorch deep learning framework. The only parameter modified is the training epoch. Other augmentations in the model’s configuration were not used, and the other default configurations during the experiments were provided by the authors. FastestDet, YOLOv5, and YOLOv7 were set to train 100 epochs, while CenterNet and RetinaNet trained 50 epochs. The trained models were tested using the test dataset. To compare the performance between detection models, the mean average precision (mAP) was used as the evaluation metric of model performance.
For our VGCopy-paste, the location range where the manhole cover is pasted in images has to be set to avoid the problem of having visual features of different types of covers that are too similar due to increased distances. As shown in Figure 7, the pasting range of the manhole cover along the height direction of the background image is set from 0.67 to 0.91, and its corresponding inclination angle of covers ranges from 55 to 80 in our experiment.
Each road segmentation model was retrained on the UAS dataset for 300 epochs, and the mean IOU was used as the evaluation metric of model performance.
When pasting the target, we used RainNet randomly to make the manhole cover blend into the background more realistically. RainNet can treat image harmonization as a style transfer problem, and we adopted a 512 × 512 resolution model, which is trained on the iHarmony4 [36] dataset. As shown in Figure 8, the resolution of the image taken by the onboard camera is 1920 × 1080 , so the resultant image after data augmentation cannot be directly used as the input of RainNet. For distinguishing the foreground and background in the input image, we also need to provide a mask for dividing the foreground. To solve this problem, a 512 × 512 image block centered on the manhole cover was extracted as the input of RainNet, and the composite image was subtracted from the original background image to obtain the foreground mask.
Rainnet uses the foreground mask to migrate the style of the background to the foreground, and image harmonization can alleviate the incompatibility between the manhole cover pasted after data augmentation and the background. The comparison effect of Rainnet before and after use is shown in Figure 9.

4.4. Main Result and Analysis

The experimental results of the detection model’s performance trained using different methods are shown in Table 3, and the evaluation results of each road segmentation model are shown in Table 4. AP50 and AP75 evaluation metrics are adopted from mAP [37]. With VGCopy-past, the performance of the tested model was further improved to varying degrees during the road manhole cover detection task. The Mobile-UNet that we used performs better in the road segmentation task with fewer parameters.
Our method was compared with a simple random copy-pasting method to clarify the decisive role that prior visual experience plays in VGCopy-paste. Neither will overlap the target sample with the original sample in the image. The experiment is implemented on multiple object detection models. As shown in Table 3, when only the images taken by the mobile phone are placed into the train set without any copy–paste augmentation, the generalization ability of each model is weaker than the other two methods. Both random copy–paste and VGCopy-paste copy more samples into the dataset, and the performance of all models improved. Our method achieves better performance while increasing the same number of samples as the random copy–paste method.
In addition, we evaluated various data augmentation methods on Yolov5s, and the comparison results are shown in Table 5. Since the manhole covers only cover a small part of the image and the appearance characteristics of manhole covers in the distance are similar to the noise block introduced by Cutout, the model will be disturbed by the noise introduced. HVS augmentation can adjust the contrast and saturation of the image to make the details of the image more prominent. However, in the task of abnormal manhole cover detection, the street’s background is relatively complex, and the pavement is full of fine lines and signs. The change in background color may cause the characteristics of the manhole cover to be disturbed by the pavement’s features. Although mixup can mix multiple images to generate new images, this method does not work well for target detection tasks in complex scenes, and Random affine also does not work well. In this task, neither method can adjust the low-performance problems caused by data imbalances in the model. The method based on copying and pasting can effectively alleviate the problem of data imbalance, but random copying without controlling the pasting range will lead to model confusion, because the manhole cover is pasted to an impossible position and overlaps with the complex background.
Network performance was validated with and without VGCopy-paste at different training epochs to verify the role of VGCopy-paste throughout the training process. The experimental results are shown in Figure 10. VGCopy-paste can increase the training efficiency of the model. The precision of the original yolov5 and yolov7 starts to decrease at 30 epochs, and the model’s accuracy increases slowly. In contrast, after using VGCopy-paste, both models achieved better performance with less training time.
The magnitude of the enhancement of VGCopy-paste can be determined by the paste range and the number of pastes on each image. The model with different parameters was retrained, and its performance was validated on the test set; moreover, the experimental results are shown in Table 6. As the number of samples increases, the AP50 of yolov5 decreases slightly, but the performance drops significantly on the AP75. When the number of paste samples is two, yolov7 AP50 reached the best but caused four drops in AP75. The reason for this phenomenon may be that the pasted manhole cover introduces the characteristics of the original image, which changes the image distribution of the training set, and with the increase in the number of manhole covers, the difference between the image distribution trend of the training set and the distribution trend of the test set became larger; finally, the impact on the model exceeded the contribution of the manhole cover sample to the data balance. AP improved for pasted range when the range was set from 0.58 to 0.67, but it dropped rapidly as it approached 0.8. In our experiments, the number of pasted samples is set to 1, and the pasted range is set from 0.67 to 0.91. The higher the paste position, the smaller the size of the manhole cover after perspective transformations. When the manhole cover is too small, the visual characteristics of different types of manhole covers will be so similar that the model cannot effectively distinguish them by learning their appearance.

5. Conclusions

In this paper, a deep learning framework for abnormal manhole cover detection in urban systems is presented. A new data augmentation method was proposed to alleviate the problem of insufficient training samples for road abnormal manhole covers. After using the extraction method proposed by us to extract the manhole cover samples, we can use the simple copy and paste algorithm to greatly improve the effect of the model, as in [13]. In addition, a perspective transformation was carried out using the previous visual experience provided by road semantic segmentation and the parameters predicted by linear fitting to paste different types of abnormal manhole cover samples onto the target image so as to introduce a better performing model on the basis of random copying. According to the experimental results, the proposed data augmentation method successfully increased the number of abnormal well cover samples in the training set and subsequently enhanced the abnormal manhole cover detection performance. With the deep learning model, the mAP with AP50 reached over 82 and is at least higher by 6.8 compared with the baseline model using the same data augmentation model.
In the case of training, the data augmentation method based on depth learning will be more time-consuming than the traditional method. However, the combination of prior visual and data augmentation can generate the training data of abnormal manhole covers in a more intuitive way. The same sample can generate data with different appearance features in different images, which can greatly increase the data efficiency on a limited number of datasets.
In future work, we will focus on combining camera self-calibration with data enhancements based on computer vision to remove the limitation that the camera must have a fixed angle of view and achieve data enhancement for different angles in different scenes in order to further improve the accuracy of object detection tasks.

Author Contributions

Formal analysis, X.Y. and D.Z.; investigation, X.Y., K.Y. and D.Z.; writing—original draft preparation, X.Y.; writing—review and editing, X.Y., K.Y., D.Z., L.Y., D.Q. and H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key R&D projects in Zhejiang Province, grant number 2020C03104, 2022C01082, and 2022C01005.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Part of the code used in this study is freely available at an open-source version control website: https://github.com/XavierYu404/Data-Augmented-Deep-Learning-Model-for-Abnormal-Road-Manhole-Cover-Detection.git (accessed on 26 February 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yan, K.; Chen, X.; Zhou, X.; Yan, Z.; Ma, J. Physical Model Informed Fault Detection and Diagnosis of Air Handling Units Based on Transformer Generative Adversarial NetworK. IEEE Trans. Ind. Inform. 2022, 19, 2192–2199. [Google Scholar] [CrossRef]
  2. Yan, K.; Guo, X.; Ji, Z.; Zhou, X. Deep transfer learning for cross-species plant disease diagnosis adapting mixed subdomains. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 1. [Google Scholar] [CrossRef] [PubMed]
  3. Yan, K.; Dai, Y.; Xu, M.; Mo, Y. Tunnel surface settlement forecasting with ensemble learning. Sustainability 2019, 12, 232. [Google Scholar] [CrossRef] [Green Version]
  4. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceeding of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
  5. Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehgan, M.; Jagersand, M. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7479–7489. [Google Scholar] [CrossRef]
  6. Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef] [Green Version]
  7. Ren, S.; He, K.; Ross, B.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Liu, W.; Chen, D.; Yin, P.; Yang, M.; Li, E.; Xie, M.; Zhang, L. Small manhole cover detection in remote sensing Imagery with deep convolutional neural networks. ISPRS Int. J. Geo-Inf. 2019, 8, 49. [Google Scholar] [CrossRef] [Green Version]
  9. Qing, L.; Yang, K.; Tan, W.; Li, J. Automated detection of manhole covers in Mls point clouds using a deep learning approach. Int. Symp. Geosci. Remote Sens. 2020, 2020, 1580–1583. [Google Scholar] [CrossRef]
  10. Fu, X.; Jing, W.; Guiran, C.; Huiyu, Z. Manhole cover intelligent detection and management system. In Proceedings of the International Conference on Electronic, Odessa, Ukraine, 23–27 May 2016; Volume 40, pp. 986–988. [Google Scholar] [CrossRef] [Green Version]
  11. Guo, X.; Liu, B.; Wang, L. Design and implementation of intelligent manhole cover monitoring system based on Nb-Iot. In Proceedings of the International Conference on Robots & Intelligent System, Haikou, China, 15–16 June 2019; pp. 207–210. [Google Scholar] [CrossRef]
  12. Shotrten, C.; Khoshgoftaar, T. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
  13. Ghiasi, G.; Cui, Y.; Qian, A.; Lin, R.; Cubuk, E.; Le, Q.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. Comput. Vis. Pattern Recognit. 2021, 2021, 2918–2928. [Google Scholar] [CrossRef]
  14. Wu, J.; Zhou, C.; Yang, M.; Zhang, Q.; Yuan, J. Temporal-context enhanced detection of heavily occluded pedestrians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington State Convention Center, Seattle, WA, USA, 16–20 June 2020; pp. 13427–13436. [Google Scholar] [CrossRef]
  15. Hu, H.; Cai, Q.; Wang, D.; Lin, J.; Sun, M.; Krahenbuhl, P.; Darrell, T.; Yu, F. Joint monocular 3d vehicle detection and tracking. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5389–5398. [Google Scholar] [CrossRef]
  16. Serna, C.; Ruicheck, Y. Traffic signs detection and classification for european urban environments. IEEE Trans. Intell. Transp. Syst. 2019, 21, 4388–4399. [Google Scholar] [CrossRef]
  17. Feng, Z.; Li, M.; Stilz, M.; Kunert, M.; Wiesbeck, W. Lane detection with a high-resolution automotive radar by introducing a new type of road marking. IEEE Trans. Intell. Transp. Syst. 2019, 20, 2430–2447. [Google Scholar] [CrossRef]
  18. Zhang, H.; Moustapha, C.; Yann, N.; David, L. mixup: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
  19. Yun, S.; Han, D.; Chun, S.; Joon, S.; Yoo, Y.; Choe, J. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6022–6031. [Google Scholar] [CrossRef]
  20. Bochkovskjy, A.; Wang, C.; Liao, H. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  21. Dvornik, N.; Mairal, J.; Schmid, C. Modeling Visual Context is Key to Augmenting Object Detection Datasets. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Volume 11216, pp. 364–380. [Google Scholar] [CrossRef] [Green Version]
  22. Georgios, G.; Arsalan, M.; Alexander, C.; Jana, K. Synthesizing training data for object detection in indoor scenes. Robot. Sci. Syst. 2017, 479–488. [Google Scholar] [CrossRef]
  23. Fang, H.; Sun, J.; Wang, R.; Gou, M.; Li, Y.; Lu, C. InstaBoost: Boosting instance segmentation via probability map guided copy-pasting. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 682–691. [Google Scholar] [CrossRef] [Green Version]
  24. Wei, Z.; Yang, M.; Wang, L.; Ma, H.; Xhong, R. Customized mobile LiDAR system for manhole cover detection and identification. Sensors 2019, 19, 2422. [Google Scholar] [CrossRef] [Green Version]
  25. Ren, H.; Deng, F. Manhole cover detection using depth information. J. Phys. Conf. Ser. 2021, 1865, 012037. [Google Scholar] [CrossRef]
  26. Zhou, B.; Zhao, W.; Guo, W.; Li, L.; Zhang, D.; Mao, Q.; Li, Q. Smartphone-based road manhole cover detection and classification. Autom. Constr. 2022, 140, 104344. [Google Scholar] [CrossRef]
  27. Ling, J.; Xue, H.; Song, L.; Xie, R.; Gu, X. Region-aware adaptive instance normalization for image harmonization. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9357–9366. [Google Scholar] [CrossRef]
  28. Zhang, Y.; Chen, H.; He, Y.; Ye, M.; Cai, X.; Zhang, D. Road segmentation for all-day outdoor robot navigation. Neurocomputing 2018, 314, 316–325. [Google Scholar] [CrossRef]
  29. Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef] [Green Version]
  30. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Pattern Recognit. Image Process. 2015, 9351, 234–241. [Google Scholar] [CrossRef] [Green Version]
  31. FastestDet Release v1.0. Available online: https://github.com/dog-qiuqiu/FastestDet (accessed on 3 August 2022).
  32. YOLOv5 Release v6.2. Available online: https://github.com/ultralytics/yolov5/tree/v6.2 (accessed on 3 August 2022).
  33. Zhou, X.; Wang, D.; Philipp, K. Objects as Points. arXiv 2019, arXiv:1905.11922. [Google Scholar] [CrossRef]
  34. Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [Green Version]
  35. Wang, C.; Alexey, B.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
  36. Wenyan, C.; Jianfu, Z.; Li, N.; Liu, L.; Zhixin, L.; Weiyuan, L.; Liqing, Z. DoveNet: Deep Image Harmonization via Domain Verification. Comput. Vis. Pattern Recognit. 2019, 8391–8400. [Google Scholar] [CrossRef]
  37. Wu, P.; Li, H.; Zeng, N. FMD-Yolo: An efficient face mask detection method for COVID-19 prevention and control in public. Image Vis. Comput. 2022, 117, 104341. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overview of our proposed VGCopy-paste data augmentation for road manhole cover detection. We used the image taken by the vehicle’s camera and the abnormal manhole cover image taken by the mobile device as the input. Using the road semantic segmentation algorithm to obtain prior visual information, that is, the road segmentation map, we found the corresponding perspective transformation parameters for pasting and finally paste the extracted manhole cover samples onto the road.
Figure 1. Overview of our proposed VGCopy-paste data augmentation for road manhole cover detection. We used the image taken by the vehicle’s camera and the abnormal manhole cover image taken by the mobile device as the input. Using the road semantic segmentation algorithm to obtain prior visual information, that is, the road segmentation map, we found the corresponding perspective transformation parameters for pasting and finally paste the extracted manhole cover samples onto the road.
Sensors 23 02676 g001
Figure 2. The process of recovering the shape of the manhole cover based on perspective transformation. The convex and concave abnormal well covers may have the same complete appearance as normal well covers, but they are often not aligned with the road surface. Therefore, the thickness of the convex well covers and the concave depth of the concave well covers need to be considered when extracting their samples.
Figure 2. The process of recovering the shape of the manhole cover based on perspective transformation. The convex and concave abnormal well covers may have the same complete appearance as normal well covers, but they are often not aligned with the road surface. Therefore, the thickness of the convex well covers and the concave depth of the concave well covers need to be considered when extracting their samples.
Sensors 23 02676 g002
Figure 3. Example of a damaged manhole cover under the mobile equipment coordinate system and vehicle camera coordinate system.
Figure 3. Example of a damaged manhole cover under the mobile equipment coordinate system and vehicle camera coordinate system.
Sensors 23 02676 g003
Figure 4. Linear fitting perspective transformation parameters, from left to right, are the width and angle of the pasted manhole cover sample and the distance from the vehicle’s camera.
Figure 4. Linear fitting perspective transformation parameters, from left to right, are the width and angle of the pasted manhole cover sample and the distance from the vehicle’s camera.
Sensors 23 02676 g004
Figure 5. Example of three categories of abnormal manhole covers: (A,B) denote “Dislocated”, (C) denotes “Damaged”, and (D) denotes “Missing”.
Figure 5. Example of three categories of abnormal manhole covers: (A,B) denote “Dislocated”, (C) denotes “Damaged”, and (D) denotes “Missing”.
Sensors 23 02676 g005
Figure 6. Examples of VGCopy-paste.
Figure 6. Examples of VGCopy-paste.
Sensors 23 02676 g006
Figure 7. Example of pasting range of a manhole cover. The red area in the image is the pasting range.
Figure 7. Example of pasting range of a manhole cover. The red area in the image is the pasting range.
Sensors 23 02676 g007
Figure 8. Example of image block capture and mask generation. The red ellipse represents the pasting position of the manhole cover.
Figure 8. Example of image block capture and mask generation. The red ellipse represents the pasting position of the manhole cover.
Sensors 23 02676 g008
Figure 9. The situation of using or not using the image harmonization algorithm for the same manhole cover in different scenes, where (a1,b1) do not use the image harmonization algorithm, and (a2,b2) use the image harmonization algorithm.
Figure 9. The situation of using or not using the image harmonization algorithm for the same manhole cover in different scenes, where (a1,b1) do not use the image harmonization algorithm, and (a2,b2) use the image harmonization algorithm.
Sensors 23 02676 g009
Figure 10. Comparison of model performance in different training epochs, where “Ours” denotes the model that used VGCopy-paste during training, and “None” denotes the model that did not use any data augmentation method during training.
Figure 10. Comparison of model performance in different training epochs, where “Ours” denotes the model that used VGCopy-paste during training, and “None” denotes the model that did not use any data augmentation method during training.
Sensors 23 02676 g010
Table 1. The architecture of the lightweight Mobile-UNet network.
Table 1. The architecture of the lightweight Mobile-UNet network.
No.InputOperatorDescriptionNo. Parameters
1512 × 512 × 3conv2d, 3 × 3First conv layer464
2256 × 256 × 16bneck, 3 × 3Inverted ResBlock, s = 2744
3128 × 128 × 16bneck, 3 × 3Inverted ResBlock, s = 23864
464 × 64 × 24bneck, 3 × 3Inverted ResBlock, s = 15416
564 × 64 × 24bneck, 5 × 5Inverted ResBlock, s = 213,736
632 × 32 × 40bneck, 5 × 5Inverted ResBlock, s = 157,264
732 × 32 × 40bneck, 5 × 5Inverted ResBlock, s = 157,264
832 × 32 × 40bneck, 5 × 5Inverted ResBlock, s = 121,968
932 × 32 × 48bneck, 5 × 5Inverted ResBlock, s = 129,800
1032 × 32 × 48bneck, 5 × 5Inverted ResBlock, s = 291,848
1116 × 16 × 96bneck, 5 × 5Inverted ResBlock, s = 1294,096
1216 × 16 × 96bneck, 5 × 5Inverted ResBlock, s = 1294,096
1316 × 16 × 96shortcut + upsampleConnect to No. 12 layer117,824
1432 × 32 × 24shortcut + upsampleConnect to No. 9 layer39,376
1564 × 64 × 16shortcut + upsampleConnect to No. 4 layer56,608
16128 × 128 × 64upsampleupsample to 512 × 512 × 284,800
Total No. parameters: 1,169,168
Table 2. Classification of manhole covers and the quantity of various covers collected.
Table 2. Classification of manhole covers and the quantity of various covers collected.
CategoryNo. of Manhole Cover Samples Shot by the Vehicle’s CameraNo. of Manhole Cover
Samples Shot by Smartphone
No. of Manhole Cover
Samples after Data
Augmentation
Dislocated202361549
Damaged81261035
Missing4120809
Normal478104781
Table 3. Road manhole cover detection on our test set, where “Without copy-paste” denotes the abnormal manhole cover images that are added to the training set without additional processing. “Random copy-paste” denotes extracting abnormal manhole cover samples and random pasting onto the training set. “VGCopy-paste” denotes extracting abnormal manhole covers and using VGCopy-paste. AP50 and AP75 evaluation metrics are adopted from mAP.
Table 3. Road manhole cover detection on our test set, where “Without copy-paste” denotes the abnormal manhole cover images that are added to the training set without additional processing. “Random copy-paste” denotes extracting abnormal manhole cover samples and random pasting onto the training set. “VGCopy-paste” denotes extracting abnormal manhole covers and using VGCopy-paste. AP50 and AP75 evaluation metrics are adopted from mAP.
MethodParams A P 50   A P 75  
FastestDetWithout copy–paste4.74 M63.230.7
FastestDetRandom copy–paste4.74 M67.431.2
FastestDetVGCopy-paste4.74 M76.146.0
YOLOv5sWithout copy–paste7.02 M61.930.9
YOLOv5sRandom copy–paste7.02 M78.341.7
YOLOv5sVGCopy-paste7.02 M80.054.6
CenterNet-DLA34Without copy–paste20.17 M64.122.0
CenterNet-DLA34Random copy–paste20.17 M70.538.7
CenterNet-DLA34VGCopy-paste20.17 M70.939.9
Retinanet-D34Without copy–paste31.52 M50.517.8
Retinanet-D34Random copy–paste31.52 M70.324.3
Retinanet-D34VGCopy-paste31.52 M70.725.3
YOLOv7Without copy–paste37.21 M73.435.9
YOLOv7Random copy–paste37.21 M80.550.6
YOLOv7VGCopy-paste37.21 M81.856.5
Table 4. Road segmentation models evaluated on the UAS test set.
Table 4. Road segmentation models evaluated on the UAS test set.
ParamsmeanIOU
FCN35.31 M91.9
UNet4.32 M96.5
Ours1.17 M97.9
Table 5. Performance of different data augmentation methods in the abnormal road manhole cover detection task.
Table 5. Performance of different data augmentation methods in the abnormal road manhole cover detection task.
Method         A P 50   A P 75  
Mixup60.133.1
Cutout46.613.4
Random affine63.127.8
HVS augmentation40.813.1
Random copy–paste78.341.7
VGCopy-paste80.054.6
Table 6. Sensitive analysis on different hyperparameter configurations. AP50 and AP75 evaluation metrics were adopted from mAP.
Table 6. Sensitive analysis on different hyperparameter configurations. AP50 and AP75 evaluation metrics were adopted from mAP.
No. of Pasted SamplesModelPasted Range A P 50   A P 75  
1YOLOv50.58–0.9179.549.9
1YOLOv50.67–0.9180.054.6
1YOLOv50.77–0.9175.448.9
1YOLOv50.67–0.9180.054.6
2YOLOv50.67–0.9179.751.5
3YOLOv50.67–0.9179.049.8
1YOLOv70.58–0.9181.655.8
1YOLOv70.67–0.9181.856.5
1YOLOv70.77–0.9180.054.9
1YOLOv70.67–0.9181.856.5
2YOLOv70.67–0.9182.852.5
3YOLOv70.67–0.9181.051.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, D.; Yu, X.; Yang, L.; Quan, D.; Mi, H.; Yan, K. Data-Augmented Deep Learning Models for Abnormal Road Manhole Cover Detection. Sensors 2023, 23, 2676. https://doi.org/10.3390/s23052676

AMA Style

Zhang D, Yu X, Yang L, Quan D, Mi H, Yan K. Data-Augmented Deep Learning Models for Abnormal Road Manhole Cover Detection. Sensors. 2023; 23(5):2676. https://doi.org/10.3390/s23052676

Chicago/Turabian Style

Zhang, Dongping, Xuecheng Yu, Li Yang, Daying Quan, Hongmei Mi, and Ke Yan. 2023. "Data-Augmented Deep Learning Models for Abnormal Road Manhole Cover Detection" Sensors 23, no. 5: 2676. https://doi.org/10.3390/s23052676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop