Stower-13: A Multi-View Inspection Image Dataset for the Automatic Classification and Naming of Tension Towers

Lu, Yaolin; Zheng, Enhui; Chen, Yifu; Wu, Kaijun; Yang, Zhonghao; Yuan, Jiayu; Xie, Min

doi:10.3390/electronics13101858

Open AccessArticle

Stower-13: A Multi-View Inspection Image Dataset for the Automatic Classification and Naming of Tension Towers

School of Mechanical and Electrical Engineering, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(10), 1858; https://doi.org/10.3390/electronics13101858

Submission received: 18 April 2024 / Revised: 8 May 2024 / Accepted: 9 May 2024 / Published: 10 May 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In high-voltage transmission lines, the tension tower needs to withstand the tension load of the overhead power line for a long time, which is prone to damage, and it is an important part of the inspection in the circuit inspection. In the modern circuit inspection process, operation and maintenance personnel mostly use unmanned aerial vehicles (UAVs) to photograph various parts of the tension tower, obtain inspection images, and manually classify and name the massive inspection images, which is low in accuracy and efficiency. Based on the above problems, this paper collects a large number of real-life UAV inspection images of various parts of a tension tower, and proposes the Stower-13 inspection image dataset, which is used to train the classification model to achieve automatic classification and naming of inspection images. Based on this dataset, this paper also proposes an improved MobileViT model, in which the Scale-Correlated Pyramid Convolution Block Attention Block (SCPCbam) module is introduced, which adds the Convolution Block Attention Module (CBAM) to the four branches of the original Scale-Correlated Pyramid Convolution (SCPC) module, so as to strengthen the ability of image multi-scale information extraction and improve the classification accuracy. This paper discusses a number of experiments on the model, and the experimental results show that the dataset proposed in this paper helps the model to understand the feature information. At the same time, the improved MobileViT model has a strong ability to extract image spatial feature information, the classification accuracy is higher than that of other models of the same type, so it is able to cope with a wide range of problems that arise in the course of practice, and it meets the practical needs of automatically naming transmission line inspection images.

Keywords:

tension tower; image classification; deep learning; SCPCbam

1. Introduction

In recent times, Unmanned Aerial Vehicles (UAVs) [1,2] have become a mainstay in the realm of transmission line inspections. High-definition camera equipment, mounted on these UAVs, captures detailed images of various components of the transmission towers, thereby facilitating the acquisition of inspection images. Subsequently, operation and maintenance personnel are tasked with the painstaking process of manually identifying the characteristic elements within these images, classifying the images accordingly, and assigning appropriate names based on the unique features identified within each image. This meticulous process plays a pivotal role in enabling the subsequent overhaul and maintenance of the transmission lines.

However, the sheer volume of inspection images presents a daunting challenge. The manual classification and naming of these images is not only inefficient but also subject to the subjective judgement of the operation and maintenance personnel, thereby creating a propensity for errors in image naming. This situation underscores an immediate need for the development of a precise and efficient automatic naming method for inspection images, a solution that could significantly enhance the efficiency, accuracy, and standardization of the naming process undertaken by operation and maintenance personnel.

Despite the urgency of this need, existing automatic naming methodologies continue to grapple with the following issues:

There is an acute shortage of comprehensive, publicly available datasets related to transmission tower inspections. Existing datasets predominantly focus on transmission tower power components, such as insulators, vibration-proof hammers, etc., while categorized datasets encompassing all aspects of transmission tower parts remain largely unavailable.
The naming accuracy of existing automatic naming algorithms for transmission tower inspection images leaves much to be desired. As a result, operation and maintenance personnel can only use the algorithm’s naming results as a reference, necessitating a manual analysis of inspection image names. This process is not only inefficient, but it also escalates the overall costs associated with transmission line maintenance.

Addressing the aforementioned challenges, this paper establishes a comprehensive multi-angle tension-resistant tower inspection image dataset named Stower-13. This dataset encompasses all inspection images captured by UAVs in an actual field environment and manually named under the guidance of operation and maintenance personnel. As a result, the inspection image names used in the dataset of this paper were obtained based on the name of the main part of the image, such as insulators, etc., combined with the image location information, such as large side, small side, etc., and all the names of the inspection images were recognised by the person in charge of a grid company, in line with the industry standards. The dataset exhibits the following features:

It contains a multitude of UAV inspection images utilized for categorization, covering a series of perspectives of the tower from various heights and angles.
The dataset comprises 21 classes of inspection images captured by the UAV at the same tower from different parts and angles, including large and small side corridors. After screening and eliminating low-quality images, such as overexposed and dark images, the dataset encompasses 7073 inspection images of 270 tension-resistant towers. Further detailed information will be introduced in the subsequent sections. This dataset holds considerable significance in the classification task of automatic naming for inspection images.

In this paper, we employ the MobileViT model, trained and tested on Stower-13, and introduce the Scale Correlated Pyramid Attention Convolution (SCPCbam) module atop this network. By making comparisons with various classification models, this paper validates the effectiveness of the proposed model. The experimental results reveal that the Stower-13 inspection image dataset proposed herein aids the deep learning network model in learning the deep features of the specific classification site of the tension tower. This process further enhances the classification effect and performance index, facilitating a more accurate and efficient classification and naming of inspection images.

The primary contributions of this paper are as follows:

This paper presents the construction of a novel tension tower inspection image dataset, named Stower-13, designed specifically for training automatic classification models of tension tower images. Stower-13 comprises diverse views capturing different sections and heights of the tension tower, ranging from its base to its head. All images are captured by UAVs during the real-world inspection process of a transmission line, thus reflecting the actual conditions including lighting variations and obstacles.
A refined lightweight MobileViT network model is introduced in this paper for training and testing the dataset. This model incorporates the SCPCbam module to enhance the network’s capacity to extract deep features while simultaneously conserving parameters and computational resources. Consequently, the accuracy of the classification model is improved, offering a novel intelligent framework for the automated naming of inspection images.
This paper proposes enhancements to the original loss function to address challenges related to category imbalance and the learning of difficult-to-categorize samples. These improvements aim to refine the model’s ability to handle imbalanced data distributions and to better accommodate challenging instances, thus enhancing the overall performance of the classification system.

2. Related Work

2.1. Automatic Naming of Inspection Images

The initial stage of intelligent inspection predominantly relies on manually naming inspection images, which proves to be labor-intensive and costly. Manual organization and analysis of images consume significant time and resources, and are susceptible to the subjective judgments of operation and maintenance personnel, leading to potential inaccuracies. Consequently, the manual naming approach falls short in addressing the demands of large-scale inspection image classification tasks.

However, the landscape of image classification has been revolutionized with the rapid advancement of deep learning models. Intelligent models have found extensive applications in classification problems, including the categorization and automatic naming of tower parts in the realm of electric power inspection. Such applications are apt for standardizing the processing of large volumes of inspection images, thereby enhancing the efficiency of operation and maintenance personnel.

In recent years, scholars have made noteworthy contributions in this domain. For instance, Silva, F [3] and QinXY [4] et al. utilized radar point cloud data to establish relevant models, incorporating the latitude and longitude information of the tower to facilitate the classification of tower parts. Similarly, Zhiqiang Feng et al. [5] employed the PointCNN model to train point cloud data to classify point cloud images. However, this approach is significantly constrained by distance, impeding stable and accurate tower classification.

In another study, Odo, A [6] and Souza [7] et al. designed a UAV inspection platform to classify high-resolution aerial images of medium- and low-voltage towers, aiding operation and maintenance personnel in locating and troubleshooting faulty parts. Shahrzad Falahatnejad [8] and lukavevic [9] et al. proposed a Generative Adversarial Network (GAN) model to enhance the resolution of UAV inspection images, thereby improving image quality and facilitating the identification of faulty parts.

Michalski [10], CaoG [11], and LiHC [12] et al. manually annotated the location and category of transmission towers in satellite images to implement the classification and naming of towers. LiuJ [13], XuCL [14], LiuZY [15], and LiaoY [16] et al. developed an intelligent annotation platform to mark inspection images which is conducive to subsequent line maintenance operations by operation and maintenance personnel.

Despite these strides, research in the field of inspection image classification and automatic naming remains relatively sparse. The challenges of low classification accuracy and irregular naming results have hindered progress in this domain. In response to these challenges, this paper introduces the Stower-13 dataset, which is standardized for naming. This universal dataset is utilized to extract image features using the deep learning network model. Subsequently, the trained model classifies and names the inspection images, thereby realizing the automatic naming of UAV inspection images.

2.2. Power Inspection Image-Related Datasets

In the past, most inspection images focused on capturing key components of transmission lines, such as vibration-proof hammers and insulators. These datasets were primarily utilized for defect detection in critical components and small target detection tasks. Datasets like InsuGenSet, ID, and KCIGD were developed for this purpose. However, these datasets often fail to capture the complexity of real-world backgrounds and environments in transmission line inspection images. InsuGenSet and ID primarily aimed to enhance the resolution and realism of insulator images, while KCIGD focused on augmenting background complexity compared to CPLID. Typically, these datasets only contain detailed views of individual parts of the pole tower.

To achieve automatic classification and naming of inspection images, the dataset must encompass inspection images from multiple viewpoints and parts for effective deep learning network training. The Stower-13 dataset, utilized in this paper, includes other inspection parts, including insulator images, providing a more realistic representation of the background information of inspection images.

Characterized by diverse viewpoints and high complexity of the pole and tower parts, Stower-13 propels the model to comprehend deeper features of the classification target. This promotes the learning of inspection image target features and the development of automatic naming of inspection images. Simultaneously, it enables the localization of the inspection images’ shooting location, which is beneficial for the subsequent maintenance and repair work by operation and maintenance personnel.

3. Stower-13 Dataset

3.1. Dataset Collection

Currently, much of the research focused on transmission line tower inspection images revolves around datasets tailored for insulator defect detection, such as CPLID, InsuGenSet, and ID. However, these datasets have limited applicability, primarily in classification and target detection tasks specific to particular tower equipment. Moreover, they lack real-world background environments and noise interference.

In our study, we aimed to classify all parts of the pole tower and achieve automatic naming of inspection images. To this end, we utilized a self-constructed dataset comprising UAV inspection images captured in real environments for both training and testing. This dataset comprises 7073 inspection images with dimensions of 5472 × 3078 pixels and includes elevation information to aid subsequent classification tasks. These images were sourced from a power grid company, where inspectors employ UAVs to inspect several 500 kV overhead transmission lines. Notably, images from different inspection sites offer multiple shooting angles and feature complex, real backgrounds, encompassing diverse landscapes such as mountains, forests, farmlands, and urban areas.

Given that the dataset is derived entirely from real-world environments, it is subject to various weather conditions, including sun exposure, cloudy skies, fog, and rainfall. Furthermore, it encompasses complex background noise and variations in light intensity.

In essence, the Stower-13 dataset utilized in this study encapsulates a spectrum of real-world background and noise parameters, offering diverse viewpoints. This aligns with the objectives of our paper, aiming to achieve automatic naming of inspection images. Additionally, it provides a valuable resource for training and testing in subsequent research endeavors within this domain.

3.2. Dataset Composition

The Stower-13 dataset utilized in this paper encompasses over 270 tension-resistant towers from 500 kV overhead transmission lines. The tower inspection images in this dataset adhere to the shooting standards of grid inspection and were captured from multiple viewpoints. In order to obtain the ideal inspection image to be used for training, it is necessary to ensure that the UAV is within a safe distance from the shooting site to shoot the tower part as clearly as possible, and to ensure that there is sufficient light and no interference by obstacles such as trees. In general, it is not possible to obtain idealised inspection images; in practice, we ensured that the UAV was within a safe distance from the shooting area to shoot; filtered the images obtained to remove low-quality inspection images under the influence of unfavourable factors such as insufficient or excessive lighting, rainy and foggy weather, and large areas of obstruction; and only retained suitable images for training. These images were organized by the operation and maintenance personnel into relevant folders for subsequent training.

Each tension tower comprises 21 types of inspection parts, spanning the lower, middle, and upper phases of the tower. These parts include the large side cross-arm suspension point, large side insulator string, large side wire clip, small side cross-arm suspension point, small side insulator string, small side wire clip, jumper cross-arm suspension point, and voltage limiting ring, among others. Here, “large side” denotes the direction of the drone’s flight, connecting to the next tower, while “small side” refers to the opposite direction of the drone’s flight, connecting with the previous tower, as indicated by the red arrow in the figure below. This distinction is illustrated below in Figure 1.

Each inspection site within the Stower-13 dataset comprises multiple inspection photos, totaling 7074 inspection images. Upon analysis, it was observed that the high similarity among images of the same part across different phases could potentially compromise subsequent classification accuracy. To address this, the inspection images contain height information, which enables direct classification of images of identical part across different phases based on elevation. Consequently, this paper consolidates images of identical part across different phases into one class, resulting in a streamlined classification scheme comprising 13 classes.

Furthermore, the final naming requirements for the inspection images include additional information such as the tower number, which can be directly derived from the image’s latitude and longitude coordinates. It is worth noting that the model employed in this paper exclusively focuses on classifying and automatically naming image parts.

Compared to other existing datasets, the Stower-13 dataset employed in this study exhibits the following distinctive characteristics:

(1): Enhanced Standardized Category Classification: the inspection image dataset encompasses all inspection parts of 500 kV transmission line shear-resistant towers, which were meticulously photographed by relevant operation and maintenance personnel, adhering strictly to grid inspection specifications.
(2): Varied Perspectives: The inspection image dataset features multiple images of each inspection site, captured from diverse shooting angles. This diversity facilitates the extraction of corresponding features by the classification network.
(3): Authenticity: Captured in real weather and environmental conditions, the inspection image dataset offers images with intricate background and noise information. This authenticity enhances the classification network’s resilience to interference.

Figure 2, below, showcases some sample images from the inspection image dataset.

Table 1, below, shows the categories of photographed parts included in the inspection image dataset.

In this paper, the training, validation, and test sets are structured according to an 8:1:1 ratio. Specifically, the training set comprises 5668 images, the validation set consists of 703 images, and the test set encompasses 703 images. The composition of these sets is illustrated in Figure 3 below. In addition, the validation and test sets used in this paper only have the same number of images, but the content of the images is not duplicated and will not affect the subsequent experiments.

4. Automatic Naming of Inspection Images

4.1. Lightweight Network MobileViT

The Transformer’s self-attention mechanism excels at capturing global information, but its network model demands extensive data training to achieve high accuracy. Unlike CNNs, Transformers lack spatial inductive bias, necessitating a large amount of data for effective learning. Attempts to mitigate this by introducing absolute positional coding encounter challenges when input and output sizes differ, rendering transfer learning ineffective. Interpolation methods, though utilized, often lead to decreased accuracy.

Conversely, CNNs possess an inductive bias, incorporating substantial prior information and requiring relatively fewer parameters for learning. However, CNNs struggle with spatial information acquisition, impacting their performance to some extent.

The MobileViT model represents an enhancement of the original Transformer architecture by integrating a hybrid design of CNN and Transformer elements. This hybrid approach leverages a CNN to introduce spatial induction bias, compensating for the Transformer’s inherent shortcomings. Additionally, incorporating CNN components accelerates network convergence and enhances stability during the training process. Consequently, the MobileViT model, despite its reduced parameter count, achieves high accuracy rates. Notably, compared to traditional lightweight CNN networks, MobileViT exhibits superior accuracy.

At the heart of the MobileViT model lies the MobileViT block, which constitutes the core of its architecture. Figure 4 provides a detailed illustration of this model’s structure.

The MobileViT model adopts a multi-step approach to feature modeling. Initially, the feature map undergoes local modeling via a convolutional layer with a kernel size of n × n, followed by the adjustment of channel numbers using a 1 × 1 convolutional layer. Subsequently, global modeling is executed through the Unfold–Transformer–Fold structure, involving the segmentation of the feature map into patches, typically 2 × 2 in size, each comprising four pixels. Within the self-attention mechanism, each pixel exclusively connects with pixels of the same color within its patch.

After global modeling, channel numbers are restored to their original size through another 1 × 1 convolutional layer. The resulting feature map is then combined with the original input feature maps via shortcut branches. Finally, feature fusion is performed using a convolutional layer with a kernel size of n × n to generate the output.

To optimize computational efficiency, MobileViT employs a token-level self-attention mechanism, wherein each token exclusively attends to its corresponding token during computation. This strategy reduces computational redundancy, which is particularly advantageous in image data where substantial redundancy exists. By minimizing redundant computations, MobileViT achieves comparable accuracy while significantly reducing parameter requirements.

MobileViT integrates the strengths of CNN’s spatial generalization and VIT’s global processing, surpassing lightweight networks like MobileNet with equivalent parameter counts. As a result, MobileViT was selected as the backbone network in this study, capable of fulfilling the task requirements of inspection and operation maintenance personnel for lightweight, low-latency, and resource-constrained devices. Its adoption facilitates the realization of automatic naming for inspection images on mobile devices, effectively meeting the demands of the task at hand.

4.2. Convolution Block Attention Module

Convolution Block Attention Modules (CBAMs) are designed to address the challenges faced by traditional neural networks in processing information across various scales, shapes, and orientations. The CBAM introduces two types of attention mechanisms: channel attention and spatial attention. Channel attention enhances feature representations across different channels, while spatial attention focuses on extracting key information from different locations in space. This dual attention mechanism is illustrated in Figure 5 below.

In Channel Attention Modules, in order to enhance the feature representation of each channel, for the input feature maps, global maximum pooling and global average pooling are first performed for each channel to compute the maximum feature value and average feature value on each channel. The feature vectors after global maximum pooling and average pooling are input into a shared fully connected layer; this fully connected layer is used to learn the attention weights for each channel, and the global maximum feature vector and the average feature vector are compared to obtain the final attention weight vector. The attention weights are kept between 0 and 1 by the Sigmoid activation function, and using the obtained attention weights, they are multiplied with each channel of the original feature map to obtain the attention weighted channel feature map.

The spatial attention module emphasizes the importance of different locations in the image. For the input feature maps, maximum pooling and average pooling are performed along the channel dimensions, respectively to generate features with different contextual scales, the features after maximum pooling and average pooling are spliced together along the channel dimensions to obtain feature maps with different scales of contextual information, and then these are processed through the convolutional layer to generate the spatial attention weights. The Sigmoid activation function is also utilized to keep the weights between 0 and 1. The obtained weights are applied to the original feature map to weight the features at each spatial location in order to highlight the important regions and reduce the influence of unimportant regions.

CBAM multiplies the above two output features element by element to obtain the final attention enhancement feature. This augmented feature is used as an input to the subsequent network, retaining critical information while suppressing noise and irrelevant information.

4.3. Scale-Correlated Pyramid Convolution Block Attention Block

Ionut Cosmin Duta et al. [17] introduced pyramidal convolution to address the limitations of CNNs in capturing detailed information and adequately representing the actual sensory field. Pyramidal convolution uses convolution kernels of varying scales to extract multi-scale information. Typically, the size of the convolution kernel decreases sequentially from the top to the bottom, while the number of channels increases sequentially. This setup facilitates multi-level feature fusion, enhancing the network’s ability to capture information at different scales.

On the other hand, Wu et al. [18] proposed a method where feature extraction at different scales should be correlated and mutually beneficial. Their approach, Scale Correlated Pyramid Convolution (SCPC), effectively learns scale-related features by utilizing small-scale features to complement large-scale features. This process helps fill in the gaps present in the larger-scale features.The specific calculation process is as follows:

Assuming that M denotes the input of SCPC, a 1 × 1 convolutional transform is first performed:

M_{1} = C o n v_{1 \times 1} (M)

(1)

Then,

M_{1}

is uniformly divided into four feature mappings along the channel dimension:

M_{2}^{1}, M_{2}^{2}, M_{2}^{3}, M_{2}^{4} = S p l i t (M_{1})

(2)

Next, multi-scale learning is performed through scale-related methods:

M_{3}^{1} = C o n v_{3 \times 3}^{a_{i}} (M_{2}^{1})

(3)

M_{3}^{i} = C o n v_{3 \times 3}^{a_{i}} (M_{2}^{i} + M_{3}^{i - 1}), i \in (2, 3, 4)

(4)

where

C o n v_{3 \times 3}^{a_{i}} (\cdot)

is a 3 × 3 null convolution and the null rate is

a_{i}

. Finally, the multi-scale features are connected and the residual connection is added:

O = C o n v_{3 \times 3} (C o n c a t (M_{3}^{1}, M_{3}^{2}, M_{3}^{3}, M_{3}^{4})) + M

(5)

where O is the output and O = H(M). In the module used in this paper, the cbam module is added after each cavity convolution to process different scale, shape, and orientation information. The specific SCPCbam structure is shown in Figure 6 below.

In general, increasing the kernel size of a convolutional kernel comes with a significant increase in parameters and computational complexity. However, small convolutional kernels used in CNNs cannot effectively cover large regions of the input. To address this, CNNs employ convolutional chains of downsampling layers to progressively reduce input size and increase the network’s perceptual field. Nonetheless, this approach encounters two primary challenges:

Theoretical versus Actual Perceptual Field: The theoretical perceptual field may theoretically cover a large portion of the output or even the entire output. However, in practice, the actual perceptual field tends to be smaller than the theoretical one, leading to a discrepancy between the expected and observed results.

Downsampling without Sufficient Contextual Information: downsampling the input without adequate contextual information affects the learning process and the network’s recognition capabilities.

Pyramidal convolution offers a promising solution to these challenges. By decreasing kernel size from top to bottom while increasing the number of channels, pyramidal convolution provides sufficient contextual information while achieving a better receptive field.

Additionally, Scale-Dependent Pyramid Convolution, aiming to enhance multi-level feature fusion, improves upon traditional pyramid convolution by employing a structure with four parallel branches. Furthermore, in this paper, the CBAM module is added after the null convolution of each branch. This addition enables the model to focus more on channel-wise feature expression and target positional information, thereby enhancing the overall feature extraction capability of the network.

4.4. Automatic Naming Model for Inspection Images

This paper introduces an enhanced MobileViT [19] model aimed at enhancing the classification accuracy of inspection images in the Stower-13 dataset. The improved MobileViT model builds upon the MobileViT-S model as its foundational feature extraction network and incorporates the improved Scale Correlated Pyramid Convolution (SCPC) module to bolster the extraction of multi-scale information from images. In addition, as a lightweight network, the improved MobileViT model after training can achieve real-time detection of inspection images.

The specific enhancement entails integrating the SCPCbam module into the original MobileViT model. This module is added after the final 1 × 1 convolutional layer. By performing this step, the feature representation across different channels is strengthened, and key information extraction from various spatial locations is enhanced. This augmentation further bolsters the network’s capability to extract multi-scale information. The specific architecture of the improved MobileViT model is depicted in Figure 7 below.

In this paper, inspection images undergo categorization and naming using the aforementioned network. Initially, inspection images were captured using a drone to photograph the tension-resistant tower. Subsequently, the images underwent preprocessing and were cropped to a size of 336 × 336, serving as input for the classification network. Using the trained weights, the inspection images were predicted to obtain the category and confidence level of each image.

However, due to the high degree of similarity among the top, middle, and bottom phases of the tension-resistant tower’s inspection images, directly using the same part of the image in the classification network does not yield complete name information. To address this, additional attribute information of the inspection image was utilized, including latitude, longitude, and elevation data, to augment the name of the image.

This augmentation process involves utilizing latitude and longitude information to assign the image to a specific tower number. Additionally, by leveraging elevation information, the phase of the tower to which the image belongs is determined. The complete process is illustrated in Figure 8 below.

The Focal loss function serves as our choice for addressing the sample imbalance issue present in the Stower-13 dataset. Given the dataset’s inherent characteristics, where certain image parts are challenging to classify while others are relatively straightforward, the Focal loss function offers an effective solution.

This loss function mitigates the impact of easily classifiable samples and emphasizes the training of more challenging samples. It achieves this by introducing a modulating factor to the standard cross-entropy loss function. The Focal loss function is defined as follows, with reference to the cross-entropy loss function:

F L (p_{t}) = - α {(1 - p_{t})}^{γ} l o g (p_{t})

(6)

where

α

weight helps to deal with the imbalance of categories,

{(1 - p_{t})}^{γ}

is the regulating factor, and

γ \geq 0

is the adjustable focusing parameter. The larger the value of

γ

, the greater the degree of reduction in the weight of the easy-to-categorize samples, and thus the model can be allowed to focus its attention on the difficult-to-categorize samples.

5. Experimental Results and Disscussion

5.1. Operating Environment and Parameter Setting

In this study, the MobileViTS network pre-trained on ImageNet was used as the backbone network. In the training phase, all images were resized to 336 × 336 size using random cropping bilinear interpolation with uniform data enhancement operations. The inspection images were trained by the MobileViTS network during training, after which multi-scale information fusion was performed by incorporating the improved SCPC module to extract the target feature vector in the inspection images, which was utilized to classify them through the text-added 13-dimensional classification layer. In the testing phase, the classification was performed through the test set and the classification confidence of each image was obtained, and the one with the highest classification confidence was selected as the result output.

The hardware configuration of the experimental equipment uses an Ubuntu 18.04 operating system, NVIDIA FeForce RTX 3060, and is trained and tested in Python 3.9.18, Pytorch 1.12.1, and CUDA 11.4 environments.

The parameters of the improved MobileViT training model are shown in Table 2 below:

In the training phase of the improved MobileViT model, training freezing was used to increase the model training speed. During the freeze training process, the backbone extraction network was frozen, but the network layers related to the classification head were not frozen, and the weights were iteratively optimized during training. The freezing training phase, with fewer parameter changes and low hardware requirements, could avoid the over-random phenomenon of network weight initialisation in the case of fewer samples in the dataset, improve the training efficiency, accelerate the model convergence speed, and reduce the training time.

5.2. Data Enhancement

For the inspection image dataset, in order to avoid an underfitting phenomenon during training, data enhancement was performed on each image before training by randomly cropping the image to 336 × 336 in size, scaling, randomly flipping, and so on. Figure 9 below shows a specific example of data enhancement.

5.3. Ablation Experiments and Discussion

5.3.1. Impact of Improved MobileViT Model on Classification Results

By performing ablation experiments on the Stower-13 dataset with training epochs of 100, the effectiveness of the original SCPC module and the improved SCPC module on the MobileViT model were tested, respectively, and the specific module settings and results are shown in Table 3 below:

By analyzing the experimental results in the above table, we can find that, compared with the original MobileViT model, the number of parameters increases by 1.8% after adding the SCPC module, while the inference time increases by only 1.8 ms, and the accuracy rate increases by 1.21% to 96.52%. Meanwhile, after adding the improved SCPCbam module, the number of parameters increases by 4.0%, while the inference time increases by 3.8 ms, and the accuracy rate increases by 2.28% to 97.59%, in which misclassified categories usually have two approximate category confidence levels, and the operations and maintenance personnel are able to quickly determine the correct category and correct it based on these two approximate category confidence levels, which can meet the actual engineering needs. These results show that the addition of the SCPC module greatly enhances the network’s ability to infer multi-scale information, while effectively mitigating the interference caused by image background information. In addition, the introduction of the CBAM module after each branch of the original SCPC module also enhances the extraction capability of the model. This enhancement is not only reflected in the feature representation of each channel, but also in the precise positioning of the target, thus improving the overall classification accuracy.

5.3.2. Effect of Image Resolution on Classification Results

Inspection images of varying sizes encompass diverse fine-grained information. To investigate the impact of these differing image sizes on classification outcomes, we altered the input image dimensions to identify the scale optimally suited for the model employed in this study. The specific image sizes and their corresponding accuracies are meticulously detailed in Table 4, presented below:

The experimental results reveal a compelling trend: the model yields peak accuracy, reaching 96.52%, when the input dimension is set at

336 \times 336

. Conversely, alternative input scales result in a discernible reduction in classification accuracy. We postulate that this phenomenon may arise due to the high similarity among certain image categories and the recurrent nature of detailed information. When the input size is excessively large, the model becomes vulnerable to the disruption of repetitive detail information, consequently leading to a decline in classification accuracy. On the other hand, an overly small output size could impede the model’s ability to effectively learn the detail information, thereby also contributing to a decrease in classification accuracy. In light of these findings, we decided to standardize the input size at

336 \times 336

for inspection images utilized as training and testing samples in all subsequent experiments.

5.3.3. Effect of Different Backbone Networks on Classification Results

To affirm the superior performance of the model utilized in this study on the Stower-13 dataset, we embarked on a series of experiments using various backbone networks. These experiments were conducted without the use of pre-training weights and involved a training epoch of 50. The aim was to compare the accuracy of the model on the validation set post convergence. The specific outcomes of these comparative analyses are meticulously detailed in Table 5, as presented below:

The performance of the different backbone network validation sets is shown in Figure 10 below:

The experimental results clearly demonstrate the significant superiority of the improved MobileViT model used in this study over the original MobileViTV2 model, particularly in terms of classification accuracy. The improved model also exhibits enhanced performance when compared to similar lightweight networks, achieving a classification accuracy of 94.32%. Upon further analysis, it becomes evident that the SCPCbam module employed in this study outperforms traditional attention mechanisms in terms of its capacity to extract feature information from images, thereby attaining a higher classification accuracy. This underscores the efficacy of the improved MobileViT model in handling complex image classification tasks.

5.3.4. Classification Results of Inspection Images under Different Lighting Backgrounds

In real-world scenarios, UAV inspection images are profoundly influenced by weather conditions, leading to complex lighting information. These images are susceptible to issues such as overexposure in bright weather and inadequate lighting in rainy or foggy conditions, which can significantly lower the model’s classification accuracy. To validate the robustness of the proposed MobileViT model against varying background lighting conditions, we conducted tests using inspection images captured under different lighting scenarios. The results of these experiments are illustrated in Figure 11 below.

Figure 11 comprises various sub-figures, where (a) represents the classification effect of inspection images under insufficient lighting conditions, (b) illustrates the classification effect under normal lighting conditions, and (c) depicts the classification effect under overexposure. The prediction results indicate that the MobileViT model employed in this study, with the assistance of the SCPCbam module, can focus more intensively on the subject features of the image. This enables the model to enhance its resistance to background lighting information interference. Consequently, it demonstrates a robust ability to handle inspection images under diverse real-world lighting conditions, maintaining high classification accuracy.

5.3.5. Classification Results of Similar Parts of Inspection Images

The inspection images of the strain tower of double circuit transmission present a unique challenge due to the high degree of similarity between images of the same part from the large side and small side. As illustrated in Figure 12 below, (a) represents images of three part from the larger side, while (b) represents images of three part from the smaller side. Upon observation, the same parts of the large side and small side exhibit substantial differences only in the orientation of certain features, while the main feature components and background information remain similar. This results in a low degree of differentiation between the images, making accurate classification more challenging.

The experimental results indicate that the enhanced MobileViT network utilized in this research significantly improves the perception of target location information via the SCPCbam module. It also enhances the extraction of spatial features, enabling the differentiation of images of the same parts from the large side and small side of the tower. This leads to an improvement in the overall classification accuracy of the model.

6. Conclusions

This paper presents the construction of the Stower-13 inspection image dataset, which encompasses multi-view and multi-background images of different parts of a double-back tension-resistant tower. This dataset serves as a valuable resource for tasks such as automatic naming during grid inspection and defect detection in inspected parts, thereby contributing to the advancement of intelligent models in this domain.

Furthermore, this paper introduces an improved network model based on the MobileViT architecture, with the primary objective of automating the naming task for inspection images. To enhance the model’s ability to extract detailed information, the SCPCbam module is incorporated, thereby improving the overall classification accuracy of the network.

Through a series of experiments, this paper determines the most suitable cropping size and backbone network by evaluating image classification results with varying resolutions and backbone networks. Subsequent experiments validated the effectiveness of the proposed module, demonstrating its superiority over other networks in extracting detailed information. The overall classification accuracy achieved in the experiments reached 97.59%, confirming the feasibility and efficacy of the proposed module.

In summary, this paper presents a comprehensive approach to automatic naming of inspection images, leveraging the Stower-13 dataset and an improved MobileViT-based network model with the SCPCbam module. The experimental results validate the effectiveness of the proposed approach and contribute to the advancement of intelligent models for inspection tasks in the field.

Indeed, the sheer volume of images generated by automatic inspection processes presents a significant challenge in terms of processing efficiency for operation and maintenance personnel. While the proposed approach in this paper demonstrates promising results, there remains a need to balance the trade-offs between training cost, model complexity, and computational efficiency.

In future research, addressing these challenges could involve developing models with lower training costs and higher accuracy. This could be achieved through innovations in network architectures, optimization algorithms, and training strategies. For example, exploring lightweight network architectures that strike a balance between model complexity and performance could reduce training costs while maintaining high accuracy. Additionally, techniques such as knowledge distillation, transfer learning, and model quantization could further optimize training efficiency without compromising accuracy.

Moreover, leveraging the Stower-13 dataset provided in this paper can serve as a valuable resource for future research endeavors in the field of automatic naming of inspection images and defect detection. Researchers can utilize this dataset to develop and benchmark new models, validate existing approaches, and address real-world engineering needs more effectively.

By continually refining and innovating upon existing methodologies, researchers can contribute to the development of more efficient and accurate solutions for automatic inspection tasks, ultimately enhancing the productivity and effectiveness of operation and maintenance personnel in practical engineering settings.

Author Contributions

Conceptualization, Y.L. and M.X.; methodology, Y.L.; software, Y.L., Y.C. and J.Y.; validation, Y.L., Y.C., K.W. and Z.Y.; formal analysis, E.Z.; investigation, Y.L.; resources, E.Z.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Y.L.; supervision, Y.L. and M.X.; project administration, Y.L. and E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Du, Q.; Dong, W.; Su, W.; Wang, Q. UAV inspection technology and application of transmission line. In Proceedings of the 2022 IEEE 5th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 23–25 September 2022; pp. 594–597. [Google Scholar]
Hui, D.; Zhang, Y.; Miao, H.; Wang, Z.; Zhang, G.; Zhao, L. Research on Details Enhancement Method of UAV Inspection Image for Overhead Transmission Line. In Proceedings of the 2019 International Conference on Smart Grid and Electrical Automation (ICSGEA), Xiangtan, China, 10–11 August 2019; pp. 27–31. [Google Scholar]
Silva, F.; Amaro, N. Transmission Tower Classification Using Point Cloud Similarity. In Proceedings of the APCA International Conference on Automatic Control and Soft Computing, Caparica, Portugal, 6–8 July 2022; pp. 609–618. [Google Scholar]
Qin, X.; Wu, G.; Lei, J.; Fan, F.; Ye, X.; Mei, Q. A novel method of autonomous inspection for transmission line based on cable inspection robot lidar data. Sensors 2018, 18, 596. [Google Scholar] [CrossRef] [PubMed]
Feng, Z.; Wang, X.; Zhou, X.; Hu, D.; Li, Z.; Tian, M. Point Cloud Extraction of Tower and Conductor in Overhead Transmission Line Based on PointCNN Improved. In Proceedings of the 2023 3rd International Symposium on Computer Technology and Information Science (ISCTIS), Chengdu, China, 7–9 July 2023; pp. 1009–1014. [Google Scholar]
Odo, A.; McKenna, S.; Flynn, D.; Vorstius, J. Towards the automatic visual monitoring of electricity pylons from aerial images. In Proceedings of the VISAPP 2020: 15th International Conference on Computer Vision Theory and Applications, Valletta, Malta, 27–29 February 2020; pp. 566–573. [Google Scholar]
Souza, B.J.; Stefenon, S.F.; Singh, G.; Freire, R.Z. Hybrid-YOLO for classification of insulators defects in transmission lines based on UAV. Int. J. Electr. Power Energy Syst. 2023, 148, 108982. [Google Scholar] [CrossRef]
Falahatnejad, S.; Karami, A.; Nezamabadi-pour, H. PTSRGAN: Power transmission lines single image super-resolution using a generative adversarial network. Int. J. Electr. Power Energy Syst. 2024, 155, 109607. [Google Scholar] [CrossRef]
Lukačević, I.; Lagator, D. Requirements for High-Quality Thermal Inspection of the Transmission Lines. In Proceedings of the International Conference on Organization and Technology of Maintenance, Osijek, Croatia, 12 December 2022; pp. 84–95. [Google Scholar]
Michalski, P.; Ruszczak, B.; Lorente, P.J.N. The implementation of a convolutional neural network for the detection of the transmission towers using satellite imagery. In Proceedings of the International Conference on Information Systems Architecture and Technology, Wrocław, Poland, 15–17 September 2019; pp. 287–299. [Google Scholar]
Cao, G.; Liu, Y.; Fan, Z.; Chen, L.; Mei, H.; Zhang, S.; Wang, J.; Han, X. Research on small-scale defect identification and detection of smart grid transmission lines based on image recognition. In Proceedings of the 2021 IEEE 4th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), Shenyang, China, 19–21 November 2021; pp. 423–427. [Google Scholar]
Li, H.; Yang, Z.; Han, J.; Lai, S.; Zhang, Q.; Zhang, C.; Fang, Q.; Hu, G. TL-Net: A Novel Network for Transmission Line Scenes Classification. Energies 2020, 13, 3910. [Google Scholar] [CrossRef]
Liu, J.; Wang, X.; Cui, Y.; Huang, X.; Zhang, L.; Zhang, Y. An Intelligent Annotation Platform for Transmission Line Inspection Images. In Proceedings of the International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, Fuzhou, China, 30 July–1 August 2022; pp. 673–681. [Google Scholar]
Xu, C.; Li, Q.; Zhou, Q.; Zhang, S.; Yu, D.; Ma, Y. Power line-guided automatic electric transmission line inspection system. IEEE Trans. Instrum. Meas. 2022, 71, 1–18. [Google Scholar] [CrossRef]
Liu, Z.; Wu, G.; He, W.; Fan, F.; Ye, X. Key target and defect detection of high-voltage power transmission lines with deep learning. Int. J. Electr. Power Energy Syst. 2022, 142, 108277. [Google Scholar] [CrossRef]
Liao, Y.; Jiang, X.; Zhang, Z.; Zheng, H.; Li, T.; Chen, Y. The Influence of Wind Speed on the Thermal Imaging Clarity Based Inspection for Transmission Line Conductors. IEEE Trans. Power Deliv. 2022, 38, 2101–2109. [Google Scholar] [CrossRef]
Duta, I.C.; Liu, L.; Zhu, F.; Shao, L. Pyramidal convolution: Rethinking convolutional neural networks for visual recognition. arXiv 2020, arXiv:2006.11538. [Google Scholar]
Wu, Y.H.; Liu, Y.; Zhang, L.; Cheng, M.M.; Ren, B. EDN: Salient object detection via extremely-downsampled network. IEEE Trans. Image Process. 2022, 31, 3125–3136. [Google Scholar] [CrossRef] [PubMed]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]

Figure 1. Schematic diagram of the large and small side orientation.

Figure 2. Examples of inspection image shooting part categories.

Figure 3. Dataset partitioning.

Figure 4. Schematic diagram of MobileViT block.

Figure 5. Schematic diagram of CBAM module.

Figure 6. Schematic diagram of SCPCbam.

Figure 7. Schematic diagram of overall model.

Figure 8. Flowchart of automatic classification and naming of inspection images.

Figure 9. Example of training dataset enhancement.

Figure 10. Validation set results of different backbone networks.

Figure 11. Image classification results under different lighting backgrounds.

Figure 12. Classification results of images of similar parts on different sides.

Table 1. Inspection image shooting part categories.

Label	Class	Label	Class
1	Full view	8	Large side strain clamp
2	Base	9	Small side cross-arm hanging point
3	Ground line	10	Small side insulator
4	Large side corridor	11	Small side strain clamp
5	Small side corridor	12	Jumper cross-arm hanging point
6	Large side cross-arm hanging point	13	Grading ring
7	Large side insulator

Table 2. Table of training parameters.

Parameters	Value	Parameters	Value
Input Shape	336 × 336	Learning Rate	0.0002
Optimizer	AdamW	Loss Function	Focal Loss
Epochs	100	Batch Size	8

Table 3. Ablation experiments of the improved MobileViT on the inspection photos datasets.

MobileViT Backbone	SCPC	SCPCbam	Epochs	Precision/%	Params/M	Time/ms
✓	✕	✕	100	95.31	4.95	14.8
✓	✓	✕	100	96.52	5.04	16.6
✓	✕	✓	100	97.59	5.15	18.6

Table 4. Ablation study of different input sizes on the inspection photos dataset.

Image Size	Epochs	Accuracy
224	100	95.44%
336	100	96.52%
448	100	95.08%

Table 5. Comparsion of results based on different models.

Model	Epochs	Accuracy
SqueezeNet	50	87.85%
MobileNetV2	50	93.87%
MobileViTV2	50	93.25%
MobileNetV3	50	93.58%
GhostNetV2	50	94.74%
Ours	50	95.63%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Y.; Zheng, E.; Chen, Y.; Wu, K.; Yang, Z.; Yuan, J.; Xie, M. Stower-13: A Multi-View Inspection Image Dataset for the Automatic Classification and Naming of Tension Towers. Electronics 2024, 13, 1858. https://doi.org/10.3390/electronics13101858

AMA Style

Lu Y, Zheng E, Chen Y, Wu K, Yang Z, Yuan J, Xie M. Stower-13: A Multi-View Inspection Image Dataset for the Automatic Classification and Naming of Tension Towers. Electronics. 2024; 13(10):1858. https://doi.org/10.3390/electronics13101858

Chicago/Turabian Style

Lu, Yaolin, Enhui Zheng, Yifu Chen, Kaijun Wu, Zhonghao Yang, Jiayu Yuan, and Min Xie. 2024. "Stower-13: A Multi-View Inspection Image Dataset for the Automatic Classification and Naming of Tension Towers" Electronics 13, no. 10: 1858. https://doi.org/10.3390/electronics13101858

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stower-13: A Multi-View Inspection Image Dataset for the Automatic Classification and Naming of Tension Towers

Abstract

1. Introduction

2. Related Work

2.1. Automatic Naming of Inspection Images

2.2. Power Inspection Image-Related Datasets

3. Stower-13 Dataset

3.1. Dataset Collection

3.2. Dataset Composition

4. Automatic Naming of Inspection Images

4.1. Lightweight Network MobileViT

4.2. Convolution Block Attention Module

4.3. Scale-Correlated Pyramid Convolution Block Attention Block

4.4. Automatic Naming Model for Inspection Images

5. Experimental Results and Disscussion

5.1. Operating Environment and Parameter Setting

5.2. Data Enhancement

5.3. Ablation Experiments and Discussion

5.3.1. Impact of Improved MobileViT Model on Classification Results

5.3.2. Effect of Image Resolution on Classification Results

5.3.3. Effect of Different Backbone Networks on Classification Results

5.3.4. Classification Results of Inspection Images under Different Lighting Backgrounds

5.3.5. Classification Results of Similar Parts of Inspection Images

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI