Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet

Zhu, Sisi; Li, Xinyu; Wan, Gang; Wang, Hanren; Shao, Shen; Shi, Pengfei

doi:10.3390/sym16070845

Open AccessArticle

Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet

by

Sisi Zhu

¹,

Xinyu Li

¹,

Gang Wan

¹,

Hanren Wang

²,

Shen Shao

² and

Pengfei Shi

^2,*

¹

Hubei Technology Innovation Center for Smart Hydropower, Wuhan 430019, China

²

College of Artificial Intelligence and Automation, Hohai University, Changzhou 213200, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(7), 845; https://doi.org/10.3390/sym16070845

Submission received: 6 May 2024 / Revised: 9 June 2024 / Accepted: 21 June 2024 / Published: 4 July 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

In the task of classifying images of cracks in underwater dams, symmetry serves as a crucial geometric feature that aids in distinguishing cracks from other structural elements. Nevertheless, the asymmetry in the distribution of positive and negative samples within the underwater dam crack image dataset results in a long-tail problem. This asymmetry, coupled with the subtle nature of crack features, leads to inadequate feature extraction by existing convolutional neural networks, thereby reducing classification accuracy. To address these issues, this paper improves VanillaNet. First, the Seesaw Loss loss function is introduced to tackle the long-tail problem in classifying underwater dam crack images, enhancing the model’s ability to recognize tail categories. Second, the Adaptive Frequency Filtering Token Mixer (AFF Token Mixer) is implemented to improve the model’s capability to capture crack image features and enhance classification accuracy. Finally, label smoothing is applied to prevent overfitting to the training data and improve the model’s generalization performance. The experimental results demonstrate that the proposed improvements significantly enhance the model’s classification accuracy for underwater dam crack images. The optimized algorithm achieves superior average accuracy in classifying underwater dam crack images, showing improvements of 1.29% and 0.64% over the relatively more accurate models ConvNeXtV2 and RepVGG, respectively. Compared to VanillaNet, the proposed algorithm increases average accuracy by 2.66%. The improved model also achieves higher accuracy compared to the pre-improved model and other mainstream networks.

Keywords:

underwater dam crack image classification; VanillaNet; convolutional neural networks; feature extraction

1. Introduction

Over the past few decades, artificial neural networks have made significant advancements. Convolutional neural networks (CNNs) have achieved remarkable results in the field of optical image recognition, prompting researchers to increasingly apply them to tasks such as image detection, segmentation, and classification. However, the underwater dam crack image dataset is characterized by an uneven distribution of positive and negative samples, and the features of cracks in these images are subtle. Directly applying existing CNNs to the task of classifying underwater dam crack images can result in insufficient feature extraction and low recognition accuracy. These issues ultimately lead to a decrease in the classification accuracy of the models.

In 2012, a new phase in the development of convolutional neural networks was signaled by the introduction of AlexNet [1], a deep convolutional neural network developed by Alex Krizhevsky and coworkers that consists of 12 layers and achieves state-of-the-art performance in large-scale image recognition benchmarks. Building on this foundation, ResNet introduced identity mapping through shortcut connections, enabling the training of deep neural networks to perform at a high level in a wide range of computer vision applications, such as image classification, object detection, and semantic segmentation. The integration of manually designed modules in these models, coupled with the continuous increase in network complexity, has undoubtedly enhanced the representational capabilities of deep neural networks, sparking a research trend on how to train networks with more complex architectures to achieve higher performance.

In addition to convolutional architectures, Dosovitskiy et al. [2] introduced the Transformer architecture to image recognition tasks, demonstrating its potential with large-scale training data. Zhai et al. [3] investigated the scaling laws of the visual transformer architecture, achieving an impressive top-1 accuracy of 90.45% on the ImageNet dataset [4], indicating that, like convolutional networks, deeper Transformer architectures tend to perform better. Wang et al. [5] further proposed extending the depth of the Transformer to 1000 layers to achieve higher accuracy. Liu et al. [6] re-examined the design space of neural networks, introducing ConvNeXt, which achieved performance comparable to state-of-the-art transformer architectures.

While well-optimized deep and complex networks have achieved satisfactory performance, their increasing complexity poses challenges for deployment. For instance, the shortcut operations in ResNet consume a significant amount of off-chip memory traffic, as they require the merging of features from different layers. Moreover, complex operations such as the axial shifts in AS-MLP [7] and the shifting window self-attention in Swin Transformer [8] necessitate sophisticated engineering implementations, including rewriting CUDA code.

These challenges call for a simplification of the paradigms in neural network design. However, the development of ResNet [9] seems to have led to the abandonment of neural architectures that rely solely on convolutional layers (excluding additional modules such as shortcut connections). This is primarily because the performance gains from adding convolutional layers have not met expectations. Simple networks without shortcut connections suffer from the vanishing gradient problem, resulting in a 34-layer plain network performing worse than an 18-layer one. Moreover, the performance of simple networks like AlexNet and VGGNet has been significantly out-paced by deep and complex networks, such as ResNets and ViT. Consequently, the design and optimization of neural networks with simple architectures have received less attention. Addressing this issue and developing models that are both high-performing and concise would be of great value.

VanillaNet [10] is an innovative neural network architecture that improves the model’s efficiency and performance by simplifying the network structure. It avoids overly deep network layers, shortcut connections, and complex operations such as self-attention mechanisms, resulting in a series of simplified networks that address inherent complexity issues, making it very suitable for environments with limited resources. To train VanillaNet, a “deep training” strategy has been designed to mitigate the negative impacts brought by the simplified network. This approach starts with several layers that include nonlinear activation functions. As training progresses, these nonlinear layers are gradually eliminated, making the network converge more easily while maintaining inference speed.

In this paper, an underwater dam crack image classification algorithm is improved and proposed based on VanillaNet, which improves the model’s classification accuracy of underwater dam crack images. First, a loss function Seesaw Loss is introduced for VanillaNet, which effectively solves the long-tail problem in the classification of underwater dam crack images. Secondly, the Adaptive Frequency Filtering Token Mixer is introduced to enhance the model’s ability to capture the features of crack images, thus improving the classification accuracy. Finally, a label smoothing technique is applied to avoid overfitting of the model to the training data and improve the generalization performance of the model. The final experiments show that the optimized algorithm improves by 1.29% and 0.64% over the models ConvNeXtV2 and RepVGG, which have higher relative accuracy, respectively. Compared to VanillaNet, the proposed algorithm improves the average accuracy by 2.66%.

2. Background Study

Detection of underwater dam cracks using visible light image processing technology has emerged as a crucial method. This technology offers convenience, intuitiveness, efficiency, cost-effectiveness, and non-destructiveness, all of which cater to the demands of crack detection and classification of dam structures. Nonetheless, there remain certain challenges in underwater dam crack detection and classification. On one hand, the presence of absorption, scattering, and convolution effects results in underwater dam crack images exhibiting significant uneven brightness and blurred details. Consequently, even within deep learning methodologies, effectively classifying cracks by comparing sample images with test images proves to be challenging. On the other hand, the uneven distribution of positive and negative samples within the underwater dam crack image dataset gives rise to a long-tail problem. This issue causes conventional machine learning models to prioritize categories with more samples during the training phase, often overlooking categories with fewer samples. Consequently, the model may fail to adequately learn features of these under-represented categories, leading to reduced classification accuracy in these instances. Conversely, with ample samples available for head categories, models can readily acquire sufficient features, thereby frequently achieving superior classification performance. Addressing these challenges holds significant importance for the maintenance and repair of dam cracks.

The evolution of image classification technology has shifted from early rule-based methods to contemporary machine-learning-based approaches. Initially, methods relied on manually crafted feature extraction techniques like edge detection and texture analysis, resulting in restricted classification performance, especially in intricate scenarios [11]. However, the progression of machine learning, notably deep learning, has heralded significant advancements in image classification technology. The integration of convolutional neural networks (CNNs) in particular has played a pivotal role in driving this progress. CNNs possess the ability to autonomously learn feature representations from images, thereby diminishing the necessity for manual feature engineering and enhancing classification accuracy.

Deep complex networks exhibit outstanding performance, yet their high complexity poses deployment challenges, such as the considerable off-chip memory traffic demanded by shortcut operations in ResNet, and the intricate engineering needed for advanced operations in AS-MLP and Swin Transformer. This underscores a need for rethinking simplified neural network designs. Nonetheless, the development of networks like ResNet suggests that architectures solely reliant on convolutional layers are gradually being phased out due to limited performance gains and degradation in simple networks attributable to the vanishing gradient problem. Neural networks with simpler architectures, such as AlexNet and VGGNet, have lagged behind deep complex networks in performance metrics. Hence, the pursuit of designing and optimizing neural network models that are both high-performing and concise holds considerable significance.

Existing image classification algorithms focused on improving their detection accuracy. Ruan et al. [12] proposed a simple and effective neural network attention module that extracts the information from the input features through different scales to improve the classification accuracy and classification efficiency in the image classification task. Cheng et al. [13] proposed a repeated attention mechanism that effectively combines multi-scale features to bring higher recognition accuracy when using multi-domain datasets. Song et al. [14] proposed a self-cascading neural network capable of considering both global and local feature information based on an improved convolutional neural network to improve the performance of the network. Although these algorithms achieved high performance, they also introduced a large computational overhead.

This paper presents an enhanced algorithm for classifying underwater dam crack images, building upon VanillaNet. This approach integrates the Seesaw Loss and Adaptive Frequency Filtering Token Mixer (AFF Token Mixer) into VanillaNet. These enhancements effectively tackle the long-tail problem and improve the model’s capability to capture features of crack images. Moreover, the application of label smoothing techniques mitigates overfitting to the training data. Consequently, these refinements lead to heightened accuracy in underwater dam crack image classification.

3. Improved VanillaNet Image Classification Algorithm for Underwater Dam Crack Images

3.1. VanillaNet Network Model

VanillaNet, as a simplified and efficient neural network model, has demonstrated excellent performance in numerous image classification tasks. Its advantage in inference speed makes it particularly outstanding in scenarios requiring rapid processing. The streamlined network structure reduces computational complexity while ensuring strong classification performance, especially on standardized datasets.

However, when VanillaNet is applied to specific and complex scenarios, such as the identification and classification of underwater dam cracks, its performance declines. The difficulty of recognizing underwater dam crack images is significantly higher than that of general image classification tasks. These images are often affected by variable lighting, murky water, visual obstructions, and diverse crack morphologies. These factors greatly increase the complexity of the classification task. Due to the simplicity of VanillaNet’s structure, it may lack sufficient representational capability to handle these complex features, leading to reduced classification accuracy. Therefore, it is necessary to make some modifications to VanillaNet to better suit the characteristics of the underwater dam crack dataset.

The network structure of VanillaNet is shown in Figure 1. Most advanced classification network architectures are composed of three main parts: the stem block, the main body, and the fully connected layer. The stem block is responsible for converting the input image’s three channels into multiple channels and performing downsampling. The main body carries out feature extraction, typically divided into four stages, each consisting of multiple identical units (blocks), which reduce the resolution of the feature maps while increasing the number of channels after each stage. Finally, there is a fully connected layer that outputs the final classification results. VanillaNet follows this general framework but differs in that each stage contains only one network layer, achieving an extreme simplification of the network.

Next, let us delve into the architecture of VanillaNet using the 6-layer example. The stem block employs a 4 × 4 × 3 × C convolutional layer with a stride of 4, which maps the image from 3 channels to C channels of features, following a common approach. In stages 1, 2, and 3, max pooling layers with a stride of 2 are used to reduce the size of the feature maps, with the channel count doubling at each stage. In stage 4, the channel count is not increased, as it is followed by an average pooling layer. The final layer is a fully connected layer that outputs the classification results. Each convolutional layer has a kernel size of 1 × 1 to maintain the feature map information while minimizing computational cost. An activation function is applied after each 1 × 1 convolutional layer. To simplify the training process, batch normalization is added after each layer. For VanillaNets with different numbers of layers, additional blocks are incorporated at each stage, and the detailed architectural specification is shown in Table 1. Notably, VanillaNet does not include skip connections, as experiments have shown that adding shortcuts provides minimal performance improvement. This also simplifies the architecture, making it extremely easy to implement without branches and additional convolutional blocks.

Overall, the VanillaNet design philosophy emphasizes simplicity and efficiency while maintaining high performance, resulting in significant advantages in many real-world applications. This network architecture strikes a balance between the number of parameters and computational cost, enhancing portability and practicality. Because of the high number of parameters, VanillaNet is able to extract image features efficiently, thus improving performance. In addition, the use of deep training strategies and adaptive activation functions enhances the nonlinear representation of the network. This approach makes VanillaNet an efficient, portable, and practical neural network, which brings new ideas and directions to the field of deep learning.

3.2. Depth Training Strategy

The main idea of the depth training strategy is to train two convolutional layers with activation functions at the beginning of the training process, rather than a single convolutional layer. As the number of training epochs increases, the activation function gradually simplifies to an identity mapping. At the end of the training, the two convolutional layers can be easily merged into one to reduce inference time. This concept is also widely applied in convolutional neural networks (CNNs).

To achieve the combination of the activation function

A (x)

(which can be a common function such as ReLU and Tanh) with the identity mapping, the following formula can be used for expression:

A^{'} (x) = (1 - λ) A (x) + λ x

(1)

Here,

λ

is a hyperparameter used to balance the degree of nonlinearity of the modified activation function

A^{'} (x)

. The current training epoch is denoted as e, and the total number of depth training epochs is denoted as E. Set

λ = e / E

. Therefore, at the beginning of the training (

e = 0

),

A^{'} (x) = A (x)

, meaning the network has strong nonlinear capabilities. When training converges, we get

A^{'} (x) = x

, meaning there is no activation function between the two convolutional layers. A further demonstration of how to merge these two convolutional layers will be shown below.

First, merge each batch normalization layer and its preceding convolutional layer into a single convolutional layer, which is a common technique. Denote the weight matrix of a convolutional layer with

C_{i n}

input channels,

C_{o u t}

output channels, and kernel size

k \times k

as

W \in R^{C_{o u t} \times (C_{i n} \times k \times k)}

, and the bias matrix as

B \in R^{C_{o u t}}

. The scaling factor, shift, mean, and variance in batch normalization are denoted as

γ

,

β

,

μ

,

σ \in R^{C_{o u t}}

, respectively. The merged weights and bias matrix are

W_{i}^{'} = \frac{γ_{i}}{σ_{i}} W_{i}, B_{i}^{'} = \frac{(B_{i} - μ_{i}) γ_{i}}{σ_{i}} + β_{i}

(2)

Here, the subscript

i \in {1, 2, \dots, C_{o u t}}

denotes the value in the i-th output channel.

After merging the convolutional layers and batch normalization layers, start by merging two

1 \times 1

convolutions. Assuming

x \in R^{C_{i n} \times H \times W}

and

y \in R C_{o u t} \times H^{'} \times W^{'}

are the input and output features, the convolution operation can be expressed as

y = W * x = W \cdot im 2 col (x) = W \cdot X

(3)

Here, ∗ represents the convolution operation, · represents matrix multiplication, and

x \in R (C_{i n} \times 1 \times 1) \times (H^{'} \times W^{'})

is the input transformed into a matrix corresponding to the shape of the convolutional kernel through the im2col operation. Moreover, for

1 \times 1

matrices, it is found that the im2col operation becomes a simple reshape operation, as there is no overlapping sliding convolutional kernel needed. Therefore, let the weight matrices of the two convolutional layers be

W^{1}

and

W^{2}

; the merging of the two convolutional layers without activation functions is expressed as

y = W^{1} * (W^{2} * x) = W^{1} \cdot W^{2} \cdot im 2 col (x) = (W^{1} \cdot W^{2}) * X

(4)

Thus, the

1 \times 1

convolutions can be merged without increasing the inference speed. It is worth noting that although the depth training technique increases the training FLOPs, it does not affect the inference cost, which is particularly important in engineering applications.

3.3. Seesaw Loss

Due to the asymmetric distribution of positive and negative samples in underwater dam crack image datasets, a long-tail problem arises. To address this, the algorithm in this study employs the Seesaw Loss function [15]. The long-tail issue in image classification refers to the situation where the number of samples in minority classes (tail classes) is significantly less than that in majority classes (head classes), resulting in an asymmetric distribution. This phenomenon is common in natural and real-world data distributions and poses a challenge: standard machine learning models tend to focus on classes with more samples during training while neglecting those with fewer samples.

In the case of a long-tail distribution, the model’s performance is often significantly affected because the model may fail to effectively learn the features of the tail classes, leading to low classification accuracy in these classes. In contrast, the head classes, due to the abundance of samples, allow the model to easily learn sufficient features, thus often performing well in classification.

The long-tail recognition task has recently received increasing attention as these issues are closer to real-world applications. A representative solution to this problem is loss reweighting. Loss reweighting methods use different reweighting strategies based on the statistical data of each class to adjust the losses of different classes. Other common methods involve rebalancing the number of instances for each class, such as repetition factor sampling and class-balanced sampling, both of which are based on the number of samples per class. Different sampling strategies can be employed at different training stages to form a multi-stage training process. A recent study proposed a decoupled training procedure, initially using natural sampling to train a good representation network, followed by class-balanced sampling to fine-tune the classifier. There are also attempts to modify the classifier to improve the performance of tail classes, for example, using different classifiers for different class groups or using two classifiers trained by different data samplers.

Classifiers trained with the widely used cross-entropy (CE) loss tend to be highly biased on long-tail datasets, resulting in significantly lower accuracy for tail classes compared to head classes. The primary reason is that the gradients from positive samples are overwhelmed by those from negative samples. Therefore, the algorithm in this paper uses Seesaw Loss to mitigate the overwhelming effect of negative sample gradients on tail classes and to compensate for the gradients of misclassified samples, thereby avoiding false positives.

Firstly, let us review the widely used cross-entropy (CE) [16] loss in the current framework:

L_{c e} (z) = - \sum_{i = 1}^{C} y_{i} l o g (σ_{i}), σ_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{C} e^{z_{j}}}

(5)

where

z = [z_{1}, z_{2}, \dots, z_{C}]

and

σ = [σ_{1}, σ_{2}, \dots, σ_{C}]

are the classifier’s logit scores and probabilities, respectively. And

y_{i} \in 0, 1, 1 \leq i \leq C

is the true label. For a training sample belonging to class i, the gradients concerning

z_{i}

and

z_{j}

are given by

\frac{\partial L_{c e} (z)}{\partial z_{i}} = σ_{i} - 1

(6)

\frac{\partial L_{c e} (z)}{\partial z_{j}} = σ_{j}

(7)

This shows that samples from class i penalize the classifier for class j relative to the predicted probability of class j. When the number of instances for class i is much greater than for class j, the classifier for class j receives penalties from most samples and receives very little positive signal, suppressing the predicted probability for class j, leading to low classification accuracy for tail classes. Figure 2 graphically shows the difference between the cross-entropy loss function and Seesaw Loss for the long-tail problem.

To mitigate the above issue, a feasible solution is to reduce the negative sample gradient imposed on tail classes by head classes in Equation (7). Therefore, Seesaw Loss can be defined as

L_{s e e s a w} (z) = - \sum_{i = 1}^{C} y_{i} log ({\hat{σ}}_{i}), \hat{σ} i = \frac{e^{z_{i}}}{\sum_{j \neq i}^{C} S_{i j} e^{z_{j}} + e^{z_{i}}}

(8)

Then, the gradient for the negative class j in Equation (7) becomes

\frac{\partial L_{s e e s a w} (z)}{\partial z_{j}} = S_{i j} \frac{e^{z_{j}}}{e^{z_{i}}} {\hat{σ}}_{i}

(9)

Here,

S_{i j}

plays the role of an adjustable balancing factor between different classes. By carefully designing

S_{i j}

, Seesaw Loss adjusts the punishment of positive class i on negative class j. Seesaw Loss is defined

S_{i j}

through a mitigation factor and a compensation factor, as follows:

S_{i j} = M_{i j} \cdot C_{i j}

(10)

The mitigation factor

M_{i j}

reduces the punishment on tail class j based on the proportion of instances between tail class j and head class i. The compensation factor

C_{i j}

increases the punishment for class j when tail class i is misclassified as head class j. The mitigation factor and compensation factor will be detailed in the following paragraphs.

Throughout the training process, instances are accumulated for each class i, denoted as

N_{i}

. As shown in Figure 3, for a training sample with a positive label i, the mitigation factor adjusts the punishment for the negative label j based on the ratio of

N_{i}

to

N_{j}

as follows:

M_{i j} = \{\begin{cases} 1, & if N_{i} \leq N_{j} \\ {(\frac{N_{j}}{N_{i}})}^{p}, & if N_{i} > N_{j} \end{cases}

(11)

When class i is more common than class j, Seesaw Loss reduces the punishment for class j, which is imposed by the samples of class i, by a factor of

{(\frac{N_{j}}{N_{i}})}^{p}

. Otherwise, Seesaw Loss maintains the punishment for the negative class to reduce misclassification. The exponent p is a hyperparameter used to adjust the degree of mitigation. It is worth noting that Seesaw Loss accumulates instance counts during the training process, rather than obtaining statistical data from the entire dataset in advance. This strategy has two benefits. First, it can be applied when the distribution of the entire training set is not available, for example, when training samples are obtained from a data stream. Second, each class’s training samples may be affected by the data sampling method used [17], and online accumulation is robust to sampling methods. During training, the mitigation factor is uniformly initialized and smoothly updated to approximate the true data distribution.

Although the mitigation factor effectively balances the gradients of head and tail classes, it may lead to more false positives due to the reduced punishment for tail classes. Moreover, adjusting p in

M_{i j}

alone cannot eliminate false positives because it is applied to the entire class. A compensation factor is proposed that focuses on misclassified samples rather than adjusting the entire class. As shown in Figure 3, when misclassification occurs, i.e., the predicted probability for the negative label j is greater than that for class i, this factor compensates for the reduced gradient. The compensation factor

C_{i j}

is calculated as follows:

C_{i j} = \{\begin{cases} 1, & if σ_{j} \leq σ_{i} \\ {(\frac{σ_{j}}{σ_{i}})}^{q}, & if σ_{j} > σ_{i} \end{cases}

(12)

For a training sample with a positive label i, if the probability of any negative class j predicted is greater than that of class i, the compensation factor increases the punishment for class j by a factor of

{(\frac{σ_{j}}{σ_{i}})}^{p}

, where q is a hyperparameter used to control the scale. Otherwise,

C_{i j} = 1

and only the mitigation factor

M_{i j}

is used.

In summary, in addressing the long-tail problem of underwater dam crack image datasets, this study employs Seesaw Loss, an innovative loss function that dynamically adjusts loss weights to solve the problem of sample imbalance. Seesaw Loss combines mitigation and compensation factors, focusing not only on the overall class weight balance but also specifically enhancing the correction for misclassified samples. This strategy optimizes the training process on two levels: on the one hand, it is suitable when the distribution of the entire training set is unknown or dynamically changing, making the model more adaptable; on the other hand, it maintains robustness to data distribution biases produced by different sampling methods.

3.4. Adaptive Frequency Filtering Token Mixer

The Adaptive Frequency Filtering Token Mixer [18] (AFF Token Mixer) serves as a powerful tool to enhance the accuracy of classification in the application of underwater dam crack image classification. In specific image recognition tasks, such as underwater dam crack detection, this method can efficiently mix global features in the frequency domain to capture subtle characteristics of cracks that may be overlooked by traditional methods. Crack identification typically requires capturing both global features and detailed information from the image, and the AFF Token Mixer provides exactly this capability. Inspired by this module, applying the AFF Token Mixer before each downsampling operation in VanillaNet allows for the retention of more effective global and local features in the features after each downsampling.

The AFF Token Mixer employs Fourier transforms to project latent representations into the frequency domain, where semantic-adaptive frequency filtering is conducted through element-wise multiplication. This process is akin, in mathematical terms, to token mixing operations utilizing dynamic convolution kernels of equivalent spatial resolution as the original latent representation space. Leveraging frequency-domain deep learning, the AFF Token Mixer tackles the computational demands associated with deploying visual transformers, large-core convolutional neural networks (CNNs), and multilayer perceptrons (MLPs) on mobile devices.

The mainstream neural network operations, namely, CNNs, Transformers, and MLPs, each have their methods of token mixing. CNNs mix tokens by learning convolutional kernel weights, where the size of the spatial kernel determines the mixing range. Typically, these weights are fixed and the range is often local. Transformers mix tokens by pairwise correlations between query tokens and key tokens within a local or global scope. While these weights are semantically adaptive, the computational complexity makes them very costly in terms of computation. MLPs typically use fixed weights to mix tokens within manually designed scopes [19,20,21], but these weights lack good semantic adaptability. The AFF Token Mixer aims to provide a universal token mixer for lightweight neural networks, with three advantages: computational efficiency, semantic adaptability, and effectiveness on a global scale.

Specifically, the AFF Token Mixer can transform underwater images into the frequency domain, where global and local features are represented in a more distinguishable manner. In this way, the model can learn subtle differences reflecting crack features at different frequencies, thereby more accurately locating and classifying cracks in the original image space. Figure 4 shows the network structure of the adaptive frequency filtering token mixer.

In deep learning architectures, a common mathematical tool is the convolution theorem of Fourier inverse transform. This theorem elucidates that convolution operations in one domain are mathematically equivalent to element-wise multiplication operations in the frequency domain after Fourier transformation. To create a lightweight and fast architecture, fast Fourier transform (FFT) is employed to obtain the representation of features in the frequency domain. Specifically, given features

X \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and number of channels of the features, respectively, this feature set can be seen as a series of tokens in latent space. Through FFT transformation, the feature

X

can be represented in the frequency domain as

X_{F}

, which can be detailed by the following Fourier transform formula:

X_{F} (u, ν) = \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} X (h, w) e^{- 2 π i (u h + v w)}

(13)

Wherein, features

X_{F}

at different spatial positions correspond to different frequency components of

X

. They obtain global information from

X

through a Fourier transform with a complexity of

O (N log N)

.

Next, applying the previously mentioned convolution theorem, filtering of the frequency representation

X_{F}

of

X

is conducted using a learnable instance-specific mask, facilitating efficient mixing of global tokens. Furthermore, inverse FFT is employed to transform the filtered

X_{F}

back to feature representation

X

in the original latent space. This process can be expressed as

\hat{X} = F^{- 1} [M (F (X)) ⊙ F (X)]

(14)

Here,

M (F (X))

is the learned mask tensor from

X_{F}

, which has the same size and shape as

X_{F}

. As illustrated in the diagram, to make the network as lightweight as possible,

M (\cdot)

is implemented through an efficient

1 \times 1

convolution (linear) layer, followed by a ReLU activation function and another set of linear layers. ⊙ denotes the Hadamard product, also known as element-wise multiplication, while

F^{- 1} (\cdot)

represents the inverse Fourier transform. Here,

\hat{X}

can be seen as the result of the global adaptive token mixing of

X

, which is mathematically equivalent to using a large-size dynamic convolution as the weights for token mixing.

3.5. Label Smoothing

Label smoothing [22] is a technique that improves the generalization performance of deep learning models by introducing a certain “fuzziness” into the labels. In traditional training without label smoothing, one-hot encoding is commonly used, where the correct class is assigned a probability of 1, and all other classes have a probability of 0. This can lead to overfitting of the model to the training data, as the model may try to predict the correct class with very high confidence during training, thus ignoring samples that are not as certain.

Label smoothing addresses this by optimizing the process. It softens the hard target labels to prevent the model from being overly confident. Specifically, as shown in Table 2, for a problem with K classes, instead of assigning a probability of 1 to the correct class, it assigns a slightly smaller probability, such as

1 - ξ

, while giving a small positive probability, such as

ξ / (K - 1)

, to all other classes. This adjustment encourages the model to be more conservative in its predictions, without completely ignoring any class.

Another advantage of label smoothing is that it helps the model learn smoother probability distributions. Without label smoothing, the model may make extreme predictions for certain indistinguishable samples. By introducing label smoothing, the model can pay more attention to other classes in these situations, aiding the model in recognizing and learning the potential ambiguity of these samples.

4. Experimental Environment and Dataset

4.1. Experimental Environment

For the experiments, the following computer configuration was used: an Intel Core i5-10400F CPU, an Nvidia GeForce RTX 3090 graphics card, and 32 GB of DDR4 3200 MHz RAM. The operating system was Ubuntu 22.04. The deep learning framework utilized was PyTorch version 1.10.0, and all programming was conducted in a Python 3.8 environment. The deep learning parameters used for the experiments in this study are detailed in Table 3.

4.2. Dataset

Dams composed of concrete will inevitably develop cracks on their structural surfaces over time due to extended service life and limitations in construction techniques. These cracks present a variety of forms and can generally be categorized into six types, as shown in Figure 5: web cracks, bursting cracks, irregular short cracks, vertical cracks, transverse cracks, and tilt cracks. Web cracks usually appear on the surface of concrete and form a complex web pattern, caused mainly by shrinkage of the surface layer. Bursting cracks, on the other hand, are sudden cracks in the concrete surface caused by internal stress concentrations or material defects. Irregular short cracks, as their name suggests, are cracks of varying lengths and random orientations that may be caused by excessive localized loads or stress concentrations at member joints. Vertical cracks occur because concrete shrinks during hardening and drying, which leads to vertical cracks because of stronger structural constraints. Transverse cracks are easily generated due to the concentration of external loads (e.g., water pressure, earth pressure) or internal stresses within the structure, especially in the middle or near the bottom of the dam. Tilt cracks, on the other hand, run diagonally through the concrete structure and are usually caused by oblique tensile stresses. The presence of these cracks not only affects the appearance of the dam, but may also pose a threat to its structural safety, requiring appropriate repair measures depending on the type and severity of the cracks.

In the experimental section of this study, we constructed a dataset specifically for underwater dam crack detection. The images in this dataset were captured directly by underwater cameras in real dam environments, ensuring the authenticity and practicality of the data. From a large collection of video footage, 250 representative original underwater dam crack images were selected using frame-by-frame extraction. To train the crack detection model accurately, these images were subsequently imported into the Labelme annotation tool, where each frame underwent precise manual annotation, ensuring the accuracy and consistency of the annotations. To augment the dataset and enhance the model’s generalization ability, a series of data augmentation operations were applied to these annotated original images. Specifically, transformations such as rotation, cropping, and horizontal flipping were performed to simulate possible appearances of underwater dam cracks at different angles and scales, thereby increasing the diversity of the data. Through these carefully designed data augmentation strategies, the dataset ultimately comprised 1200 high-quality underwater crack images with a resolution of 448 × 448 pixels.

To achieve clearer visual recognition of dam cracks, the dataset adopted the binarization step in image processing, which converts complex and variable underwater environment images into concise black-and-white binary images. As shown in Figure 6, this is the result of binarizing a web crack image. In these processed images, cracks are presented as high-contrast white pixels, while the surrounding background remains pure black. Such high contrast greatly simplifies the process of distinguishing cracks from the background. This approach allows for more accurate extraction of key information such as crack size and shape from the images, providing important reference data for assessing the structural integrity and extent of damage to dams.

The original dataset consists of 1200 binarized crack images enhanced by techniques such as rotation and flipping. In view of the need for large-scale samples for deep learning, this study also extends the training set by selecting some labeled images from the ground crack dataset that are similar to the underwater crack images in terms of texture properties. After screening and enhancement, 5000 binary images of cracks with a resolution of 448*448 are finally obtained for model training, and the dataset of each category is divided into the training set, test set, and validation set in the experiment, with the ratio of about 8:1:1. and the distribution of the number of each category is shown in Table 4. These crack images are categorized into six types: web cracks, bursting cracks, irregular short cracks, vertical cracks, transverse cracks, and tilt cracks. The corresponding category labels are set to 0, 1, 2, 3, 4, 5 for training convenience.

5. Experimental Results and Analysis

5.1. Ablation Experiment

In order to validate the algorithmic improvement points proposed in this paper, including Seesaw Loss, AFF Token Mixer, and label smoothing techniques, the specific impact on the algorithmic performance is identified. Firstly, a series of ablation experiments are conducted to individually evaluate the impact of each improvement measure on the classification task. By comparing the classification accuracy of different categories before and after the experiments, as well as the overall average accuracy, the utility of each improvement technique is quantitatively analyzed. Additionally, the experiments test the combined effect of these improvement points on the overall model performance to verify whether the improvements can further enhance the model’s robustness and classification efficiency when used together. Through these comparisons, this study aims to validate the effectiveness of each technical improvement and ensure that the proposed algorithm can achieve better performance in practical applications.

The classification accuracy of the model is judged by the accuracy rate, and the classification accuracy rate is calculated by the following formula:

Accuracy = \frac{True Positives + True Negatives}{Total Number of Instances}

(15)

where true positives refers to the number of instances correctly predicted as positive classes and true negatives refers to the number of instances correctly predicted as negative classes.

Looking at Table 5, the models using the Seesaw Loss technique achieved accuracy improvements in the six categories of web cracks, bursting cracks, irregular short cracks, vertical cracks, transverse cracks, and tilt cracks. The accuracy increased from 86.15%, 86.57%, 89.17%, 90.48%, 89.25%, and 86.94% to 87.36%, 87.81%, 90.35%, 90.71%, 90.14%, and 88.17%, respectively. Particularly noteworthy is the significant accuracy improvement contributed by Seesaw Loss, especially in handling web cracks and bursting cracks, which are more complex and have fewer samples. This further confirms its advantages in addressing the long-tail problem.

Next, after incorporating the AFF Token Mixer on top of the Seesaw Loss, the model’s classification performance was further improved. The AFF Token Mixer enhances the model’s comprehensive understanding of crack features by optimizing the relationships and information flow between features. The results show that the addition of the AFF Token Mixer further increased the classification accuracy, especially in the classification of bursting cracks and tilt cracks, from 90.35% and 90.14% to 91.48% and 91.94%, respectively. This confirms its significant effect on learning global and local features.

Finally, the proposed complete model, which combines Seesaw Loss and the AFF Token Mixer, further improved its performance by incorporating the label smoothing optimization measure. In terms of classification accuracy across all categories, the proposed model outperformed configurations using Seesaw Loss or the AFF Token Mixer alone, achieving an overall average accuracy of 90.75%. This achievement highlights the positive contributions of each component in the comprehensive synergy of the model, validating the effectiveness and superiority of the proposed algorithm for underwater dam crack image classification tasks.

5.2. Contrast Experiment

In the comparative experiments section, the performance of the proposed model in underwater dam crack classification tasks is validated by comparing it with current mainstream image classification models. These comparative models include ResNet, ShuffleNetV2 [23], ConvNeXtV2 [24], and RepVGG [25]. These models are widely adopted and have excellent performance in the field. The purpose of the comparative experiments is to highlight the advantages of the proposed model in terms of classification accuracy, generalization ability, and precision in recognizing various categories.

The results of the confusion matrix clearly demonstrate the classification performance of this paper’s model on each category in the dataset, as shown in Figure 7. The classification accuracies of each category are over 88%, among which the classification accuracies of irregular short cracks, vertical cracks, transverse cracks, and tilt cracks are over 90%, which represents that the model has high accuracy in the samples predicted to be positive cases, i.e., very few false positive cases occur. The web cracks and bursting cracks, which did not exceed 90%, were mainly affected by the imbalance in data categories: the model was biased towards the categories with more data and had lower classification accuracy for the categories with less data. Observing the non-diagonal regions in the confusion matrix, it can be found that the values of these regions are small, indicating that the model of this paper has a lower probability in misclassification. Also, the model in this paper has darker shading on the diagonal of the confusion matrix, which means that for correctly categorized samples, the model in this paper performs more robustly.

From the results in Table 6, it can be seen that the algorithm proposed in this paper shows excellent performance in different categories of crack recognition tasks. For the classification of six categories of cracks, namely, web cracks, bursting cracks, irregular short cracks, vertical cracks, transverse cracks, and tilt cracks, the proposed model obtains 88.37%, 89.52%, 91.89%, 92.53%, 90.13%, and 90.75% accuracy, respectively, which is a leading performance among all the compared models. Especially when dealing with vertical and tilt cracks, which are two types of cracks that are more difficult to recognize, the accuracy of the proposed model reaches 92.53% and 90.75%, respectively, which is much higher than that of other models.

In comparison, ResNet is the model that performs closest to the proposed model in these tasks, but it falls slightly behind in accuracy for all categories. Although ShuffleNet V2 and ConvNeXt V2 perform well in certain categories, they still cannot match the proposed algorithm in terms of overall average accuracy. RepVGG, as a robust performing network, achieved good results in the classification tasks of bursting cracks and vertical cracks. However, in the complex underwater environment crack classification, it still falls short of meeting the demand for high accuracy.

Overall, the model proposed in this paper not only outperforms the comparison model in terms of classification accuracy, but also excels in comprehensive performance. This fully demonstrates the practicality and effectiveness of the algorithm proposed in this paper in the task of underwater dam crack classification.

6. Conclusions

The focus of this paper is to address the lack of accuracy of the VanillaNet algorithm in the task of classifying dam cracks in complex underwater environments. Key techniques such as Seesaw Loss, AFF Token Mixer, and label smoothing are introduced to the VanillaNet architecture, aiming to enhance the classification performance of the model and its generalization ability on unbalanced data. Seesaw Loss, as a novel loss function, effectively solves the long-tail problem by dynamically adjusting the weight proportion of positive and negative samples, which improves the recognition rate of the model on small category samples. The AFF Token Mixer enhances the model’s ability to represent complex crack image features by effectively combining features from different layers. The label smoothing technique, on the other hand, smooths the decision boundary of the model, avoids the model’s over-sensitivity to noisy labels, and improves the robustness of the model. Through a series of experiments, it was found that the introduction of each improvement point significantly improved the performance of the model on all types of crack identification. In particular, the proposed improved method shows significant advantages when dealing with irregular and fine crack images. Further comparison experiments also show that the improved VanillaNet outperforms the original model and other mainstream classification algorithms in all performance metrics, confirming the advancement and practicality of the proposed algorithm in this paper. In underwater dam monitoring, crack classification algorithms prevent rupture or seepage accidents by identifying and classifying cracks in the dam structure to detect potential structural problems in time. Depending on the type and severity of cracks, a reasonable maintenance and repair program can also be developed to extend the life of the dam and save maintenance costs. Future research can further optimize the model to improve the accuracy of the model. In addition, multimodal fusion can be performed by combining sonar images, optical images, and infrared images to improve the accuracy and robustness of crack detection.

Author Contributions

Conceptualization, S.Z. and X.L.; Methodology, S.Z., G.W. and S.S.; Validation, X.L., G.W. and S.S.; Formal analysis, X.L. and G.W.; Resources, P.S.; Writing—original draft, S.Z. and S.S.; Writing—review & editing, S.Z., X.L. and H.W.; Visualization, S.S.; Supervision, P.S.; Funding acquisition, P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China, Grant/Award Number: 2022YFB4703400, Jiangsu Province Natural Science Foundation, grant number: BK20231186, Open Research Fund of Hubei Technology Innovation Center for Smart Hydropower (1523020038) and Changzhou Sci&Tech Program, Grant/Award Number: CE20235053.

Data Availability Statement

The data supporting the results of this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12104–12113. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. Deepnet: Scaling transformers to 1,000 layers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–14. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Lian, D.; Yu, Z.; Sun, X.; Gao, S. As-mlp: An axial shifted mlp architecture for vision. arXiv 2021, arXiv:2107.08391. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, H.; Wang, Y.; Guo, J.; Tao, D. Vanillanet: The power of minimalism in deep learning. arXiv 2023, arXiv:2305.12972. [Google Scholar]
Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Ruan, F.; Dang, L.; Ge, Q.; Zhang, Q.; Qiao, B.; Zuo, X. Dual-Path Residual “Shrinkage” Network for Side-Scan Sonar Image Classification. Comput. Intell. Neurosci. 2022, 2022, 6962838. [Google Scholar] [CrossRef] [PubMed]
Cheng, Z.; Huo, G.; Li, H. A multi-domain collaborative transfer learning method with multi-scale repeated attention mechanism for underwater side-scan sonar image classification. Remote Sens. 2022, 14, 355. [Google Scholar] [CrossRef]
Song, Y.; He, B.; Liu, P. Real-time object detection for AUVs using self-cascaded convolutional neural networks. IEEE J. Ocean. Eng. 2019, 46, 56–67. [Google Scholar] [CrossRef]
Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Chen, K.; Liu, Z.; Loy, C.C.; Lin, D. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 9695–9704. [Google Scholar]
Zhang, Z.; Fan, X.; Xie, Y.; Xu, H. An edge detection method based artificial bee colony for underwater dam crack image. In Proceedings of the Biomedical Imaging and Sensing Conference, Yokohama, Japan, 25–27 April 2018; Volume 10711, pp. 199–202. [Google Scholar]
Zawad, M.R.S.; Zawad, M.F.S.; Rahman, M.A.; Priyom, S.N. A comparative review of image processing based crack detection techniques on civil engineering structures. J. Soft Comput. Civ. Eng. 2021, 5, 58–74. [Google Scholar]
Huang, Z.; Zhang, Z.; Lan, C.; Zha, Z.J.; Lu, Y.; Guo, B. Adaptive Frequency Filters As Efficient Global Token Mixers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6049–6059. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Liu, R.; Lehman, J.; Molino, P.; Petroski Such, F.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. arXiv 2018, arXiv:1807.03247. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? arXiv 2018, arXiv:1906.02629. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]

Figure 1. The network structure of VanillaNet.

Figure 2. Difference between the cross-entropy loss function and Seesaw Loss for the long-tail problem.

Figure 3. Shows the calculation of the mitigation and compensation factors.

Figure 4. Network structure of the Adaptive Frequency Filtering Token Mixer.

Figure 5. Pictures of the different classes of dam cracks.

Figure 6. Image of web cracks after binarization.

Figure 7. Confusion matrix.

Table 1. Detailed architecture specifications.

	Input	VanillaNet-5	VanillaNet-6	VanillaNet-7/8/9/10/11/12/13
stem	224 × 224	4 × 4, 521, stride 4	4 × 4, 521, stride 4	4 × 4, 521, stride 4
stage 1	56 × 56	[1 × 1, 1024] × 1 MaxPool 2 × 2	[1 × 1, 1024] × 1 MaxPool 2 × 2	[1 × 1, 1024] × 1 MaxPool 2 × 2
stage 2	28 × 28	[1 × 1, 2048] × 1 MaxPool 2 × 2	[1 × 1, 2048] × 1 MaxPool 2 × 2	[1 × 1, 2048] × 1 MaxPool 2 × 2
stage 3	14 × 14	[1 × 1, 4096] × 1 MaxPool 2 × 2	[1 × 1, 4096] × 1 MaxPool 2 × 2	[1 × 1, 4096] × 1 / 2 / 3 / 4 / 5 / 6 / 7, MaxPool 2 × 2
stage 4	7 × 7	-	[1 × 1, 4096] × 1	[1 × 1, 4096] × 1
classifier	7 × 7	AvgPool 7 × 7, 1 × 1, 1000	AvgPool 7 × 7, 1 × 1, 1000	AvgPool 7 × 7, 1 × 1, 1000

Table 2. Difference between one-hot and label smoothing in label coding.

Label Processing	Web Cracks	Bursting Cracks	Irregular Short Cracks	Vertical Cracks	Transverse Cracks	Tilt Cracks
One-hot encoding	1	0	0	0	0	0
Label smoothing	0.9	0.02	0.02	0.02	0.02	0.02

Table 3. The various parameters of the deep learning system.

Network Parameter	Value
Initial learning rate	0.0002
Momentum	0.9
Weight decay	0.0001
Optimizer type	SGD
Learning rate adjustment multiplier	0.3
Batch size	4
Array step	5
Number of iterations	200

Table 4. Distribution of category number of datasets in this study.

Class	Web Cracks	Bursting Cracks	Irregular Short Cracks	Vertical Cracks	Transverse Cracks	Tilt Cracks
Number	185	245	1167	1835	1226	342

Table 5. Comparison of accuracy of ablation experiments in this study.

Arithmetic	Web Cracks	Bursting Cracks	Irregular Short Cracks	Vertical Cracks	Transverse Cracks	Tilt Cracks	Average Value
Baseline	86.15	86.57	89.17	90.48	89.25	86.94	88.09
Seesaw Loss	87.36	87.81	90.35	90.71	90.14	88.17	89.10
SL + ATM	88.14	89.26	91.48	92.05	91.94	89.74	90.44
Proposed	88.37	89.52	91.89	92.53	92.05	90.13	90.75

SL + ATM in the table stands for Seesaw Loss + AFF Token Mixer.

Table 6. The accuracy results of the contrast trials.

Arithmetic	Web Cracks	Bursting Cracks	Irregular Short Cracks	Vertical Cracks	Transverse Cracks	Tilt Cracks	Average Value
ResNet	85.12	85.35	89.84	89.13	88.43	89.74	87.94
ShuffleNet V2	86.93	87.42	90.11	89.49	90.02	88.16	88.69
ConvNeXt V2	87.37	87.16	88.74	91.58	91.39	90.53	89.46
RepVGG	87.95	88.91	91.23	91.79	91.04	89.73	90.11
Proposed	88.37	89.52	91.89	92.53	92.05	90.13	90.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, S.; Li, X.; Wan, G.; Wang, H.; Shao, S.; Shi, P. Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet. Symmetry 2024, 16, 845. https://doi.org/10.3390/sym16070845

AMA Style

Zhu S, Li X, Wan G, Wang H, Shao S, Shi P. Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet. Symmetry. 2024; 16(7):845. https://doi.org/10.3390/sym16070845

Chicago/Turabian Style

Zhu, Sisi, Xinyu Li, Gang Wan, Hanren Wang, Shen Shao, and Pengfei Shi. 2024. "Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet" Symmetry 16, no. 7: 845. https://doi.org/10.3390/sym16070845

APA Style

Zhu, S., Li, X., Wan, G., Wang, H., Shao, S., & Shi, P. (2024). Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet. Symmetry, 16(7), 845. https://doi.org/10.3390/sym16070845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Underwater Dam Crack Image Classification Algorithm Based on Improved VanillaNet

Abstract

1. Introduction

2. Background Study

3. Improved VanillaNet Image Classification Algorithm for Underwater Dam Crack Images

3.1. VanillaNet Network Model

3.2. Depth Training Strategy

3.3. Seesaw Loss

3.4. Adaptive Frequency Filtering Token Mixer

3.5. Label Smoothing

4. Experimental Environment and Dataset

4.1. Experimental Environment

4.2. Dataset

5. Experimental Results and Analysis

5.1. Ablation Experiment

5.2. Contrast Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI