Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment

Wan, Xing; Johari, Juliana; Ruslan, Fazlina Ahmat

doi:10.3390/info15110717

Open AccessArticle

Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment

by

Xing Wan

^1,2,*

,

Juliana Johari

²

and

Fazlina Ahmat Ruslan

^2,*

¹

School of Intelligent Manufacturing, Leshan Vocational and Technical College, Leshan 614000, China

²

School of Electrical Engineering, Universiti Teknologi MARA (UiTM), Shah Alam 40450, Malaysia

^*

Authors to whom correspondence should be addressed.

Information 2024, 15(11), 717; https://doi.org/10.3390/info15110717

Submission received: 5 October 2024 / Revised: 1 November 2024 / Accepted: 6 November 2024 / Published: 7 November 2024

(This article belongs to the Special Issue Computer Vision for Security Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Text CAPTCHAs are crucial security measures deployed on global websites to deter unauthorized intrusions. The presence of anti-attack features incorporated into text CAPTCHAs limits the effectiveness of evaluating them, despite CAPTCHA recognition being an effective method for assessing their security. This study introduces a novel color augmentation technique called Variational Color Shift (VCS) to boost the recognition accuracy of different networks. VCS generates a color shift of every input image and then resamples the image within that range to generate a new image, thus expanding the number of samples of the original dataset to improve training effectiveness. In contrast to Random Color Shift (RCS), which treats the color offsets as hyperparameters, VCS estimates color shifts by reparametrizing the points sampled from the uniform distribution using predicted offsets according to every image, which makes the color shifts learnable. To better balance the computation and performance, we also propose two variants of VCS: Sim-VCS and Dilated-VCS. In addition, to solve the overfitting problem caused by disturbances in text CAPTCHAs, we propose an Auto-Encoder (AE) based on Large Separable Kernel Attention (AE-LSKA) to replace the convolutional module with large kernels in the text CAPTCHA recognizer. This new module employs an AE to compress the interference while expanding the receptive field using Large Separable Kernel Attention (LSKA), reducing the impact of local interference on the model training and improving the overall perception of characters. The experimental results show that the recognition accuracy of the model after integrating the AE-LSKA module is improved by at least 15 percentage points on both M-CAPTCHA and P-CAPTCHA datasets. In addition, experimental results demonstrate that color augmentation using VCS is more effective in enhancing recognition, which has higher accuracy compared to RCS and PCA Color Shift (PCA-CS).

Keywords:

CAPTCHA recognition; color shift; auto-encoder; attention mechanism; large kernel

1. Introduction

With the rapid development of the Internet, companies and individuals can freely use it for business, communication, and entertainment, resulting in significant lifestyle changes [1]. While cyberspace provides convenience to people, it also has certain security risks. CAPTCHA is a protection mechanism proposed to enhance network security. At the beginning of the 21st century, the scholar Luis von Ahn et al. jointly proposed CAPTCHA [2]. As a widely used public security program, its main function is to design a verification mechanism to distinguish between humans and machines, so the CAPTCHA mechanism prevents the access of malicious robot programs. When designing a CAPTCHA security mechanism, it is considered safe if the success rate of computer cracking is less than 10% [3]. As the first line of defense for the information system, CAPTCHA provides the most direct and effective security guarantee for various systems. Today, most websites worldwide use CAPTCHAs to defend against network attacks and web spiders. It is crucial to enhance the security of CAPTCHAs while maintaining their user-friendliness [4]. In recent years, deep learning technology has provided many ideas and technical frameworks for automatic attack and protection of CAPTCHAs [5]. Neural networks can automatically extract the most important features, therefore avoiding designing image operators manually. However, as the security design of CAPTCHAs continues to advance, higher demands are placed on the capabilities of CAPTCHA security evaluators. It is essential to continuously improve these evaluators to effectively assess the security of CAPTCHAs.

There are about ten categories of CAPTCHAs, and the most common type is text-based CAPTCHA [6]. These CAPTCHAs present distorted text with different backgrounds and noise for the user to recognize. Text-based CAPTCHA, the earliest CAPTCHA method, makes it simple to generate images for deployments. Because of its small storage space and rapid loading speed, many websites gradually have adopted it, making it the most widely used CAPTCHA mechanism [7]. Despite its widespread use, the security of the text-based CAPTCHA image has gradually decreased due to the development of optical character recognition (OCR) [8]. To improve recognition difficulty, the fonts, colors, shades, positions, backgrounds, and angles are changed, and noise lines and points, distortions, borders, and overlaps are incorporated to increase recognition difficulty [9].

Some other image CAPTCHA systems are alternatives to text-based CAPTCHAs, which include sliding verification codes, click-based CAPTCHAs, drag-and-drop CAPTCHAs, selection-based CAPTCHAs, drawing-based CAPTCHAs, and interactive CAPTCHAs [10]. Typically, users must move or drag these CAPTCHAs with the mouse, then click or place them based on the provided hint. These mechanisms are more difficult to detect since they require not only classification but also location. So, object detection models such as YOLOX [11], YOLOv6 [12], and YOLOv7 [13] are often adopted. As one-stage detectors, the YOLO series has the advantages of speedy detection, high precision, and robustness. Like text CAPTCHAs, these CAPTCHAs are embedded in many anti-attack mechanisms, making it difficult for attackers to improve the effect of identification.

The best way to find the security vulnerabilities of text CAPTCHAs is by using different attacking methods to evaluate them [14]. To improve the recognition accuracy of models, one useful way is to collect CAPTCHAs as much as possible from the target websites. However, many websites have implemented anti-spider systems to impose restrictions on access. As a result, collecting enough CAPTCHAs from the target website is time-consuming and unrealistic. Many researchers have developed various data augmentation methods to enhance data volume and diversity based on the few images collected from websites. The data augmentation methods usually include image clipping, image combination, random resizing, rotation, flipping, and color shift [15]. Among them, the color shift is commonly used in CAPTCHA recognition tasks, as many data augmentation methods, which are widely used in classification tasks, are not suitable for CAPTCHAs breaking due to their interfering designs [16].

This research presents novel approaches to color shift methodologies by enhancing existing Random Color Shift algorithms and systematically evaluating their performances across a variety of datasets and models. Furthermore, we introduce the AE-LSKA framework, designed to identify characters within images from a more global perspective, effectively minimizing the disruptive influence of localized interferences found in CAPTCHAs. The study makes three significant contributions. Firstly, we developed the VCS, Sim-VCS, and Dilated-VCS techniques utilizing the variation method, which allows for adjustable upper and lower bounds that adapt to specific input CAPTCHAs. Secondly, we propose an AE-based approach incorporating Large Separable Kernel Attention, which exhibits high suitability and low complexity for CAPTCHA recognition tasks characterized by significant levels of noise and interference. Lastly, comprehensive experiments were conducted on both simple and complex datasets utilizing strong and weak models, evaluating multiple metrics. The results convincingly demonstrate the effectiveness and superiority of the proposed VCS, Sim-VCS, Dilated-VCS, and AE-LSKA methods. These contributions collectively advance the state of CAPTCHA recognition technology by offering more resilient and adaptable solutions.

2. Datasets and Algorithms

To evaluate our proposed methods comprehensively, we adopt two models with different performances validated on two datasets with distinct complexities, as well as our proposed algorithms, as shown in Table 1.

In recent years, some text CAPTCHA-breaking networks have been proposed, such as CapsuleNet [17], transformer-based recognizers [18], and ConvNet [19]. Deep-CAPTCHA is an innovative model introduced in 2020 by Noury et al., which is widely used as a security measure to break text CAPTCHAs [20]. One of the strengths of Deep-CAPTCHA is its ability to adapt to various types of text CAPTCHAs, including those with noise, distortions, and different fonts. The structure of the model includes three convolution layers, one fully connected layer, one flattened layer, one dense layer, and one softmax layer, as shown in Figure 1. The red boxes on the left and right sides of the figure represent the true and predicted probabilities of the letter A, respectively. By calculating the BCE loss for both probabilities, the gradient of the backpropagation could be obtained.

Adaptive-CAPTCHA is a new text CAPTCHA breaker developed on Deep-CAPTCHA, which was proposed in 2024 and has a stronger ability to recognize text CAPTCHAs with a correlation between characters using the Convolutional and Recurrent Neural Network (CRNN) module [21]. In addition, this model can effectively remove interference and noise by introducing an Adaptive Fusion Filtering Network (AFFN). Compared with Deep-CAPTCHAs, Adaptive-CAPTCHAs can more effectively counter the anti-attack mechanisms embedded in text CAPTCHAs while keeping the number of parameters smaller. The model’s detailed architecture is shown in Figure 2.

This study employs two distinct CAPTCHA datasets for experimentation: M-CAPTCHA and P-CAPTCHA. P-CAPTCHA is generated using the Python captcha library with default settings, featuring minimal interference and simpler fonts, as shown in Figure 3, while M-CAPTCHA includes complex, distorted characters with Markov distribution, as shown in Figure 4.

Recent advancements in color augmentation have spurred improvements in CAPTCHA recognition.PCA-CS proposed by Krizhevsky et al. introduced a method of augmenting images by adjusting RGB channels using eigenvalues [22]. Wang et al. proposed RCS, which adjusted color channels using randomly sampled offsets; however, it struggled with setting optimal hyperparameters, risking either unrealistic image alterations or insufficient augmentation impact [23]. Ishkov et al. further refined this method by proposing Adaptive Channel Weights (ACWs), a method that dynamically modifies RGB channel importance using learnable parameters, which proved beneficial for enhancing CAPTCHA recognition accuracy [24]. The common problem with these color shift methods is that their offsets are hyperparameters rather than optimal values generated according to the target dataset.

The integration of distinct lightweight attention mechanisms has the potential to markedly enhance the efficacy of recognition outcomes due to interferences in text CAPTCHAs. This can be achieved through the utilization of plug-and-play modules such as the Convolutional Block Attention Module (CBAM) [25], Squeeze-and-Excitation (SE) [26], Shuffle Attention (SA) [27], Efficient Channel Attention (ECA) [28], ParNet [29] and Global Context (GC) [30]. A new development in the field of attention mechanisms is LSKA [31]. It builds upon Visual Attention Networks (VANs) by leveraging separable convolution kernels to enhance attention efficiency [32]. This module demonstrates comparable performance to standard large kernel attention modules, outperforming Vision Transformers (ViTs) [33] and ConvNeXt [34].

However, the attention mechanism itself can be affected by interference. Therefore, the use of AE to compress data and reduce invalid interfering information represents an effective means of enhancing the attentional mechanism [35].

3. Methods

VCS is a color augmentation method that generates new images by automatically adjusting the image color distribution according to image features. It is an effective way to complete vision tasks such as CAPTCHA recognition, where it is challenging to collect sufficient data from websites. Two simplified schemes are also proposed to balance performance and computation. The novelty of the VCS method lies in that it automatically calculates the optimal color shift instead of using hyperparameter settings as in other color augmentation methods. Furthermore, another proposed lightweight attention mechanism, AE-LSKA, employs AE to compress the interference while utilizing LSKA to expand the receptive field, thereby enhancing the abilities of models. Both VCS and AE-LSKA can significantly improve the CAPTCHA recognition accuracy.

3.1. VCS

As previously mentioned, determining the upper and lower offsets is crucial for color augmentation. However, it is not straightforward to find the optimal values through human experience. RCS typically sets the range of image pixel value offsets between 10% and 50%, while PCA-CS just performs an overall shift on three color channels of an image. However, adopting fixed values overlooks the variations between different datasets and individual images. For instance, some images have nearly saturated pixel values, which could lead to overflow if the pixel values are increased significantly. Additionally, an excessively large offset might compromise the contextual realism of the CAPTCHA; conversely, a small offset may not yield effective data augmentation. Therefore, it is essential to find offset ranges tailored to the specific characteristics of each CAPTCHA.

The proposed VCS generated the upper and lower offsets for each image by extracting its features. It is worth noting that the two offsets are different and require separate generations. We supposed that the input image was

X \in R^{3 \times H \times W}

and the output image was

Y \in R^{3 \times H \times W}

, where 3 represents the number of RGB color channels,

H \in R

represents the height of the image, and

W \in R

represents the width of the image. For each offset (lower or upper), we first used the three adaptive pooling layers,

A d a p t i v e P o o l_{o p} : R^{3 \times H \times W} \to R^{3 \times 1 \times 3}, o p \in {m a x, m i n, a v g}

, to obtain the

m a x i m u m

,

m i n i m u m

, and

m e a n

. Next, VCS used the merge function,

C o n c a t : R^{3 \times 3 \times 1 \times 3} \to R^{3 \times 3 \times 3}

, to merge them and then passed the merged vector to a predictive convolutional layer

F_{i} : R^{3 \times 3 \times 3} \to R^{3 \times 1 \times 1}, i \in {L, U}

with a 3 × 3 kernel size to generate

o f f s e t_{L} \in R^{3 \times 1 \times 1}

and

o f f s e t_{U} \in R^{3 \times 1 \times 1}

. These two outputs were finally sent to an activation function,

S i g m o i d : R^{3 \times 1 \times 1} \to R^{3 \times 1 \times 1},

and added by the unit vector,

I \in R^{3 \times 1 \times 1}

, to obtain the lower and upper limits, as shown in Equations (1) and (2):

L o w e r = I - S i g m o i d (F_{L} (C o n c a t [{A d a p t i v e P o o l}_{m a x, m i n, a v g} (X)]))

(1)

U p p e r = I + S i g m o i d (F_{U} (C o n c a t [{A d a p t i v e P o o l}_{m a x, m i n, a v g} (X)]))

(2)

Based on these estimated

L o w e r

and

U p p e r

values, they could be used to reparametrize the image scaling weight

W \in R^{3 \times H \times W}

sampled from a uniform distribution to obtain the new weight

W^{'} \in R^{3 \times H \times W}

, as shown in Equation (3). This approach drew upon the sampling methodology from a Variational Auto-Encoder (VAE) [36]. Eventually, the original image was multiplied by the weighting coefficients to obtain the final generated image, where

⊙

represents Hadamard product, and we scaled the sampled values using the upper and lower limits by the broadcasting rule.

W^{'} = L o w e r + (U p p e r - L o w e r) ⊙ W, W \in U (0, 1)

(3)

After obtaining the new weight

W^{'}

, we multiplied it by the input image

X

to obtain the final new image

Y

, as shown in Equation (4):

Y = X ⊙ W^{'}

(4)

Figure 5 divides the entire VCS procedure into four stages. To generate three statistics, the first stage was to extract three features from the input image using three adaptive pooling layers. According to these statistics, in the second stage, two predictive convolution layers generated color offsets for the upper and lower intervals. These offsets were used to reparametrize the initial weights sampling from the uniform distribution to obtain the final weights. In the last stage, the weights were used to scale the input image to generate the new image.

In the third stage, VCS adopted the two predictive convolutions, both of which had parameters (PARAMs) of

C \cdot C \cdot K_{p r e d i c t} \cdot K_{p r e d i c t}

. While the three adaptive pooling layers in the first stage did not contribute to PARAMs, they did increase the number of float point operations. If

K_{p r e d i c t}

denotes the kernel size of

F_{i}, i \in {l o w e r, u p p e r}

(

K_{p r e d i c t} = 3

), and

C

denotes the number of input channels and output channels (

C = 3

), respectively, the total PARAMs of VCS are as shown in Equation (5). Since two predictive convolutions were used, there was factor of two in front.

P A R A M s_{V C S} = 2 \cdot C^{2} \cdot {K^{2}}_{p r e d i c t} = 162

(5)

Floating Point Operations Per Second (FLOPs) were used to evaluate computation complexity. Apart from the convolution and pooling layers, the sigmoid function also contributed to the calculation, which had the same dimension as the color offsets. In the convolution layer, we added and multiplied each parameter once independently, resulting in a value twice that of the PARAMs of VCS. The three pooling operations required the same FLOPs as the number of pixels in the CAPTCHA, yielding three times the image resolution. Finally, there were two more sigmoid operations followed by addition after combining like terms with FLOPs of

3 \cdot H \cdot W

, as shown in Equation (6):

F L O P s_{V C S} = 2 \cdot P A R A M s_{V C S} + 9 \cdot H \cdot W + 2 \cdot C = 330 + 9 \cdot H \cdot W

(6)

3.2. Sim-VCS

To adapt a portion of the CAPTCHA recognition network deployed on a mobile, we considered further reducing the computational complexity of the model. A simplified version with low computational complexity was proposed. For this algorithm, the three pooling layers were replaced with three 3 × 3 constant tensors. Consequently, the upper and lower offsets were independent of individual input images, thus reducing the amount of computation. However, this implies that each image received the same offsets, derived from the entire dataset, regardless of the individual image’s color distribution. Sim-VCS maintained the same PARAMs as VCS because it still retained the two prediction convolutional layers, which ensured that the offsets were learnable, as shown in Equation (7):

P A R A M s_{S i m - V C S} = 2 \cdot C^{2} \cdot {K^{2}}_{p r e d i c t} = 162

(7)

The reduction in FLOPs was mainly due to the removal of the three adaptive pooling layers, as the number of operations in the pooling operation was proportional to the size of the input image. Therefore, the FLOPs of Sim-VCS were only related to the PARAMs of the convolution, activation, and addition operation, as illustrated in Equation (8). The discrepancy in FLOPs is that Sim-VCS lacks a pooling operation, which is dependent on the dimensions of the input image for VCS.

F L O P s_{S i m - V C S} = 2 \times P A R A M s_{S i m - V C S} + 2 \times C + 3 \cdot H \cdot W = 330 + 3 \cdot H \cdot W

(8)

Figure 6 illustrates the distinction between VCS and Sim-VCS: the former is a statistic that is dependent on the input picture, while the latter is a constant.

3.3. Dilated-VCS

As mentioned before, VCS samples the mean, maximum, and minimum statistics of the input image. However, in some conditions, these statistics may not have fully captured the color changes in the image. To obtain more detailed image color distribution, this study proposed a feature extraction scheme based on dilated convolution, as shown in Figure 7. It should be noted that sampling density was inversely proportional to the dilation factor. When the dilation factor was 1, it was a full sampling of the input image. However, due to the large amount of interference in the CAPTCHA image, the higher the sampling density, the higher the risk of overfitting. Therefore, it was necessary to balance sampling density and overfitting, which could be improved by adding a dropout layer.

By constructing a dilated convolution, we could randomly obtain the pixel values of certain positions in the image. Dilated convolution gave more details of the input image than using a pooling layer to extract the maximum, minimum, and mean values. The benefit of using dilated convolution instead of regular convolution was that the presence of significant interference in CAPTCHA could easily lead to overfitting using standard convolutional approaches, thereby reducing the model’s generalization effect. Let the kernel size of dilated convolution module,

K_{s}

, be equal to stride,

s

, and let

d

stand for the dilated rate. The PARAMs of Dilated-VCS were determined only by

K_{s}

and

K_{p r e d i c t}

, and the PARAMs are shown in Equation (9):

P A R A M s_{D i l a t e d - V C S} = 2 \cdot C^{2} \cdot K_{s}^{2} + 2 \cdot C^{2} \cdot K_{p r e d i c t}^{2} = 18 \cdot K_{s}^{2} + 162

(9)

The FLOPs for the convolution operation were twice as many as the convolution kernel parameters. However, the output dimension of the sampled convolution was 3 × 3; therefore, the sampled convolution layer of the FLOPs also had to be multiplied by a factor of 9, as illustrated by Equation (10):

F L O P s_{D i l a t e d - V C S} = 324 \times K_{s}^{2} + 330 + 3 \cdot H \cdot W

(10)

We supposed that the input image size WAS (3, 64, 192), and we considered the adaptive pool to be a special convolution with a kernel size of (64, 64). Table 2 demonstrates that PARAMs and FLOPS of the Dilated-VCS are proportional to the square of the sampling kernel size. Thus, the selection of an appropriate kernel size ensured optimal performance for Dilated-VCS without exceeding the FLOPs of VCS. Compared to Dilated-VCS, VCS has reduced PARAMs, with only 162.

3.4. AE-LSKA

One of the major difficulties in text CAPTCHA recognition is that the model is highly susceptible to many disturbances during the training process, learning too many invalid local details. In this paper, we embedded AE-LSKA in the existing recognition model by the convolution module with large kernels so that it paid more attention to the overall information of the image and reduced the impact of local interference on the recognition results. LSKA decomposed 2D convolution kernels into cascaded 1D kernels in horizontal and vertical directions, reducing computational complexity and memory usage while enabling the use of large kernels. This design improved model efficiency and robustness, prioritizing object shape over texture, consistent with the requirements of vision tasks.

To expand the receptive field, many architectures integrated larger convolution kernels. We first considered replacing these large kernel convolutions, which typically had a kernel size of 5 × 5, with an AE module consisting of two small 3 × 3 kernel convolutions in series. This structure not only reduced the PARAMs and FLOPs but also compressed the number of intermediate channels, which effectively prevented the forward propagation of interfering information. In addition, the AE was followed by an LSKA module, which enhanced the receptive field and accelerated gradient propagation.

Figure 8 shows that the 5 × 5 convolution in the original architecture was replaced, with AE-LSKA modules using different dilated kernels. It should be noted that Conv stands for convolution, DW-Conv stands for depth-wise convolution, and DW-D-Conv stands for depth-wise dilated convolution. The receptive field of the LSKA was determined by the kernel size, as well as the dilation. For the 5 × 5 convolution layer, we constructed five receptive fields of 5, 7,11, 15, and 23, respectively. The symbols

c 1

,

c 2

, and

c_

represent the number of channels associated with the input, output, and intermediate layers, respectively.

Table 3 compares the original 5 × 5 convolution with AE-LSKAs with different parameters. Here, for simplicity, we assumed that the number of input and output channels were the same (

c 1 = c 2 = C

), the input and output feature sizes were constant, and the stride

d = 1

. Typically, the number of channels in the middle layer satisfied

c_= c 1

. The sizes of the output features could be disregarded because they were identical. Typically, the number of channels

C

was not less than 3, so the PARAMs and FLOPs of the 5 × 5 convolution were much larger than those of all the AE-LSKA modules, indicating that AE-LSKA was an efficient approach for reducing model weights.

4. Results

The experiments are conducted on an NVIDIA GeForce RTX 3060 12 GB GPU and an Intel Core i5-8265U CPU @ 1.60 GHz 1.80 GHz, with PyTorch version 2.2 and Python (version 3.11.9, as shown in Table 4). The training and validation phases are accelerated using GPUs in conjunction with CUDA. To evaluate the inference time, the CPU platform is employed for testing different models. The M-CAPTCHA dataset with randomly chosen 5000 samples on Kaggle “https://www.kaggle.com/datasets/sanluo/mcaptcha” (accessed on 15 June 2024) and the P-CAPTCHA dataset with 3000 samples (generated using the Python captcha library) are adopted. The datasets are divided into training and validation sets with a ratio of 80% to 20% and trained with Deep-CAPTCHA and Adaptive-CAPTCHA. The selection of the epochs for the training phase is based on the experimental observation that the model’s loss and accuracy metrics remain stable after 130 epochs. Selecting a batch size of 256 optimizes the utilization of the RTX3060’s existing GPU memory. Also, in addition to PARAMs and FLOPs, we focus on the Average Attack Success Rate (AASR) and loss during training and validation, as shown in Table 4.

4.1. Performance of VCS and Variants

Figure 9a shows the validation results of the Adaptive-CAPTCHA on the M-CAPTCHA dataset. According to the figure, the accuracy of VCS shows a steady upward trend throughout the validation process and finally reaches nearly 80%, which is the highest among all algorithms. Among the RCSs with different thresholds, RCS-1.0 and RCS-0.7 have relatively higher accuracies. Furthermore, in the absence of a color augmentation algorithm, the AASR is at its lowest, approximately 35%, which is 45 percentage points less than the highest value. Figure 9b demonstrates that the model also achieves similar results on P-CAPTCHA, with VCS having the best performance at more than 70%. In the absence of color shift augmentation, the AASR is approximately 60%. In addition, the amount of performance improvement by VCS on the P-CAPTCHA is smaller due to the fewer disturbances and patterns of P-CAPTCHA.

Figure 10a shows that the AASRs of all algorithms are less than 30% using Deep-CAPTCHA on the M-CAPTCHA, in which case the best performance is achieved without any color shift enhancement. However, VCS still performs close to the best RCS. Figure 10b shows the performance of Deep-CAPTCHA on the P-CAPTCHA. The AASR of VCS is about 35%, which is significantly higher than RCS by at least three percentage points and seven percentage points higher than when not using any color offset.

Figure 11a,b show the performance of Adaptive-CAPTCHA after integrating VCS and PCA-CS on M-CAPTCHA and P-CAPTCHA, respectively. The AASR with PCA-CS outperforms the model without any color shift and has a faster convergence rate relative to VCS. However, compared to VCS, even the AASR of the best-performing PCA-CS is still about 40 and 5 percentage points lower on each dataset, respectively.

In Figure 12a,b, PCA-CS-0.5 exhibits nearly the best AASR, indicating that PCA-CS is robust to weak recognizers and can guarantee the generalization of features in the dataset. This is due to the consistency of the channel color distribution maintained by PCA-CS through singular value decomposition.

Table 5 compares the AASRs of Dilated-VCS with various dropout and dilated convolutional kernels, demonstrating that all algorithms achieved an AASR of 99.9% when trained on both datasets using Adaptive-CAPTCHA. Dilated-VCS (kernel = 4, dropout = 0.3) integrated into Adaptive-CAPTCHA achieves an accuracy of 72% on M-CAPTCHA, while the best score is obtained on P-CAPTCHA by adopting a kernel size of 8 and dropout rate of 0.3. In addition, Deep-CAPTCHA has a larger optimal convolution kernel both on P-CAPTCHA and M-CAPTCHA.

4.2. Experimental Analysis of AE-LSKA

Figure 13a,b illustrate that AE-LSKAs with larger dilation kernels exhibit slower convergence rates during the training of Adaptive-CAPTCHA across both datasets. However, all models achieve convergence after sufficient training epochs.

Figure 14a demonstrates that Adaptive-CAPTCHA achieves an AASR of approximately 88% with AE-LSKA (k = 7) integrated, exceeding the performance of the original model by 49 percentage points and that of the AE-only model by 8 percentage points. Furthermore, Figure 14b reveals that AE-LSKA (k = 7) has the best accuracy rate of about 83%.

Figure 15a,b show the results for when AE-LSKA is integrated into Deep-CAPTCHA. Compared to Adaptive-CAPTCHA, the steeper AASR curves indicate that LSKA enhances global attention while increasing the model’s learning difficulty. AE-LSKA (k = 11) achieved approximately 33% accuracy on M-CAPTCHA, an improvement of roughly 6 percentage points over the original model. Nevertheless, the results demonstrate that the model integrating only AE exhibits the highest accuracy on P-CAPTCHA, with AE-LSKA (k = 7) ranking as the second-highest performer. In addition, it can be seen from the figure that the performance of AE-LSKA (k = 23) is relatively low on both datasets because the overly large kernel size makes the model unable to learn effectively.

Figure 16a illustrates the AASR improvements achieved for the recognition of individual characters from A to Z on the M-CAPTCHA dataset when the Adaptive-CAPTCHA model is enhanced with AE-LSKA (k = 7). Similarly, Figure 16b presents the results for the same model on P-CAPTCHA, where it is evident that most characters experience a notable enhancement in recognition accuracy.

Table 6 provides an in-depth comparison of an AE integrated with various attention modules to validate PARAMs, FLOPs, and AASR using two distinct CAPTCHA recognition models: Deep-CAPTCHA and Adaptive-CAPTCHA. It can be observed that when only the AE module is incorporated into the model, the number of parameters and the amount of computation are the lowest among all configurations.

For the Deep-CAPTCHA, the AE + LSKA (k = 11) configuration achieves a high AASR of 32.9%, which is the second-highest performance among the validated models. The highest AASR is 48.9%, observed with the AE + PNA configuration; however, this comes at the cost of FLOPs with 379.51 M. In contrast, the AE + LSKA (k = 11) provides a more balanced trade-off with FLOPs of 178.38 M.

For the Adaptive-CAPTCHA, AE + LSKA (k = 7) maintains its strong performance. Specifically, the configuration achieves an impressive AASR of 89.8% with 229.59 M FLOPs, resulting in the highest-performing configuration among the attention modules validated, demonstrating superior accuracy and computational efficiency. While AE + PNA offers the highest accuracy for Deep-CAPTCHA, its FLOPs are much larger than other algorithms and have the lowest accuracy for Adaptive-CAPTCHA.

5. Discussion

5.1. Analysis of Color Shift Augmentation Techniques

Among the RCS algorithms with varying thresholds, RCS-0.7 and RCS-1.0 demonstrate relatively higher accuracy, suggesting that increased variability improves the training performance of robust Adaptive-CAPTCHAs in complex scenarios like M-CAPTCHA. In general, PCA-CS has the lowest overall performance, followed by RCS.

When applied to complex datasets with weak recognizers, RCS can increase the model’s learning difficulty, thereby decreasing recognition accuracy. Therefore, the color shift algorithms are contingent upon the model and dataset and should only be employed when the original dataset is insufficient for adequate model training. Additionally, we observe that the RCS curves exhibit greater jitter on both datasets, indicating that RCS stability is lower compared to VCS and PCA-CS.

VCS shows optimal performance in nearly all scenarios, except when training on complex datasets with weak recognizers. This suggests that VCS effectively enhances data diversity while increasing training complexity, which is efficient for simple datasets regardless of model robustness.

For Dilated-VCS, the substantial interference in the CAPTCHA leads strong recognizers to overfit. Thus, combining a small kernel with a high dropout rate enhances the model’s generalization capabilities on the validation set. Dilated-VCS and Sim-VCS perform less effectively and exhibit an inverse relationship regarding computational complexity but maintain a better balance between performance and computation.

5.2. Effectiveness of AE-LSKA

Incorporating AE-LSKA significantly enhances the performance of Adaptive-CAPTCHA and Deep-CAPTCHA across both datasets. Experimental results reveal that increasing the LSKA kernel size slows learning curve convergence and increases computational complexity.

The optimal kernel size choice depends on both the dataset and the model. Adaptive-CAPTCHA with AE-LSKA (k = 7) yields optimal results on the M-CAPTCHA, while AE-LSKA (k = 15) achieves the best score on the P-CAPTCHA. For Deep-CAPTCHA, AE-LSKA (k = 11) and AE achieve the highest recognition accuracy on M-CAPTCHA and P-CAPTCHA, respectively.

The improvement from AE-LSKA is significantly more pronounced for strong recognizers than for weak models. Moreover, for low-performance models, using LSKA with a larger kernel may lead to a substantial decrease in recognition accuracy due to the increased complexity of the learning process.

Overall, AE is an effective technique for compressing noise while simultaneously reducing both FLOPs and PARAMs. In contrast, LSKA increases both FLOPs and parameters while expanding the receptive field. AE-LSKA strikes an optimal balance between both CAPTCHA recognition models. These findings validate the effectiveness of LSKA in improving the performance of AE-based CAPTCHA solvers, providing robust solutions that are adaptable to various computational constraints.

5.3. Limitations and Future Work

The algorithms proposed in this study have only been evaluated on text CAPTCHA datasets and have not been validated on datasets from other domains. Future research will assess the effectiveness and versatility of the algorithms in general downstream tasks, including image classification, image denoising, object detection, and image segmentation.

Beyond CAPTCHA recognition, the proposed methods hold potential for broader applications. The VCS approach, which allows for learnable color transformations, provides significant advantages in scenarios that require robust augmentation techniques for contaminated images, such as image denoising, where subtle variations in image quality can affect model performance. In medical image analysis, VCS could address challenges arising from variations in image acquisition or patient conditions, thereby enhancing the accuracy of disease detection and diagnosis.

Similarly, the AE-LSKA model, with its efficient mitigating interference and expanding the receptive field, shows significant promise in computationally demanding computer vision tasks. For instance, in underwater acoustic detection, where noisy data and occlusions are common, AE-LSKA could effectively reduce noise while maintaining real-time computation. Future research will explore these broader applications further and optimize the algorithms for specific use cases.

6. Conclusions

In conclusion, this study presents two innovative algorithms designed to improve the effectiveness of text CAPTCHA evaluation systems. The first contribution involves the development of a VCS algorithm specifically for color data augmentation in text CAPTCHAs, accompanied by two simplified variants, Sim-VCS and Dilated-VCS. A thorough analysis of their mathematical foundations and algorithmic processes has been conducted, offering detailed insights into their operational mechanisms. The second contribution is the integration of an Auto-Encoder with Large Separable Kernel Attention, termed AE-LSKA, which replaces convolutional modules within architectures. Experimental evaluations show that the VCS algorithm significantly enhances the AASR of both strong and weak models, outperforming existing color augmentation algorithms like RCS and PCA-CS. Moreover, AE-LSKA consistently improves model performance across all validated datasets. It not only surpasses most attention mechanisms concerning AASR but also achieves an optimal balance between accuracy and computational efficiency, as measured by FLOPs. These advancements underscore the potential of VCS and AE-LSKA as effective tools for text CAPTCHA recognition, providing a dual approach that addresses both data augmentation and architectural optimization. These findings indicate significant improvements in CAPTCHA processes, paving the way for the development of more robust and efficient models for practical applications.

Author Contributions

X.W. carried out all the experiments and wrote the first draft of the paper. F.A.R. provided guidance on the innovations and the overall structure, and J.J. revised the first draft of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Leshan Special Robot Engineering Technology Research Center.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Setiawan, A.B.; Sastrosubroto, A.S. Strengthening the Security of Critical Data in Cyberspace, a Policy Review. In Proceedings of the 2016 International Conference on Computer, Control, Informatics and its Applications (IC3INA), Tangerang, Indonesia, 3–5 October 2016; pp. 185–190. [Google Scholar]
von Ahn, L.; Blum, M.; Hopper, N.J.; Langford, J. CAPTCHA: Using Hard AI Problems for Security. In Advances in Cryptology—EUROCRYPT 2003; Biham, E., Ed.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 294–311. [Google Scholar]
Yan, J.; El Ahmad, A.S. Usability of CAPTCHAs or Usability Issues in CAPTCHA Design. In Proceedings of the 4th Symposium on Usable Privacy and Security—SOUPS ’08, Pittsburgh, PA, USA, 23–25 July 2008; ACM Press: Pittsburgh, PA, USA, 2008; p. 44. [Google Scholar]
Alsuhibany, S.A. Evaluating the Usability of Optimizing Text-Based CAPTCHA Generation. Int. J. Adv. Comput. Sci. Appl. IJACSA 2016, 7, 164–169. [Google Scholar] [CrossRef]
Wang, J.; Qin, J.; Xiang, X.; Tan, Y.; Pan, N. CAPTCHA Recognition Based on Deep Convolutional Neural Network. Math. Biosci. Eng. 2019, 16, 5851–5861. [Google Scholar] [CrossRef] [PubMed]
Guerar, M.; Verderame, L.; Migliardi, M.; Palmieri, F.; Merlo, A. Gotta CAPTCHA ’Em All: A Survey of 20 Years of the Human-or-Computer Dilemma. ACM Comput. Surv. 2022, 54, 1–33. [Google Scholar] [CrossRef]
Chellapilla, K.; Larson, K.; Simard, P.Y.; Czerwinski, M. Building Segmentation Based Human-Friendly Human Interaction Proofs (HIPs). In Human Interactive Proofs; Baird, H.S., Lopresti, D.P., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1–26. [Google Scholar]
Zhang, J.; Sang, J.; Xu, K.; Wu, S.; Zhao, X.; Sun, Y.; Hu, Y.; Yu, J. Robust CAPTCHAs Towards Malicious OCR. IEEE Trans. Multimed. 2021, 23, 2575–2587. [Google Scholar] [CrossRef]
Wang, P.; Gao, H.; Guo, X.; Xiao, C.; Qi, F.; Yan, Z. An Experimental Investigation of Text-Based CAPTCHA Attacks and Their Robustness. ACM Comput. Surv. 2023, 55, 196:1–196:38. [Google Scholar] [CrossRef]
Xing, W.; Mohd, M.R.S.; Johari, J.; Ruslan, F.A. A Review on Text-Based CAPTCHA Breaking Based on Deep Learning Methods. In Proceedings of the 2023 International Conference on Computer Engineering and Distance Learning (CEDL), Shanghai, China, 29 June–1 June 2023; pp. 171–175. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Walia, J.S.; Odugoudar, A. Vulnerability Analysis of Captcha Using Deep Learning. In Proceedings of the 2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG), online, 8 December 2023; pp. 1–7. [Google Scholar]
Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.-T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A Comprehensive Survey on Data Augmentation. arXiv 2024, arXiv:2405.09591. [Google Scholar]
Bursztein, E.; Martin, M.; Mitchell, J.C. Text-Based CAPTCHA Strengths and Weaknesses. In Proceedings of the Proceedings of the 18th Acm Conference on Computer & Communications Security (CCS 11), Chicago, IL, USA, 17–21 October 2011; Assoc Computing Machinery: New York, NY, USA, 2011; pp. 125–137. [Google Scholar]
Mocanu, I.G.; Yang, Z.; Belle, V. Breaking CAPTCHA with Capsule Networks. Neural Netw. 2022, 154, 246–254. [Google Scholar] [CrossRef] [PubMed]
Shi, Y.; Liu, X.; Han, S.; Lu, Y.; Zhang, X. A Transformer Network for CAPTCHA Recognition. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Information Systems, Chongqing, China, 28–30 May 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Qing, K.; Zhang, R. An Efficient ConvNet for Text-Based CAPTCHA Recognition. In Proceedings of the 2022 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Penang, Malaysia, 22–25 November 2022; pp. 1–4. [Google Scholar]
Noury, Z.; Rezaei, M. Deep-CAPTCHA: A Deep Learning Based CAPTCHA Solver for Vulnerability Assessment. arXiv 2020, arXiv:2006.08296. [Google Scholar] [CrossRef]
Wan, X.; Johari, J.; Ruslan, F.A. Adaptive CAPTCHA: A CRNN-Based Text CAPTCHA Solver with Adaptive Fusion Filter Networks. Appl. Sci. 2024, 14, 5016. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. Acm 2017, 60, 84–90. [Google Scholar] [CrossRef]
Wang, X.; Yu, J. Learning to Cartoonize Using White-Box Cartoon Representations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 8087–8096. [Google Scholar]
Ishkov, D.O.; Terekhov, V.I. Text CAPTCHA Traversal with ConvNets: Impact of Color Channels. In Proceedings of the 2022 4th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), Moscow, Russia, 17–19 March 2022; pp. 1–5. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-Deep Networks. Adv. Neural Inf. Process. Syst. 2022, 35, 6789–6801. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN. Expert. Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual Attention Network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Chen, S.; Guo, W. Auto-Encoders in Deep Learning—A Review with New Perspectives. Mathematics 2023, 11, 1777. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]

Figure 1. The structure and training process of Deep-CAPTCHA.

Figure 2. Adaptive-CAPTCHA improved on Deep-CAPTCHA.

Figure 3. Samples of P-CAPTCHA.

Figure 4. Samples of M-CAPTCHA.

Figure 5. The flowchart of VCS.

Figure 6. The difference between VCS and Sim-VCS.

Figure 7. The dilated convolution-based sampling of Dilated-VCS.

Figure 8. The AE-LSKAs with different dilated kernels and receptive fields.

Figure 9. AASR comparison between VCS and RCS using Adaptive-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 10. AASR comparison between VCS and RCS using Deep-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 11. AASR comparison between VCS and PCA-CS using Adaptive-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 12. AASR comparison between VCS and PCA-CS using Deep-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 13. Learning process AE-LSKAs with different dilated kernels integrating into Adaptive-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 14. Validation accuracy comparison of AE-LSKAs with different dilated kernels integrated into Adaptive-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 15. Validation accuracy comparison of AE-LSKAs with different kernels integrated into Deep-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Figure 16. Individual character accuracy comparison using Adaptive-CAPTCHA: (a) M-CAPTCHA; (b) P-CAPTCHA.

Table 1. Algorithms and datasets.

Dataset/Model	Description
Deep-CAPTCHA	Weak CAPTCHA recognizer
Adaptive-CAPTCHA	Strong CAPTCHA recognizer
P-CAPTCHA	Simple CAPTCHA dataset
M-CAPTCHA	Complex CAPTCHA dataset
AE-LSKA	New Feature extraction module
VCS, Sim-VCS, Dilated-VCS	New color shift methods

Table 2. Comparison of VCS and Dilated-VCS.

Type	Input	$K_{s}$	Dilated Rate	Stride	Padding	PARAMs	FLOPs
Dilated-VCS	(64, 192)	(4, 4)	(7, 21)	(22, 64)	(1, 0)	450	5514
Dilated-VCS	(64, 192)	(8, 8)	(3, 9)	(22, 64)	(1, 0)	1314	21,066
Dilated-VCS	(64, 192)	(22, 22)	(1, 3)	(22, 64)	(1, 0)	8874	156,978
VCS	(64, 192)	(64, 64)	(1, 1)	(64, 64)	(0, 0)	162	73,728

Table 3. PARAMs and FLOPs of AE-LSKAs.

/	Conv 5 × 5	AE-LSKA5	AE-LSKA7	AE-LSKA11	AE-LSKA15
Kernel (LSKA)	/	$3 + 3$	$3 + 3$	$3 + 5$	$3 + 5$
Dilation	/	2	2	2	3
PARAMs	$25 C^{2}$	$(10 \cdot C^{2} + 12 \cdot C)$	$(10 \cdot C^{2} + 12 \cdot C)$	$(10 \cdot C^{2} + 16 \cdot C)$	$(10 \cdot C^{2} + 16 \cdot C)$
FLOPs	$50 C^{2}$	$(20 \cdot C^{2} + 24 \cdot C)$	$(20 \cdot C^{2} + 24 \cdot C)$	$(20 \cdot C^{2} + 32 \cdot C)$	$(20 {\cdot C}^{2} + 32 \cdot C)$

Table 4. Hardware and software for experiments.

Type	Name	Specification	Version
Hardware	Graphics Processing Unit (GPU)	NVIDIA GeForce RTX 3060 12 GB	-
Hardware	Central Processing Unit (CPU)	Intel(R) Core (TM) i5-8265U CPU @ 1.60 GHz 1.80 GHz	-
Software	PyTorch	-	2.2
	Python	-	3.11.9
	Cuda	-	12.2.140
	Cudnn	-	9.1.0
Dataset	M-CAPTCHA	5000 images	-
Dataset	P-CAPTCHA	3000 images	-
Model	-Deep-CAPTCHA	-	-
Model	Adaptive-CAPTCHA	-	-
Evaluation	AASR, Loss, PARAMs, FLOPs	-	-
Setup	Dataset splitting	80% for training and 20% for validation	-
	Epochs	130	-
	Batch size	256	-
	Metrics	AASR, PARAMs, FLOPs, Loss	-

Table 5. AASRs of Dilated-VCSs with different settings.

Algorithm	Dilated Kernel	Dropout	Model	Train\|Val AASR (%) P-CAPTCHA	Train\|Val AASR (%) M-CAPTCHA
Dilated-VCS	4	0.0	Deep-CAPTCHA	96.9\|37.0	95.0\|26.2
	4	0.3		97.2\|36.1	95.6\|26.6
	8	0.0		96.4\|37.8	94.6\|27.2
	8	0.3		97.3\|37.9	94.8\|25.7
	22	0.0		97.2\|36.8	93.3\|25.2
	22	0.3		97.0\|39.1	93.7\|24.0
	4	0.0	Adaptive-CAPTCHA	99.9\|73.2	99.9\|69.2
	4	0.3		99.9\|73.1	99.9\|72.0
	8	0.0		99.9\|71.6	99.9\|70.1
	8	0.3		99.9\|74.0	99.9\|69.0
	22	0.0		99.9\|67.6	99.9\|57.6
	22	0.3		99.9\|68.2	99.9\|55.6
VCS	-	-	Deep-CAPTCHA	94.0\|35.8	93.7\|26.6
VCS	-	-	Adaptive-CAPTCHA	99.9\|74.5	99.9\|78.0
Sim-VCS	-	-	Deep-CAPTCHA	95.6\|35.6	93.4\|26.2
Sim-VCS	-	-	Adaptive-CAPTCHA	99.9\|72.4	99.9\|76.9

Table 6. A comparison of AE with attention modules on the M-CAPTCHA dataset.

Algorithm	Model	PARAMs	FLOPs	AASR (%)
Conv (Baseline)	Deep-CAPTCHA	6.46 M	212.78 M	28.5
AE		6.40 M	146.13 M	28.9
AE + LSKA (k = 7)		6.41 M	176.02 M	28.4
AE + LSKA (k = 11)		6.41 M	178.38 M	32.9
AE + CBAM (ratio = 8)		6.40 M	148.37 M	25.0
AE + CBAM (ratio = 16)		6.40 M	148.37 M	31.2
AE + ECA (ratio = 2)		6.40 M	146.72 M	26.1
AE + ECA (ratio = 4)		6.40 M	146.72 M	21.1
AE + GC (ATT + ADD)		6.41 M	146.78 M	28.7
AE + GC (AVG + MUL)		6.41 M	146.73 M	23.8
AE + SA (groups = 8)		6.40 M	146.43 M	25.4
AE + SA (groups = 16)		6.40 M	146.45 M	22.9
AE + SE (ratio = 8)		6.40 M	146.72 M	20.9
AE + SE (ratio = 16)		6.40 M	146.72 M	23.0
AE + PNA		6.48 M	379.51 M	48.9
Conv (Baseline)	Adaptive-CAPTCHA	3.82 M	259.80 M	30.3
AE		3.29 M	192.97 M	79.1
AE + LSKA (k = 7)		3.39 M	229.59 M	89.8
AE + LSKA (k = 11)		3.39 M	232.10 M	88.3
AE + CBAM (ratio = 8)		3.31 M	195.32 M	85.4
AE + CBAM (ratio = 16)		3.30 M	195.29 M	77.3
AE + ECA (ratio = 2)		3.29 M	193.60 M	82.6
AE + ECA (ratio = 4)		3.29 M	193.60 M	85.9
AE + GC (ATT + ADD)		3.38 M	193.74 M	66.3
AE + GC (AVG + MUL)		3.38 M	193.69 M	81.8
AE + SA (groups = 8)		3.29 M	193.29 M	84.3
AE + SA (groups = 16)		3.29 M	193.31 M	84.3
AE + SE (ratio = 8)		3.31 M	193.62 M	84.0
AE + SE (ratio = 16)		3.30 M	193.61 M	79.7
AE + PNA		4.27 M	489.68 M	24.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, X.; Johari, J.; Ruslan, F.A. Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment. Information 2024, 15, 717. https://doi.org/10.3390/info15110717

AMA Style

Wan X, Johari J, Ruslan FA. Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment. Information. 2024; 15(11):717. https://doi.org/10.3390/info15110717

Chicago/Turabian Style

Wan, Xing, Juliana Johari, and Fazlina Ahmat Ruslan. 2024. "Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment" Information 15, no. 11: 717. https://doi.org/10.3390/info15110717

APA Style

Wan, X., Johari, J., & Ruslan, F. A. (2024). Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment. Information, 15(11), 717. https://doi.org/10.3390/info15110717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Algorithm	Dilated Kernel	Dropout	Model	Train\|Val AASR (%) P-CAPTCHA	Train\|Val AASR (%) M-CAPTCHA
Dilated-VCS	4	0.0	Deep-CAPTCHA	96.9\|37.0	95.0\|26.2
	4	0.3		97.2\|36.1	95.6\|26.6
	8	0.0		96.4\|37.8	94.6\|27.2
	8	0.3		97.3\|37.9	94.8\|25.7
	22	0.0		97.2\|36.8	93.3\|25.2
	22	0.3		97.0\|39.1	93.7\|24.0
	4	0.0	Adaptive-CAPTCHA	99.9\|73.2	99.9\|69.2
	4	0.3		99.9\|73.1	99.9\|72.0
	8	0.0		99.9\|71.6	99.9\|70.1
	8	0.3		99.9\|74.0	99.9\|69.0
	22	0.0		99.9\|67.6	99.9\|57.6
	22	0.3		99.9\|68.2	99.9\|55.6
VCS	-	-	Deep-CAPTCHA	94.0\|35.8	93.7\|26.6
VCS	-	-	Adaptive-CAPTCHA	99.9\|74.5	99.9\|78.0
Sim-VCS	-	-	Deep-CAPTCHA	95.6\|35.6	93.4\|26.2
Sim-VCS	-	-	Adaptive-CAPTCHA	99.9\|72.4	99.9\|76.9

Article Menu

Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment

Abstract

1. Introduction

2. Datasets and Algorithms

3. Methods

3.1. VCS

3.2. Sim-VCS

3.3. Dilated-VCS

3.4. AE-LSKA

4. Results

4.1. Performance of VCS and Variants

4.2. Experimental Analysis of AE-LSKA

5. Discussion

5.1. Analysis of Color Shift Augmentation Techniques

5.2. Effectiveness of AE-LSKA

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI