1. Introduction
Presently, numerous daily life activities which include communication, travel and tours, education, online E-commerce, entertainment, and most importantly money transactions are carried out by connecting to the internet. In order to perform such website tasks, legitimate end-users have to provide their private information by registering to the website. In creating a website account or registering, some hackers tend to write malicious programs which waste the website resources by making automatic false registrations called bots. These false registrations may affect the website and make it vulnerable to hackers, where they can go further to steal end-users’ private information or intercept their transactions. Therefore, in order to differentiate between human end-users and machine bots, cybersecurity practitioners generate CAPTCHAs as a form of security mechanism in website applications to defend end-users’ private information from automated malicious attacks.
CAPTCHA, which stands for (Completely Automated Public Turing test to tell Computers and Humans Apart), provides a way for web service providers to make some assumptions about whether an end-user is human or robot [
1]. The most famous way of protecting internet forms is to create a special image made up of letters and numbers and then require the user to enter it in a special textbox, as seen in
Figure 1. CAPTCHAs serve as a network security approach, which are used for websites to avoid automatic form entering, spamming, automatic voting, etc. [
2].
The majority of text-based CAPTCHAs consist of English uppercase letters (A to Z), English lowercase letters (a to z), and numerals (0 to 9) [
3]. These text-based CAPTCHAs are distorted text images which can be misrecognized by computers or robots but can highly be recognized by humans. The letters in text-based CAPTCHAs are sometimes overlapped, rotated, or an addition of curvature to the text image [
4].
The recognition accuracy based on humans for effective CAPTCHAs is at least 80%, while less than 0.01% recognition accuracy is based on computers [
2,
5]. Sometimes, since some companies or website owners want to satisfy their customers to easily recognize these CAPTCHAs, they are quickly compelled to generate loose CAPTCHAs, especially text-based CAPTCHAs with white backgrounds.
The CAPTCHA recognition technology in the area of traditional image processing, is divided into image preprocessing, positioning, character segmentation, and character recognition. This traditional approach, however, is difficult to form an accurate template set as a result of the adhered and complex CAPTCHA images such as rotation and overlapping [
2]. Any CAPTCHA is vulnerable to hackers, regardless of the image-generating algorithm [
1]. However, with a constant ethical vulnerability assessment, the loopholes in the CAPTCHA security can be assessed and then reduced.
The pipeline of CAPTCHA breaking consists of efforts, approaches, and a model that attempts to increase accuracy [
1]. According to Kolupaev et al. [
1], a neural network could recognize a single letter easier than a human could, and this fact did not depend on font style, rotation, or distortion. This means that if a hacker is well invested in building a neural network, the only defense mechanism is to make the characters hard to separate, but with the possibility to separate in order to provide good training data for the CNN to train on and then break these CAPTCHAs with high accuracy. This is a goal we want to achieve in the current study.
Recently, deep learning, which is a subfield in machine learning [
6], is one of the demanded areas in artificial intelligence research. Deep learning has achieved good performance and great success in many applications such as scene recognition [
7], image recognition [
8], object detection [
9,
10,
11,
12,
13], and image restoration [
14,
15,
16]. In contrast to the traditional pattern recognition technique, the huge advantage of deep learning is its effectiveness in learning features actively without artificial design [
2].
The advancement of deep learning has been so dramatic that tasks previously thought not easily addressable by computers and used as CAPTCHA to prevent spam are now possible to solve. Based on this observation, we are motivated to develop an efficient and accurate CNN architecture to break text-based CAPTCHAs with white backgrounds. The main idea is to perform a vulnerability assessment by developing a faster and high accuracy model. The significance of this study will create an awareness for the CAPTCHA-generating practitioners to assess the strengths and weaknesses of their generating algorithms in order to first observe if they can detect security loopholes in the text-based CAPTCHAs before deploying on websites.
This will assist them in generating an improved text-based CAPTCHA in the area of an automated challenge and respond against attacks dispensed by automated machines or bots. This study recognizes and acknowledges the previous CAPTCHA breaking techniques which have implemented their proposed CNN architectures using standard convolution layers. However, since a wide range of websites still use text-based CAPTCHAs, the trust of the end-users must not be betrayed through malicious attacks. CAPTCHA-generating practitioners must imitate the hackers through ethical means to perform a cybersecurity vulnerability assessment by developing faster, robust, and efficient algorithms than can outperform.
Therefore, we propose an efficient and accurate Depth-wise Separable Convolutional Neural Network to break text-based CAPTCHAs. Several CAPTCHA breaking practitioners have implemented their CNN architectures using standard CNNs. However, to the best of our knowledge, this is the first CAPTCHA breaking study to adopt the Depth-wise Separable CNN to break text-based CAPTCHAs.
In this current work, we are not utilizing the whole CAPTCHA image as an input for training the network, but we are rather splitting the images for the individual characters. We prove that, with a good and practical approach for creating the right training data for the Convolutional Neural Network (CNN), one may not require to fine-tune the existing state-of-the-art models to break CAPTCHAs, especially with a white background. This is due to the fact that these existing CNN architectures have a high number of parameters, may take a longer time to train, and the model sizes may relatively increase, which may consume a lot of memory and resources.
Therefore, this study makes the problem simple by extracting all the single letters from the CAPTCHA images with a 2-pixel margin around the edges, annotate, and then label them automatically for the CNN architecture to be trained on. The CNN has to use these single letters as inputs rather than the whole CAPTCHA image relative to the current problem at hand. A detailed description of the technique used to extract the single characters is given in the next section.
In this approach, with a small number of convolution layers in a well-developed CNN architecture, as proposed in this work, the experimental results show that the proposed technique performed better to break text-based CAPTCHAs with rotation, overlapping, and with a white background. One of the importance of CAPTCHA breaking research is to find security loopholes in CAPTCHAs, as a form of a vulnerability assessment technique, to assist CAPTCHA generating practitioners in generating improved text-based CAPTCHAs.
2. Materials and Methods
This section presents the dataset used in this work, the problem definition, and a practical way to tackle it. In addition, it provides the idea behind the proposed CAPTCHA breaking algorithm, the internal structure of the proposed computation-efficient CNN model, and its evaluation metrics.
2.1. Data Description
The CAPTCHA breaking research is very sensitive, and for that matter, there are no publicly available standard datasets of CAPTCHA images to be used. As a result, we are motivated to obtain CAPTCHA images either by retrieving them from real websites or generating them using libraries. Therefore, we adopted a low-cost approach to generate CAPTCHA images using the open source python CAPTCHA library (
https://pypi.org/project/captcha/ (accessed on 20 December 2020)). We obtained 10,000 CAPTCHA images, which are similar to real world CAPTCHA images such as the Weibo CAPTCHA scheme. The images contain four characters consisting of numerals and uppercase English letters. We excluded the [0,1,O,I] characters to avoid confusion to both humans and machines. The sample of the generated CAPTCHA images can be seen from
Figure 2a. The dataset poses a challenge where some of the characters are overlapping as seen in
Figure 2b. The approach for mitigating the problem is discussed in the next section. The dataset and codes used for implementing this current research study are available at (
https://github.com/sm-multimedia/Text-based-CAPTCHA-Breaking- (accessed on 27 January 2021)).
2.2. Problem Definition
The problem definition in this current study is to input a text-based CAPTCHA image which has rotated, overlapping characters, and especially with a white background, and then break the same CAPTCHA with a high accuracy rate, as illustrated in
Figure 3a. The problem can be solved by using deep learning to break the characters in the CAPTCHA image simultaneously, as shown in
Figure 3b.
However, in contrast to the approach in
Figure 3b, the problem is made simple by splitting the images for each single character, annotating, and labelling them to generate a new training dataset for a simple and efficient proposed CNN architecture. The graphical representation of this scenario is illustrated in
Figure 3c.
2.3. Basic Idea of the Single Character Extraction (SCE) Algorithm
Based on
Figure 3c, the SCE algorithm was developed which helped generate a new training dataset. The original text-based CAPTCHA images are loaded from the disk and split for the individual characters to output a single character. The general formula for the new training dataset set is expressed as:
where
T represents the new number of training dataset,
C represents the number of characters in each CAPTCHA image, and
N represents the total number of CAPTCHA images.
The main concept backing the proposed technique is to make the problem simple, especially when dealing with four-character CAPTCHA images with white backgrounds. The original text-based CAPTCHA images are loaded and converted to grayscale images for faster preprocessing. The base filenames of the original CAPTCHA images contain the labels which can be extracted, the images are then binarized through a thresholding technique. Thresholding the images ensures that the backgrounds of the images are black, while the foregrounds are white. This part of the preprocessing is a critical step in the workflow since it assists in finding the outlines of each of the characters in the CAPTCHA images. The contours of the characters in the threshold images are looped over and located. The bounding boxes for the contours are computed with a 2-pixel margin around the edges to extract each single letter. In the course of the extraction, the overlapping characters are detected. This part is as critical as the thresholding, if not handled well, it will end up generating a bad training dataset.
Therefore, this study checked the problem by comparing the width and height of the contour, if the width divided by the height is greater than a certain number of threshold, in the case of this study, the threshold was 1.25 based on the experiments, then the letter is too wide to be a single letter, which is then split into two. Finally, the extracted single letters are annotated and labelled for training. The single characters can be resized to a desirable input size for the Convolutional Neural Network (CNN) architecture. The result of the Single Character Extraction (SCE) algorithm is used to train a computation-efficient Convolutional Neural Network model to break the CAPTCHA characters. A graphical description of the SCE algorithm is illustrated in
Figure 4.
2.4. Network Structure and Parameters of the Proposed CAPTCHA Breaking Depth-Wise Saparable CNN
The proposed CNN architecture is inspired by MobileNets [
17], but in contrast to MobileNets, we introduce a lot of dropout layers and without a pointwise convolutional layer. The architecture of our proposed CNN model consists of six Depth-wise Separable Convolutional layers, seven Batch Normalization layers, seven ReLU activation function layers, four Dropout layers, three Max-pooling layers, one flatten layer, one fully connected layer, and one output Softmax layer.
The network architecture uses exclusively 3 × 3 Depth-wise Separable Convolutional filters, stacked on top of each other preceding the performed max-pooling. This current research work adopted the Depth-wise Separable Convolution rather than the standard convolution layers. Since it is more efficient, it requires less memory, less computation, and at some situation, can perform better than the standard convolution. In simplicity, the architecture of the proposed CNN model uses three (DEPTHWISE_CONV => ReLU => POOL) blocks with an increasing stacking number of filters. The first block consists of one Depth-wise Separable Convolution layer with 32 filters, followed by the Rectified Linear Unit (ReLU) activation function, which is popularly used in deep learning for faster and effective training, Batch Normalization, Max-pooling, and Dropout, which regularize the network by adding noise to the output feature maps of each layer. The second block consists of two Depth-wise Separable Convolution layers with 64 filters each, followed by two Rectified Linear Unit (ReLU) activation functions at the end of each Convolution layer, two Batch-Normalizations at the end of each ReLU, a Max-pooling, and a Dropout. The third block consists of three Depth-wise Separable Convolution layers with 64 filters each, followed by three Rectified Linear Unit (ReLU) activation functions at the end of each Convolution layer, three (3) Batch Normalization layers at the end of each ReLU layer, a Max-pooling, and a Dropout layer. The Batch Normalization and Dropout layers ensure the stability of the model and prevent over-fitting, respectively. The proposed architecture has only one fully-connected layer which consists of a flattened layer, followed by 256 hidden neurons, Rectified Linear Unit (ReLU) activation function, Batch Normalization, Max-pooling, and Dropout. The output layer consists of 32 nodes, which include the number of characters to be predicted by the network with a Softmax layer.
In the Depth-wise Separable operation, convolution is applied to a single channel at a time as compared to the standard CNN, in which it is done for all the M channels. Therefore, in the depth-wise convolution, the filters or kernels will be of
Dk ×
Dk × 1 in size. Suppose there are M channels in the input image, then the M filters are needed. As a result, the output will be of
Dp ×
Dp ×
M in size. The operation of a single convolution needs
Dk ×
Dk multiplications. As the filters slide by
Dp ×
Dp times across all the M channels, then the total number of multiplications is equal to
M ×
Dp ×
Dp ×
Dk ×
Dk. Therefore, for the Depth-wise Separable Convolution, the cost operation can be expressed as:
The network architecture of the proposed CNN model can be seen in
Figure 5, and the full description of the proposed CAPTCHA Breaking CNN is illustrated in
Figure 6.
2.5. Evaluation Metrics
The proposed CAPTCHA breaking CNN model was evaluated using Accuracy, Precision, and Recall. The formulas for Accuracy, Precision, and Recall are given as:
where
TP is true positive,
FP is false positive,
TN is true negative, and
FN is false negative based on the model’s prediction.
4. Discussion
CAPTCHA breaking is an ethical and essential form of the vulnerability assessment technique to evaluate the strength of security in text based CAPTCHAs before being deployed on web applications. Text-based CAPTCHA breaking algorithms can be grouped into two main divisions, segmentation-based and segmentation-free. The segmentation-based algorithms are categories of segmentation [
24] and character recognition [
25]. When it comes to CAPTCHA breaking research, it may not always matter which kind of algorithm category it may fall under. However, what matters most is the speed and high accuracy which can be achieved by the proposed algorithm to break the CAPTCHA.
Based on previous works, Simard et al. [
26] and Jaderberg et al. [
27] used a traditional technique such as image processing to locate a single number or character regions in an image, and then segment them in order to recognize the individual characters. Yan et al. [
28] used the segmentation technique to segment Microsoft CAPTCHAs with a 60% rate of recognition [
2].
Furthermore, CAPTCHA image segmentation has been implemented using the vertical projection [
29,
30,
31] approach based on the work of Zhang et al. [
32]. In their work, they enhanced the vertical projection to treat the characters with a combination of size features of the characters and their locations with a vertical projection histogram [
3]. The connected component algorithm has been used to segment Yahoo and Google CAPTCHA schemes [
5]. However, according to Thobhani et al. [
3], the vertical projection and connected component algorithms require massive preprocessing procedures, which are computationally expensive and consume a lot of time.
In the work of Yu et al. [
33], they adopted a low-cost approach based on open source python libraries to generate CAPTCHA images, and then proposed a peak segmentation algorithm together with the Convolutional Neural Network (CNN) to recognize their generated CAPTCHAs. They defined their peak segmentation as one which relies on the writing order of CAPTCHA from left to right. The whole CAPTCHA is compressed into the x-axis by summing up the values in the y-axis. In their character recognition of the CNN architecture, they constructed a convolutional input layer using the ReLU activation layer of 28 × 34 in input size. Their CNN model consisted of standard CNN layers, max-pooling layer, 20% dropout layer, a flatten layer, and an output layer with Softmax activation. Their model obtained a 99.32% training accuracy and 92.37% validation accuracy.
Hu et al. [
2] proposed a technique based on the Convolutional Neural Network to recognize CAPTCHA and avoided the traditional image processing technology including location and segmentation [
2]. Stark et al. [
34] also used the segmentation-free algorithm based on the Convolutional Neural Network to recognize CAPTCHAs. Another CAPTCHA recognition study based on the combination of Convolutional Neural Network and attention-based Recurrent Neural Network has been achieved under the segmentation-free algorithm. In this kind of network architecture, the CNN serves as the feature extractor to obtain meaningful information from the CAPTCHA image such as feature vectors, and a variant of RNN, such as the Long- Short Term Memory (LSTM) network, which is used to transform the feature vectors into a text sequence. Even though this kind of model has a high recognition rate, according to Thobhani et al. [
3], the architecture of the CNN-RNN model is relatively complicated and could also result in increased memory and storage size.
In the work of Kwon et al. [
4], they generated their CAPTCHA images using a two-step style-transfer learning in deep neural networks. Then, they tried to break the generated CAPTCHAs using a fine-tuned VGGNet Convolutional Neural Network based on the transfer learning approach.
In the work of Thobhani et al. [
3], they proposed an attached binary image algorithm. In their ABI algorithm, they made a specific number of copies of the input CAPTCHA image, which is equal to the number of characters in the input CAPTCHA image. Then, they attached unique binary images to each copy. Thereafter, they built a CNN architecture to use their ABI algorithm to recognize CAPTCHAs with a white background and CAPTCHAs with a noisy background. Their CNN architecture consisted of 17 standard convolutional layers, five max-pooling layers, one flatten layer, one dropout layer, and one output Softmax layer. After training their model on the CAPTCHA dataset of four characters with a white background, which is from the Weibo website for 120 epochs with a 128 batch size, their model obtained accuracies for training, validating, and testing of 98.45%, 93.26%, and 92.68%, respectively.
This current study is inspired by the above observed assumptions, and is motivated to contribute to CAPTCHA breaking research in order to detect the loopholes in cybersecurity based on text-based CAPTCHAs which are loosely generated. The significance of our research will assist website owners or companies to think twice and generate CAPTCHAs which are hard to break by machines in order to protect the private information of their end-users.
Therefore, in this current work, rather than using the whole text-based CAPTCHA image as the training input, the original CAPTCHA image is split for the individual characters based on the SCE algorithm. As a result, a single character of 20 × 20 pixels in input shape is used as a training input.
The competitive results of the proposed text-based CAPTCHA breaking Depth-wise Separable CNN model have been compared with notable image recognition CNN architectures such as MobileNets, DenseNet, ShuffleNet, and Xception through transfer learning. This study has made the problem simple, and for that matter, a network with a complex number of parameters may not be necessary.
After training the fine-tuned mobileNets architecture for 10 epochs, the model obtained training and validation accuracies of 98.3% and 99.4%, respectively. The training and validation losses were 0.0657 and 0.0211, respectively. The size of the fine-tuned MobileNets model obtained was 37.76 MB. Based on the fine-tuned DenseNet, after 10 epochs, the model obtained training and validation accuracies of 99.1% and 99.6%, respectively. The training and validation losses were 0.0487 and 0.0122, respectively. The size of the DenseNet model after training on the custom CAPTCHA dataset was 82.43 MB. For the same 10 epochs, the fine-tuned ShuffleNet model obtained training and validation accuracies of 97.9% and 98.4%, respectively. The training and validation losses were 0.1156 and 0.0790, respectively. The size of the ShuffleNet model after training on the custom CAPTCHA dataset was 14.25 MB. After training the fine-tuned Xception for 10 epochs, the model obtained training and validation accuracies of 99.3% and 99.6%, respectively. The training and validation losses were 0.0256 and 0.0115, respectively. The size of the Xception model after training on the custom CAPTCHA dataset was 239.79 MB.
Based on the transfer learning models observed above, all the models performed good on the training and validation sets. However, we also noticed a lot of parameters in the fine-tuned CNN architectures and that may demand unnecessary storage and memory sizes relative to the current problem at hand.
Therefore, this study proposed an efficient and accurate CAPTCHA breaking CNN architecture which is flexible, less complex, and above all, has a high breaking accuracy. Our proposed CNN architecture consisted of six Depth-wise Separable Convolutional layers, seven Batch Normalization layers, seven ReLU activation function layers, four Dropout layers, three Max-pooling layers, one flattened layer, one fully connected layer, and one output Softmax layer. The number of parameters in our proposed model is far less than the fine-tuned CNN architectures, as shown in Table 2. Our proposed CNN model is the first CNN architecture research work to utilize a Depth-wise Separable Convolution to perform text-based CAPTCHA breaking. After 10 epochs, our proposed model achieved training and validation accuracies of 99.5% and 99.8%, respectively. The training and validation losses were 0.0164 and 0.0090, respectively. The size of our proposed model after training on the custom CAPTCHA dataset was 2.36 MB.
As shown in
Table 1, we observed that, our proposed model has lesser number of trainable and non-trainable parameters as compared with the fine-tuned CNN models. The training and validation accuracies of our proposed model were higher as compared with the fine-tuned CNN models. It took less time to train our proposed model for 7 s and 7 ms/step for 10 epochs on the GPU. And lastly, after training, the model size of our proposed CNN model was lesser as compared with the fine-tuned CNN models.
By performing inference on the testing datasets, the results are shown in
Table 2. We observed that, all the models obtained accuracies of more than 90% on the testing dataset. However, in terms of the breaking time, our proposed model is simple, flexible, efficient, and used lesser time to break the text-based CAPTCHAs as compared with the fine-tuned CNN models. Examples of text-based CAPTCHA breaking by our proposed model are shown in
Figure 16. The limitation of our proposed CAPTCHA breaking model is that it performs better on CAPTCHA images with rotated, overlapping characters, and with a white background. It has not been tested on CAPTCHA images with distortion, strikethrough, and with a complex noisy background.