4.1. Datasets
In this paper, we conducted training and testing on three widely used public datasets, including ICDAR2015 for multi-directional text, Total-Text for curved text, and MSRA-TD500 covering multi-lingual text.
Figure 8 demonstrates the performance of our proposed algorithm on different types of text instances. The second column presents the probability maps, the third column displays the threshold maps, and the fourth column exhibits the binarized maps, jointly illustrating the processing procedures and outcomes of the algorithm.
The ICDAR 2015 dataset contains 1000 images for training and 500 images for testing. Primarily focusing on English text, it includes various scales and orientations of text images. This dataset is primarily composed of purely English text, including text images in different directions as well as some blurred images. Training and testing on this dataset enables the text detection algorithm to better adapt to a wide range of application scenarios.
The Total-Text dataset comprises 1255 images for training and 300 images for testing. It covers images from various scenes and environments, including streetscapes, billboards, signs, newspapers, and more. The text in this dataset exhibits diverse forms, with a majority of images featuring curved text. Training and testing on this dataset significantly improves the text detection algorithm’s perception of various shapes of text, enabling it to better adapt to different application scenarios.
The MSRA-TD500 dataset contains both Chinese and English text images, with 300 images for training and 200 images for testing. It covers both indoor and outdoor scenes. Indoor images mainly include signs and door numbers, while outdoor images involve scenes with complex backgrounds such as guideboards, billboards, and warning signs. Training and testing on this dataset effectively enhances the generalization ability of the text detection algorithm for different languages, enabling it to accurately detect and recognize text in various environments.
4.2. Experimental Configuration
In this study, we selected Python 3.7 as the programming language and used the deep learning framework Pytorch 1.5 to conduct experiments. The entire experimental process was accelerated with the NVIDIA GeForce RTX 3090 graphics card. Initially, we trained the network using the SynthText synthetic dataset [
41] for 100 k iterations and then fine-tuned the model for 1200 epochs on real datasets based on the pre-trained model. We strictly adhere to the officially provided dataset without any modification, ensuring the accuracy of our experiments. In our experimental settings, we set the initial learning rate
to 0.007,
to 0.9, weight decay to 0.0001, the number of cascaded fpems to 2, momentum to 0.9, and batch size to 16. Adam [
42] was adopted as our training optimizer. The learning rate
was continuously reduced using the following formula:
During this experiment, we employed three data augmentation techniques to expand the dataset, including random rotation, random splitting, and random flipping. Given the diversity of the dataset, small text regions were challenging to detect. Therefore, we ignored some excessively small text regions during the label creation process, excluding them from the training process. Since different scales of test images significantly impact detection performance, we maintained the aspect ratio of the test images during the inference stage and adjusted the size of the input images by setting an appropriate height for each dataset.
4.3. Evaluation Index
In the experiment, we determined the correctness of predictions by comparing whether the Intersection over Union (IOU) value between the predicted text box and the corresponding label box was greater than 0.5. The calculation formula is as follows:
In the above formula, represents the area of the predicted text box, and represents the area of the label text box.
We use three main performance parameters, precision, recall, and F-score, to evaluate the detection performance of the model. The number of true text boxes predicted as text boxes is recorded as
, the number of true text boxes predicted as background areas is recorded as
, and the number of false text boxes predicted as background areas is recorded as
. The comprehensive values for precision, recall, and F-score are calculated as follows:
4.4. Ablation Experiment
To comprehensively evaluate the effectiveness of each module proposed in this paper and their impact on overall performance, we conducted ablation experiments on the improved model. The core idea of ablation experiments is to gradually add or modify specific parts of the model and observe the impact of these changes on model performance, thereby gaining a deeper understanding of the internal working mechanisms of the model. We added DSAF, the PFFM, and the SAM to the original DBNet algorithm, and then compared the model performance before the addition to evaluate the impact of these modules on the model performance.
In this paper, we conducted a series of ablation experiments on the ICDAR 2015, Total-Text, and MSRA-TD500 datasets. Performing ablation experiments on different datasets helps better validate the generalization ability of the model, facilitating a comprehensive assessment of its performance, robustness, and application scope.
For the backbone of the ablation experiments, we chose ResNet-18. Compared to ResNet-50, ResNet-18 has fewer parameters, requiring fewer computational resources for training and evaluation, resulting in faster training and inference processes. Additionally, ResNet-18 has a lower model complexity making it easier to observe the impact of each module on model performance, facilitating analysis of the ablation experiments. The results of the ablation experiments on the three datasets are presented in
Table 1,
Table 2 and
Table 3. Since the SAM is added to the concatenated feature graph, we combined it with the PFFM in the fourth-row ablation experiment of each dataset to analyze the SAM’s impact on the overall performance of the network.
As can be seen from
Table 1, on the ICDAR2015 dataset, after adding the DSAF module, the accuracy, recall, and F-score are 0.4%, 0.52%, and 0.47% higher than the original DBNet model, respectively. After employing the PFFM, the accuracy, recall, and F-score are 1.13%, 0.51%, and 0.79% higher than the DBNet model using FPN, respectively. Adding the SAM after the PFFM can further increase the accuracy, recall, and F-score to 87.53%, 78.86%, and 82.97%, respectively. By incorporating all these modules, the method used in this paper achieves an accuracy of 87.65%, a recall of 79.45%, and an F-score of 83.34% on this dataset. Compared to the original DBNet model, the addition of these modules leads to improvements in accuracy, recall, and F-score of 1.53%, 1.64%, and 1.56%, respectively.
As evident from
Table 2, on the Total-Text dataset, the introduction of the DSAF module resulted in an increase of 1.44%, 2.84%, and 2.25% in accuracy, recall, and F-score, respectively, compared to the original DBNet network. After utilizing the PFFM, the accuracy, recall, and F-score were 0.62%, 3.21%, and 2.08% higher than the original DBNet model, respectively. Incorporating the SAM after the PFFM further elevated the accuracy, recall, and F-score to 87.35%, 79.28%, and 83.11%, respectively. By integrating all these modules, the method employed in this paper achieved an accuracy of 87.41%, a recall of 79.32%, and an F-score of 83.16% on this dataset. Compared to the original DBNet model, the simultaneous addition of these modules led to improvements in accuracy, recall, and F-score of 0.83%, 4.0%, and 2.61%, respectively.
As shown in
Table 3, on the MSRA-TD500 dataset, the introduction of the DSAF module resulted in an increase of 0.77%, 1.57%, and 1.2% in accuracy, recall, and F-score, respectively, compared to the original DBNet network. After utilizing the PFFM, the accuracy, recall, and F-score were, respectively, 1.33%, 1.15%, and 1.23% higher than the results obtained from the DBNet model with FPN. Incorporating the SAM after the PFFM further elevated the accuracy, recall, and F-score to 87.26%, 81.58%, and 84.32%, respectively. By integrating all these modules, the method employed in this paper achieved an accuracy of 87.53%, a recall of 82.52%, and an F-score of 84.95% on this dataset. Compared to the original DBNet model, the simultaneous addition of these two modules led to improvements in accuracy, recall, and F-score of 1.78%, 2.26%, and 2.04%, respectively.
As can be seen from the three tables above, embedding the DSAF module into the feature extraction network ResNet allows us to combine local and global attention, thereby enhancing the ability to extract text feature information and better capture contextual information. Compared to the original feature fusion, the PFFM we adopted has stronger feature fusion capabilities and can significantly improve model robustness. Additionally, the SAM applied after the cascaded feature map further enhances the detection performance for diverse texts.
4.5. Experimental Results
In this paper, we plotted the training loss curve on three different datasets. For the backbone, we chose ResNet50. Compared to ResNet18, ResNet50 has more layers and larger parameter sizes, which enables it to have stronger feature learning capabilities and better generalization performance. Therefore, the loss curve exhibited by ResNet50 is more stable compared to that of ResNet18.
As can be seen from
Figure 9, due to the differences in datasets and sample diversity, the convergence speed and stability of each data set are different. However, in the three datasets, the loss curve of our proposed model rapidly decreases during the initial training stage and then gradually stabilizes, indicating that the model gradually learns the data’s regularity during the training process and gradually converges to the optimal solution indirectly indicating that our proposed model has good learning ability. In addition, the loss curves trained on these three data sets can gradually reach stable convergence during the training process without large fluctuations in the later stages, which also indicates that our proposed model has good stability.
We compared the method we adopted with other methods on the multi-directional text dataset ICDAR2015, the curved text dataset Total-Text, and the multi-language text dataset MSRA-TD500. The experimental results are presented in
Table 4,
Table 5 and
Table 6.
The comparison results of our model with other models on the multi-directional text ICDAR 2015 dataset are shown in
Table 4. From
Table 4, we can see that the accuracy, recall, and F-score of Ours-ResNet-18 are 87.7%, 79.5%, and 83.4%, respectively, which are 1.6%, 1.7%, and 1.7% higher than the original DB-ResNet-18. The accuracy, recall, and F-score of Ours-ResNet-50 are 87.9%, 83.4%, and 85.6%, respectively, which are 0.5%, 2.0%, and 1.3% higher than the original DB-ResNet-50. The accuracy, recall, and F-score of Ours-ResNet-50(1152) are 89.6%, 84.2%, and 86.8%, respectively, which are 1.1%, 0.4%, and 0.7% higher than the original DB-ResNet-50(1152). The experimental results show that our proposed model has better detection performance than the original model in multi-directional text detection. Compared with previous classical methods, the proposed method has achieved good results in the three indexes.
The comparison results of our proposed model with other models on the curved text Total-Text dataset are shown in
Table 5. The accuracy, recall, and F-score of Ours-ResNet-18 are 87.4%, 79.3%, and 83.2%, respectively, which are 0.7%, 1.8%, and 1.7% higher than the original DB-ResNet-18. The accuracy, recall, and F-score of Ours-ResNet-50 are 88.5%, 82.4%, and 85.3%, respectively, which are 2.3%, 2.2%, and 2.1% higher than the original DB-ResNet-50. The experimental results show that our proposed model has better detection accuracy on curved text than the original model. Compared with previous classical methods, the proposed method has achieved good results in the three indexes.
The comparison results of our model with other models on the MSRA-TD500 multilingual text dataset are shown in
Table 6. In
Table 6, the accuracy, recall, and F-score of Ours-ResNet-18 are 87.5%, 82.5%, and 84.9%, respectively, which are 1.8%, 2.3%, and 2.0% higher than the original DB-ResNet-18. The accuracy, recall, and F-score of Ours-ResNet-50 are 89.3%, 85.2%, and 87.2%, respectively, which are 0.5%, 2.8%, and 1.7% higher than the original DB-ResNet-50. The experimental results show that our proposed model has better detection performance on multilingual text datasets. Compared with previous classical methods, the proposed method has achieved good results in the three indexes.
Figure 10 below illustrates the detection results of the original DBNet model and our improved model. The figure compares the results on the ICDAR2015 dataset, the MSRA-TD500 dataset, and the Total-Text dataset, respectively. As can be seen from the comparisons, the original DBNet model exhibits missed detections in all three datasets, whereas our proposed model can better avoid such missed detections. These three images are randomly selected from the three datasets, further demonstrating the generalization ability and robustness of our model. The results show that the DSAF module, PFFM, and SAM can effectively enhance the detection ability of text features.
The above results show that the model we proposed has superior detection performance in the multi-directional text data set ICDAR2015, the curved text data set Total-Text, and the multi-language text data set MSRA-TD500. These data sets contain text information from most indoor and outdoor scenes, indicating that our model has excellent detection performance in natural scene text detection. Experiments demonstrate that the attention fusion (DSAF) module and cascade feature fusion (PFFM) module are very important for text feature extraction and feature fusion, significantly improving the detection accuracy of the original algorithm. At the same time, the added SAM also improves the detection performance to some extent. In summary, the model is superior to existing methods in performing scene text detection tasks, with superior performance, and can effectively and accurately detect text in various scenes.