The ICDAR 2015 dataset contains 1000 training images and 500 test images. All the images are taken automatically by the camera, and the shooting angle is not adjusted, so it is very random, and there is tilt and blur. Therefore, the text may appear in any direction and any position. At the same time, the text appears randomly in a certain position in the image, and the dataset has not been adjusted to improve the image quality, in order to increase the difficulty of detection.
The Total-Text dataset is a public dataset used to detect bent texts. It contains bent texts of commercial signs in real scenes, and the language to be detected is English. There are 1555 pictures, 1255 training images, and 300 test images.
The MSRA-TD500 dataset belongs to a multi-language and multi-category dataset, including Chinese and English, and contains 500 pictures, 300 training images, and 200 test images. These images are mainly taken indoors (offices and shopping malls) and outdoors (streets) with cameras. Indoor images include signs, house numbers, and warning signs. Outdoor images include guide boards and billboards with complex backgrounds.
4.2. Experiment and Discussion
In order to better prove the realization of each module proposed in this paper, we have carried out detailed ablation experiments on the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. Three main performance parameters, precision (P), recall (R), and comprehensive evaluation index (F), are considered to evaluate the detection performance of the model, which proves the effectiveness of the residual correction branch (RCB) and the two-branch attention feature fusion (TB-AFF) modules proposed by us. During the training process, the experiment was conducted in the same environment, and the place marked “√” indicated that the method was used. The results are listed in
Table 2,
Table 3 and
Table 4.
As can be seen from
Table 2, on the ICDAR2015 dataset, after adding the RCB module, the recall rate and F value exceed the original DB model results by about 4.68% and 1.56%, respectively. After adding the TB-AFF module, the recall rate and F value exceed the original DB model results by about 4.82% and 2.03%, respectively. At the same time, by adding these two modules, the method achieves 79.48% recall rate, 87.26% accuracy rate, and 83.19% F value in natural scene text image detection, which ensures the integrity of text information in the process of text detection. Compared with the results of the original model, the recall rate and F value increased by 5.68% and 2.39%, respectively, under the same accuracy. Therefore, the detection performance of the network combining these two modules is better than that of the network using the RCB module or TB-AFF module alone.
As can be seen from
Table 3, on the Total-Text dataset, compared with the local original DB model reproduction results, the introduction of the RCB module improves the recall rate and F value by about 4.56% and 2.12%, respectively. After the introduction of the TB-AFF module, the recall rate and F value increased by about 5.33% and 2.10%, respectively. At the same time, by introducing the two modules, this method achieves 78.95% recall rate, 87.37% accuracy rate, and 82.95% F value in natural scene text image detection. Compared with the results of the original model, the recall rate and F value are improved by 5.15% and 2.15%, respectively, under the same accuracy rate. Therefore, the detection performance of the network combining these two modules is better than that of the network using the RCB module or TB-AFF module alone.
As can be seen from
Table 4, on the MSRA-TD500 dataset, compared with the local original DB model reproduction results, after the introduction of the RCB module, the recall rate and F value are increased by about 7.82% and 3.35%, respectively. After the introduction of the TB-AFF module, the recall rate and F value increased by about 6.78% and 2.95%, respectively. At the same time, by introducing the two modules, the method achieves 83.33% recall rate, 88.02% accuracy rate, and 85.61% F value in natural scene text image detection. Compared with the results of the original model, the recall rate and F value are increased by 9.53% and 4.81%, respectively, under the same accuracy. Therefore, the detection performance of the network combining these two modules is better than that of the network using the RCB module or TB-AFF module alone.
It can be seen from the above observation that in the residual correction branch (RCB) module, we introduce the average pool down-sampling operation to establish the connection between positions in the whole pool window. The experimental results show that using the 18-layer backbone network and the proposed RCB can greatly improve the baseline performance, and the results are obviously improved. This phenomenon shows that the network with a residual correction branch can generate richer and more distinctive feature representations than the original ordinary convolution, which is helpful to find more complete target objects and can be better confined to semantic areas, even though their sizes are small. At the same time, in order to overcome the semantic and scale inconsistency between input features, our two-branch attention feature fusion (TB-AFF) module combines local feature information with global feature information, which can capture context information better. The experimental results show that the multi-scale attention fusion network (MSAFN) with the dual-branch attention feature fusion (TB-AFF) module can improve the performance of advanced networks with a small parameter budget, which indicates that people should pay attention to feature fusion in deep neural networks, and a proper attention mechanism of feature fusion may produce better results. It is further explained that instead of blindly increasing the depth of the network, it is better to pay more attention to the quality of feature fusion.
We have carried out relevant experiments on the downsampling factor r, and the experimental results are shown in
Table 5. We find that the smaller the value of r, the greater the complexity of Flops is. When r = 3,4,5, the complexity of its network Flops is similar. When r = 4, its f value reaches the relatively optimal result. In addition, we also verified it on Ours-Resnet-50 and found that when r = 4, the Flops and F values were well-balanced. Among them, the experimental results in
Table 5 were carried out on the improved Resnet18, and Flops was calculated by inputting 3 × 640 × 640.
In natural scene text detection, most cases appear as characters or text lines. The residual correction branch (RCB) designed by us increases the receptive field of the network by downsampling the feature map, thus modeling the context information around each spatial location and making the network detect the text information in the image more accurately and completely. The ablation experiment results also verify this point, which shows that the RCB proposed by us is effective.
Figure 6 shows the visualization results of baseline and the method in this paper. For each unit in the graph, the second column is the probability graph, the third column is the threshold graph, and the fourth column is the binary graph. From the experimental results, the residual correction branch (RCB) and the double branch attention feature fusion (TB-AFF) modules play an important role in text feature extraction in model training, effectively enhancing the model’s attention to text features, making effective use of the extracted text features and improving the detection accuracy of scene text to some extent.
We compare the proposed method with other advanced methods on different datasets, including the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. The experimental results are shown in
Table 6,
Table 7 and
Table 8.
The algorithm in this paper is compared with other algorithms on the Total-Text curved text dataset, and the results are shown in
Table 6. Our model outperforms segmentation-based algorithms such as the TextSnake algorithm, PSENet algorithm, and TextField algorithm in three evaluation indexes. Ours-ResNet-18 (800 × 800) achieved 78.95% recall, 87.37% accuracy, and 82.95% F value, which surpassed the original model DB-resnet-18 (800 × 800) by about 0.67%, 3.55%, and 2, respectively. Ours-ResNet-50 (800 × 800) achieved 82.19% recall rate, 88.06% accuracy rate, and 85.03% F value, which were about 3.76% and 3.79% higher than the original model DB-ResNet-50 (800 × 800), respectively. The above experimental results show that this model can adapt to any shape of curved text detection, and, in most cases, the method proposed in this paper is obviously superior to other methods.
We also tested the parameters and complexity of other models in
Table 6, and the test results are shown in
Table 6. As can be seen from
Table 6, Ours-ResNet-18 (800 × 800) has increased a little parameter quantity and complexity compared with the baseline and achieved a performance improvement of 2.25%. For Ours-ResNet-50 (800 × 800), although the parameters and complexity of the model have increased, the performance has improved by about 4%. Compared with PSE-1s, our model only needs fewer parameters and complexity to achieve better performance.
On the multidirectional text dataset ICDAR2015, the comparison results between our algorithm and other algorithms are shown in
Table 7. Ours-ResNet-18 (1280 × 736) achieved 79.48% recall rate, 87.26% accuracy rate, and 83.19% F value, which exceeded the original model DB-ResNet-18 (1280 × 736) by about 5.68% and 2.39%, respectively. Ours-ResNet-50 (1280 × 736) achieved 79.83% recall, 87.82% accuracy, and 83.63% F value, which surpassed the original model DB-ResNet-50 (1280 × 736) by about 2.03% and 0.73%, respectively. Ours-ResNet-50 (2048 × 1152) achieved a recall rate of 84.26%, an accuracy rate of 88.21%, and an F value of 86.19%, which surpassed the original model DB-resnet-50 (2048 × 152) by about 4.96% and 1.99%, respectively. Experimental results show that compared with the original model, the new network improves the recall rate and achieves better detection performance.
In addition, the model in this paper is superior to such regression-based algorithms as the RRD (rotation-sensitive regression detector) algorithm in the evaluation index. Ours-ResNet-18 (1280 × 736) outperforms the EAST algorithm by about 3.66%, 5.98%, and 4.99% in accuracy, recall, and F value respectively. The Corner algorithm will predict two adjacent texts as one text instance, resulting in inaccurate detection [
34]. SPN algorithm (Short Path Network) has poor robustness for curved text examples. When the candidate region predicted in the first stage only contains a part of the text instance, the SRPN (Scale-based Region Proposal Network) algorithm cannot correctly predict the boundary of the whole text instance in the second stage [
35]. Compared with the EAST algorithm, Corner algorithm, SPN algorithm, and SRPN algorithm, our model makes full use of semantic information to improve the accuracy of text pixel prediction and classification, reduces the interference of background pixels on small-scale texts, and makes use of rich feature information to improve the positioning ability of text examples.
On the long-text dataset MSRA-TD500, the comparison results between our algorithm and other algorithms are shown in
Table 8. Ours-ResNet-18 (512 × 512) has achieved 77.15% recall rate, 90.16% accuracy rate, and 83.15% F value, which exceeds the original model DB-ResNet-18 (512 × 512) by about 4.46% and 3.95%, respectively. Ours-ResNet-18 (736 × 736) achieved a recall rate of 83.33%, an accuracy rate of 88.02%, and an F value of 85.61%, which surpassed the original model DB-ResNet-18 (736 × 736) by about 7.63% and 2.81%, respectively. Ours-ResNet-50 (736 × 736) achieved 84.71% recall rate, 89.80% accuracy rate, and 87.18% F value, which exceeded the original model DB-ResNet-50 (736 × 736) by about 5.51% and 2.28% respectively, exceeding the table. In addition, Ours-ResNet-18 (736 × 736) outperforms the segmentation-based TextSnake algorithm and CRAFT algorithm in three evaluation indexes.
Figure 7 shows the visualization results of our method and the original DBNet on different types of text examples. It is worth noting that the images here are randomly selected from three datasets, which can better prove the robustness of our model.
For
Figure 7a, comparing Baseline and Ours, Baseline missed a part of the text in the figure (i.e., “CA”), while our method can detect it. For
Figure 7b,c, Baseline mistakenly detects the non-text area and detects the non-text area as the text area. Compared with Baseline, our method can avoid the false detection. As for
Figure 7d, comparing Baseline and Ours, Baseline missed a part of the text (i.e., “1”) in the figure, while our method can detect it. For
Figure 7e, Baseline missed the middle English text, but our method can accurately detect it. For
Figure 7f, Baseline detects “COFFEE” as two parts of text, but the actual “COFFEE” represents the same semantic information, which should be detected as a whole text area, and our method can detect it.
The above results show that the proposed algorithm improves the detection ability on the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. We can see that our network is very good in the natural scene text detection dataset, with good accuracy, recall rate, and F value, and can obtain a more efficient network. Experiments show that the residual correction branch (RCB) and double branch attention feature fusion (TB-AFF) modules are very important for text feature extraction and location information enhancement, which can improve the detection accuracy of the original algorithm without losing the detection efficiency. At the same time, in various challenging scenes, such as uneven lighting, low resolution, complex background, etc., this model can effectively deal with the drastic scale change of text, effectively improve the effect of text detection in natural scenes and accurately detect the scene text, which to some extent is inseparable from our proposed network model.