Figure 1.
Overall flow of prediction. Images are passed to both MobileNetV2 and the balanced decoder, and post-processing generates the predicted results.
Figure 1.
Overall flow of prediction. Images are passed to both MobileNetV2 and the balanced decoder, and post-processing generates the predicted results.
Figure 2.
Structure of the proposed CAST. The five stages on the left represent MobileNetV2, while the center box is the balanced decoder. With the inverted residual block (IRB), and IRB* is a special case without any expansion. Convolutions of kernel size 3x3 are denoted by Conv. Here c indicates the number of channels, and OS represents the output stride.
Figure 2.
Structure of the proposed CAST. The five stages on the left represent MobileNetV2, while the center box is the balanced decoder. With the inverted residual block (IRB), and IRB* is a special case without any expansion. Convolutions of kernel size 3x3 are denoted by Conv. Here c indicates the number of channels, and OS represents the output stride.
Figure 3.
Structure of an IRB, which consists of three steps: expansion, depth-wise resolution, and projection. Here t represents an expansion factor, and c indicates the number of channels. The W and H are the width and height of the feature map, respectively.
Figure 3.
Structure of an IRB, which consists of three steps: expansion, depth-wise resolution, and projection. Here t represents an expansion factor, and c indicates the number of channels. The W and H are the width and height of the feature map, respectively.
Figure 4.
Structures of four comparable decoders, where is the output stride, c is the number of channels, and convolutions of kernel size are denoted by Conv. (a) decoder of EAST, (b) decoder of CRAFT, (c) decoder using IRB for all stages, and (d) a different decoder using IRB for all stages. To ensure decoder (d) has a similar complexity, its expansion is set to be 3.
Figure 4.
Structures of four comparable decoders, where is the output stride, c is the number of channels, and convolutions of kernel size are denoted by Conv. (a) decoder of EAST, (b) decoder of CRAFT, (c) decoder using IRB for all stages, and (d) a different decoder using IRB for all stages. To ensure decoder (d) has a similar complexity, its expansion is set to be 3.
Figure 5.
The process of ground truth generation. (a) Score map generation by shrinking the word boxes. (b) RBOX generation based on the distance from the given pixel to the four edges.
Figure 5.
The process of ground truth generation. (a) Score map generation by shrinking the word boxes. (b) RBOX generation based on the distance from the given pixel to the four edges.
Figure 6.
Prediction results with the ICDAR2015 dataset.
Figure 6.
Prediction results with the ICDAR2015 dataset.
Figure 7.
False positive results with the ICDAR2015 dataset, where the red rectangles indicate the failed cases.
Figure 7.
False positive results with the ICDAR2015 dataset, where the red rectangles indicate the failed cases.
Table 1.
Results with the ICDAR2013 dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. An asterisk(*) indicates that the corresponding result is obtained in our own experiments. M and P denote MobilNetV2 and Pvanet, respectively. E, C, and B are the EAST decoder, the CRAFTdecoder, and our proposed balanced decoder, respectively. The inference time represents the time needed for obtaining a result for an image, where italic values indicate that the values are borrowed from other studies.
Table 1.
Results with the ICDAR2013 dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. An asterisk(*) indicates that the corresponding result is obtained in our own experiments. M and P denote MobilNetV2 and Pvanet, respectively. E, C, and B are the EAST decoder, the CRAFTdecoder, and our proposed balanced decoder, respectively. The inference time represents the time needed for obtaining a result for an image, where italic values indicate that the values are borrowed from other studies.
Method | R | P | F | Size | Time | FLOPS | Param |
---|
P + E (EAST) [12] * | 66.57 | 90.27 | 76.63 | 512 (short) | 63.3 ms | 5.2 G | 3.2 M |
PixelLink [3] | 83.60 | 86.40 | 84.50 | 512 × 512 | 207.69 ms | 175.5 G | 20.5 M |
Seglink [10] | 83.00 | 87.70 | 85.30 | 512 × 512 | 50 ms | - | - |
Mask TextSpotter [20] | 88.27 | 95.01 | 91.52 | 1000 (short) | 217.4 ms | - | - |
SPCNET [21] | 90.59 | 93.77 | 92.16 | 848 (short) | - | 470.1 G | 35.5 M |
CRAFT [2] | 92.40 | 97.67 | 94.96 | 960 (long) | 160.73 ms | 252.3 G | 20.8 M |
M + E | 69.57 | 91.38 | 78.99 | 512 (long) | 51.9 ms | 3.6 G | 2.6 M |
M + C | 69.97 | 91.10 | 79.15 | 512 (long) | 56.6 ms | 5.1 G | 3.0 M |
M + B (CAST) | 69.69 | 94.78 | 80.32 | 512 (long) | 53.3 ms | 4.7 G | 2.8 M |
Table 2.
Results with the ICDAR2015 dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. M and P denote MobilNetV2 and Pvanet, respectively. E, C, and B are the EAST decoder, the CRAFT decoder, and our proposed balanced decoder, respectively. The inference time represents the time needed for obtaining a result for an image, where italic values indicate that the values are borrowed from other studies.
Table 2.
Results with the ICDAR2015 dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. M and P denote MobilNetV2 and Pvanet, respectively. E, C, and B are the EAST decoder, the CRAFT decoder, and our proposed balanced decoder, respectively. The inference time represents the time needed for obtaining a result for an image, where italic values indicate that the values are borrowed from other studies.
Method | R | P | F | Size | Time | FLOPS | Param |
---|
Seglink [10] | 76.80 | 73.10 | 75.00 | 1280 × 768 | - | - | - |
P + E (EAST) [12] | 71.35 | 80.63 | 75.71 | 1280 (long) | 52.3 ms | 23.2 G | 3.2 M |
Mask TextSpotter [20] | 81.20 | 85.80 | 83.40 | 1000 (short) | 217.39 ms | - | - |
Ruan et al. [14] | 80.55 | 86.59 | 83.46 | 1280 × 704 | 90.09 ms | 166.0 G | 24.2 M |
PixelLink [3] | 82.00 | 85.50 | 83.70 | 1280 × 768 | 275.66 ms | 650.2 G | 20.5 M |
PSENet [19] | 86.90 | 84.50 | 85.70 | 2240 (long) | 625 ms | - | |
CRAFT [2] | 84.30 | 89.80 | 86.90 | 2240 (long) | 430.01 ms | 1023.9 G | 20.8 M |
SPCNET [21] | 85.80 | 88.70 | 87.20 | 848 (short) | - | 470.1 G | 35.5 M |
CharNet H88 [34] | 89.99 | 91.98 | 90.97 | 2280 (long) | 961.36 ms | 2402.9 G | 89.1 M |
M + E | 74.65 | 83.20 | 78.66 | 1280 (long) | 46.3 ms | 16.0 G | 2.6 M |
M + C | 76.36 | 84.99 | 80.44 | 1280 (long) | 49.9 ms | 22.5 G | 3.0 M |
M + B (CAST) | 76.79 | 85.84 | 81.06 | 1280 (long) | 53.2 ms | 20.8 G | 2.8 M |
Table 3.
Results with ICDAR2017 MLT dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. An asterisk(*) indicates that the corresponding result is obtained in our own experiment. M and P denote MobilNetV2 and Pvanet, respectively. E, C, and B are the EAST decoder, the CRAFT decoder, and our proposed balanced decoder, respectively. The inference time represents the time needed for obtaining a result for an image, where italic values indicate that the values are borrowed from other studies.
Table 3.
Results with ICDAR2017 MLT dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. An asterisk(*) indicates that the corresponding result is obtained in our own experiment. M and P denote MobilNetV2 and Pvanet, respectively. E, C, and B are the EAST decoder, the CRAFT decoder, and our proposed balanced decoder, respectively. The inference time represents the time needed for obtaining a result for an image, where italic values indicate that the values are borrowed from other studies.
Method | R | P | F | Size | Time | FLOPS | Param |
---|
P + E (EAST) [12] * | 51.83 | 66.18 | 58.13 | 2400 (long) | 277.7 ms | 113.7 G | 3.2 M |
SPCNET [21] | 66.90 | 73.40 | 70.00 | 848 (short) | - | 470.1 G | 35.5 M |
PSENet (ResNet152) [19] | 75.35 | 69.18 | 72.13 | orginial × 2 | - | - | - |
CRAFT [2] | 68.20 | 80.60 | 73.90 | 2560(long) | 1178.80 ms | 1563.62 G | 20.8 M |
CharNet H-88 [34] | 70.97 | 81.27 | 75.77 | 2280 (long) | 1712.19 ms | 3123.82 G | 89.1 M |
M + E | 57.31 | 67.29 | 61.90 | 2400 (long) | 264.0 ms | 78.7 G | 2.6 M |
M + C | 57.45 | 70.80 | 63.43 | 2400 (long) | 273.4 ms | 110.4G | 3.0 M |
M + B (CAST) | 58.38 | 70.40 | 63.83 | 2400 (long) | 279.2 ms | 102.1 G | 2.8 M |
Table 4.
Comparison the CAST with the most accurate models for each dataset, where F indicates F1 score, and means that the corresponding model is relatively K times better or worse than the CAST.
Table 4.
Comparison the CAST with the most accurate models for each dataset, where F indicates F1 score, and means that the corresponding model is relatively K times better or worse than the CAST.
Method | Dataset | F | Time | FLOPS | Param |
---|
CRAFT [2] | ICDAR13 | 1.18× | 3.01× | 53.68× | 7.43× |
CharNet [34] | ICDAR15 | 1.12× | 18.07× | 115.52× | 31.86× |
CharNet [34] | ICDAR17 | 1.19× | 6.13× | 30.59× | 31.86× |
Table 5.
Comparison of several decoders with the ICDAR2015 dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. The results are obtained using the same backbone MobileNetV2.
Table 5.
Comparison of several decoders with the ICDAR2015 dataset, where R, P, and F represent the recall, precision, and F1 score, respectively. The results are obtained using the same backbone MobileNetV2.
Method | R | P | F | FLOPS | Param |
---|
EAST decoder | 74.65 | 83.20 | 78.66 | 16.0 G | 2.6 M |
IRB decoder 1 | 75.20 | 84.75 | 79.69 | 21.1G | 2.6M |
IRB decoder 2 | 75.64 | 84.28 | 79.73 | 22.3G | 2.7 M |
CRAFT decoder | 76.36 | 84.99 | 80.44 | 22.5G | 3.0 M |
Balanced decoder | 76.79 | 85.84 | 81.06 | 20.8G | 2.8 M |
Table 6.
Inference time comparison using a CPU with the ICDAR2015 dataset, where F represents the F1 score. Here ms and s represent milliseconds and seconds, respectively.
Table 6.
Inference time comparison using a CPU with the ICDAR2015 dataset, where F represents the F1 score. Here ms and s represent milliseconds and seconds, respectively.
Method | F | Time | FLOPS | Param |
---|
PixelLink [3] | 83.70 | 1.89 s | 650.2G | 20.5 M |
CRAFT [2] | 86.90 | 42.66 s | 1023.9G | 20.8 M |
CharNet [34] | 90.97 | 1230 s | 2402.9G | 89.2 M |
CAST | 81.06 | 352.90 ms | 20.8G | 2.8 M |