TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution

Zhou, Fenfen; Cai, Yuanqiang; Tian, Yingjie

doi:10.3390/electronics12010159

Open AccessArticle

TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution

by

Fenfen Zhou

¹,

Yuanqiang Cai

² and

Yingjie Tian

^3,4,5,*

¹

School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100190, China

²

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

³

School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China

⁴

Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China

⁵

Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(1), 159; https://doi.org/10.3390/electronics12010159

Submission received: 8 December 2022 / Revised: 23 December 2022 / Accepted: 25 December 2022 / Published: 29 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Text detection has been significantly boosted by the development of deep neural networks but most existing methods focus on a single kind of text instance (i.e., overlaid text, layered text, scene text). In this paper, we expand the text detection task from a single dimension to multiple dimensions, thus providing multi-type text descriptions for the scene and content analysis of videos. Specifically, we establish a new task to detect and classify text instances simultaneously, termed TextDC. As far as we know, existing benchmarks cannot meet the requirements of the proposed task. To this end, we collect a large-scale text detection and classification dataset, named Text3C, which is annotated using multilingual labels, location information, and text categories. Together with the collected dataset, we introduce a multi-stage and strict evaluation metric, which penalizes detection approaches for missing text instances, false positive detection, inaccurate location boxes, and error text categories, developing a new benchmark for the proposed TextDC task. In addition, we extend several state-of-the-art detectors by modifying the prediction head to solve the new task. Then, a generalized text detection and classification framework is designed and formulated. Extensive experiments using the updated methods are conducted on the established benchmark to verify the solvability of the proposed task, the challenges of the dataset, and the effectiveness of the solution.

Keywords:

text detection and classification; multiple types; Text3C dataset

1. Introduction

Text detection is one of the critical tasks in computer vision. It has received widespread attention and has been applied to various practical applications [1,2,3,4]. Existing text detectors consider the text instance from a single perspective, i.e., overlaid text, layered text, or scene text [5]. Overlaid text is usually employed to describe the scene or present people’s dialogue. It is embedded in the edge region of images or video frames, and its coverage process does not consider the background, as shown in Figure 1a. Layered text primarily defines the themes of scenes and events. It occurs in combination with a color-pure layer and covers the blank area of images or video frames, as shown in Figure 1b. Scene text is inherent in images and video scenes. It can unambiguously reflect the information about various objects in the current scene and the relative positions of adjacent scenes, as shown in Figure 1c.

However, a single-perspective investigation cannot be adequately satisfied in the current era of streaming media because images and video frames in complex scenes often contain a variety of descriptions such as the scene being portrayed, a summary of the content, the subjective point of view, and so on. In view of this, we propose a new task to detect and classify all types of text in images and videos at the same time, termed TextDC. This is helpful for a multi-angle understanding, retrieval, and summary of images and videos. The new task requires text detectors to predict the boundary and category (i.e., overlaid text, layered text, and scene text) of the text instance simultaneously. Thus, it challenges the traditional benchmarks that only contain single-type text. To address this issue, we construct a comprehensive benchmark that meets the task requirements and reflects the requirements of real-world scenarios. As far as we know, no existing dataset can meet this task.

To support the research of the proposed TextDC task, we collect a large-scale multi-type text dataset with three instance categories, referred to as Text3C. Together with a new multi-stage evaluation protocol, we propose a promising and exciting benchmark to further the expansion of the text detection community. As shown in Figure 1d, each of the 595,691 text instances is labeled with a boundary box and text type (i.e., overlaid text, layered text, and scene text). The established dataset is divided into a training set with 29,534 video frames and a test set with 10,806 video frames, with a ratio of 3:1. Our dataset not only contains the text categories contained in other datasets but is also more varied in terms of scenarios. It addresses various challenges, including scale diversity, extreme aspect ratios, multiple directions, multiple languages, multiple categories, and unconstrained scenarios, thus posing a new challenge to state-of-the-art text detectors.

Existing text detection methods only determine whether a region contains text and assign a boundary box or pixel mask to label the corresponding instance. These methods are used to solve the problem of binary classification. Thus, they do not have the ability to distinguish multiple text categories, namely they cannot output overlaid text, layered text, or scene text for each instance. In this paper, rooted in an open text detection architecture, we extend some state-of-the-art algorithms to make them suitable for the proposed text detection and classification task, and successively establish a new open-source text detection and classification framework. We use and modify the open text detection and recognition framework, mmocr (https://github.com/open-mmlab/mmocr), accessed on 5 February 2022. Specifically, we first model the generality of the existing text detection methods and clarify their structural similarities (backbone stage for feature exaction and neck stage for feature fusion) and differences (prediction stage for boundary boxes). Then, we propose a well-designed multi-category feature layer that follows the neck stage so that the existing algorithms can solve the new task. The contributions of this paper include the following:

We establish a large-scale multi-type text detection benchmark, termed Text3C, which consists of 40,340 images, including 595,691 text instances, and addresses the challenges of overlaid text, layered text, and scene text.
We propose a new task to detect and classify text in images and video frames simultaneously, named TextDC, which can consider the current types of streaming media with multi-view representations.
We propose a simple and effective solution for the TextDC task. Based on an open text detection and recognition framework, we model the similarities (i.e., backbone and neck stages) while reserving the differences (prediction head) of some state-of-the-art methods, and then modify them by adding category-related feature layers to match the proposed new task.
Extensive experiments are conducted on the proposed dataset with a multi-stage evaluation protocol to demonstrate the usefulness of the proposed Text3C and its potential to promote community research.

This paper is organized as follows. The related benchmarks and methods are described in Section 2. Section 3 presents the proposed Text3C benchmark including the dataset and protocols utilized, and Section 4 describes the proposed method. The experimental evaluation and comparison of different methods are presented in Section 5, followed by the conclusion in Section 7.

2. Related Work

Text detection is a basic computer vision task. In the past decade, various detection algorithms have been proposed [1,2,3,4] to efficiently and effectively localize text regions using rectangles [5,8,9,10,11,12,13,14], quadrilaterals [7,15,16,17], and polygons [18,19,20]. In this section, we discuss the text benchmarks and detection methods from the perspective of text types.

2.1. Benchmarks

Text can be divided into three types according to different presentations: overlaid text, layered text, and scene text. Here, we briefly introduce the related datasets.

Overlaid text often appears in TV shows, movies, and news interviews to show characters’ dialogues and scene descriptions. Overlaid text detection faces many challenges including temporal diversity, complex backgrounds, similar colors, and low contrast. Tian et al. [5] established an overlaid text dataset that includes five web videos, i.e., Putin Speech, Curtains, Zippers, Wood Box, and Biscuit Joiner. The training set includes Putin Speech with 9839 frames and 14,492 ground truths. The test set consists of Curtains with 2760 frames and 3383 ground truths, Zippers with 5250 frames and 8520 ground truths, Wood Box with 4556 frames and 5661 ground truths, and Biscuit Joiner with 5265 frames and 9876 ground truths. Cai et al. [12] created a long-time overlay text dataset with three videos from TV dramas V1, V2, and V3. The training set includes V3 with 20,506 frames and 14,358 ground truths. The testing set consists of V1 with 18,415 frames and 12,975 ground truths, and V2 with 18,149 frames and 13,862 ground truths. Both of the above datasets focus on overlaid text detection.

Layered text usually appears in news videos to present news events and provide the theme of the current video clip or the topic of an interview. To detect layered text accurately, various state-of-the-art detectors need to consider the scrolling of the text in the video region, the diversity of scales, the vertical layout characteristics, and the presentation of artistic fonts. Zayene et al. [6,21] collected four videos including an AIjazeera video with 228,700 frames and 1029 text lines, a France 24 video with 150,425 frames and 956 text lines, a Russia Today video with 181,500 frames and 1541 text lines, and an EI Wataniya video with 300,525 frames and 1298 text lines. Each video consists of a training set and a test set. Together with new evaluation protocols, they established the first layered text dataset, named AcTiv. It spans talk shows, interviews, documentaries, weather reports, and sports scenes.

Scene text widely exists in scenes depicting people’s daily lives. This type of text may present multiple forms in images or videos with corresponding objects and backgrounds. Unlike the rectangular box annotation method used for overlaid text and layered text, scene text adopts three methods of annotation, i.e., rectangular boxes, quadrilateral boxes, and polygonal boxes, due to the flexibility and particularity of its presentation. Most of the early scene text datasets [8,9,10,11] were collected through focused shooting and, as a result, the text is presented horizontally and the label boxes are rectangular. These datasets have obvious characteristics: clear text, diversified scales, and central distribution. Text collected through focused shooting cannot properly reflect real-world scenarios from a human perspective. Researchers then began to use hand-held intelligent terminals (e.g., smart mobile phones and glasses) to capture scene text and then build unconstrained scene text datasets [7,15,16,17], which mainly use quadrilaterals as the annotation boxes. Due to the randomness of the text capture process, the detection process is faced with many challenges: motion blur, diverse scales, diverse languages, complex backgrounds, diverse colors, and flexible presentation methods. Due to the increasing demand for text representation and aesthetics, the forms of text presentation on some commodities, shops, and billboards have gradually evolved from horizontal and vertical to curved. In order to solve this problem, researchers designed a polygon strategy to label curved text [18,19,20]. The appearance of curved character datasets poses a huge challenge to detection and recognition algorithms, that is, the representation of curved forms and the correction of the curved text.

The datasets mentioned above only contain a single category of text and cannot comprehensively analyze mainstream multi-label streaming media content such as short, interview, and live videos. In order to make up for this gap, we collected and constructed a large-scale detection dataset for multi-type texts, which can further the research of multi-type text detection.

2.2. Methods

To adapt to the changes in text presentation and detection scenarios, researchers proposed a large number of text detection algorithms for various datasets. Based on different types of text instances, existing methods can be divided into overlaid text detection, layered text detection, and scene text detection methods.

The overlaid text detection method tackles texts with a neat appearance and dense distribution. Thus, text localization can be properly processed using traditional manual design feature learning or current deep feature learning. Liu et al. [22,23] designed stroke-like features and a fusion strategy of sequence frames to extract the overlaid text in videos and cleverly exploited the information redundancy of successive video frames. Tian et al. [5] developed a tracking-based unified framework to localize overlaid text. Cai et al. [12] proposed a sampling-and-recovery model and a divide-and-conquer model to efficiently localize overlaid text.

The layered text detection method usually deals with news video frames with high resolution. Layered text has both the motion features of overlaid text and the appearance features of scene text, which adds many challenges to the detector. Shivakumara et al. [24] designed a fast Fourier Laplacian-based feature extraction method to learn the local features of text and group them into an entire text instance. Zayene et al. [6] collected a new layered text dataset and developed a method to find text regions by fusing stroke-width transform features and convolution features. Fang et al. [25] applied a region-based convolutional neural network to locate text as a whole. Cai et al. [26] utilized the spatial locations and componential representations of layered text to improve detection performance.

The scene text detection method mainly focuses on learning text features and text appearance representations. Text features are extracted using the backbone network and optimized by fusing multiple-scale features. The well-designed prediction head aims to represent the appearance of text by adapting it to rectangles, quadrilaterals, and polygons.

Most of the early text feature learning methods utilize a VGG backbone network [27] for multi-resolution feature extraction [28,29,30]. The representative TextBox [28] uses the multi-scale characteristics of the backbone network itself to learn the feature information of the corresponding scale text at different feature levels. With the proposal of the ResNet backbone network [31] and feature pyramid network [32], researchers often combine both to extract the multi-scale features of text objects. Wan et al. [33] fused text knowledge into a network with a ResNet backbone and pyramid features through the image-level text recognition task and then migrated the well-designed network to the detection algorithm to improve detection accuracy. The works in [34,35] adopted U-shaped networks similar to feature pyramids to construct multi-scale feature networks to generate text region prediction at different scales. Because of the unique long strip structure of text, some researchers believe that a convolution using a mainstream square receptive field will introduce background noise when extracting character features. Therefore, a variety of convolutions with long strip receptive fields [28,36] and deformable convolution operations [37] have been designed and, as a result, the convolution can focus on the text area and suppress background interference. Tang et al. [38] designed a feature-sampling-and-grouping framework by jointly modeling transformers and convolutions.

Text objects have been developed from horizontally distributed texts from the ICDAR 2011 [8] and ICDAR 2013 [9] datasets; obliquely distributed texts from the ICDAR 2015 [7], MLT 2017 [15], and MLT 2019 [16] datasets; and bending distributed texts from the Total-Text [18] and CTW1500 [19] datasets. The representations of text appearance are becoming more and more diversified and the text outline representation abilities of detection algorithms are constantly improving from the early representations of rectangular boxes [28,39] and inclined rectangular boxes [29,30] to the present representations of arbitrary shapes [34,40,41,42,43,44]. The works in [42,43,44] generated a text mask based on pixel-level learning to determine the boundaries of text through geometric calculations. However, this kind of method requires large-scale datasets for pre-training and complex post-processing to obtain the final text position. The contour representations of Cartesian coordinate systems [45] and polar coordinate systems [41] not only require post-conversion processing but are also inaccurate for the localization of text with large differences in the aspect ratio and high curvature. Zhu et al. [40] converted the text detection problem to the Fourier domain and utilized the Fourier signal to fit any shape of text. The accurate positioning benefits from the theoretical modeling support that allows the Fourier signal to fit any closed interval. Huang et al. [46] considered the synergy of detection and recognition to improve text localization.

Existing methods regard text detection as a binary classification problem, that is, they judge whether a specific area is a text area. They cannot classify multiple types of text instances. For the text localization and classification task, we propose an effective and efficient solution to improve existing algorithms by adding a multi-category feature layer.

3. Text3C Benchmarks

We established a text detection dataset to simulate news and live video that contains multiple types of text (i.e., overlaid text, layered text, and scene text), thereby helping researchers to further their efforts in this field.

In order to highlight the three different types of text instances, we named the proposed dataset Text3C. As shown in Figure 2, Text3C consists of a training set with 13 videos and a test set with 7 videos. In terms of language categories, it consists of English, Chinese, and Other. When the language and content of the text cannot be identified, it is classified as Other. Unlike existing mainstream datasets that only have one type of text annotation, the established dataset classifies text categories into overlaid text, layered text, and scene text according to the different text embedding methods and representation topics. Due to the variety of text categories and complex distribution scenarios, the text is presented in various forms so our text annotation consists of rectangular boxes, quadrilateral boxes, and polygonal boxes at the same time. As the dataset was derived from news and live video, it naturally contains diversified scenes and different text presentations. Our Text3C faces various detection challenges, e.g., multiple languages, multiple types, multiple directions, dense text distribution, various aspect ratios, multiple scales, and complex poses and backgrounds.

3.1. Collection

We collected a large-scale multi-type text dataset to test the text detection task. Specifically, we first invited 15 computer majors to collect a large number of news videos and live videos from the Internet (Bilibili (https://www.bilibili.com/), CCTV (https://www.cctv.com/), and YouTube (https://www.youtube.com/) accessed on 23 January 2022). Then, three domain experts were invited to choose high-quality videos. Finally, 20 videos were selected after two rounds of cross-screening. These videos were divided into a training set with 13 videos and 29,534 frames and a test set with 7 videos and 10,806 frames.

3.2. Annotation

For the text annotation, we only invited 20 people to label the corresponding videos to ensure the consistency of the labeling. Each video was checked by two rounds of cross-checking to ensure annotation quality. Based on the presentation forms of the text, we used rectangular, quadrilateral, and polygonal boxes to label the different text instances. In all, we annotated 595,691 text instances, including 396,096 text instances in 29,534 video frames of the training set and 199,595 text instances in 10,806 frames of the test set. After the spatial boxes were provided, the annotator first assigned language labels to the text instances of either Chinese, English, or Other, and then assigned type labels to the text instances of either overlaid text, layered text, or scene text. Due to the existence of multiple types, the established dataset forced the algorithms to localize and classify text instances simultaneously, which makes Text3C more valuable and suited to academic applications. To the best of our knowledge, the proposed Text3C benchmark is the first to highlight text types.

3.3. Analysis

To help researchers have a comprehensive understanding of the video frame distribution, text type distribution, and language category distribution of the dataset, we visualized these statistical comparisons based on the training and test sets in Figure 3. The number of video frames in the 13 subsets of the training set ranged from 462 to 5000. The number of video frames in the 7 subsets of the test set ranged from 152 to 4825. The frame distributions of both subsets were uniform and similar, as shown in Figure 3a,b. Based on the differences in the text representations, we labeled each text instance as overlaid text, layered text, or scene text, and the statistic distribution is shown in Figure 3c. Scene text presented a similar distribution in the training and test sets, but there were large differences for both overlaid and layered texts. Multi-type annotations can provide researchers with the possibility to parse video content from multiple perspectives, and the differential distribution poses a greater challenge to algorithms. In addition, we performed a statistical comparison of the text language categories in the training and test sets, as shown in Figure 3d. The small distribution difference in the two subsets means that this method can still compete with various text detection methods for modeling robust language representation.

3.4. Comparison

The statistical comparison of the proposed Text3C benchmark and the existing mainstream benchmarks is summarized in Table 1. #Image/frame denotes the numbers of frames in the training set, test set, and whole dataset. #Text instance denotes the numbers of texts in the training set, test set, and whole dataset. It is clear that the numbers of instances in our dataset are superior to those of other datasets and the plentiful texts require algorithms with stronger discovery capabilities. The language attribute includes English, Chinese, and Other. Our dataset covers three language categories, which can better verify the robustness of the various algorithms for different language categories in the process of text detection. The presentation type of text includes overlaid text, layered text, and scene text. Unlike other datasets that contain only one presentation type for text annotations, our dataset provides three presentation types for text annotations. Holistic presentations not only provide researchers with new insights but also provide the text community with real experimental data to address text-video-related tasks from multiple perspectives.

3.5. Evaluation Protocol

To fairly compare the different methods with the proposed Text3C task, we introduced a novel evaluation protocol, which can penalize approaches for missing text instances, duplicating the localization of one text, false-positive localization, and the false classification of text types. Unlike previous loose evaluation criteria for text detection, such as the ICDAR 2015 protocol [7], the detection box and the real box only need an IoU greater than

0.5

to be successful. The new text detection and classification task has the same evaluation requirements as the common object detection task. Therefore, we introduced the multi-stage evaluation strategy of MS COCO [48] and designed three new metrics without category constraints, i.e., mean Recall (mR), mean Precision (mP), and mean F-measure (mF), to verify the detection performance of the different algorithms. Moreover, we also designed three new metrics with category constraints, i.e.,

m R_{c}

,

m P_{c}

, and

m F_{c}

, to verify the detection and classification performance of the different algorithms on the TextDC task. In fact, our evaluation protocol is the multi-threshold generalization of the ICDAR 2015 protocol, that is, the constraint conditions adjusted from

0.5

to the averaging over all 10 IoU thresholds, i.e.,

I o U \in [0.50, 0.95]

with a uniform step size of

0.05

.

4. The Proposed Method

4.1. Generalized Network Architecture

A generalized text detection framework is displayed in Figure 4 to explain how we extended the popular text detection methods to adapt the proposed TextDC task. Most algorithms are composed of a backbone network (

f_{b a c}

) for feature learning, neck stack (

f_{n e c}

) for feature fusion, and prediction head for text-related representation (i.e., position and category). The backbone network is mainly responsible for extracting the multi-scale features of the input image or video frame. Resnet [31] is the mainstream backbone used in current text detection approaches. In order to fairly compare the performance of the different methods on the proposed TextDC task, we uniformly adopted Resnet50 as the backbone for all the comparison approaches. The main function of the neck network is the fusing and intermediate processing of multi-scale feature maps. It receives the outputs of the backbone and outputs one or more feature maps. The processing of the backbone and neck stages can be formulated by

Y = f_{n e c} (f_{b a c} (X))

, where

X \in R^{H \times W \times 3}

represents the input image;

Y = {Y_{i}}_{i \in I}

denotes the feature maps of the input image;

I = {1, 2, \dots, n}

is the feature map index set; n is the number of feature maps generated by the neck network, which varies for different text detection algorithm; and

f_{b a c}

and

f_{n e c}

denote the backbone and neck stage, respectively.

After the backbone and neck stages, the representation features Y are sent to the prediction head. As shown in the prediction head in Figure 4, we designed two branches: the upper branch is used to predict the categories of the text instances and the bottom branch retains the prediction head of the original algorithm. Specifically, the bottom branch generates heat maps or many proposal boxes O by an activation function

f_{l o c}

. Afterward, a post-processing function

f_{b o x}

is employed to determine the text boundary boxes. The upper branch predicts three feature maps C to represent three categories of text by an activation function

f_{c l s}

. Thereafter, the post-processing function

f_{t y p}

matches three category heat maps using the predicted text boxes to determine the text type.

To enable the generalized text detectors to solve the proposed task, we extended the prediction head to give it the ability to generate text categories. Specifically, the two branches send the image features Y to the localization function

f_{l o c}

and classification function

f_{c l s}

to generate the original prediction feature map and the category prediction feature map, formulated by

O = f_{l o c} (Y)

and

C = f_{c l s} (Y)

, where

O = {O_{i}}_{i \in I}

and

C = {C_{i} | C_{i} \in R^{\frac{H}{s_{i}} \times \frac{W}{s_{i}} \times 3}, i \in I}

represent the original prediction and the category prediction feature maps, respectively, and

I = {1, 2, \dots, n}

is the feature map index set. The final position and category of text are generated by

P o s = f_{b o x} (O)

and

C l s = f_{t y p} (C, P o s)

, where

f_{b o x}

represents the post-processing function for obtaining the boundary boxes of the text instances and

f_{t y p}

represents the post-processing function to obtain the text category map.

C l s = {c_{i} | c_{i} \in {0, 1, 2}, i = 1, 2, \dots, m}

and

P o s = {p o s_{i} | p o s_{i} \in R^{H \times W}, i = 1, 2, \dots, m}

are the final category and location results of the text. m denotes the number of detected text instances in the current image and

p o s_{i}

is a binary mask for the ith detected text instance.

4.2. Classification Head

We designed a simple and effective text classification strategy to enable the state-of-the-art text detection methods to tackle the proposed TextDC task. Specifically, we added a new branch on the neck network to generate three classification feature maps that were used to match overlaid text, layered text, and scene text. It consisted of multiple convolutions for feature transform and upsampling operations for aligning resolutions between the prediction map and the ground-truth map.

4.3. Training and Inference

We completed the training by adding the supervision information

\hat{O}

and

\hat{C}

to O and C. The supervision of C is composed of three binary maps, and for each map, the text region of the corresponding category (overlaid, layered, or scene) was set to 1 and the rest were set to 0. The supervision of O is consistent with the original algorithm.

In the inference stage, we used two post-processing functions to generate the model prediction O and C, where

f_{b o x}

is the generator of the original algorithm and

f_{t y p}

is the well-designed category generator, which can be formulated as follows:

\begin{matrix} C l s_{t} = & f_{t y p} (C, P o s_{t}) \\ = & a r g m a x_{i} [\frac{1}{1_{\frac{H}{s}}^{T} P o s_{t}^{^{'}} 1_{\frac{W}{s}}} \sum_{j = 1}^{\frac{H}{s}} \sum_{k = 1}^{\frac{W}{s}} P o s_{t, j k}^{^{'}} C_{i, j k}] . \end{matrix}

(1)

\begin{matrix} P o s_{t}^{^{'}} = & f_{r} (P o s_{t}, C_{i}) \in R^{\frac{H}{s} \times \frac{W}{s}} . \end{matrix}

(2)

where the function

f_{r}

denotes the operation of converting

P o s_{t}

into a binary mask with the same size as the spatial dimension of

C_{i}

through bilinear interpolation, t denotes the tth text instance in the detection results, i denotes the ith category prediction feature map, and

1_{v}

denotes a vector of dimension v and all its elements are equal to 1.

4.4. Loss Function

The loss function of the proposed task can be formulated as follows:

\begin{matrix} L_{t o t a l} & = L_{o r g} (O, \hat{O}) + γ L_{c l s} (C, \hat{C}, \hat{M}) . \end{matrix}

(3)

where

L_{o r g}

denotes the loss function of the original text detection algorithm,

L_{c l s}

denotes the loss function of our classification task, C denotes the prediction category feature maps, and

\hat{C}

is the category maps of the ground truth for C.

\hat{M}

is a text mask, where all text regions were set to 1 and other regions were set to 0, with the aim of removing the background loss.

γ

is a hyperparameter to balance the two losses. For the different text detection algorithms, we adjusted the value of

γ

so that

L_{c l s}

and

L_{o r g}

were roughly the same order of magnitude. For the algorithms that apply heads to multi-scale feature maps, we calculated the mean of the losses at each scale. To reduce the complexity of the prediction head, we did not consider the background class. We calculated the

L_{c l s}

by a masked cross-entropy loss, which can be formulated as follows:

\begin{matrix} L_{c l s} (C, \hat{C}, \hat{M}) = & - \frac{1}{1_{\frac{H}{s}}^{T} \hat{M} 1_{\frac{W}{s}} \times 3} \sum_{j = 1}^{\frac{H}{s}} \sum_{k = 1}^{\frac{W}{s}} \\ {\hat{M}}_{j k} \sum_{i = 1}^{3} ({\hat{C}}_{i, j k} l o g (C_{i, j k}) + (1 - {\hat{C}}_{i, j k}) l o g (1 - C_{i, j k})) . \end{matrix}

(4)

where

C_{i} \in R^{\frac{H}{s} \times \frac{W}{s} \times 3}

denotes the ith prediction category feature map,

\hat{C_{i}} \in R^{\frac{H}{s} \times \frac{W}{s} \times 3}

denotes the binary map of the ith category ground truth for

C_{i}

, and s denotes the downsampling ratio.

4.5. Label Generation

We designed two kinds of supervision labels, i.e., classification labels and localization labels, as shown in Figure 5. The comparison algorithms can be divided into regression-based methods and segmentation-based methods. The text box supervisions of the regression-based approaches can be normalized to the ground truth. The segmentation supervision of the segmentation-based approaches can be obtained by binarizing the regions inside and outside the text boundary. Some methods aim to segment adjacent texts accurately by shrinking the text region so we also generated the kernel region of a text instance by indenting the text region of the segmentation map. In order to conceal the differences among the different detectors, we designed a simple and effective classification strategy that represents the text types with separate activation maps. To this end, we obtained the corresponding supervision information of the overlaid text, layered text, and scene text by separating and binarizing the ground-truth map.

5. Experiments

To match the proposed text detection and classification task, we extended the six state-of-the-art text detection approaches by adding the proposed category branch. Then, we evaluated them on the established Text3C dataset using two kinds of evaluation protocols.

5.1. Experiment Settings

Six comparison approaches were implemented based on the mmocr platform (https://github.com/open-mmlab/mmocr) accessed on 5 February 2022. All experiments were conducted on a single 3090 GPU. For a fair comparison, all comparison algorithms were trained on the training set and evaluated on the test set of the established Text3C dataset. In addition, we optimized all the networks using the Adam optimizer. The other parameters (e.g., learning rate, batch size, and so on) were the same as those of the corresponding algorithm.

5.2. Metrics

To evaluate the detection performance of the proposed algorithms, we adopted the TextDC protocols without category constraints (i.e., mean Recall (

m R

), mean Precision (

m P

), mean F-measure (

m F

)), and the TextDC protocols with both localization and classification constraints (i.e.,

m R_{c}

,

m P_{c}

,

m F_{c}

) to verify the performance of the different updated approaches.

5.3. Performance Comparisons for Text Detection

To verify the rationality of the proposed detection protocols, we compared the performance of state-of-the-art text detectors under the two kinds of evaluation metrics (i.e., the ICDAR15 protocol [7] and the proposed TextDC protocol without category constraints). Specifically, we used the ICDAR 2015 evaluation standard (i.e., Precision (P), Recall (R), F-measure (F)) with a single threshold (i.e., IoU ≥ 0.5) and the proposed TextDC evaluation standard (i.e., mean Precision (mP), mean Recall (mR), mean F-measure (mF)) with multi-stage and strict constraints to evaluate the detection performance. Table 2 lists the comparison results of the six text detectors.

As shown in Table 2, the detection performance ranking of the different detection approaches under the proposed TextDC protocols was the same as that under the ICDAR15 evaluation protocols, which demonstrates that the proposed evaluation metrics can reflect the advantages and disadvantages of the different algorithms. For the ICDAR15 evaluation protocols with a single constraint, we found that FCENet [40] achieved the best detection performance with

89.3 %

precision,

68.8 %

recall, and a

77.7 %

F-measure. This shows that the method can search more text instances with loose constraints. For the proposed TextDC evaluation protocol with multi-stage constraints, we found that PSENet [43] obtained the best detection performance with a

63.1 %

mean precision, a

44.2 %

mean recall, and a

52.0 %

mean F-measure. In addition, this algorithm also achieved excellent detection performance under the ICDAR15 protocols. This shows that PSENet had better representation ability for text instances, that is, the predicted text boxes covered the corresponding text regions to the maximum extent, which was important for the subsequent text classification, tracking, and recognition. From the detection performance comparisons of the different methods under the two evaluation protocols, it could be seen that the established Text3C dataset was a challenging touchstone for the text detection algorithms.

5.4. Performance Comparisons for Text Detection and Classification

In order to evaluate the performance of the related algorithms on the proposed new TextDC task, we updated the ICDAR15 detection protocol by adding a classification constraint. Specifically, correctly detecting a text instance requires an IoU greater than

0.5

and accurate classification. Thus, three evaluation protocols were updated to

P_{c}

,

R_{c}

, and

F_{c}

. The proposed TextDC protocol also holds the category label in each stage constraint. The three evaluation protocols consisted of

m P_{c}

,

m R_{c}

, and

m F_{c}

. To evaluate the proposed TextDC task on the established Text3C dataset, we extended the six state-of-the-art text detection methods to predict both the text position and text category. The comparison methods were marked as TextDC (DBNet [42]), TextDC (TextSnake [49]), TextDC (DRRG [50]), TextDC (PAN [44]), TextDC (FCENet [40]), and TextDC (PSENet [43]). Table 3 shows two kinds of evaluation results for the different updated detectors.

As shown in Table 3, the ranking of the detection and classification performance of the six detectors under the proposed TextDC evaluation protocols was the same as that under the ICDAR15 evaluation protocols. This demonstrates that our evaluation metrics can properly reflect the advantages and disadvantages of the different approaches. TextDC (PSENet) achieved the best performance under both the ICDAR15 and TextDC protocols and obtained the greatest gap with the second algorithm. Specifically, for the ICDAR15 protocols, the

F_{c}

of TextDC (PSENet) was

4.5 %

(

65.1 %

vs.

60.6 %

) higher than that of TextDC (PAN), which was due to the fact that the precision of TextDC (PSENet) for text instances was significantly better than that of the other algorithms (

79.0 %

vs.

69.4 %

). For TextDC, the

m F_{c}

of TextDC (PSENet) was

9.4 %

(

47.4 %

vs.

37.7 %

) higher than that of TextDC (FCENet), which was ranked second, and this was due to its huge advantage of

14.3 %

m P_{c}

(

57.6 %

vs.

43.3 %

). The obvious gap fully demonstrates that TextDC (PSENet) had a stronger representation ability for text instances and was more conducive to text classification.

5.5. Visualization Comparison

To intuitively observe and compare the detection and classification performance of the different algorithms on the proposed Text3C dataset, we visualized some of the sample results in Figure 6. From the perspective of detection, the performance of TextDC(FCENet) and TextDC(PSENet) was better than that of the other algorithms, especially for scene text, which naturally exists in the environment and is the most difficult to detect, thanks to their good feature learning ability of difficult sample texts and their ability to distinguish adjacent texts. This is consistent with the numerical comparison of the different algorithms in Table 2. From the perspective of detection and classification as a whole, the comprehensive performance of TextDC(PSENet) was better than that of the other methods. This is consistent with the numerical comparison of the different algorithms in Table 3. Although the algorithm TextDC(PSENet) achieved good performance, it still needs to be optimized on large-scale text or dense adjacent text.

5.6. Joint Analysis

In order to comprehensively analyze the performance of the different methods, we used three evaluation metrics, i.e., the mean F-measure with category constraints (

m F_{c}

), classification precision (

C P

), and frames per second (

F P S

), to evaluate the comparison algorithms from three perspectives, as shown in Table 4. Each method adopted ResNet50 as the backbone and trained 30 epochs on the training set of our Text3C dataset.

As shown in Table 4, for the localization and classification performance, TextDC(PSENet) achieved the best performance with

47.4 %

m F_{c}

. From a text region classification perspective, TextDC(TextSnake) obtained the highest precision with

92.3 %

C P

, which shows that the extracted features are profitable for the classification of text. From the perspective of the methods’ speeds, TextDC(PAN) achieved the fastest speed and was much faster than TextDC(FCENet), with 18 frames (

39.6

vs.

21.6

) because it contained a low computational-cost segmentation head after a lightweight backbone network. The summary analysis shows that our Text3C dataset can properly reflect the detection and classification performance, running speed, and feature representation ability of the different algorithms.

6. Limitation and Discussion

Since the task needs to detect different text instances and identify their corresponding categories, it poses a great challenge to the various text detection algorithms. In this section, we first use some examples of failures in the detection and classification task to reveal the limitations of the algorithm and the specific challenges of the dataset. Then we provide the full analysis and discuss the optimization direction of the detection methods.

6.1. Limitation Analysis

In Figure 7, we show some examples of the failure of the TextDC(PSENet) algorithm, which had the best performance, to inspire our future efforts. In Figure 7a, from the perspective of text detection, we can see that the algorithm had insufficient detection capability for scene text, especially for irregular handwritten art fonts. From the perspective of text type classification, we can see that some scene texts in English or digital content were wrongly classified as overlaid text. This may be because most of the English and numbers in the training set were overlaid text, resulting in the overfitting of the network learning. In Figure 7b, we can see that the algorithm accurately located simple text instances, but errors occurred when classifying text categories. The five lines of text in the middle should be layered text but they were wrongly classified as overlaid text and scene text. This may be due to the large amount of layered text in the training set concentrated in the edge regions, and the network was affected by the location information when learning text categories.

6.2. Discussion

Based on the analysis, we concluded that the main optimization direction of the proposed new task was in strengthening the learning of scene text, and, in particular, improving the robustness of irregular text instances. In addition, it was necessary to enhance the anti-interference ability of the categories for location information. More importantly, the category diversity features of the dataset hindered the detection performance of the different algorithms. Combining text categories and positions to enhance the robustness of text features will be our focus in future research.

7. Conclusions

We proposed a large-scale multi-type text detection benchmark, termed Text3C, with three annotation strategies (rectangular boxes, quadrilateral boxes, and polygonal boxes) and three text types (overlaid text, layered text, and scene text). The established benchmark is useful for the text detection community and opens up a promising direction for text detection. To accompany the text detection and classification task (TextDC), we proposed multi-stage constraint metrics for a fair and comprehensive evaluation and a simple but effective optimization method for the various state-of-the-art detection methods. Extensive experiments were conducted on the proposed Text3C dataset to verify the effectiveness of our updated methods and identify the challenges of the established benchmark.

Plentiful analyses and discussions are provided to inspire our future research directions. We plan to design a more robust feature learning network to overcome the problems of the poor location of scene texts and the difficult classification of different types. In addition, we will further strengthen the association between category learning and location learning of text objects, thereby improving overall performance and providing robust category features for subsequent tasks. In addition, to further increase the practicability, we plan to optimize the algorithm for end-to-end text detection and recognition to provide more direct support with multi-dimensional and unambiguous text contents for the downstream multi-modal understanding of live broadcasts and news videos.

Author Contributions

Project administration, Y.T.; methodology, F.Z. and Y.T.; validation, Y.C.; investigation, F.Z. and Y.C.; writing—original draft, F.Z.; writing—review and editing, F.Z., Y.C. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by grants from the National Natural Science Foundation of China (No. 12071458, 71731009).

Data Availability Statement

The data can be downloaded from https://github.com/yuchengtianxia/Text3C, accessed on 10 December 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ye, Q.; Doermann, D.S. Text Detection and Recognition in Imagery: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. [Google Scholar] [CrossRef] [PubMed]
Yin, X.; Zuo, Z.; Tian, S.; Liu, C. Text Detection, Tracking and Recognition in Video: A Comprehensive Survey. IEEE Trans. Image Process. 2016, 25, 2752–2773. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Yao, C.; Bai, X. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. 2016, 10, 19–36. [Google Scholar] [CrossRef]
Long, S.; He, X.; Yao, C. Scene Text Detection and Recognition: The Deep Learning Era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
Tian, S.; Yin, X.; Su, Y.; Hao, H. A Unified Framework for Tracking Based Text Detection and Recognition from Web Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 542–554. [Google Scholar] [CrossRef] [PubMed]
Zayene, O.; Hennebert, J.; Touj, S.M.; Ingold, R.; Amara, N.E.B. A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV. In Proceedings of the International Conference on Document Analysis and Recognition, Tunis, Tunisia, 23–26 August 2015; pp. 996–1000. [Google Scholar]
Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.K.; Bagdanov, A.D.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on Robust Reading. In Proceedings of the International Conference on Document Analysis and Recognition, Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
Karatzas, D.; Mestre, S.R.; Mas, J.; Nourbakhsh, F.; Roy, P.P. ICDAR 2011 Robust Reading Competition—Challenge 1: Reading Text in Born-Digital Images (Web and Email). In Proceedings of the International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 1485–1490. [Google Scholar]
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazán, J.; de las Heras, L. ICDAR 2013 Robust Reading Competition. In Proceedings of the International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
Wang, K.; Belongie, S.J. Word Spotting in the Wild. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6311, pp. 591–604. [Google Scholar]
Wang, K.; Babenko, B.; Belongie, S.J. End-to-end scene text recognition. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar]
Cai, Y.; Wang, W.; Huang, S.; Ma, J.; Lu, K. Spatiotemporal text localization for videos. Multimed. Tools Appl. 2018, 77, 29323–29345. [Google Scholar] [CrossRef]
Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S.J. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
Yuan, T.; Zhu, Z.; Xu, K.; Li, C.; Mu, T.; Hu, S. A Large Chinese Text Dataset in the Wild. J. Comput. Sci. Technol. 2019, 34, 509–521. [Google Scholar] [CrossRef]
Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification—RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition, Kyoto, Japan, 9–15 November 2017; pp. 1454–1459. [Google Scholar]
Nayef, N.; Liu, C.; Ogier, J.; Patel, Y.; Busta, M.; Chowdhury, P.N.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; et al. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition—RRC-MLT-2019. In Proceedings of the International Conference on Document Analysis and Recognition, Sydney, NSW, Australia, 20–25 September 2019; pp. 1582–1587. [Google Scholar]
Sun, Y.; Liu, J.; Liu, W.; Han, J.; Ding, E.; Liu, J. Chinese Street View Text: Large-Scale Chinese Text Reading With Partially Supervised Learning. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9085–9094. [Google Scholar]
Ch’ng, C.K.; Chan, C.S.; Liu, C. Total-Text: Towards Orientation Robustness in Scene Text Detection. Int. J. Doc. Anal. Recognit. (IJDAR) 2020, 23, 31–52. [Google Scholar] [CrossRef]
Liu, Y.; Jin, L.; Zhang, S.; Luo, C.; Zhang, S. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit. 2019, 90, 337–345. [Google Scholar] [CrossRef]
Chng, C.K.; Ding, E.; Liu, J.; Karatzas, D.; Chan, C.S.; Jin, L.; Liu, Y.; Sun, Y.; Ng, C.C.; Luo, C.; et al. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text—RRC-ArT. In Proceedings of the International Conference on Document Analysis and Recognition, Sydney, NSW, Australia, 20–25 September 2019; pp. 1571–1576. [Google Scholar]
Zayene, O.; Seuret, M.; Touj, S.M.; Hennebert, J.; Ingold, R.; Amara, N.E.B. Text Detection in Arabic News Video Based on SWT Operator and Convolutional Auto-Encoders. In Proceedings of the IAPR Workshop on Document Analysis Systems, Santorini, Greece, 11–14 April 2016; pp. 13–18. [Google Scholar]
Liu, X.; Wang, W. Extracting captions from videos using temporal feature. In Proceedings of the 18th International Conference on Multimedia 2010, Firenze, Italy, 25–29 October 2010; pp. 843–846. [Google Scholar]
Liu, X.; Wang, W. Robustly Extracting Captions in Videos Based on Stroke-Like Edges and Spatio-Temporal Analysis. IEEE Trans. Multim. 2012, 14, 482–489. [Google Scholar] [CrossRef]
Shivakumara, P.; Phan, T.Q.; Tan, C.L. A Laplacian Approach to Multi-Oriented Text Detection in Video. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 412–419. [Google Scholar] [CrossRef] [PubMed]
Fang, S.; Xie, H.; Chen, Z.; Zhu, S.; Gu, X.; Gao, X. Detecting Uyghur text in complex background images with convolutional neural network. Multim. Tools Appl. 2017, 76, 15083–15103. [Google Scholar] [CrossRef]
Cai, Y.; Wang, W. Robustly detect different types of text in videos. Neural Comput. Appl. 2020, 32, 12827–12840. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4161–4167. [Google Scholar]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.; Bai, X. Rotation-Sensitive Regression for Oriented Scene Text Detection. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
Lyu, P.; Yao, C.; Wu, W.; Yan, S.; Bai, X. Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7553–7563. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Wan, Q.; Ji, H.; Shen, L. Self-Attention Based Text Knowledge Mining for Text Detection. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5983–5992. [Google Scholar]
Zhang, S.; Zhu, X.; Yang, C.; Wang, H.; Yin, X. Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1285–1294. [Google Scholar]
Cai, Y.; Liu, C.; Cheng, P.; Du, D.; Zhang, L.; Wang, W.; Ye, Q. Scale-Residual Learning Network for Scene Text Detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2725–2738. [Google Scholar] [CrossRef]
Cai, Y.; Wang, W.; Chen, Y.; Ye, Q. IOS-Net: An inside-to-outside supervision network for scale robust text detection in the wild. Pattern Recognit. 2020, 103, 107304. [Google Scholar] [CrossRef]
Yang, Q.; Cheng, M.; Zhou, W.; Chen, Y.; Qiu, M.; Lin, W. IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1071–1077. [Google Scholar]
Tang, J.; Zhang, W.; Liu, H.; Yang, M.; Jiang, B.; Hu, G.; Bai, X. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. In Proceedings of the Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4553–4562. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An Efficient and Accurate Scene Text Detector. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2642–2651. [Google Scholar]
Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier Contour Embedding for Arbitrary-Shaped Text Detection. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3123–3131. [Google Scholar]
Wang, F.; Chen, Y.; Wu, F.; Li, X. TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 111–119. [Google Scholar]
Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-Time Scene Text Detection with Differentiable Binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11474–11481. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape Robust Text Detection With Progressive Scale Expansion Network. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8439–8448. [Google Scholar]
Liu, Y.; Chen, H.; Shen, C.; He, T.; Jin, L.; Wang, L. ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9806–9815. [Google Scholar]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition. In Proceedings of the Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4583–4593. [Google Scholar]
Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2018; Volume 11206, pp. 19–35. [Google Scholar]
Zhang, S.; Zhu, X.; Hou, J.; Liu, C.; Yang, C.; Wang, H.; Yin, X. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9696–9705. [Google Scholar]

Figure 1. Examples of different text-style categories. (a) Overlaid text detection from USTB-VidTEXT [5]. (b) Layered text detection from AcTiv [6]. (c) Scene text detection from ICDAR2015 [7]. (d) The proposed Text3C dataset with all text types, which supports the challenges of text detectors and multidimensional text research for this text-related community. We mark the overlaid, layered, and scene text instances in orange, blue, and green, respectively. Best viewed in color.

Figure 2. Illustrations of our Text3C dataset, which includes 13 training videos and 7 test videos. Text types are labeled as overlaid text (OT), layered text (LT), or scene text (ST). Text languages include Chinese (CT), English (ET), and Other (OT). Text instances can be framed in a rectangular box (RB), quadrilateral box (QB), or polygonal box (PB). The red bar at the top of each image denotes the video category, text type, text language, and text box, (category|type|language|box). Best viewed in color and zoomed in.

Figure 3. Statistical comparison of the established Text3C dataset. The dataset includes 13 training videos (a) and 7 test videos (b), and each video contains a different number of video frames. The examples of text types (c) and languages (d) show similar statistical distributions in the training and test sets.

Figure 4. The generalized framework of the detection and classification network. It consists of three components, i.e., backbone, neck, and head. The backbone network (

f_{b a c}

) is used to receive the input image (X) and extract the image features from different scales. The neck stack (

f_{n e c}

) focuses on the fusing and re-treatment of the multiple-scale features and provides the combination features to the prediction head. The prediction head contains two branches. The upper branch predicts the category of text via

f_{c l s}

and

f_{t y p}

and the lower branch predicts the position of text via

f_{l o c}

and

f_{b o x}

(the same as the prediction head of the original detection algorithm). The final results comprise the position map and corresponding category map.

Figure 4. The generalized framework of the detection and classification network. It consists of three components, i.e., backbone, neck, and head. The backbone network (

f_{b a c}

) is used to receive the input image (X) and extract the image features from different scales. The neck stack (

f_{n e c}

) focuses on the fusing and re-treatment of the multiple-scale features and provides the combination features to the prediction head. The prediction head contains two branches. The upper branch predicts the category of text via

f_{c l s}

and

f_{t y p}

and the lower branch predicts the position of text via

f_{l o c}

and

f_{b o x}

(the same as the prediction head of the original detection algorithm). The final results comprise the position map and corresponding category map.

Figure 5. The generation process of localization labels (right component) and classification labels (left component). The localization labels consist of the text boxes’ ground truths generated by normalizing the ground-truth boxes and the segmentation and kernel regions’ ground truths generated by segmenting and shrinking the ground-truth map. The classification label consists of three individual segmentation maps of different types.

Figure 6. Visualization results of different methods on the proposed Text3C dataset. The illustrated detection and classification results were predicted by (a) TextDC(DBNet), (b) TextDC(TextSnake), (c) TextDC(DRRG), (d) TextDC(PAN), (e) TextDC(FCENet), and (f) TextDC(PSENet), respectively. Notably, overlaid, layered, and scene text instances are indicated by orange, blue, and green boxes, respectively. Best viewed in color and zoomed in.

Figure 7. Examples of the failure of the TextDC(PSENet) algorithm on the proposed Text3C dataset. (a) Error detection and classification in scene text instances. (b) Error classification in layered text instances. Notably, overlaid, layered, and scene text instances are indicated in orange, blue, and green boxes, respectively. Best viewed in color and zoomed in.

Table 1. The statistical comparison of the Text3C dataset and current mainstream datasets. We use √ to indicate available features and - for unavailable features. For the language attribute, we use Chn, Eng, and Oth to denote Chinese, English, and Other languages, respectively. Notably, the dataset AcTiV-D contains Arabic language videos, which we group as Other. For the type attribute, we use OT, LT, and ST to indicate overlaid text, layered text, and scene text, respectively. Bold indicates the largest value.

Dataset	#Image/Frame			#Text Instance			Language			Type
Dataset	Train	Test	All	Train	Test	All	Chn	Eng	Oth	OT	LT	ST
ICDAR 2013 [9]	229	233	462	848	1095	1943	-	√	-	-	-	√
ICDAR 2015 [7]	1000	500	1500	11,886	5230	17,116	-	√	√	-	-	√
MSRA-TD500 [47]	300	200	500	1068	651	1719	-	√	-	-	-	√
COCO-Text [13]	43,686	20,000	63,686	118,309	27,550	145,859	-	√	√	-	-	√
AcTiV-D [20]	1480	363	1843	4133	1000	5133	-	-	√	-	√	-
USTB-VidTEXT [5]	9839	17,831	27,670	14,492	27,440	41,932	-	√	-	√	-	-
UCAS-STLData [12]	36,564	20,506	57,070	26,837	14,358	41,195	-	√	-	√	-	-
ICDAR 2013 VT [9]	9790	5487	15,277	67,800	26,134	93,934	-	√	√	-	-	√
Text3C (Ours)	29,534	10,806	40,340	396,096	199,595	595,691	√	√	√	√	√	√

Table 2. Comparison results of the state-of-the-art text detectors on the established Text3C dataset, evaluated with the ICDAR15 protocols [7], and the proposed TextDC protocols without category constraints. Bold indicates the best performance.

Method	ICDAR15 Protocol			TextDC Protocol
Method	$P$	$R$	$F$	$mP$	$mR$	$mF$
DBNet[42]	0.747	0.249	0.374	0.439	0.147	0.220
TextSnake[49]	0.789	0.526	0.631	0.452	0.301	0.361
DRRG [50]	0.761	0.569	0.651	0.488	0.312	0.381
PAN [44]	0.783	0.608	0.685	0.440	0.312	0.365
FCENet [40]	0.893	0.688	0.777	0.564	0.435	0.491
PSENet [43]	0.881	0.617	0.726	0.631	0.442	0.520

Table 3. Comparison results of the detection and classification performance on the proposed Text3C dataset, evaluated with the ICDAR15 protocols with category constraints [7], and the proposed TextDC protocols. Bold indicates the best performance.

Method	ICDAR15 Protocol			TextDC Protocol
Method	$P_{c}$	$R_{c}$	$F_{c}$	${mP}_{c}$	${mR}_{c}$	${mF}_{c}$
TextDC(DBNet [42])	0.507	0.169	0.254	0.320	0.107	0.160
TextDC(TextSnake [49])	0.690	0.460	0.552	0.406	0.270	0.325
TextDC(DRRG [50])	0.646	0.482	0.552	0.425	0.272	0.332
TextDC(PAN [44])	0.694	0.539	0.606	0.399	0.283	0.331
TextDC(FCENet [40])	0.676	0.521	0.588	0.433	0.333	0.377
TextDC(PSENet [43])	0.790	0.553	0.651	0.576	0.403	0.474

Table 4. Comparison of detection, classification, and speed of TextDC with state-of-the-art methods on the established Text3C dataset. Bold indicates the best performance.

Method	Epoch	Backbone	${mF}_{c}$	$CP$	$FPS$
TextDC(DBNet [42])	30	ResNet50	0.160	0.784	11.1
TextDC(TextSnake [49])	30	ResNet50	0.325	0.923	2.1
TextDC(DRRG [50])	30	ResNet50	0.332	0.845	3.1
TextDC(PAN [44])	30	ResNet50	0.331	0.907	39.6
TextDC(FCENet [40])	30	ResNet50	0.377	0.805	21.6
TextDC(PSENet [43])	30	ResNet50	0.474	0.920	6.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, F.; Cai, Y.; Tian, Y. TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution. Electronics 2023, 12, 159. https://doi.org/10.3390/electronics12010159

AMA Style

Zhou F, Cai Y, Tian Y. TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution. Electronics. 2023; 12(1):159. https://doi.org/10.3390/electronics12010159

Chicago/Turabian Style

Zhou, Fenfen, Yuanqiang Cai, and Yingjie Tian. 2023. "TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution" Electronics 12, no. 1: 159. https://doi.org/10.3390/electronics12010159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution

Abstract

1. Introduction

2. Related Work

2.1. Benchmarks

2.2. Methods

3. Text3C Benchmarks

3.1. Collection

3.2. Annotation

3.3. Analysis

3.4. Comparison

3.5. Evaluation Protocol

4. The Proposed Method

4.1. Generalized Network Architecture

4.2. Classification Head

4.3. Training and Inference

4.4. Loss Function

4.5. Label Generation

5. Experiments

5.1. Experiment Settings

5.2. Metrics

5.3. Performance Comparisons for Text Detection

5.4. Performance Comparisons for Text Detection and Classification

5.5. Visualization Comparison

5.6. Joint Analysis

6. Limitation and Discussion

6.1. Limitation Analysis

6.2. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI