Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms

Zhou, Boxiu; Wang, Xiuhui; Zhou, Wenchao; Li, Longwen

doi:10.3390/electronics13142814

Open AccessArticle

Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms

by

Boxiu Zhou

,

Xiuhui Wang

^*

,

Wenchao Zhou

and

Longwen Li

Department of Computer Science and Technology, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2814; https://doi.org/10.3390/electronics13142814

Submission received: 8 June 2024 / Revised: 12 July 2024 / Accepted: 16 July 2024 / Published: 17 July 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The task of trademark text recognition is a fundamental component of scene text recognition (STR), which currently faces a number of challenges, including the presence of unordered, irregular or curved text, as well as text that is distorted or rotated. In applications such as trademark infringement detection and analysis of brand effects, the diversification of artistic fonts in trademarks and the complexity of the product surfaces where the trademarks are located pose major challenges for relevant research. To tackle these issues, this paper proposes a novel recognition framework named SwinCornerTR, which aims to enhance the accuracy and robustness of trademark text recognition. Firstly, a novel feature-extraction network based on SwinTransformer with EFPN (enhanced feature pyramid network) is proposed. By incorporating SwinTransformer as the backbone, efficient capture of global information in trademark images is achieved through the self-attention mechanism and enhanced feature pyramid module, providing more accurate and expressive feature representations for subsequent text extraction. Then, during the encoding stage, a novel feature point-retrieval algorithm based on corner detection is designed. The OTSU-based fast corner detector is presented to generate a corner map, achieving efficient and accurate corner detection. Furthermore, in the encoding phase, a feature point-retrieval mechanism based on corner detection is introduced to achieve priority selection of key-point regions, eliminating character-to-character lines and suppressing background interference. Finally, we conducted extensive experiments on two open-access benchmark datasets, SVT and CUTE80, as well as a self-constructed trademark dataset, to assess the effectiveness of the proposed method. Our results showed that the proposed method achieved accuracies of 92.9%, 92.3% and 84.8%, respectively, on these datasets. These results demonstrate the effectiveness and robustness of the proposed method in the analysis of trademark data.

Keywords:

optical character recognition; SwinTransformer; corner detector; trademark classification

1. Introduction

Scene Text Recognition (STR) has garnered significant attention as an important research area in image visual understanding tasks, owing to its wide applications in fields such as autonomous driving [1] intelligent navigation [2,3] and key entity recognition [4,5]. It also makes substantial contributions to multimodal analysis [6] and text understanding [7,8]. Trademarks are a common occurrence in natural scenarios, and are often designed to incorporate artistic text and graphics to enhance brand identity and recognition. These texts contain decorative elements and unconventional designs, which result in a high degree of visual variation, making them more difficult to recognise.

For irregular text in natural scenes, Shi et al. [9] first proposed the use of spatial transformer networks (STNs) and thin-plate spline (TPS) transforms to fit non-rigid transformations such as perspective and bending, aimed at correcting irregular text. Building on this model, the authors also proposed the attentional scene text recognizer (ASTR) [10], which employs bidirectional decoding to effectively improve accuracy. To obtain more text features, Sheng et al. [11] introduced a non-recurrent sequence-to-sequence text recognizer (NRTR), which relies primarily on the attention mechanism and uses a complete Transformer structure to encode and decode the input image. This model demonstrated the effectiveness of the Transformer structure in text recognition. Despite the success of many current methods in irregular text recognition, they overlook the challenges of recognising different styles of fonts, overlapping fonts and artistic characters, rendering them ineffective when applied directly to trademark text recognition.

To address the aforementioned challenges in trademark text recognition, we propose a novel trademark text-recognition network, SwinCornerTR. The main contributions of this paper are as follows.

A novel feature-extraction network based on EFPN-SwinTransformer was proposed. By incorporating SwinTransformer as the backbone network, efficient capture of global information in trademark images is achieved through the self-attention mechanism and enhanced feature pyramid module. This provides more accurate and expressive feature representations for subsequent text extraction, leading to improved text-extraction accuracy and efficiency.
A novel feature point-retrieval algorithm based on corner detection is designed. The OTSU-based fast corner detector is presented to generate a corner map. It utilises OTSU’s method for selecting an adaptive threshold and subsequently conducts corner detection according to this threshold, thereby achieving both efficient and accurate corner identification. In addition, in the encoding phase, a feature point-retrieval mechanism based on corner detection is introduced to achieve priority selection of key-point regions, eliminating character-to-character lines and suppressing background interference.

2. Methodology

Figure 1 illustrates the system architecture. Initially, the input image is processed by an OTSU-FAST (features from accelerated segment test) corner detector, producing a corner map. The original image and the corner map are then separately inputted into the SwinTransformer [12] for feature extraction. The extracted raw image features are processed using a self-attention mechanism to capture local features, which are then combined with the corner point features using a Feature-Query Attention mechanism to capture global features. The encoder produces the feature map, which is then used by the decoder to extract a series of characters.

2.1. Feature Extraction Based on EFPN-SwinTransformer

The current widely used CRNN-based (convolutional recurrent neural network) text-recognition system has low accuracy for arbitrary shapes and text with complex backgrounds. The main reason is that traditional CNN models, in the stage of encoding and extracting image features, cannot effectively capture the spatial information of images. For text images with complex shapes and highly variable backgrounds, the extracted features are insufficient to capture both the global information and detailed features of the text. Additionally, traditional models use LSTM (long short-term memory) to convert features extracted by CNN into text, but LSTM struggles to fully capture long-term dependencies and directly span multiple characters to capture the correlation of global features. This often results in incorrectly translating image features into similar-looking incorrect characters during the decoding stage, thereby reducing the accuracy of text recognition. To address the aforementioned challenges, this paper utilises the SwinTransformer as the backbone network, capitalising on its hierarchical structure and sliding window multi-head self-attention mechanism. Initially, the input image is processed by the SwinTransformer to extract fundamental patterns and textures. After feature extraction, the hierarchical feature map is input into our proposed EFPN module, thereby enhancing the feature-representation capability. By analysing image textures at various scales, the model achieves a comprehensive understanding and synthesis of texture information. These extracted features are subsequently utilised in self-attention blocks for further processing.

The structure of SwinTransformer is illustrated in Figure 2, which adopts a hierarchical design and contains four stages. In a process known as Patch Merging, each stage expands the receptive field by gradually decreasing the resolution of the input feature maps layer by layer. This process merges neighboring small patches into a large patch, so that the large patch encompasses the contents of the four small patches, thereby increasing the receptive field and providing more contextual information. This method effectively captures multi-scale features because each large patch combines features from multiple small patches, enabling the model to simultaneously consider both detailed and overall features at different scales. After each Patch Merging, the downsampling rate is doubled, and the feature maps after Stage 2, Stage 3 and Stage 4 are

H / 8 \times W / 8 \times 2 C

,

H / 16 \times W / 16 \times 4 C

and

H / 32 \times W / 32 \times 8 C

, respectively.

The SwinTransformer Block module is shown in Figure 3. The SwinTransformer module amalgamates both the W-MSA (window-based multiheaded self-attention) and the SW-MSA (shifted window-based multiheaded self-attention) mechanisms. Each MSA module is preceded by a Layer Normalisation (LN) layer, while the latter stages incorporate an additional two layers of Multi-Layer Perceptrons (MLPs). Unlike the standard Multi-Headed Self-Attention (MSA) module in the Transformer Block, the Window-based Multi-Headed Self-Attention (W-MSA) computes the self-attention of local non-overlapping windows instead of global self-attention. This results in SwinTransformer having only linear complexity, while the standard Transformer has quadratic complexity, thereby reducing algorithmic complexity and computation. However, different window information cannot be communicated across windows, which may lead to reduced modeling capabilities. To enhance cross-window connectivity, Shifted Window-based Multi-Headed Self-Attention (SW-MSA) achieves the exchange of information among otherwise non-communicating windows by moving the windows, thus maintaining global information.

An example of SwinTransformer module window segmentation is shown in Figure 4, where the feature map is segmented into non-overlapping windows of size N × N by the Window Multi-head Self-Attention layer (W-MSA). As illustrated in Figure 4a, attention computation is performed within each window, with the specific operations of sliding windows depicted in Figure 4b,c The initially obtained image blocks after segmentation are shifted and concatenated to yield the final actual window-segmentation approach. For instance, as shown in Figure 4b,c, the number of SW-MSA windows is reduced from 9 to 4, thereby reducing computational load while maintaining global information.

Trademark text often contains artistic fonts, necessitating high-resolution feature maps to capture the intricate details and stylistic elements of the fonts. To enhance the feature-representation capability of the SwinTransformer, we have incorporated an Enhanced Feature Pyramid Network (EFPN) into its architecture. The standard FPN utilises a top-down fusion approach, which has limitations in feature integration. To better preserve lower-level information, we have introduced a bottom-up feature enhancement branch to the standard FPN, as illustrated in Figure 5. Initially, the multi-scale feature maps are fed into the EFPN. Each input layer undergoes lateral convolution with a 1 × 1 convolutional kernel to perform channel matching, transforming the dimensions from (96, 192, 384, 768) to (256, 256, 256, 256). Higher resolution feature maps are 2 × up sampled, matched and merged with the size of the next lower resolution feature map. The 3 × 3 convolution is used to reduce the aliasing effect that occurs during up-sampling after pixel-by-pixel addition. The new branching path is similar to the original path but operates in the opposite direction, performing bottom-up downsampling.

In addition, since the new branch has the same number of feature mapping channels as the top-down path, the 1 × 1 convolution is omitted, simplifying the model and reducing the computational burden. By adding a bottom-up path, effective multi-scale feature fusion can be achieved, resulting in richer feature representations. This approach not only aids in capturing subtle stylistic variations and local details in the fonts, but also optimises feature representation to better accommodate complex text-recognition tasks.

2.2. Corner Detection Based on Feature-Query Mechanism

In our study, we observed that the text fonts used in trademark images differ from those used in documents. Generally, the text fonts in trademarks tend to be more distinctive, with complex backgrounds and varying degrees of deformation. Furthermore, the issue of artistic characters is particularly prominent in this context. In order to improve our understanding and knowledge of the overlapping characters, ligature characters and decorative artistic characters found in trademark text, we propose a Feature-Query-based encoder. This encoder is designed to better focus on the key features and stroke inflection points of the text, thus reducing the interference from complex backgrounds on text recognition and enhancing the robustness of trademark text recognition.

We propose an improved FAST corner detector, which is based on the OTSU method, to generate corner maps. This detector enhances the efficiency of the Harris detector by providing faster computational speed, effectively reducing the feedforward time and maintaining high accuracy. The OTSU-FAST algorithm, in its entirety, is presented in Algorithm 1.

Algorithm 1 Corner-detection algorithm based on Feature-Query mechanism.

1:: The input image undergoes grayscale processing and noise filtering.
2:: The image is divided into M × N parts, and the OTSU algorithm is applied to each segmented part to compute the adaptive threshold T for the sub-image. The number of pixels in the M × N subpart whose pixel grayscale value is less than the threshold T is denoted as $N 0$ , and the number of pixels greater than the threshold T is denoted as $N 1$ . The relationship between $N 0$ , $N 1$ , N and M can be expressed as Equation (1). The ratio of foreground to background pixels in the subpart is denoted as $ε 0$ for foreground pixels and $ε 1$ for background pixels, respectively, given by Equations (2)–(4). The average grayscale values of the foreground and background pixels in the subpart are denoted as $φ 0$ and $φ 1$ , respectively, and the overall average grayscale value of the image subpart is denoted as $φ$ , defined by Equation (5).

$N 0 + N 1 = M \times N$

(1)

$ε 0 = N 0 (M \times N)$

(2)

$ε 1 = N 1 (M \times N)$

(3)

$ε 0 + ε 1 = 1$

(4)

$φ = ε 0 φ 0 + ε 1 φ 1$

(5)
3:: The OTSU algorithm is used to compute the threshold T of the sub-image. By iterating over each pixel within the sub-regions of the image, it is possible to identify the threshold T that maximises the inter-class variance g. This threshold T then serves as the threshold for that particular image sub-region. Thus, the inter-class variance g can be simplified and expressed as Equation (6).

$g = ε_{0} ε_{1} (φ_{0} - φ_{1})$

(6)
4:: The FAST corner-detection algorithm is applied to each segmented sub-region of the image. Then, calculate the absolute values of the differences in grayscale between the centre point and the 16 candidate points distributed along the circumference of the circle, as defined by Equation (7)

$C_{x} = | f_{x} - f_{0} |$

(7)
5:: If $C_{x} \geq T$ , then the candidate point is marked as a valid candidate point; otherwise, the point is excluded. By individually comparing the 16 candidate points, if more than nine consecutive points are marked as valid, the centre point is designated as a corner.

The threshold selection based on OTSU has enhanced the adaptability of corner detection in various images, as it takes into account the global information of the entire image. As shown in Figure 6, the first row presents the corner map obtained using the improved OTSU-FAST detection method, while the second row displays the corner map resulting from the original FAST detection method. The original FAST algorithm exhibits significant sensitivity to noise, frequently misidentifying noise and discontinuous regions as corners. This results in a failure to accurately highlight the essential textual features.

Upon obtaining the corner map of the input image, both the original image and the corner map are processed through the feature-extraction module, resulting in the generation of the original feature map

X \in R^{\frac{W}{4} \times \frac{H}{4} \times C}

and the corner feature map

Q \in R^{\frac{W}{4} \times \frac{H}{4} \times C}

, respectively. In this context, H, W and C denote the height, width and feature dimension of the feature map, respectively. The original feature map acquires the image feature

X^{'}

through a multi-head self-attention mechanism. This is subsequently combined with the corner feature map Q using the Feature-Query Attention. The Feature-Query Mechanism based on corner detection can be formulated as follows:

F Q A (Q, X^{'}, X^{'}) = s o f t max (\frac{Q {X^{'}}^{T}}{σ}) X^{'}

(8)

Here, the corner feature Q is used as the query, and the image feature

X^{'}

is used as the keys and values. FQA represents the Feature-Query Attention.

σ

is a scaling factor.The softmax function is applied to obtain the weights of the values.

3. Experiments

The experiments in this paper were conducted on a GeForce RTX 3090 (NVIDIA, Santa Clara, CA, USA), implemented using the PyTorch 2.3.1. During the training process, the image size was adjusted to 224 × 224, the dimensions of the encoder and decoder outputs were set to 512, the number of attention heads h was 8 and the number of self-attention layers in the encoder and decoder were 10 and 5, respectively. Training was conducted using the Adam optimiser [13], with an initial learning rate of

3 \times 10^{- 4}

. The number of epochs was set to 20, and the batch size was set to 128. The learning rate was reduced to

3 \times 10^{- 5}

at the 11th epoch.

3.1. Datasets and Evaluation Criteria

The SwinCornerTR model was trained on two widely recognised STR training datasets, Mjsynth [14] and SynthText [15], in addition to a proprietary trademark text dataset, Trademark, which we developed. The model was evaluated on two standard benchmarks as well as the Trademark dataset. The benchmarks include a regular scene text dataset, Street View Text (SVT) [16] and an irregular scene text dataset, CUTE80 (CT80) [17]. The Trademark dataset consists of a training set with 4000 images and a test set with 1500 images. The dataset includes fonts with overlapping characters, ligatures, distortions and decorative elements to ensure a thorough evaluation of the model’s performance across a variety of trademark scenarios.

To evaluate the performance of model, we introduce three types of evaluations: accuracy, precision and recall. Accuracy is the most basic evaluation metric, representing the ratio between the number of correctly recognised characters and the total number of characters. The other criteria are defined as:

p r e c i s o n = \frac{T P}{T P + F P}

(9)

r e c a l l = \frac{T P}{T P + F N}

(10)

where

T P

is the positive sample with correct classification,

F P

is the positive sample with wrong classification, N is the number of types of target objects and

A P_{i}

is the

A P

value of the class i.

3.2. Comparative Experiments

To verify the advantages of SwinCornerTR in trademark text-recognition tasks, we conducted a series of evaluations. We selected several current mainstream and advanced scene text-recognition methods for comparison and applied them to our constructed dataset, Trademark. The validation results are presented in Table 1. The CRNN model achieves an accuracy of only 59.43%, attributed to its feature-extraction network, which consists of a simple stack of CNNs lacking an attention mechanism and possessing limited receptive fields, thereby leading to suboptimal recognition performance. Although SATRN achieves a precision of 91.22% and a recall of 90.96%, its accuracy is merely 78.93%, indicating that the overall recognition performance of the model remains inadequate. Compared to other advanced scene text-recognition methods, SwinCornerTR achieves optimal performance on the trademark dataset, with a recognition accuracy of 84.88%. It improves the accuracy by 25.45% compared to CRNN (CNN + RNN) and 5.95% compared to SATRN (CNN-Transformer). This signifies the superior efficacy of SwinCornerTR in trademark text-recognition tasks when compared to extant methodologies. The rationale appears to lie in the integration of the SwinTransformer as the principal neural network framework. Through harnessing its self-attention mechanism and an augmented feature pyramid module, the system adeptly captures comprehensive global information from trademark images. Consequently, this facilitates the generation of more precise and informative feature representations, which in turn significantly enhance both the precision and efficacy of the ensuing text-extraction processes.

In order to thoroughly evaluate the model’s generalisation performance, validation was performed on two benchmark datasets: SVT, a regular scene text set, and CUTE, an irregular scene text set. Meanwhile, we compared the experimental results with other advanced text-recognition methods. The detailed experimental results are listed in Table 2. As can be observed from the table, SwinCornerTR achieves competitive results on both regular and irregular text datasets. LevOCR and SwinCornerTR perform comparably on the SVT dataset, while ABINet outperforms SwinCornerTR by 0.6% on the same dataset. However, SwinCornerTR surpasses both LevOCR and ABINet on the irregular CUTE dataset, improving accuracy by 0.6% and 3.1%, respectively. These results highlight the excellent performance of our model in processing different types of scene texts, especially the significant improvement in irregular text scenes. This indicates that our approach has a stronger generalisation ability for a wide range of scene text-recognition tasks. The most plausible explanation pertains to the utilisation of Otzu’s method for determining an adaptive threshold within the proposed approach. Subsequently, corner detection is executed in accordance with this derived threshold, which synergistically leads to both efficient and precise corner recognition. In addition, during the nascent encoding phase, an inventive feature point-retrieval strategy, firmly anchored in sophisticated corner-detection techniques, has been put into practice. This strategy empowers the selective pinpointing of salient regions of interest with elevated precedence. As a result, it effectively circumvents inter-character lines and diminishes disruptions stemming from distracting background debris.

Furthermore, in order to ascertain the stability of the results, we randomly selected 500 trademark text images as a sample and employed the model to predict each image, while recording the accuracy rate for every prediction. The standard deviation was calculated to evaluate the fluctuations in the accuracy rates. It was found that the standard deviation remained relatively stable at around 0.06, indicating that the model’s accuracy is quite consistent. Through this approach, we are able to assess the performance of the model more objectively and determine, based on the magnitude of the variance, whether further optimisation of either the model or the dataset is required.

3.3. Ablation Experiments

In order to validate the effectiveness of the added modules, we conducted ablation studies on SwinCornerTR, using CNN as the baseline backbone. The results of these experiments were evaluated on the Trademark dataset, and the experimental results are shown in Table 3. In conducting ablation experiments, we compared different backbone. The introduction of the SwinTransformer with a standard FPN achieves an improvement of 1.3% compared to using a CNN as the baseline backbone. Additionally, the combination of the improved EFPN with the SwinTransformer results in a further improvement of 0.5%. Furthermore, the model incorporating the Feature-Query Mechanism achieved an additional 2.5 performance improvement relative to the baseline model. Ultimately, we fused the SwinTransformer with the Feature-Query Mechanism; our model achieved an overall performance enhancement of 3.5% compared to the baseline. These results demonstrate the effectiveness of our model structure. The SwinTransformer, employed as the backbone, excels in trademark text recognition, enabling more effective extraction of image text features. Furthermore, the Feature-Query Mechanism effectively focuses on the key points of text features and stroke inflection points, thereby enhancing recognition capability and demonstrating superiority in handling complex trademark text.

Given that corner detection is crucial for the effectiveness of the proposed method, we proposed an improved FAST corner detector based on OTSU to generate a corner map. To verify its superiority, we compared it with the original FAST and the widely used Harris detector. The comparison results, shown in Table 4, clearly demonstrate that the accuracy of our proposed OTSU-FAST detector improved by 0.6 compared to the original FAST corner detector and by 0.9 compared to the Harris corner detector. Our method achieves a significant improvement in detection speed compared to the Harris.This not only enhances the accuracy of corner detection but also reduces the model’s feedforward time, making the overall system more efficient.

3.4. Performance of Text Recognition

A series of representative samples were selected for comparative analysis. The text in these samples exhibited complex characteristics, including overlapping strokes, ligatures, distortions and artistic decorations. The recognition results are shown in Figure 7, where GT represents the ground truth-recognition results.

In example a, the character is a ligature and is correctly identified by both STARN and the method proposed in this paper, while CRNN correctly identifies only four out of the nine characters. In example b, although the font is not visibly deformed, the letter ‘E’ opens to the left, which differs from the standard writing direction in the document. STARN and CRNN misidentify it as ‘z’ and ‘i’, respectively. In examples c, d and e, the text contains decorative elements. Particularly in examples d and e, the characters are heavily overlapped, yet STARN and CRNN can only recognise some of the more distinct characters. In example e, both methods only recognise ‘A’ and fail to identify the overlapping ‘L’. In example f, the font is highly overlapped and decorative, with a complex background. Despite these challenges, our model still correctly recognises the text, whereas CRNN fails to correctly recognise both the characters and their count.

SwinCornerTR demonstrates excellent performance in these examples, which can be attributed to the superior ability of the SwinTransformer backbone in extracting text features. Additionally, the introduction of a Feature-Query Mechanism based on corner detection effectively focuses on the key points of the text features, thereby enhancing the model’s robustness. Overall, these examples demonstrate the outstanding performance of the proposed method in handling complex text scenarios, including ligatures, distortions and artistic decorations.

4. Conclusions

Significant progress has been made in scene text recognition, especially in relatively regular scenes. However, existing text-recognition methods do not perform well when faced with trademark texts that exhibit diverse shapes and variations. To address this issue, this paper proposes SwinCornerTR, a framework for trademark text recognition. This framework utilises the advanced SwinTransformer as its backbone and introduces a Feature-Query Mechanism based on corner detection. In addition, we have constructed a trademark text dataset named Trademark. Experimental results demonstrate that our method achieves optimal performance on the trademark dataset. This paper introduces the SwinConerTR framework, which offers an effective solution for trademark text recognition.

Nonetheless, despite the successes achieved with the current methodology, it is far from infallible and necessitates enhancement. The identification of these constraints serves to delineate avenues for potential advancement. It stands to reason that future investigations may concentrate on the development of models capable of more adeptly managing images exhibiting significant distortion and those wherein the text is excessively contiguous with its ambient backdrop. Such progress is anticipated to entail refinements in algorithmic approaches, training protocols, or datasets employed by the models, culminating in enhanced precision and resilience within a broader spectrum of practical applications.

Author Contributions

Conceptualisation, X.W. and B.Z.; methodology, X.W., B.Z. and W.Z.; software, B.Z. and L.L.; validation, B.Z. and W.Z.; formal analysis, X.W. and B.Z.; investigation, X.W.; resources, X.W. and B.Z.; data curation, B.Z.; writing—draft B.Z.; writing—formal X.W. and B.Z.; visualisation, B.Z. and W.Z.; supervision, X.W.; project administration, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program of China (No. 2021YFC3340402).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Thank Zhiduoduo Technology Co., Ltd. for providing the experimental data in this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, C.; Yuefeng, T.; Du, K.; Ding, W.; Wang, B.; Liu, J.; Wang, W. Character-level Street View Text Spotting Based on Deep Multi-Segmentation Network for Smarter Autonomous Driving. IEEE Trans. Artif. Intell. 2021, 3, 297–308. [Google Scholar] [CrossRef]
Rong, X.; Li, B.; Muñoz, J.; Xiao, J.; Arditi, A.; Tian, Y. Guided Text Spotting for Assistive Blind Navigation in Unfamiliar Indoor Environments. In Proceedings of the Advances in Visual Computing: 12th International Symposium, ISVC 2016, Las Vegas, NV, USA, 12–14 December 2016; Volume 10073, pp. 11–22. [Google Scholar] [CrossRef]
Wang, H.C.; Finn, C.; Paull, L.; Kaess, M.; Rosenholtz, R.; Teller, S.; Leonard, J. Bridging text spotting and SLAM with junction features. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 3701–3708. [Google Scholar] [CrossRef]
Wang, J.; Liu, C.; Jin, L.; Tang, G.; Zhang, J.; Zhang, S.; Wang, Q.; Wu, Y.; Cai, M. Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2738–2745. [Google Scholar] [CrossRef]
Zhang, P.; Xu, Y.; Cheng, Z.; Pu, S.; Lu, J.; Qiao, L.; Niu, Y.; Wu, F. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1413–1422. [Google Scholar] [CrossRef]
Wu, J.; Du, J.; Wang, F.; Yang, C.; Jiang, X.; Hu, J.; Yin, B.; Zhang, J.; Dai, L. A Multimodal Attention Fusion Network with a Dynamic Vocabulary for TextVQA. Pattern Recognit. 2021, 122, 108214. [Google Scholar] [CrossRef]
Yang, Z.; Lu, Y.; Wang, J.; Yin, X.; Florencio, D.; Wang, L.; Zhang, C.; Zhang, L.; Luo, J. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 8747–8757. [Google Scholar] [CrossRef]
Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; Rohrbach, M. Towards VQA Models That Can Read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8309–8318. [Google Scholar] [CrossRef]
Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust Scene Text Recognition with Automatic Rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4168–4176. [Google Scholar] [CrossRef]
Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
Sheng, F.; Chen, Z.; Xu, B. NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 781–786. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. SwinTransformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. arXiv 2014, arXiv:1406.2227. [Google Scholar]
Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar] [CrossRef]
Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar] [CrossRef]
Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
Li, H.; Wang, P.; Shen, C.; Zhang, G. Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8610–8617. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Baek, J.; Oh, S.J.; Kim, S.; Lee, H. On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 2326–2335. [Google Scholar] [CrossRef]
Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7094–7103. [Google Scholar] [CrossRef]
Bhunia, A.; Sain, A.; Kumar, A.; Ghose, S.; Chowdhury, P.; Song, Y.Z. Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14920–14929. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, B.; Yao, X.; Sun, Q.; Li, R.; Yu, B. Context-Based Contrastive Learning for Scene Text Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 3353–3361. [Google Scholar] [CrossRef]
Da, C.; Wang, P.; Yao, C. Levenshtein OCR. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 322–338. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed SwinCornerTR network.

Figure 2. Architecture of the proposed SwinTransformer.

Figure 3. The SwinTransformer Block.

Figure 4. A demonstration of SwinTransformer’s window partitioning.

Figure 5. Architecture of the proposed EFPN.

Figure 6. Comparison of corner map.

Figure 7. Examples of trademark text-recognition result, where different scenarios denoted by ’a’ to ’f’ represent distinct typical situations.

Table 1. Accuracy comparison with other STR on trademark.

Method	Accuracy	Precision	Recall
CRNN [9]	59.43	84.34	83.55
ASTER [10]	68.54	86.97	85.79
NRTR [11]	70.32	87.81	86.74
SAR [18]	74.84	89.56	89.64
SATRN [19]	78.93	91.22	90.96
SwinCornerTR	84.88	93.84	94.47

Table 2. Accuracy comparison with other STR on the benchmark dataset.

Method	SVT	CUTE
CRNN [9]	80.8	-
ASTER [10]	89.5	79.5
NRTR [11]	91.5	80.9
SAR [18]	84.5	83.3
SATRN [19]	91.3	87.8
ABINet [20]	93.5	89.2
JVSR [21]	92.2	89.7
ABINet+ConCLR [22]	94.3	91.3
LevOCR [23]	92.9	91.7
SwinCornerTR	92.9	92.3

Table 3. Results of the SwinCornerTR ablation study.

Method	SwinTransformer	EFPN	Feature-Query	Accuracy
Baseline	-	-	-	81.3
Baseline+	✓	-	-	82.6
Baseline+	✓	✓	-	83.1
Baseline+	-	-	✓	83.8
Baseline+	✓	✓	✓	84.8

Table 4. Results for different corner detector.

Corner Detector	Accuracy	Mean Detection Time/ms
Harris	83.9	25
FAST	84.2	1
OTSU-FAST	84.8	6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, B.; Wang, X.; Zhou, W.; Li, L. Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms. Electronics 2024, 13, 2814. https://doi.org/10.3390/electronics13142814

AMA Style

Zhou B, Wang X, Zhou W, Li L. Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms. Electronics. 2024; 13(14):2814. https://doi.org/10.3390/electronics13142814

Chicago/Turabian Style

Zhou, Boxiu, Xiuhui Wang, Wenchao Zhou, and Longwen Li. 2024. "Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms" Electronics 13, no. 14: 2814. https://doi.org/10.3390/electronics13142814

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms

Abstract

1. Introduction

2. Methodology

2.1. Feature Extraction Based on EFPN-SwinTransformer

2.2. Corner Detection Based on Feature-Query Mechanism

3. Experiments

3.1. Datasets and Evaluation Criteria

3.2. Comparative Experiments

3.3. Ablation Experiments

3.4. Performance of Text Recognition

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI