HIFA-LPR: High-Frequency Augmented License Plate Recognition in Low-Quality Legacy Conditions via Gradual End-to-End Learning

Lee, Sung-Jin; Yun, Jun-Seok; Lee, Eung Joo; Yoo, Seok Bong

doi:10.3390/math10091569

Open AccessArticle

HIFA-LPR: High-Frequency Augmented License Plate Recognition in Low-Quality Legacy Conditions via Gradual End-to-End Learning

¹

Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju 61186, Korea

²

Department of Radiology, MGH and Harvard Medical School, Boston, MA 02115, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2022, 10(9), 1569; https://doi.org/10.3390/math10091569

Submission received: 5 April 2022 / Revised: 28 April 2022 / Accepted: 4 May 2022 / Published: 6 May 2022

(This article belongs to the Special Issue Advances in Machine Learning and Mathematical Modeling for Optimization Problems)

Download

Browse Figures

Versions Notes

Abstract

:

Scene text detection and recognition, such as automatic license plate recognition, is a technology utilized in various applications. Although numerous studies have been conducted to improve recognition accuracy, accuracy decreases when low-quality legacy license plate images are input into a recognition module due to low image quality and a lack of resolution. To obtain better recognition accuracy, this study proposes a high-frequency augmented license plate recognition model in which the super-resolution module and the license plate recognition module are integrated and trained collaboratively via a proposed gradual end-to-end learning-based optimization. To optimally train our model, we propose a holistic feature extraction method that effectively prevents generating grid patterns from the super-resolved image during the training process. Moreover, to exploit high-frequency information that affects the performance of license plate recognition, we propose a license plate recognition module based on high-frequency augmentation. Furthermore, we propose a gradual end-to-end learning process based on weight freezing with three steps. Our three-step methodological approach can properly optimize each module to provide robust recognition performance. The experimental results show that our model is superior to existing approaches in low-quality legacy conditions on UFPR and Greek vehicle datasets.

Keywords:

gradual end-to-end learning; single-image super-resolution; automatic license plate recognition; low-quality legacy conditions; holistic feature extraction; high-frequency augmentation

MSC:

68T45

1. Introduction

1.1. License Plate Recognition in a Real-World Scenario

Scene text detection and recognition is a task that detects text regions and recognizes letters and numbers in image frames. This task can be utilized in various applications in smart parking and driving such as illegal parking detection and traffic sign recognition. When this task is applied to image frames in real-world scenarios, it does not assure satisfactory performance due to the varying resolutions of the input images. Specifically, this is a critical issue for the license plate recognition task. As shown in Figure 1, the plate regions detected in vehicle LP images may have small resolutions depending on the distance between the camera and the vehicle object. Even if the detected region images are input directly to the LP recognition module, it causes severe recognition accuracy degradation, as shown in the result of the low-resolved LP recognition approach in Figure 1. To address this problem, a bicubic-interpolation-based approach can be considered. However, this approach also generates low recognition accuracy with resized low-quality images, as shown in the result of the bicubic-interpolation-based LP recognition approach in Figure 1. To acquire high-quality LP images, super-resolution (SR) techniques can be considered. Nevertheless, conventional SR modules may not be suitable for enhancing recognition accuracy because the SR modules only focus on improving image quality, not recognition. For this reason, such an approach causes insufficient recognition performance, as shown in the result of the super-resolved LP recognition approach in Figure 1. Hence, in the real world, there is a necessity for an integrated model to improve LP recognition accuracy by restoring LP images in terms of LP recognition.

1.2. High-Frequency Augmented License Plate Recognition Model

To tackle this issue in Section 1.1, we propose a high-frequency augmented license plate recognition (HIFA-LPR) model. The HIFA-LPR model can improve the input image resolution optimally from the point of view of LP recognition and robustly classify the LP characters in the optimal super-resolved image, as shown in the last row of Figure 1. To this end, we suggest gradual end-to-end learning so that LP recognition accuracy is robust to input data with various image qualities and resolutions. When performing the gradual end-to-end learning method, most SR modules are trained with a small-sized image patch such as about 8 × 8 pixels. However, this patch-based SR approach is not suitable for end-to-end learning due to the grid patterns generated by the SR module. Since grid patterns cut off characters, the patch-based SR approaches cannot preserve character information that is closely related to LP recognition accuracy. Hence, we propose a holistic image feature extraction method that is adopted for preventing grid pattern generation while preserving character information in the SR module.

In recognition tasks, high-frequency information, such as edge, contrast, and texture, is closely related to recognition performance. For this reason, we propose an LP recognition module based on high-frequency augmentation to exploit enhanced high-frequency information. Our LP recognition module mainly consists of high-frequency augmentation blocks (HAB). We utilize the discrete cosine transform (DCT) principle that the component corresponds to the higher-frequency component as it goes to the right and bottom directions in the DCT domain. In the HAB, we extract the desired high-frequency components using the DCT principle and augment the high frequency of the feature map.

The training process of HIFA-LPR consists of three steps based on a weight freezing technique. First, the SR and the LP recognition modules are independently trained for stabilizing the training process. Second, to properly restore images in terms of LP recognition, the SR module is trained with SR loss and recognition loss while the LP recognition module weights are frozen. Third, the recognition module is trained with super-resolved LP images for enriching LP recognition accuracy. By using the weight freezing technique, we enhance the collaborative correlations between each module. To verify HIFA-LPR, we perform experiments with SR and recognition performance in low-quality legacy conditions using the UFPR [1] and Greek vehicle datasets [2]. The UFPR dataset is organized with Brazilian LP images and character labels for detecting LP and recognizing LP characters. The Greek vehicle dataset is organized with Greek LP images for LP detection only. To utilize this dataset for low-resolution (LR) recognition, we build the LP recognition dataset by manually annotating each LP character in the LP images. The contributions of this study can be summarized as follows:

A gradual end-to-end learning-based optimization method that collaboratively learns the SR and LP recognition modules is designed. We suggest this method in three steps based on the weight freezing technique.
An LP recognition module based on high-frequency augmentation is proposed to improve the recognition performance using HABs. The HAB extracts the desired high-frequency components using the DCT principle and augments the high frequency of the feature map.
A novel holistic image feature extraction method is proposed to prevent generating grid patterns during the SR module. This enables the utilization of more complete character information than using patch-based SR with the character area cut off.
To evaluate the performance of the proposed HIFA-LPR model, we build the LP recognition dataset by manually annotating 2415 characters in 345 images from the Greek vehicle dataset. Our model is superior to existing state-of-the-art works in low-quality legacy conditions. Even if the LP image resolution is 19 × 6, our model provides robust recognition performance relatively.

2. Related Works

2.1. Single-Image Super-Resolution

Single-image SR is a method to predict a high-resolution (HR) image from the corresponding LR image. However, single-image SR is an ill-posed problem because there are various methods of degradation while reducing the image quality from HR to LR. To address this problem, several studies have been conducted with deep-learning-based methods. A super-resolution convolutional neural network (SRCNN) [3] proposed the SR method based on convolutional neural networks for the first time, and it showed innovative restoration performance. A very deep convolutional network (VDSR) [4] proposed a deeper SR neural network with a residual learning strategy. An efficient sub-pixel convolutional neural network (ESPCNN) [5] proposed a pixel-shuffling layer that can learn an up-sampling module. Using this layer, ESPCNN solved the limitation of feature map magnification in the neural network. A deep back-projection network (DBPN) [6] proposed an iterative up-sampling and down-sampling module that repeatedly stacks the image upscaling and downscaling layers. A residual channel attention network (RCAN) [7] proposed a channel attention mechanism that helps to create a deep model. A dual regression network (DRN) [8] proposed a closed circuit and added an LR domain loss function that calculates the difference from the input image. A residual dense network (RDN) [9] proposed a neural network that can learn the hierarchical representation of all feature maps through the residual density structure. A second-order attention network (SAN) [10] showed outstanding performance by strongly improving the representation of image feature maps and learning the interdependencies between feature maps. Meta-transfer learning for zero-shot SR (MZSR) [11] proposed a flexible algorithm for restoring images that are blurred under actual blur conditions by training on various kernels. Shifted windows using image restoration (SwinIR) [12] proposed the SR method for image restoration by using a shifted windows (Swin) transformer which performs reliable performance on high-level vision tasks. By using this method, SwinIR outperforms the state-of-the-art SR method.

However, these SR modules only focus on image reconstruction. Since the SR modules cannot appropriately restore the image in terms of LP recognition, the recognition performance is degraded. To obtain better recognition performance, we propose a stepwise gradual end-to-end learning method using combined loss and weight freezing. Specifically, the above SR modules [3,4,5,6,7,8,9,10,11,12] generate grid patterns during the end-to-end training process due to patch-extraction-based approaches. The extracted patches that include insufficient character information hinder the optimization of the LP recognition module. To address this issue, we propose the holistic-feature-extraction-based SR module that takes the whole LP image as input. Since our method uses full character information, our SR module can be trained to improve LP character recognition compared with existing patch-extraction-based SR modules.

2.2. License Plate Recognition

LP recognition is the task to recognize the LP characters in the vehicle image. Various studies have been conducted to boost LP character recognition performance. OpenALPR [13] is the LP recognition API and it is based on OpenCV and TesseractOCR [14]. Lee et al. [15] mentioned that LP recognition performance is improved with the SR mode when the LP image is too small to recognize with the recognition module, which has a fixed input size. This shows that the SR method can improve the LP recognition performance with LR images. A super-resolved recognition method [16] was proposed for LR image character recognition and the data augmentation algorithm for left-right reversal. Wang et al. [17] proposed a method that can exploit a synthetic data generation approach based on a generative adversarial network (GAN) for a data generation procedure to obtain a large representative LP dataset. Hamdi et al. [18] proposed double GAN for image enhancement with LP images. They performed SR training used for constructive LP denoising and SR to increase the LP recognition accuracy when an LR image was used for recognition. Wang et al. [19] proposed a convolutional recursive neural network followed by the connectionist temporal classification for LP character recognition. Combining with multitask cascaded convolutional neural network detection, they proposed a recognition module that can detect the LP region and classify characters of LP.

LPRNet [20] proposed the LP character recognition module with the end-to-end method for automatic LP recognition (ALPR) without preliminary character segmentation. Moreover, this method is lightweight enough to run on a variety of platforms, including embedded devices. Laroca et al. [1] proposed the LP dataset with 4500 fully annotated images focused on usual and different real-world scenarios to help address the inadequacy of an LP database and to address the low-recognition problem. Nguyen et al. [21] proposed the LP detection and LP recognition module which is embedded with the spatial transformer to increase the accuracy of LP character detection and recognition with the CCPD dataset [22]. Xu et al. [23] proposed the location-aware 2D attention-based recognition module that recognizes both single-line and double-line plates with perspective deformation. Vasek et al. [24] proposed the LP recognition neural network with the CNN method in LR frames. Lee et al. [25] proposed the GAN-based SR method that can be adopted in LP-recognition-challenged environments. Zhang et al. [26] proposed the multitask generative adversarial network (MTGAN) which combines the SR and recognition modules in a one-step end-to-end learning method. Li et al. [27] proposed the unified deep neural network for the LP image localizing and recognizing the characters at once in a single forward pass. This method operates the LP detection and recognition jointly by a single network to avoid intermediate error accumulation and accelerate the LP processing speed. However, these methods are not the end-to-end training method for LP recognition that cannot guarantee the training stability for stable performance. Moreover, these methods cannot make the recognition module be optimized so that the loss function of LP recognition approaches the global minimum. Zhang et al. [28] proposed the robust LP recognition module in the wild situation. This method also proposed the GAN-based LP generation engine to reduce the exhausting human annotation work. However, there is significant performance degradation when these methods are applied to the image frames in real-world scenarios because the input images have various image resolutions and qualities.

Even if the SR-based approach [15,16,17,24,25] is applied to the low-quality legacy image, it produces insufficient performance due to irreverent relationships with the recognition module. Moreover, other approaches [18,26] tried to figure it out using the single-step end-to-end learning method. However, these methods based on single-step end-to-end learning cannot optimally strengthen the collaborative correlations between the SR and LP recognition modules. Meanwhile, these LP recognition modules [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28] do not have any module that augments the high frequency of characters, which is the main clue of LP recognition. For this reason, the LP recognition performance of the referred LP recognition module is not satisfactory at recognizing low-quality characters.

To tackle this issue, in this study, we propose a gradual end-to-end learning method with three steps: Step 1: the independent training of the SR module and LP recognition module; Step 2: SR module training with LP recognition module weight freezing; Step 3: LP recognition module training with SR module weight freezing.

Furthermore, we propose the LP recognition module based on high-frequency augmentation. Our LP recognition module mainly consists of HAB which reinforces the high frequencies of precise character components such as edge, texture, and contrast. Due to HAB, our LP recognition can provide robust recognition performance even if low-quality legacy LP images are inputted.

3. Proposed Method

In this section, we present our methodological contributions. First, a holistic image feature-extraction SR module is proposed to guarantee a stable end-to-end learning process, unlike DBPN [6]. Second, an LP recognition module based on high-frequency augmentation is proposed to strengthen the high-frequency component for LP recognition accuracy, unlike the state-of-the-art object recognition module, Yolov5 [29]. Our LP recognition module mainly consists of HAB. The proposed HAB extracts only the desired frequency by using our new DCT-based frequency mask. Finally, a gradual end-to-end learning process with three steps based on weight freezing is suggested to strengthen the collaborative correlations with each module.

3.1. Architecture of Each Module in HIFA-LPR Model

In this section, we introduce the HIFA-LPR model architecture. The HIFA-LPR model consists of a holistic-feature-extraction-based SR module and an LP recognition module based on high-frequency augmentation.

Holistic-Feature-Extraction-based SR Module. The existing SR methods [3,4,5,6,7,8,9,10,11,12] randomly extract patches with about 8 × 8 pixels by cropping the LR image, due to a lack of computing power. The extracted patches pass through the SR module and LP recognition module in an end-to-end learning process. However, character information is not fully considered because of the truncated character information. It causes severe training performance degradation of the recognition module in the end-to-end learning process. On the other hand, our holistic-feature-extraction-based SR ensures the training stability of the LP recognition module because our module utilizes the character position and full character information by using the whole image. Due to a lack of computing power, we set the training batch size to 1 during the end-to-end learning process.

In this study, considering end-to-end learning for super-resolved character recognition, we propose the holistic-feature-extraction-based SR that takes the whole LP image as input. The DBPN [6] is benchmarked to improve the quality of LP images. Our holistic feature extraction consists of a 3 × 3 convolution layer and a 1 × 1 convolution layer. Unlike the original DBPN method, our SR method takes the whole LP image and performs SR by extracting holistic features and repeatedly upscaling and downscaling the holistic features through the convolution layers, as shown in Figure 2. The SR module extracts the feature map of the input image. The extracted feature maps pass through the up-blocks and down-blocks to obtain feature maps of 64 channels per block. The SR module connects the feature maps obtained for each block so that feature maps are passed to the final output layer, and then, the super-resolved RGB 3-channel image is acquired. This SR module can magnify the image resolution and improve the image quality. However, training stability cannot be guaranteed since patch-based SR models generate grid patterns in the super-resolved image. Such generated grid patterns cause restoration performance degradation. As shown in Figure 3a, since the patch-based end-to-end training process loses character information, it causes severe training performance degradation. It is a critical issue in terms of LP recognition. In addition, LP recognition model training is impossible with the patch that has a small part of the input image. On the other hand, our holistic image feature extraction consisting of a 3 × 3 convolution layer and a 1 × 1 convolution layer preserves character information in the super-resolved image while alleviating annoying grid patterns, as shown in Figure 3b. By using this method, our LP recognition module can be trained with a stable training process. Algorithm 1 shows the pseudocode of the holistic-feature-extraction-based SR module for scale factor of ×4. Our holistic-feature-extraction-based SR module utilizes character position and character information by using the whole image, unlike DBPN.

Algorithm 1. The pseudocode of the holistic-feature-extraction-based SR module for scale factor ×4.

LR: LR images

W: Width of LR image

H: Height of LR image

F^(W,H)(k): Feature map with W × H pixels at stage k

G_k: Convolutional layers at stage k

P_k: Deconvolutional layers at stage k

N: Number of training images

I: Number of iterations in the SR module

set * = Convolution operation

set Conv_3 = 3 × 3 convolution layer

set Conv_1 = 1 × 1 convolution layer

do:

1: For i = 1 to N do:

2: //holistic feature extraction process of the single image

3: F^(W,H)(0) = Conv_1(Conv_3(LR_i^(W,H)))

4: For k = 1 to I do:

5: F^(4W,4H)(k) = F^(W,H)(k-1) * P_k

6: F^(W,H)(k) = F^(4W,4H)(k) * G_k

7: F^(4W,4H)(k) = F^(W,H)(k) * P_k

8: if k > 1:

9: F^(4W,4H)(k) = concatenation(F^(4W,4H)(k-1), F^(4W,4H)(k))

10: F^(W,H)(k) = F^(4W,4H)(k) * G_k

11: F^(4W,4H)(k) = F^(W,H)(k) * P_k

12: F^(W,H)(k) = F^(4W,4H)(k) * G_k

13: if k > 1:

14: F^(W,H)(k) = concatenation (F⁽^W,H)(k-1), F^(W,H)(k))

15: F^(W,H) = F^(W,H)(k)

16: end for

17: F^(4W,4H) = F^(W,H) * P

18: F^(W,H) = F^(4W,4H) * G

19: F^(4W,4H) = F^(W,H) * P

20: //Convert SR feature map to SR image

21: S_i^(4W,4H) = Conv_3(F^(4W,4H))

22: return S_i^(4W,4H)

23: end for

LP Recognition Module based on High-frequency Augmentation. In the LP recognition task, high-frequency information, such as edge, contrast, and texture, affects LP recognition performance. Hence, to improve LP recognition performance, high-frequency components should be appropriately augmented. To this end, we propose the LP recognition module which is benchmarked and improved from Yolov5 [29]. Since there is no correlation between adjacent numbers or characters in a single LP, our proposed LP recognition module independently detects and recognizes each character in the LP instead of a whole-character-based recognition. To this end, we adopt and improve Yolov5, which is known to provide state-of-the-art accuracy in the object detection field. We note that we utilize high-frequency components to promote LP recognition accuracy, unlike Yolov5.

To augment the high-frequency component, it is necessary to extract only the desired high frequency. Therefore, we utilize the DCT, which has the principle that low-frequency components are concentrated on the upper left and high-frequency components are concentrated on the lower right in the DCT spectrum. A two-dimensional image of size N × M can be transformed into the frequency domain through DCT, as shown in Equation (1).

F (u, v) = α (u) β (v) \sum_{i = 0}^{N} \sum_{j = 0}^{M} f (i, j) γ (i, j, u, v)

(1)

γ (i, j, u, v) = \cos (\frac{π (2 i + 1) u}{2 N}) \cos (\frac{π (2 j + 1) v}{2 M})

(2)

α (u) = {\begin{cases} \sqrt{\frac{1}{N}}, u = 0 \\ \sqrt{\frac{2}{N}}, u \neq 0 \end{cases}

(3)

β (v) = {\begin{cases} \sqrt{\frac{1}{M}}, v = 0 \\ \sqrt{\frac{2}{M}}, v \neq 0 \end{cases}

(4)

f (i, j) = \sum_{u = 0}^{N} \sum_{v = 0}^{M} α (u) β (v) F (u, v) γ (x, y, u, v)

(5)

In Equation (1),

f

(i, j)

is the pixel value of the

(i, j)

position of the image, and

F

(u, v)

is the DCT coefficient value at the

(u, v)

position. Equations (2)–(4) show the definitions of the cosine basis function and regularization constant, respectively. As shown in Equation (5), the frequency domain signal can be transformed into the spatial domain using a two-dimensional inverse DCT (IDCT).

By using the DCT principle, we can dynamically extract high-frequency components via a frequency mask

𝓜

. The frequency mask

𝓜

consists of a binary value and is determined depending on the hyper-parameter

ε

as given below.

𝓜 (x, y) = {\begin{matrix} 0, y < - x + 2 ε h \\ 1, otherwise \end{matrix},

(6)

where

h

denotes the height of the input image and

x, y

denote the horizontal and vertical coordinates of

𝓜

, respectively. The hyper-parameter

ε

ranges from 0 to 1. Since the more directed from the top left to the bottom right in the zigzag direction is, the higher the frequency component in the DCT domain is, we can extract the desired high-frequency components. Figure 4 shows examples of

𝓜

according hyper-parameter

ε

.

Our LP recognition module mainly consists of HABs which reinforce the high frequency of the feature map. As shown in Figure 5, the HAB receives the feature map as input. Then, the feature map is transformed into the DCT domain using 2D-DCT. To extract the high frequency in the DCT domain, the element-wise product of the feature map and frequency mask

𝓜

determined by hyper-parameter

ε

is conducted. The extracted high-frequency feature map is transformed into the spatial domain by 2D-IDCT. The obtained high-frequency feature map is added to the original feature map.

Our LP recognition module based on high-frequency augmentation is illustrated in Figure 6. The Focus layer downscales the input image with as little information loss as possible for fast recognition by transforming input image space into depth. By transporting to a convolution batch normalization leaky ReLU (CBL) and a cross-stage partial (CSP) layer, feature maps obtain richer gradient combinations while maintaining lower computations. Moreover, by splitting the gradient flow, CSP reduces the computation of the architecture with the residual unit, which maintains the gradient of the neural network. Then, the feature map passes through the spatial pyramid pooling (SPP) layer to generate a fixed one-dimensional array as input to the fully connected layer to predict. The up-sample block performs up-sampling of the feature map to expand the feature map size, and it allows for small-object detection. Before the concatenation of each feature map, the proposed HAB is employed to exploit enhanced high-frequency information. Finally, the LP recognition module can obtain three different-sized feature maps to detect and recognize the characters of the LP image. The character recognition loss (localization, classification, and confidence losses) is obtained by comparing the result of the LP recognition module with the ground truth label. For training for LP recognition, we define 10 numerical classes for 0–9 and 26 character classes for A–Z. Algorithm 2 shows the pseudocode of the LP recognition module based on high-frequency augmentation.

Algorithm 2. The pseudocode of the LP recognition module based on high-frequency augmentation.

I: Input LP image

M: Frequency mask as in Equation (6)

set Conv = Convolution layer

set CBL = Convolution, batch normalization, leaky ReLU layers

set Up-sample = Bicubic interpolation

set

•

= Element-wise product

set CSP1 = Combination of CBL and a residual unit layer

set CSP2 = Combination of convolution and a residual unit layer

do:

1: //Convert the input image to four depth maps by focus layer

2: D_1, D_2, D_3, D₄ = channel slice(I)

3: F = concatenation(D_1, D_2, D_3, D₄)

4: //Feature extraction.

5: F = CBL(F)

6: //CSP layer

7: F_1, F₂ = channel slice(F)

8: F₁ = CBL(F₁)

9: F = concatenation(F_1, F₂)

10: F = CSP1_3(F)

11: //Getting three feature maps with different sizes

12: F₁ = CBL(F)

13: F₂ = CBL(CSP1_1(CBL(F₁)))

14: F₃ = CBL(CSP2_1(CBL(SPP(CBL(CBL(F₂)))))

15: F₃ = Up-sample(F₃)

16: //High-frequency augmentation

17: F₃_dct = DCT(F₃) as in Equation (1)

18: F₃_H = F₃_dct

•

M

19: F₃_H = IDCT(F₃_H) as in Equation (5)

20: F₃_Aug = F₃ + F₃_H

21: //Concatenate the F₂ and F₃_Aug

22: C₁ = concatenation(CBL(F₂), F₃_Aug))

23: F₁ = CBL(F₁)

24: //High-frequency augmentation

25: F₁_dct = DCT(F₁) as in Equation (1)

26: F₁_H = F₁

•

M

27: F₁_H = IDCT(F₁_H) as in Equation (5)

28: F₁_Aug = F₁ + F₁_H

29: C₁_dct = DCT(C₁) as in Equation (1)

30: C₁_H = C₁_dct

•

M

31: C₁_H = IDCT(C₁_H) as in Equation (5)

32: C₁_Aug = C₁ + C₁_H

33: //Concatenate F₁_Aug and C₁_Aug

34: C₂ = concatenation(CBL(F₁_Aug), Up-sample(CBL(CSP2_1(C₁_Aug))))

35: //Getting first LP recognition feature map

36: P₁ = Conv(CBL(CSP2_1(C₂)))

37: //Concatenate the C₁ and C₂

38: C₃ = concatenation(CBL(CSP2_1(C₁))), Conv(CBL(CSP2_1(C₂))))

39: //Getting second LP recognition feature map

40: P₂ = Conv(CBL(CSP2_1(C₃)))

41: //Concatenate the high-frequency augmented F₃ and the C₃

42: C₄ = concatenation(F₃, CBL(CSP2_1(C₃)))

43: //Getting third LP recognition feature map

44: P₃ = Conv(CBL(CSP2_1(C₄)))

45: //Calculating the confidence of each output feature map P_1, P_2, P_3.

46: Output = Non-maximum suppression (P₁, P₂, P₃)

3.2. Gradual End-to-End Learning Process Based on Weight Freezing

A schematic diagram of the gradual end-to-end learning method is shown in Figure 7. We implemented the training process based on weight freezing in Steps 1, 2, and 3 as follows. The gradual end-to-end learning process based on weight freezing is one of our methodological contributions.

Step 1 process. Step 1 requires the pretrained weights of both modules to guarantee optimized performance. As shown in Figure 7a, we independently train the SR and LP recognition modules using SR loss and LP recognition loss, respectively. The SR loss Loss_SR reduces the L1 difference in pixel values between SR images and HR images. The Loss_SR is defined as

L o s s_{S R} = \sum_{i = 1}^{N} | H R_{i} - f (L R_{i}) |,

(7)

where N is the number of training images,

L R_{i}

is the i-th LR training image,

f (L R_{i})

is the SR result of

L R_{i}

, and

H R_{i}

is the i-th HR training image corresponding to the SR image

f (L R_{i})

.

The LP recognition loss Loss_recognition consists of localization, confidence, and classification losses [29]. The loss function is defined as

L o s s_{r e c o g n i t i o n} = λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} [(x_{i} - {\hat{x}}_{i})^{2} + {(y_{i} - {\hat{y}}_{i})}^{2}]

+ λ_{c o o r d} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} [{(\sqrt{w_{i}} - \sqrt{{\hat{w}}_{i}})}^{2} + [{(\sqrt{h_{i}} - \sqrt{{\hat{h}}_{i}})}^{2}]

(8)

+ \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{o b j} {(C_{i} - {\hat{C}}_{i})}^{2} + λ_{n o o b j} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} 1_{i j}^{n o o b j} (C_{i} - {\hat{C}}_{i})^{2} + \sum_{i = 0}^{S^{2}} 1_{i}^{o b j} \sum_{c \in c l a s s e s}^{} {(p_{i} (c) - {\hat{p}}_{i} (c))}^{2},

where S² denotes the grid cell for recognition. This grid cell gets a value of one if the LP character is recognized and zero otherwise. λ_coord denotes the constants to take into account more aspects of the loss function. The first and second lines denote the localization loss that computes the error of the position of the bounding box for accurate box detection. x_i denotes the horizontal coordinate of the i-th input image, y_i denotes the vertical coordinate of the i-th input image, w_i denotes the width of the i-th image, and h_i denotes the height of the i-th image. The LP recognition module calculates the sum of squared errors for the x_i, y_i, w_i, and h_i between the predicted bounding box of the LP recognition module and the ground truth. C_i denotes the confidence of the class in the i-th image grid box. This C_i is a probability value between zero and one that is determined when a character is detected in the box. When no character is detected, λ_noobj is used for the LP recognition loss. p_i(c) denotes the confidence of the detected class of the i-th image.

Step 2 process. Step 2 is constructed to associate the irrelevant SR modules with the LP recognition module based on the weight freezing method. The SR and LP recognition losses are summed to strengthen collaborative correlations with each other. To generate images that are robust to LP recognition, the summed loss is backpropagated to the SR module, as shown in Figure 7b. In this process, if the parameters of the LP recognition module are changed, the end-to-end learning process cannot be performed properly. Therefore, weights for the LP recognition module are frozen during the training process. Using this process, the SR module is trained so that it reduces both the SR and LP recognition losses. Then, we obtain SR weights that can restore the super-resolved image to improve LP recognition accuracy. The summed loss is calculated as

L o s s_{T o t a l} = α \times L o s s_{S R} + L o s s_{r e c o g n i t i o n},

(9)

where α is a hyper-parameter that scales the SR loss. In Step 2, α is set to 0.1 to equalize the scales between Loss_SR and Loss_recognition.

Step 3 process. Step 3 is designed by converting LP recognition module freezing to SR module freezing, as shown in Figure 7c. Although the SR module reconstructs the image to improve LP recognition accuracy, the LP recognition module is not yet adapted to the enhanced super-resolved image. To address this issue, the super-resolved image is used to train the LP recognition module using Loss_Total with

α = 0

. Through this, the LP recognition module presents superior recognition accuracy to other existing approaches.

As shown in Algorithm 3, a pseudocode capable of gradual end-to-end learning based on weight freezing is implemented.

Algorithm 3. A pseudocode of the gradual end-to-end learning method based on weight freezing.

L: LR image

H: HR image

N: Number of training images

W_SR: SR weights

W_LP: LP recognition weights

do:

set SR = SR module of gradual end-to-end learning method

set LP = LP recognition module of gradual end-to-end learning method

(Step 1) Independent training of SR for getting initial weights

1: For p = 1 to N do:

2: S = SR(L)

3: Loss_SR =|H − S|₁ as in (7)

4: grad_SR = Backpropagate (SR, Loss_SR)

5: W_SR ←W_SR − grad_SR

6: end for

7: return W_SR

Independent training of LP

8: For q = 1 to N do:

9: H = Bicubic_interpolation(H, 256)

10: label_pred = LP(H)

11: Loss_recognition = Loss(label_pred, label_GT) as in (8)

12: grad_LP = Backpropagate (LP, Loss_recognition)

13: W_LP ←W_LP − grad_LP

14: end for

15: return W_LP

(Step 2) Freeze LP with W_LP and Train SR.

SR loss calculation

16: For p = 1 to N do:

17: S = SR(L)

18: Loss_SR =|H − S|₁

19: end for

Recognition loss calculation

20: For q = 1 to N do:

21: S = Bicubic_interpolation(S, 256)

22: label_pred = LP(S)

23: Loss_recognition = Loss(label_pred, label_GT) as in (8)

24: end for

Total loss calculation

25: For p = 1 to N do:

26: Loss_total = Loss_SR + Loss_recognition

\times

a as in (9)

27: end for

SR weight update via total loss backpropagation

28: For q = 1 to N do:

29: grad_SR = Backpropagate (SR, Loss_total)

30: W_SR ←W_SR − grad_SR

31: return W_SR

32: end for

(Step 3) Freeze SR with W_SR and Train LP.

SR loss calculation

33: For p = 1 to N do:

34: S = SR(L)

35: Loss_SR =|H − S|₁ as in (7)

36: end for

LP loss calculation

37: For p = 1, N do

38: S = Bicubic_interpolation(S, 256)

39: label_pred = LP(S)

40: Loss_recognition = Loss(label_pred, label_GT) as in (8)

41: end for

LP weight update via LP loss backpropagation

42: For q = 1, N do

43: grad_LP = Backpropagate (LP, Loss_recognition)

44: W_LP ←W_LP − grad_LP

45: return W_LP

46: end for

Optimizer. Our HIFA-LPR model utilizes the adaptive moment (Adam) optimizer [30] to search a minimum of our loss function Loss_Total with the iterative operation as follows:

m^{(n + 1)} = β_{1} m^{(n)} + (1 - β_{1}) \nabla L o s s_{T o t a l} (W^{(n)}),

(10)

v^{(n + 1)} = β_{2} m^{(n)} + (1 - β_{2}) \nabla L o s s_{T o t a l} (W^{(n)}) • \nabla L o s s_{T o t a l} (W^{(n)}),

(11)

W^{(n + 1)} = W^{(n)} - \frac{h}{\sqrt{v^{(n + 1)} + ω}} m^{(n + 1)},

(12)

where

\nabla

Loss_Total denotes the gradient of our loss function Loss_Total, each

β_{1}

and

β_{2}

denote the exponential decay rates for the moment estimates, m⁽ⁿ⁾ denotes the estimate of the first moment of the gradient, v⁽ⁿ⁾ denotes the estimate for the second moment, W⁽ⁿ⁾ denotes the vector before the optimization, W⁽ⁿ⁺¹⁾ denotes the updated vector by the Adam optimizer, h denotes the step size for the optimization process,

•

denotes the element-wise product, and

ω

denotes the variable that prevents the dividing by zero error in Equation (12). h is the important parameter for optimization because it gives a balance between the speed and convergence of the proposed model.

4. Experiments and Analysis

4.1. Experimental Setup

Dataset. To verify the effectiveness of our model, we utilize the UFPR [1] and Greek vehicle datasets [2]. The UFPR dataset that consists of Brazilian vehicle LP is organized into 1800 training sets, 900 validation sets, and 1800 test sets. To accurately measure the performance of our framework, we reorganize the UFPR [1] dataset into 3600 training sets and 900 validation sets. We organize the Greek vehicle dataset [2] into 280 training sets and 65 validation sets, and those sets are annotated by ourselves. To simulate low-quality legacy conditions, we resize the HR images to LR images by using the built-in resize function of MATLAB for each scale factor (×3, ×4).

Metric. To analyze the performance of SR modules, we use the peak signal-to-noise ratio (PSNR), which evaluates the difference between the original image and the super-resolved image. To quantify the recognition accuracy, the mean average precision (mAP) is utilized.

Environments. Our framework is implemented in Pytorch 1.8.0., and we use Python 3.8.3, CUDA 11.2, and cuDNN 8.2.0. Our experiment is performed with AMD Ryzen 5 5600X 6-Core Processor CPU, 32GB memory, and NVIDIA RTX 3080 GPU. The 2D-DCT and 2D-IDCT are implemented using the built-in functions of torch.fft.rfft and torch.fft.irfft, respectively. Our HIFA-LPR model is trained by the Adam optimizer [30] with

β_{1} = 0.9

and

β_{2} = 0.999

, as shown in Equation (12). The training batch size of our study is set to 16, the number of epochs to 200,

ω

is set to 10⁻⁸, and the learning rate to 10⁻⁴.

4.2. Experimental Results

We compare our model with other approaches combined with SR modules [8,11,12] and LP recognition modules [14,29]. Among the recent LP recognition modules, we exclude modules that do not provide the source codes [21,23,25,26,27,28]. We set the LR image-based LP recognition module trained with LR LP images as the baseline. For a fair comparison, each SR module is trained and validated by the same datasets. In addition, Yolov5 [29] is trained with the corresponding HR LP images. Then, for a fair comparison, each LP recognition module fine-tunes their SR results like Step 3 of our gradual end-to-end learning method. Table 1 and Table 2 show the comparison between our HIFA-LPR model and other existing approaches. According to each training step, our HIFA-LPR model is denoted as HIFA-LPR (Steps 1, 2, and 3). Our model outperforms other existing approaches, as shown in Table 1 and Table 2. In particular, HIFA-LPR (Step 3) presents that the PSNR is increased by 0.8 dB and the mean average precision (mAP) is increased by 19.7% more than that of HIFA-LPR (Step 1) for a scale factor of ×3. Moreover, HIFA-LPR (Step 3) shows that the PSNR is increased by 0.14 dB and the mAP is increased by 26.5% more than that of HIFA-LPR (Step 1) for the scale factor ×4. These experimental results indicate that our proposed gradual end-to-end learning method is superior to individual learning.

Figure 8 shows experimental results from SR modules with the scale factor (

\times

4) on the UFPR dataset. As shown in Figure 8b–d, the SR modules cannot properly improve image quality from the LR image in Figure 8a. As shown in Figure 8f, SwinIR presents a better image quality when compared with the other SR modules. As illustrated in Figure 8h, our HIFA-LPR model provides enhanced edges and textures that are closely related to the LP recognition accuracy. As shown in Figure 9, the LP recognition results are used to quantitatively compare our model and existing approaches. As shown in Figure 9b–e, the LP recognition results obtained from existing approaches include many missing characters as well as several incorrect predictions. Although the result of HIFA-LPR (Step 3) includes one missing character, it outperforms other approaches in terms of recognition accuracy, as shown in Figure 9h. In addition, Figure 10 shows the mAP results for a numeric class (0–9) for the bicubic method and HIFA-LPR (Step 1, Step 2, and Step 3), respectively. We focus on only numeric class for detailed observation. The mAP result for the numerical class of HIFA-LPR (Step 3) is increased by 21.6% more than HIFA-LPR (Step 1). As the gradual end-to-end learning process progresses step by step, the mAP is increased. It demonstrates the effectiveness of the gradual end-to-end learning method.

Table 3 and Table 4 show the comparison between our HIFA-LPR model and other approaches to the Greek vehicle datasets. As shown in Table 3, HIFA-LPR (Step 3) presents that the mAP is increased by 1.5%, while the PSNR is decreased by 2.93 dB compared to that of SwinIR for scale factor ×3. As shown in Table 4, although HIFA-LPR (Step 3) shows that the PSNR is decreased by 3.6 dB less than SwinIR, the mAP of our model is increased by 1.4%. As shown in Figure 11b–e, the SR modules also cannot properly enhance the quality of the image from the LR image in Figure 11a. On the other hand, our HIFA-LPR model produces enhanced edge components, as shown in Figure 11h. It helps improve the performance of LP recognition. Figure 12 shows the results of the LP recognition module using SR results for scale factor ×4 on the Greek vehicle dataset. As illustrated in Figure 12, HIFA-LPR (Step 3) accurately predicts all characters, while other approaches include several missing characters and incorrect predictions. These experimental results indicate that the proposed HIFA-LPR model produces high recognition performance because the model performs SR to improve recognition accuracy, despite PSNR degradation. In addition, Figure 13 shows the mAP results for the numeric class (0–9) for the bicubic method and HIFA-LPR (Step 1, 2, and 3), respectively. The mAP result for the numerical class for HIFA-LPR (Step 3) is increased by 18.6% more than that of HIFA-LPR (Step 1).

4.3. Ablation Study

The effect of holistic feature extraction. In this section, we verify the effectiveness of the holistic image feature-extraction-based SR module by comparison with the patch-extraction-based SR module. For a fair comparison, each module is trained and validated using the same datasets and LP recognition module. As shown in Table 5, the holistic-extraction-based SR presents better mAP results than patch-extraction-based SR due to the elimination of grid patterns during the gradual end-to-end learning process.

The effect of the LP recognition module based on high-frequency augmentation. In this section, we investigate the effect of our LP recognition module. As we mentioned, the HAB extracts desired high-frequency components according to the hyper-parameter

ε

and augments the extracted high-frequency components. We compare our LP recognition module with Tesseract-OCR [14] and Yolov5 [29] which is our baseline module. Since OpenALPR [13] only provides cloud demo service for a single image, we provide only visual recognition results of OpenALPR, as shown in Figure 14. For a fair comparison, each LP module is trained by the same SR results. As shown in Table 6, our LP recognition module with high-frequency augmentation outperforms other modules because the HAB effectively exploits high-frequency components which are closely related to LP recognition. As shown in Figure 14, while our LP recognition module recognizes all characters, the other modules provide missing or misrecognized characters.

Extension to other countries’ LP images. Our HIFA-LPR model can be extended to other countries’ LP images that include different language characters, such as Korean, by building LP datasets and redefining character classes. To verify our framework, we conduct an additional experiment on the Korean LP dataset. Our model presents that the PSNR is increased by about 1.7 dB and the mAP is increased by about 16.4% compared with SwinIR-based LP recognition. Note that our model also surpasses other approaches on the Korean LP dataset.

Limitations. Our HIFA-LPR model may have limitations in practical applications. First, our experiments only assume that the LR image is downscaled with the known bicubic kernel. Hence, there is a possibility that the SR restoration performance will be degraded in the real world where the blur kernel is unknown. Second, our HIFA-LPR model requires more training time due to gradual stepwise learning. However, the inference time of our model is the same as that of other combination methods. The inference time of the proposed HIFA-LPR model is 3.2 ms when the size of the HR LP image is 168 × 168 pixels as input. The input image is 42 × 42 pixels in size scaled by a scale factor of ×4, and the LP recognition module’s input size is 168 × 168 pixels. Moreover, the training time of our HIFA-LPR model can be reduced by optimization techniques such as network weight compression or weight sharing.

5. Conclusions

This study focuses on LP recognition in low-quality legacy conditions. For this, we propose the HIFA-LPR model via gradual end-to-end learning. To this end, we suggest the gradual end-to-end learning method based on weight freezing. This method consists of three steps. In Step 1, the SR and LP recognition modules are independently trained for stabilizing the training process. In Step 2, the SR module is trained with a combined loss function by freezing the LP recognition module weights to strengthen collaborative correlations with each module. In Step 3, the LP recognition module is trained with super-resolved images by freezing the SR module weights to obtain higher LP recognition accuracy. Due to this method, we can enhance collaborative correlations between each module. To optimally train our model, we propose the holistic feature extraction method that can prevent grid pattern generation in the training process. To exploit high-frequency information, we propose an LP recognition module based on high-frequency augmentation. Our LP recognition module extracts only the desired high frequency and enhances the high-frequency component of the feature map. The experimental results show that our HIFA-LPR model provides the best performance in terms of mAP among various existing approaches. Although our HIFA-LPR model is intended for LP recognition, it can be extended to other object recognition tasks by building the related datasets and redefining object classes. In addition, our HIFA-LPR model can be considered in real-world scene text recognition applications such as smart parking and autonomous driving. In the smart parking task, our method can be applied to parking in designated areas and crackdown on illegal parking. In addition, our HIFA-LPR model can be applied to character recognition on traffic signs to help in autonomous driving.

Author Contributions

Conceptualization, S.B.Y.; methodology, S.-J.L., J.-S.Y. and S.B.Y.; software, S.-J.L.; validation, J.-S.Y.; formal analysis, S.-J.L. and J.-S.Y.; investigation, E.J.L. and S.B.Y.; resources, S.-J.L. and S.B.Y.; data curation, S.-J.L. and J.-S.Y.; writing—original draft preparation, S.-J.L. and J.-S.Y.; writing—review and editing, E.J.L. and S.B.Y.; visualization, S.B.Y.; supervision, S.B.Y.; project administration, S.B.Y.; funding acquisition, S.B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2020R1G1A1100798). This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network). This work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Laroca, R.; Severo, E.; Zanlorensi, L.; Oliveira, L.; Gonçalves, G.; Schwartz, W.; Menotti, D. A robust real-time automatic license plate recognition based on the YOLO detector. In Proceedings of the 2018 International Joint Conference on Neural Networks, Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–10. [Google Scholar]
Anagnostopoulos, C.N.E.; Anagnostopoulos, I.E.; Psoroulas, I.D.; Loumos, V.; Kayafas, E. License plate recognition from still images and video sequences: A survey. IEEE Trans. Intell. Transp. Syst. 2008, 9, 377–391. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [Green Version]
Kim, J.W.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1664–1673. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-loop matters: Dual regression networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5407–5416. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11065–11074. [Google Scholar]
Soh, J.W.; Cho, S.; Cho, N.I. Meta-transfer learning for zero-shot super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Seattle, WA, USA, 14–19 June 2020; pp. 3516–3525. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van, G.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE International Conference on Computer Vision, Montréal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
OpenALPR. Available online: https://www.openalpr.com (accessed on 29 October 2021).
Tesseract Open Source OCR Engine. Available online: https://github.com/tesseract-ocr/tesseract (accessed on 25 October 2021).
Lee, S.J.; Kim, T.J.; Lee, C.H.; Yoo, S.B. Image super-resolution for improving object recognition accuracy. JKIICE 2021, 25, 774–784. [Google Scholar]
Lee, S.J.; Yoo, S.B. Super-resolved recognition of license plate characters. Mathematics 2021, 9, 2494. [Google Scholar] [CrossRef]
Wang, X.; Man, J.; You, M.; Shen, C. Adversarial generation of training examples: Applications to moving vehicle license plate recognition. arXiv 2017, arXiv:1707:03124. [Google Scholar]
Hamdi, A.; Chan, Y.K.; Koo, V.C. A new image enhancement and super resolution technique for license plate recognition. Heliyon 2021, 7, 8341. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Yang, J.; Chen, M.; Wang, P. A light CNN for end-to-end car license plates detection and recognition. IEEE Access 2019, 7, 173875–173883. [Google Scholar] [CrossRef]
Zherzdev, S.; Gruzdev, A. LPRNet: License plate recognition via deep neural networks. arXiv 2018, arXiv:1806:10447. [Google Scholar]
Nguyen, D.L.; Putro, M.D.; Vo, X.T.; Jo, K.H. Triple detector based on feature pyramid network for license plate detection and recognition system in unusual conditions. In Proceedings of the International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021; pp. 1–6. [Google Scholar]
Xu, Z.; Yang, W.; Meng, A.; Lu, N.; Huang, H.; Ying, C.; Huang, L. Towards end-to-end license plate detection and recognition: A large dataset and baseline. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 255–271. [Google Scholar]
Xu, H.; Guo, Z.H.; Wang, D.H.; Zhou, X.D.; Shi, Y. 2D License plate recognition based on automatic perspective rectification. In Proceedings of the IEEE International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; pp. 202–208. [Google Scholar]
Vasek, V.; Franc, V.; Urban, M. License plate recognition and super-resolution from low-resolution videos by convolutional neural networks. In Proceedings of the British Machine Vision Conferences, Newcastle, UK, 3–6 September 2018; p. 132. [Google Scholar]
Lee, Y.; Yun, J.; Hong, Y.; Lee, J.; Jeon, M. Accurate license plate recognition and super-resolution using a generative adversarial networks on traffic surveillance video. In Proceedings of the International Conference on Consumer Electronics Asia, Jeju, Korea, 24–26 June 2018; pp. 1–4. [Google Scholar]
Zhang, M.; Liu, W.; Ma, H. Joint license plate super-resolution and recognition in one multi-task Gan framework. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 1443–1447. [Google Scholar]
Li, H.; Wang, P.; Shen, C. Toward end-to-end car license plate detection and recognition with deep neural networks. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1126–1136. [Google Scholar] [CrossRef]
Zhang, L.; Wang, P.; Li, H.; Li, Z.; Shen, C.; Zhang, Y. A robust attentional framework for license plate recognition in the wild. IEEE Trans. Intell. Transp. Syst. 2021, 22, 6967–6976. [Google Scholar] [CrossRef]
YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 October 2021).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Examples of LP recognition approaches in a real-world scenario. Our proposed HIFA-LPR model outperforms conventional approaches. “_” denotes a missing character recognition. Red character denotes incorrect prediction. Blue character denotes correct prediction. “GT” denotes ground truth. “Pred” denotes the prediction result.

Figure 2. Architecture of holistic-feature-extraction-based SR module.

Figure 3. (a) Diagram of the patch-based end-to-end training process. (b) Diagram of the holistic-feature-based end-to-end training process. In output images, red boxes denote the detected character region.

Figure 4. Examples of

𝓜

according to

ε

. Because of the larger hyper-parameter

ε

, more low-frequency components are masked. We empirically set

ε

to 0.2.

Figure 4. Examples of

𝓜

according to

ε

. Because of the larger hyper-parameter

ε

, more low-frequency components are masked. We empirically set

ε

to 0.2.

Figure 5. Architecture of HAB.

Figure 6. Flowchart of LP recognition module based on high-frequency augmentation.

Figure 7. Flowchart of gradual end-to-end learning method. (a) Step 1: Independent training of the SR module and LP recognition module. (b) Step 2: SR module training while freezing LP recognition module. (c) Step 3: LP recognition module training while freezing SR module. “GT” denotes ground truth. “Pred” denotes the prediction result.

Figure 8. SR results on the LP image of the UFPR dataset for scale factor (×3): (a) input low-quality legacy image (24 × 8), (b) bicubic (96 × 32), (c) MZSR, (d) DRN, (e) SwinIR, (f) HIFA-LPR (Step 1), (g) HIFA-LPR (Step 2), (h) HIFA-LPR (Step 3), (i) HR image.

Figure 9. LP recognition results on (a) input low-quality legacy image (24 × 8), (b) bicubic image (96 × 32), (c) MZSR result, (d) DRN result, (e) SwinIR result, (f) HIFA-LPR (Step 1) result, (g) HIFA-LPR (Step 2) result, (h) HIFA-LPR (Step 3) result, (i) HR image. “_” denotes a missing character. Red character denotes an incorrect prediction. Blue character denotes a correct prediction. “GT” denotes ground truth. “Pred” denotes the prediction result.

Figure 10. mAP comparison results for only numbers (0–9) on the UFPR dataset for scale factor (×4). (a–d) represent the mAP results on bicubic results, HIFA-LPR (Step 1) results, HIFA-LPR (Step 2) results, and HIFA-LPR (Step 3) results, respectively.

Figure 11. SR results on the LP image of the Greek vehicle dataset for scale factor (×4): (a) input low-quality legacy image (24 × 8), (b) Bicubic (96 × 32), (c) MZSR, (d) DRN, (e) SwinIR, (f) HIFA-LPR (Step 1), (g) HIFA-LPR (Step 2), (h) HIFA-LPR (Step 3), (i) HR image.

Figure 12. LP recognition results on the Greek vehicle dataset with (a) input low-quality legacy image (28 × 8), (b) bicubic result (112 × 32), (c) MZSR result, (d) DRN result, (e) SwinIR result, (f) HIFA-LPR (Step 1) result, (g) HIFA-LPR (Step 2) result, (h) HIFA-LPR (Step 3) result. (i) HR image. “_” denotes a missing character. Red character denotes an incorrect prediction. Blue character denotes a correct prediction. “GT” denotes ground truth. “Pred” denotes the prediction result.

Figure 13. mAP comparison results for only numbers (0–9) on the Greek vehicle dataset for scale factor (×4). (a–d) represent the mAP results on bicubic results, HIFA-LPR (Step 1) results, HIFA-LPR (Step 2) results, and HIFA-LPR (Step 3), respectively.

Figure 14. LP recognition results on same SR result (80 × 60). (a) Tesseract-OCR result, (b) OpenALPR result (c) Yolov5 result, (d) HIFA-LPR result, (e) HR image. “_” denotes a missing character. Red character denotes an incorrect prediction. Blue character denotes a correct prediction. “GT” denotes ground truth. “Pred” denotes the prediction result.

Table 1. Comparison between the HIFA-LPR model and other existing approaches for scale factor (×3) on the UFPR dataset.

PSNR and mAP on UFPR Validation Dataset
Method		PSNR (dB)	mAP (%)
SR Module	Recognition Module	PSNR (dB)	mAP (%)
LR baseline	Tesseract-OCR [14]	-	17.8
Bicubic		24.10	29.1
MZSR [11]		17.64	22.7
DRN [8]		25.11	33.3
SwinIR [12]		25.66	36.4
LR baseline	Yolov5 [29]	-	49.9
Bicubic		24.10	50.1
MZSR [11]		17.64	43.7
DRN [8]		25.11	58.5
SwinIR [12]		25.66	61.1
HIFA-LPR (Step 1)		26.40	62.7
HIFA-LPR (Step 2)		27.20	67.7
HIFA-LPR (Step 3)		27.20	82.4

Table 2. Comparison between the HIFA-LPR model and other existing approaches for scale factor (×4) on the UFPR dataset.

PSNR and mAP on UFPR Validation Dataset
Method		PSNR (dB)	mAP (%)
SR Module	Recognition Module	PSNR (dB)	mAP (%)
LR baseline	Tesseract-OCR [14]	-	12.7
Bicubic		22.18	18.8
MZSR [11]		17.99	11.2
DRN [8]		22.74	19.7
SwinIR [12]		23.33	21.8
LR baseline	Yolov5 [29]	-	32.4
Bicubic		22.18	35.8
MZSR [11]		17.99	27.3
DRN [8]		22.74	36.7
SwinIR [12]		23.33	38.1
HIFA-LPR (Step 1)		23.77	34.4
HIFA-LPR (Step 2)		23.91	42.2
HIFA-LPR (Step 3)		23.91	60.9

Table 3. Comparison between the HIFA-LPR model and other existing approaches for scale factor (×3) on the Greek vehicle dataset.

PSNR and mAP on Greek Vehicle Validation Dataset
Method		PSNR (dB)	mAP (%)
SR Module	Recognition Module	PSNR (dB)	mAP (%)
LR baseline	Tesseract-OCR [14]	-	48.2
Bicubic		21.33	60.3
MZSR [11]		16.07	47.2
DRN [8]		25.08	70.4
SwinIR [12]		25.58	74.7
LR baseline	Yolov5 [29]	-	72.1
Bicubic		21.33	92.1
MZSR [11]		16.07	91.8
DRN [8]		25.08	95.9
SwinIR [12]		25.58	96.8
HIFA-LPR (Step 1)		21.43	91.9
HIFA-LPR (Step 2)		22.65	94.4
HIFA-LPR (Step 3)		22.65	98.3

Table 4. Comparison between the HIFA-LPR model and other existing approaches for scale factor (×4) on the Greek vehicle dataset.

PSNR and mAP on Greek Vehicle Validation Dataset
Method		PSNR (dB)	mAP (%)
SR Module	Recognition Module	PSNR (dB)	mAP (%)
LR baseline	Tesseract-OCR [14]	-	27.4
Bicubic		21.33	31.1
MZSR [11]		16.07	22.3
DRN [8]		25.08	52.1
SwinIR [12]		25.58	55.1
LR baseline	Yolov5 [29]	-	57.1
Bicubic		19.63	77.3
MZSR [11]		15.96	74.1
DRN [8]		23.83	87.5
SwinIR [12]		23.61	89.1
HIFA-LPR (Step 1)		20.01	75.2
HIFA-LPR (Step 2)		20.60	80.6
HIFA-LPR (Step 3)		20.60	90.5

Table 5. Comparison between the patch extraction-based SR and the holistic extraction-based SR for scale factor (×4) on the UFPR dataset.

PSNR and mAP on UFPR Validation Dataset
Method		PSNR (dB)	mAP (%)
SR Module	Recognition Module	PSNR (dB)	mAP (%)
Patch-extraction-based SR	Yolov5 [29]	23.77	34.4
Holistic-extraction-based SR	Yolov5 [29]	24.65	44.7

Table 6. Comparison between the LP recognition module based on high-frequency augmentation and other modules for scale factor (×4) on the UFPR dataset.

PSNR and mAP on UFPR Validation Dataset
Method	mAP (%)
Tesseract-OCR [14]	19.1
Yolov5 [29]	34.4
LP recognition module without high-frequency augmentation	57.2
LP recognition module with high-frequency augmentation	60.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.-J.; Yun, J.-S.; Lee, E.J.; Yoo, S.B. HIFA-LPR: High-Frequency Augmented License Plate Recognition in Low-Quality Legacy Conditions via Gradual End-to-End Learning. Mathematics 2022, 10, 1569. https://doi.org/10.3390/math10091569

AMA Style

Lee S-J, Yun J-S, Lee EJ, Yoo SB. HIFA-LPR: High-Frequency Augmented License Plate Recognition in Low-Quality Legacy Conditions via Gradual End-to-End Learning. Mathematics. 2022; 10(9):1569. https://doi.org/10.3390/math10091569

Chicago/Turabian Style

Lee, Sung-Jin, Jun-Seok Yun, Eung Joo Lee, and Seok Bong Yoo. 2022. "HIFA-LPR: High-Frequency Augmented License Plate Recognition in Low-Quality Legacy Conditions via Gradual End-to-End Learning" Mathematics 10, no. 9: 1569. https://doi.org/10.3390/math10091569

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HIFA-LPR: High-Frequency Augmented License Plate Recognition in Low-Quality Legacy Conditions via Gradual End-to-End Learning

Abstract

1. Introduction

1.1. License Plate Recognition in a Real-World Scenario

1.2. High-Frequency Augmented License Plate Recognition Model

2. Related Works

2.1. Single-Image Super-Resolution

2.2. License Plate Recognition

3. Proposed Method

3.1. Architecture of Each Module in HIFA-LPR Model

3.2. Gradual End-to-End Learning Process Based on Weight Freezing

4. Experiments and Analysis

4.1. Experimental Setup

4.2. Experimental Results

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI