4.1. Setup
We used the EDSR model with 16 residual blocks [
12] as the backbone for our experiment and modified the open-source code [
26] implemented using the TensorFlow library. The network was trained from scratch (the weights were initialized using the Xavier method) using the Adam optimizer with learning rate
and momentum terms
, upscale factor
, batch size
, and total steps (
T) = 100,000 on a single RTX 3090 GPU. Finally, the weights with the highest PSNR are saved.
We used two training datasets, Old book Chinese character OCR (OBCC-OCR) [
27] and DIV2K [
28], with 1108 and 1000 images, respectively, and two test datasets, i.e., IAM-HistDB [
29] and DIVA-HisDB [
30], consisting of 127 and 120 images, respectively. The OBCC-OCR, IAM-HistDB, and DIVA-HisDB datasets only contain scanned text images of handwritten or printed historical manuscripts written in various languages, including Chinese, German, English, and Latin (see
Figure 2). The DIV2K dataset, consisting of images of various categories, was used to show the usefulness of semantic SR. A total of 10% of the images in the OBCC-OCR and DIV2K datasets were used for validation, and another 10% were used for testing. The original images were used as HR images and were downscaled using cubic interpolation with antialiasing to create LR images. The HR images were randomly cropped into patches of size 96 × 96, rotated, and flipped for training. The cropping size had little influence on the experimental results. In Equation (
1),
,
, and the depth of the feature maps was 512.
For performance analysis, we computed the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) to evaluate the quality of SR images. PSNR and SSIM values were averaged for all images in each test dataset in the tables below.
4.2. Results and Discussion
Table 1, where the EDSR model was trained using the OBCC-OCR dataset, shows that the semantic loss improved the quality of text SR images. The improvement was not great (there was little difference in text readability between the SR images generated with and without the semantic loss), but it was consistent. Without the semantic loss, as shown in
Figure 3 and
Figure 4, text-unrelated textures may appear inside the letters, or the letters may be more seriously distorted, reducing text readability. This indicates that the semantic loss has provided the SR network with text-specific semantic information useful for upscaling text images from other images in the same category. However, when the EDSR model was trained using the DIV2K dataset in
Table 2, the quality of the SR images deteriorated with the semantic loss. This is because the training images do not belong to the same semantic category and the semantic loss incorrectly enforces the SR images of different categories to be semantically similar.
When using the DIV2K training dataset, the SR results of IAM-HistDB and DIVA-HisDB were better without semantic loss. This is because, when using the OBCC-OCR training dataset, the model will be over-fitted to the OBCC-OCR dataset without using the semantic loss, rendering it ineffective for the other text image datasets, although they are semantically equivalent. Furthermore, the OBCC-OCR images do not have sufficient or various textures as the DIV2K images and the textures that do not exist in the OBCC-OCR images (OBCC-OCR, IAM-HistDB, and DIVA-HisDB are text image datasets that can include locally or partially different textures.) can be recovered more effectively using the DIV2K training dataset. This is the same in the DIV2K SR results.
In most results, the visual enhancement of the SR images with the semantic loss could not be clearly observed with the naked eye, as shown in
Figure 3 and
Figure 4. To indirectly visualize the enhancement, we computed the difference between SR images with and without semantic loss, as shown in
Figure 5. The difference images represent the additional information obtained by adopting semantic loss. They were bright in the text regions as expected, but this held true even when the DIV2K training dataset was used. However, we could deduce that the text-specific information was not correctly learned when using the DIV2K training dataset because the difference was unclear in some text regions, especially colored or light text regions. The difference was more noticeable when using the OBCC-OCR training dataset, indicating that the text-specific information was well-learned, improving the quality of text SR images.
Figure 6 shows the SR results of non-text images in DIV2K when the OBCC-OCR training dataset was used. Basically, the image quality was poor due to the lack of image textures present in OBCC-OCR. Furthermore, we discovered that the network trained with the semantic loss produced visual artifacts in the non-text SR images even when the PSNR and SSIM values in
Table 1 were increased. This is understandable given that the proposed method may degrade the SR results of non-text images by optimizing the SR process for text images.
As shown in
Figure 7, we discovered that the effect of the semantic loss was more visible when the image resolution was low. By introducing the semantic loss into the OBCC-OCR results, the PSNR values increased by 0.3 dB when producing 200 × 300 SR images from 50 × 75 LR images while increasing by 0.03 dB when producing 1500 × 2400 SR images from 375 × 600 LR images. This would be because the lower the image resolution, the less semantic information is lost than pixel information.
Given that the proposed method’s performance depends on the weight
in Equation (
2), we set
to 0.006 in our experiments, as mentioned before. This is because, as shown in
Table 3, both the small and large
s had lower PSNR and SSIM values. When
, the PSNR and SSIM values were lower than when the semantic loss was not used. From the results, we believe that small
s make it difficult to learn semantic information, and large
s prevent the recovery of local details at the pixel level.
Finally, we compared the proposed method to the multi-class GAN-based one [
5]. To ensure fairness, the EDSR model was used as a generator in the GAN-based method, a single discriminator was trained for text images, and the hyperparameters (such as learning rate, momentum terms, and batch size) common in both methods were set as the same (We did not consider fine-tuning the hyperparameters for each method. However, the hyperparameters were set similar to most SR studies, and fine-tuning them does not significantly affect the performance of each method. Furthermore, it is not the main concern of this study). In the GAN-based method, weighting factors
,
,
, and margin
were set to 1,
, 0.1, and 1, respectively, as given in [
5]. This comparison shows which approach best guides the EDSR model in learning text-related semantic information. As shown in
Table 4, the proposed method outperformed the GAN-based method when using different datasets in the training and test phases. For the DIVA-HisDB dataset, the GAN-based method performed worse than even the naive EDSR model (i.e., “w/o
” in
Table 1). However, for the OBCC-OCR dataset, the GAN-based method outperformed the proposed method. This suggests that the GAN-based method is over-fitted to the training dataset and is ineffective for extracting and learning semantic information from the dataset.