Efficient Comic Content Extraction and Coloring Composite Networks

Man, Qiaoyue; Cho, Young-Im

doi:10.3390/app15052641

Open AccessArticle

Efficient Comic Content Extraction and Coloring Composite Networks

by

Qiaoyue Man

and

Young-Im Cho

^*

Department of Computer Engineering, Gachon University, 1342 Seongnamdaero, Sujeong-gu, Seongnam-si 461-701, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2641; https://doi.org/10.3390/app15052641

Submission received: 21 January 2025 / Revised: 16 February 2025 / Accepted: 24 February 2025 / Published: 28 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Comics are widely loved by fans around the world as a form of visual art and cultural communication. With the development of digitalization, automated comic content detection and segmentation and comic coloring systems have become important research directions for digital archiving, automatic translation, and visual content analysis. This paper proposes a composite network composed of efficient content extraction and colorization, which includes a comic extraction module and a comic colorization module based on an improved Generative Adversarial Network. It solves the problem of single performance and poor effect that has been present in previous models. In various performance comparison experiments, our model shows an excellent and robust performance.

Keywords:

computer vision; image processing; convolutional networks; image colorization

1. Introduction

Comics are popular all over the world, they combine images and text to convey information to people while encouraging immersive imagination. Compared to monotonous text, the combination of visual elements and text makes information dissemination more efficient and is widely used in entertainment, education, advertising, etc. The rapid development of digital network technology has led to the rapid dissemination of digital content such as text, pictures, videos, and music in a wide range of contexts. The presentation of comics, which are widely loved by people all over the world, has also undergone a radical change: from the past, when paper comics occupied the mainstream comics market, to the present, where digital comics are the main mode of distribution.

Comics are a complex document type that integrates images and text. The analysis, recognition, and extraction of their internal components is an interesting and challenging task. Usually, comics consist of multiple pages with mixed-layout features composed of multiple layouts. Each page basically contains multiple content boxes, each of which contains dialog boxes, characters, scenes, and other content comprising different shapes. In some comics, the characters or dialog elements in the content box will overflow the content box and penetrate the gap between panels, called the “gutters”. This free-form document layout of comics brings great difficulty to intelligent recognition and extraction tasks. Traditional comic segmentation technology uses basic features such as image edges, colors, and textures to segment elements in comic pages. Usually, the input image is grayed and binarized, and edge detection algorithms (such as Sobel and Canny) are used to detect the edges of dialog boxes, frames, and characters. Contour tracking or Hough transform are used to determine the position of the frames. The positions of dialog boxes, characters, and other graphic elements are determined based on the geometric shape of the contours. Although these traditional methods can segment comics with regular frames to a certain extent, they have poor effects on complex and irregular comic pages, are easily affected by noise or complex backgrounds, and cannot accurately detect and segment the diverse elements in comics, such as the characters, dialogue bubbles, and other elements in each frame of the comics. With the development of artificial intelligence and the widespread application of convolutional networks [1], more and more researchers have begun to try to use deep learning methods to process comics. Among them, Kondo and Hasegawa [2] used a CNN-based model to classify cartoon characters in illustrations and identify fake paintings. Qin et al. [3] used Faster R-CNN [4] to perform face detection on characters in comics, showing relatively good performance. Xin and Ma [5] used the YOLOv3 [6] model to detect dialogue bubbles in comics and recognize text, further promoting the development of comics detection. Soykan et al. [7] designed an identity-aware self-supervised framework that exploits the inherent similarity between facial and body representations to achieve comic character re-identification. Hinami et al. [8] designed a context-aware comic translation framework that translates Japanese comics by sensing the text content in the dialogue bubbles in the comics, showing good results. Sharma and Kukreja [9] proposed a CNN-SVM hybrid model to classify comic content, making it easier for readers to read more efficiently. Most of these studies involve the simple detection of comics, or recognition of comic characters, and rarely involve accurate extraction of the overall content of comics. The extraction of comic content is crucial for the secondary re-creation of comics, reducing the unnecessary duplication of content creation, or coloring comics. Therefore, in this paper, we designed a module that can accurately extract comic content to solve these problems and help comic practitioners reduce their workload and improve efficiency.

The task of coloring comic images is very important and challenging in comic projects. Traditional manual methods are time-consuming, labor-intensive, and monotonous. Comic images often contain complex lines and details, which require coloring algorithms to accurately understand the image semantics and keep the edges clear. Standard convolutional networks are difficult to handle. However, with the update and iteration of technology, the emergence of generative adversarial networks (GANs) [10] has given researchers hope. Among them, Shimizu et al. [11] proposed a semi-automatic colorization method based on generative adversarial networks (GANs) to learn the drawing style of a specific comic from a small amount of training data. This method solves the problem of comic character coloring to a certain extent. Liu et al. [12] proposed a feature zoom colorization network, Zoom-GAN, to solve the multi-scale object colorization problem. A zoom instance normalization (Zoom-IN) module was designed in the generator to improve the colorization performance of small object areas under different scene conditions. Parmar et al. [13] proposed a pix2pix-zero image-to-image translation method that preserves the content of the original image without manual hints and converts sketches into color images.

As a unique art form, comics are complex and time-consuming to create, especially in the process of content segmentation and coloring. With the development of computer vision and image processing technology, automated comic content segmentation and coloring technology have become possible. This paper aims to explore methods for comic content segmentation and automatic coloring, proposing a novel and efficient composite network for the automatic coloring and content extraction of black-and-white comics to reduce the burden on comic creators and improve their creation efficiency. Our main contributions are as follows:

We designed a comic feature extraction network based on ResNet [14], incorporating the RoIalign layer and a subdivision layer to improve the performance of comic content detail feature detection and extraction.
We created a GAN-based comic coloring network module. Using U-Net as the generator and Patchgan as the discriminator, the perception of image details is enhanced, the overall color performance of the image is improved, and the generated color comic images are comparable to those manually colored by comic artists.
We created a dataset, KComics5000, based on Korean comics, consisting of 5000 comic pages.

2. Related Work

Comic content segmentation and extraction: In the traditional field of comic segmentation and extraction, specific areas in comics are extracted manually through specific software, or comic images are segmented and extracted using traditional image processing techniques such as edge detection, threshold segmentation, and regional growth. These methods work well when processing comics with simple backgrounds, but the segmentation effect is often not ideal when facing complex backgrounds and diverse characters. In recent years, a large number of deep-learning-based algorithm models have been widely used in image detection, recognition, and other fields, achieving good results. Deep convolutional neural networks (CNNs) can learn the hierarchical structure of image features and have been proven to outperform other algorithms that use carefully designed features in many visual recognition and positioning tasks. Deep neural network models are widely deployed in various industries, but in the comics industry, there are relatively few studies on the use of deep neural networks to analyze and process comic publications. It mainly detects simple borders and dialog boxes in comics and identifies the content of dialogues. For example, Arai and Tolle [15] used the improved connected component labeling (CCL) algorithm as a comic blob extraction function to detect speech bubbles in comics and recognize the text in the bubbles, thereby achieving the goal of automatically extracting Japanese characters from comics. Rigaud et al. [16] combined topology and spatial positioning, using the text positions in the comics as coordinates to achieve the pixel-level segmentation of the speech bubbles. Nguyen et al. [17] proposed the Sequencity612 comics dataset and used the YOLOv2 [18,19] model to detect specific characters in the comics. Dutta et al. [20] developed a CNN model that connects edge and semantic information to effectively detect narrative text boxes and speech bubbles of various shapes and their tails. Dubray and Laubrock [21] designed a fully convolutional network [22] based on the U-Net [23] architecture to predict pixel-based image segmentation and segment speech bubbles in comics. These studies involve only the simple detection of dialog bubbles or comic frames (rectangular boxes) of comics; few researchers have conducted studies on the detection of comic contents, and even fewer researchers have focused on the detailed segmentation of the characters appearing in them. Addressing the defects and problems that have arisen at this stage, we designed a module to automatically detect the content of comics and segment it meticulously.

Comic automatic coloring: With the booming development of the comics industry, the comic creation coloring process is highly repetitive, with low levels of creativity; it is gradually becoming a bottleneck that restricts efficiency. The traditional manual coloring method is time-consuming and labor-intensive, and it is difficult to meet the growing creative needs of the industry. Therefore, automated comic coloring and comic content detection and segmentation are equally important in comic projects. Chen et al. [24] matched the local features between the target image and the reference image based on the active learning framework and used the mixed-integer quadratic programming method (MIQP) to refine the matching results by considering the context structure, achieving the goal of coloring specific characters in comics. Dou et al. [25] proposed a novel dual-color space-guided generative adversarial network (DCSGAN), which considers the complementary information contained in RGB and HSV color spaces and generates vivid color comic character images through pixel-level supervision. Liu et al. [26] extracted the color style of a reference image and used adaptive instance normalization (AdaIN) to fuse the extracted color style features into the deep hierarchical representation of the sketch, achieving comic character coloring. Although these methods achieve the goal of the partial coloring of comics and improve the speed of coloring comics to a certain extent, they also suffer from incomplete coloring of the global content of comics, monotonous coloring, and poor overall naturalness in the image. Here, we design a composite network model that combines comic content detection segmentation and automatic coloring. By detecting the features of various elements in the cartoon, we determine the accurate feature areas and then perform accurate coloring to achieve a quality of performance that is not inferior to that of hand coloring.

3. Methods

This section mainly introduces the proposed composite network for automatic colorization and content extraction of black-and-white comics. As shown in Figure 1, the network consists of a comic extraction module and a comic colorization module.

3.1. Comics Intelligent Detection and Recognition Framework

In the comic content extraction module, as shown in Figure 2, we designed a cartoon content segmentation network model. The network model is based on ResNet50 as the backbone network; here, skip connections, which can alleviate the problem of gradient vanishing in deep network training, are used to extract features,

F_{i}

, for comics accurately. The region proposal network is used to generate candidate regions for comic frames, dialog bubbles, characters, and so on. Drawing on the Mask R-CNN framework [27], a RoIalign layer is added to retain the original feature data without quantization, to obtain more realistic RoI features and improve the segmentation ability of characters and irregular dialog bubbles in comics. Since the standard Mask R-CNN is not ideal for edge prediction in image segmentation tasks, the generated mask size is only 28 × 28, and the stretched mask cannot accurately correspond to the detailed features of the characters in the comics, making it difficult to perform fine feature extraction.

Therefore, we added an adaptive subdivision to the model. As show as Figure 3, the adaptive subdivision layer predicts segmentation labels by adaptively selecting points in the image plane and improving the anti-aliasing capability of object edges. This strategy is inspired by the classic adaptive subdivision technique in computer graphics [28], which can effectively improve the quality of the mask generated by image segmentation.

Here, we use the subdivision of multilayer perceptron to solve the local receptive field limitation and improve the global feature extraction ability; this is combined with upsampling to allow the unscaled information to blend, improving the quality of cartoon target extraction. Its equation can be expressed as:

P_{i} = ReLU (W_{i} F_{i} + b_{i}) + Upsample (P_{i + 1})

(1)

where

W_{i}

is the MLP weight matrix,

b_{i}

is the bias term, and

F_{i}

is the feature extracted by ResNet.

The total loss function of the comic extraction module is:

L = L_{RPN} + L_{ROI - CLS} + L_{ROI - BOX} + L_{MASK}

(2)

Among them,

L_{RPN}

is the candidate box loss,

L_{ROI - CLS}

is the target classification loss,

L_{ROI - BOX}

is the bounding box loss, and

L_{MASK}

is the instance segmentation loss.

3.2. Comic Coloring Module

Comic coloring is an essential part of the comic project. Here, we designed a GAN-based coloring network module, as shown in Figure 4. In this network module, the U-Net structure is used as the generator for comic image coloring. In addition, the comic mask features extracted in the comic extraction module are introduced into the generator to reduce the amount of calculation while enhancing the feature generation performance. In the discriminator, we use patchgan. The generated colored image is represented as multiple patches, and the multi-receptive field method is used for discrimination, which enhances the attention to image details and improves the discriminator’s discrimination ability.

Generator G aims to generate a realistically colored image, y, by inputting the black-and-white comic image,

x

, and the mask feature,

m

, provided by Mask R-CNN. Therefore, the output of the generator is:

\hat{y} = G (x, m)

(3)

where

x

is a black-and-white comic image,

m

is the mask feature calculated by the comic extraction module, and

\hat{y}

is the color image output by the generator.

The discriminator,

D

is used to determine whether a local region is a true image or not; given an input image,

y

(true image) or

\hat{y}

(generated image), the discriminator outputs a true or false value of the local patch. We can define the discriminator’s determination of the input image

I

as:

D (I) = σ (f (I))

(4)

where

σ

is the Sigmoid activation function, which outputs the probability of a local patch of the image being true or false (0 or 1).

f (I)

is the feature extracted by the discriminator,

D

, for the input image,

I

.

This network module uses binary cross-entropy loss to distinguish between real and fake images, and its adversarial loss is expressed as:

L_{GAN} (G, D) = E [\log D (x, y)] + E [\log (1 - D (x, G (x)))]

(5)

where

G

is a generator and

D

is a discriminator in the GAN.

4. Results

4.1. Dataset and Implementation Details

In the selection of datasets, we selected the commonly used eBDtheque [29], Manga109 [30], and DCM [31], and created a dataset based on Korean comics, KComic5000. Samples of the dataset are shown in Figure 5.

eBDtheque: The eBDtheque database is a selection of one hundred comic pages from America, Japan (manga), and Europe. It contains 850 panels, 1092 bubble dialogues, 1550 comic characters, and more.

Manga109: Manga109 consists of 109 volumes of manga drawn by professional Japanese manga artists. Commercially available to the public from the 1970s to the 2010s, these comics cover a wide range of target audiences and genres. Each comic page has multiple comic frames (content frames), and the content in each frame is different.

K-Comics: We collected a total of 5000 Korean comics, including more than 25,000 panels, 20,000 bubble dialogs, and more than 10,000 comic characters.

DCM: The dataset consists of 772 annotated images from 27 Golden Age comic books. These images were collected for free from the Digital Comic Museum’s free public domain collection of digitized comic books. Ground truth bounding boxes were made for all panels and all characters (body + face), whether they are small or large, human-like or animal-like.

During the pre-training period, it is necessary to pre-process the comic. As shown in Figure 6, we label the frames (boxes), dialogue bubbles, and characters appearing on the comic page. The size of the comic page image is different from the general image size. Here, we use the standard 840 × 640 for a single-page image, which is convenient for model training later; 80% of the dataset is randomly selected for model training and 20% is used for model testing. All experiments are trained and tested on a server PC equipped with an NVIDIA RTX3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), an AMD Threadripper 2950x CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA), and 64 G RAM.

4.2. Overall Performance

Currently, most of the research on comics is limited to single content detection, text extraction, or simple coloring processing, and almost no system model can comprehensively and systematically process the full content of comics. To solve these problems, we proposed a composite network for automatic coloring and content extraction of black-and-white comics. As shown in Figure 7, by inputting black-and-white comics into the network, the characters and dialogue bubbles in the comics can be accurately extracted, and the color image quality output can be comparable to that of manual coloring conducted by comic staff.

The composite network for automatic colorization and content extraction of black-and-white comics proposed in this study includes a comic extraction module and a comic colorization module, so it is necessary to conduct comparative verification experiments separately. The comparative experiment of the comic extraction network module verifies the comprehensive detection capabilities of frames (boxes), dialogue bubbles, and characters in comics. The comic extraction network module we proposed is compared with other excellent models. As shown in Table 1, our proposed module performs more efficiently.

As shown in Figure 8, the comic content extraction module we designed can show excellent performance on various comics, especially in extracting complex character feature details in comics.

In the comic coloring verification experiment, we use FID and SSIM as performance verification indicators for generating color comic images. FID (Frechet Inception Distance) is a measure of the similarity between the feature distribution of the original image and the image generated by the generative model. The closer the distance, the better the effect of the generative model; that is, the image has high clarity and rich diversity. The following equation is used to calculate the FID:

FID (x, g) = ‖μ_{x} - μ_{g}‖ \frac{2}{2} + T r (Σ_{x} + Σ_{g} - 2 {(Σ_{x} Σ_{g})}^{\frac{1}{2}})

(6)

SSIM (structural similarity index) is a widely used image quality evaluation indicator. It is based on the assumption that the human eye will extract structured information from the image and is a measure of the similarity between two images. It is calculated using the following equation:

S S I M (x, y) = [l (x, y)]^{α} \cdot [c (x, y)]^{β} \cdot [s (x, y)]^{γ}

(7)

As shown in Table 2, in the image quality evaluation experiment based on FID and SSIM, our proposed comic coloring network module shows a more excellent performance than the other models.

As shown in Figure 9, the post-coloring control of black-and-white comics based on our composite network model for comic coloring and content extraction demonstrates excellent performance in both the overall coloring of comics and the coloring of character details.

5. Discussion and Conclusions

Comic processing technology has made significant progress in recent years, shifting from traditional hand-painted crafts to digital processing, which has greatly improved the efficiency of creation and the quality of work. However, in the tasks of accurate comic content extraction and high-quality comic coloring, the performance is not satisfactory. To solve these problems, in this paper, we proposed a new composite network for automatic coloring and content extraction of black-and-white comics, which includes a redesigned comic extraction module and a comic coloring module based on the modified GAN. We verified each module separately. In the verification experiment of the comic extraction module, compared with other excellent networks, our proposed module achieved the highest scores of 95.3%, 98.8%, and 98.5% in Recall, Precision, and F1 score, respectively. In the comic coloring field, our module obtained excellent scores of 24.58 and 0.94 in the image quality evaluation based on FID and SSIM. After many comparative experiments with other excellent research models, our proposed network is found to be more efficient and robust in terms of the accuracy of comic content detection, segmentation, and high-quality color image generation.

Author Contributions

Conceptualization, Q.M. and Y.-I.C.; methodology, software, Q.M.; validation, Q.M. and Y.-I.C.; formal analysis, Q.M.; investigation, Q.M.; resources, Q.M. and Y.-I.C.; data curation, Q.M. and Y.-I.C.; writing—original draft preparation, Q.M.; writing—review and editing, Q.M. and Y.-I.C.; visualization, Q.M.; supervision, Q.M. and Y.-I.C.; project administration, Q.M. and Y.-I.C.; funding acquisition, Y.-I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by Korea Agency for Technology and Standards in 2022, project number is 1415180835.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets utilized in this article are open-source and publicly available for researchers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Kondo, K.; Hasegawa, T. CNN-based criteria for classifying artists by illustration style. In Proceedings of the 2020 2nd International Conference on Image, Video and Signal Processing, Singapore, 20–22 March 2020; pp. 93–98. [Google Scholar]
Qin, X.; Zhou, Y.; He, Z.; Wang, Y.; Tang, Z. A Faster R-CNN based method for comic characters face detection. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 1074–1080. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 7–12 December 2015; Volume 28. [Google Scholar]
Xin, H.; Ma, C. Comic text detection and recognition based on deep learning. In Proceedings of the 2021 3rd International Conference on Applied Machine Learning (ICAML), Changsha, China, 23–25 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 20–23. [Google Scholar]
Xianbao, C.; Guihua, Q.; Yu, J.; Zhu, Z. An improved small object detection method based on YOLOv3. Pattern Anal. Appl. 2021, 24, 1347–1355. [Google Scholar] [CrossRef]
Soykan, G.; Yuret, D.; Sezgin, T.M. Identity-aware semi-supervised learning for comic character re-identification. arXiv 2023, arXiv:2308.09096. [Google Scholar]
Hinami, R.; Ishiwatari, S.; Yasuda, K.; Matsui, Y. Towards fully automated manga translation. Proc. AAAI Conf. Artif. Intell. 2021, 35, 12998–13008. [Google Scholar] [CrossRef]
Sharma, V.; Kukreja, V. Visual Narratives Unveiled: Comic Genre Classification through CNN-SVM Fusion. In Proceedings of the 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM), Noida, India, 21–23 February 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Shimizu, Y.; Furuta, R.; Ouyang, D.; Taniguchi, Y.; Hinami, R.; Ishiwatari, S. Painting style-aware manga colorization based on generative adversarial networks. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1739–1743. [Google Scholar]
Liu, Y.; Guo, Z.; Guo, H.; Xiao, H. Zoom-GAN: Learn to colorize multi-scale targets. Vis. Comput. 2023, 39, 3299–3310. [Google Scholar] [CrossRef]
Parmar, G.; Kumar Singh, K.; Zhang, R.; Li, Y.; Lu, J.; Zhu, J.Y. Zero-shot image-to-image translation. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–11. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Arai, K.; Tolle, H. Method for real time text extraction of digital manga comic. Int. J. Image Process. (IJIP) 2011, 4, 669–676. [Google Scholar]
Rigaud, C.; Burie, J.C.; Ogier, J.M. Text-independent speech balloon segmentation for comics and manga. In Graphic Recognition, Current Trends and Challenges: 11th International Workshop, GREC 2015, Nancy, France, 22–23 August 2015; Revised Selected Papers 11; Springer: Cham, Switzerland, 2017; pp. 133–147. [Google Scholar]
Nguyen, N.; Rigaud, C.; Burie, J. Comic characters detection using deep learning. In Proceedings of the 2nd International Workshop on coMics Analysis, Processing, and Understanding, 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, 9–15 November 2017; pp. 41–46. [Google Scholar]
Han, X.; Chang, J.; Wang, K. You only look once: Unified, real-time object detection. Procedia Comput. Sci. 2021, 183, 61–72. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Dutta, A.; Biswas, S.; Das, A.K. CNN-based segmentation of speech balloons and narrative text boxes from comic book page images. Int. J. Doc. Anal. Recognit. (IJDAR) 2021, 24, 49–62. [Google Scholar] [CrossRef]
Dubray, D.; Laubrock, J. Deep CNN-based speech balloon detection and segmentation for comic books. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1237–1243. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, S.Y.; Zhang, J.Q.; Gao, L.; He, Y.; Xia, S.; Shi, M.; Zhang, F.L. Active colorization for cartoon line drawings. IEEE Trans. Vis. Comput. Graph. 2020, 28, 1198–1208. [Google Scholar] [CrossRef] [PubMed]
Dou, Z.; Wang, N.; Li, B.; Wang, Z.; Li, H.; Liu, B. Dual color space guided sketch colorization. IEEE Trans. Image Process. 2021, 30, 7292–7304. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Wu, W.; Li, C.; Li, Y.; Wu, H. Reference-guided structure-aware deep sketch colorization for cartoons. Comput. Vis. Media 2022, 8, 135–148. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9799–9808. [Google Scholar]
Guérin, C.; Rigaud, C.; Mercier, A.; Ammar-Boudjelal, F.; Bertet, K.; Bouju, A.; Burie, J.C.; Louis, G.; Ogier, J.M.; Revel, A. eBDtheque: A representative database of comics. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1145–1149. [Google Scholar]
Li, Y.; Aizawa, K.; Matsui, Y. Manga109Dialog: A Large-Scale Dialogue Dataset for Comics Speaker Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Nguyen, N.V.; Rigaud, C.; Burie, J.C. Digital comics image indexing based on deep learning. J. Imaging 2018, 4, 89. [Google Scholar] [CrossRef]

Figure 1. Comic coloring and extraction composite networks. The framework includes a comic extraction module and a comic coloring module. The comic features detected by the extraction module are integrated into the coloring module to enhance the detail generation capability of coloring. The translation of the dialogue in the comic is as follows, ‘대공녀님과 함께할 수 있어 그저 영광일 뿐입니다.’ (It is an honor to be with you, Your Grace.); ‘난’ (I); ‘소저의 안내는 제가 말을 겁니다.’ (I will speak as your guide).

Figure 2. Comic extraction module. The module uses ResNet as the backbone network and adds RPN, RoIalign, and subdivision layers. Meaning of non-English dialogues in comics, ‘대공녀님과 함께할 수 있어 그저 영광일 뿐입니다.’ (It is an honor to be with you, Your Grace.); ‘난’ (I); ‘소저의 안내는 제가 말을 겁니다.’ (I will speak as your guide).

Figure 3. Adaptive subdivision. A prediction is upsampled on a 4 × 4 grid using bilinear interpolation twice. Then, it predicts the blurred N points (black points) to restore the details on the finer mesh.

Figure 4. Comic coloring module. The module is based on the improved GAN. U-Net is used as the generator of the comic coloring module in the generator, and the comic feature mask information calculated by the comic extraction module is introduced. Patchgan is used in the discriminator part for the discriminative calculation of comic coloring.

Figure 5. Comic datasets. (Meaning of non-English dialogues in comics, such as; The content of manga109 is “the child was in a traffic accident on the way home from playing with her parents, and both of her parents protected her at the time… The last place her parents took her to was this amusement park… She probably can’t forget the last day they spent together… She escaped from the facility and came here… Shinichiro…? Even your sister will be sad, so don’t cry… Ugh, don’t cry! You’re such a crybaby, if you’re such a crybaby, your mom and dad in heaven will be sad and cry too! Stop it you idiot!! Please don’t get your face close. Oops, I know that much!! What? You feel weird right now. You should get stronger than you are now! You’re weak, but I don’t want to be lectured by someone like that! Why am I so nervous! Good morning, I… I must have gotten my face close yesterday! Yes, you made breakfast, right? Hey, are you even listening to me? I’m off. Next time, I’ll make my lunch too. If it’s Seri’s, it’s fine to make it every day, right? Tetsuta praised me…! Another close-up!! What’s going on, did he mess up or come to complain, this bastard, no, the opposite was good.”; The content of KComics5000 comics is: “That’s right. The Cheongunsim method is not a technique that can be used when fighting, so isn’t it a technique that only the elders can learn? Do you think so? The teacher taught me that it would be a loss to learn it when you are young and can hold a sword. I won’t say which is right. Each person has a different purpose for training. There is no turning back. Eeeeek! Think of this as part of the practice. It will be more helpful than training at the main temple. But if it is as the private school says, it could be dangerous! I want to go to the Tang gate and show off my skills! If that is true, then you should leave. You can just take Mingdu with you. Oops! Sorry. My hand slipped. My hand slipped and I tripped. But why are there so many rocks here? It is an honor to be with you, Your Grace. I will be your guide. If you will, I will accompany you and show you places worth seeing. It’s okay. I don’t want to take up your time. Don’t worry. Today, all my time belongs to you, Grand Duchess. Don’t you want to join Seonhwadan? Seonhwadan? They are my escort warriors, and I’m going to create the strongest escort warriors in the martial arts world. Then I’ll have them escort Yonggaga. Then you’ll be around Yonggaga, so you’ll end up doing the same thing. Right?“).

Figure 6. Annotated description of the dataset. Meaning of non-English dialogues in comics, ‘대공녀님과 함께할 수 있어 그저 영광일 뿐입니다.’ (It is an honor to be with you, Your Grace.); ‘난’ (I); ‘소저의 안내는 제가 말을 겁니다.’ (I will speak as your guide).

Figure 7. Processing process for black-and-white comics. The translation of the dialogue in the comic is as follows, ‘대공녀님과 함께할 수 있어 그저 영광일 뿐입니다.’ (It is an honor to be with you, Your Grace.); ‘난’ (I); ‘소저의 안내는 제가 말을 겁니다.’ (I will speak as your guide).

Figure 8. Comic content detection and segmentation. (Meaning of non-English dialogues in comics, such as; There was one rule made in the village. After that incident…. The rule? … What is that rule! … Only Me: But… Naruto! It’s a rule that must never be told to you. What the heck is going on? Even if you want to deny the past, what you gained in the past is also yours. If you can save the people with that, then that in itself can be considered your mission. Understood. I will prepare to get down again. Before that, is there something you must gain? What is it? Gain, Ban-yoon. But as for me now…! I received a report that there were people who were trying to harm the Grand Duchess, so I came here in a hurry with a worried heart. That happened, but the brave man here helped me. Honey, please retreat with your men. If you will, I will accompany you and show you the places worth seeing. It is okay. I do not want to take up your time. Don’t worry. Today, all my time belongs to you, Grand Duchess. This guy is seriously crazy! Hey~ Get this and come to your senses! Don’t even think about stopping me. My determination will never change! A sketch… I’m sure I’ve seen it somewhere… Are you serious? Bindo, Baeksang, and that child are all prepared for this. Rather than being obsessed with the mind and becoming a madman, it would be better to die. Today, Jang Moon-in has surprised me many times. Bindo has nothing left to be surprised about. I am sorry. Then trust me and cut taxes. Kid…? Are you okay, kid? You got hit on the back of your head by a man in a black coat back then. How did you get that head wound? What are you talking about? I’m a sophomore in high school… Have you ever been in love? I’ve never been in love before.).

Figure 9. Comparison between black-and-white comics and colored comics. (Meaning of non-English dialogues in comics, such as; There was one rule made in the village. After that incident…. The rule? … What is that rule! … Only Me: But… Naruto! It’s a rule that must never be told to you. What the heck is going on? Even if you want to deny the past, what you gained in the past is also yours. If you can save the people with that, then that in itself can be considered your mission. Understood. I will prepare to get down again. Before that, is there something you must gain? What is it? Gain, Ban-yoon. But as for me now…! I received a report that there were people who were trying to harm the Grand Duchess, so I came here in a hurry with a worried heart. That happened, but the brave man here helped me. Honey, please retreat with your men. If you will, I will accompany you and show you the places worth seeing. It is okay. I do not want to take up your time. Don’t worry. Today, all my time belongs to you, Grand Duchess. This guy is seriously crazy! Hey~ Get this and come to your senses! Don’t even think about stopping me. My determination will never change! A sketch… I’m sure I’ve seen it somewhere… Are you serious? Bindo, Baeksang, and that child are all prepared for this. Rather than being obsessed with the mind and becoming a madman, it would be better to die. Today, Jang Moon-in has surprised me many times. Bindo has nothing left to be surprised about. I am sorry. Then trust me and cut taxes. Kid…? Are you okay, kid? You got hit on the back of your head by a man in a black coat back then. How did you get that head wound? What are you talking about? I’m a sophomore in high school… Have you ever been in love? I’ve never been in love before.).

Table 1. Colored comic image performance comparison.

Method	Recall	Precision	F1
Arai and Tolle [15]	69.8	62.3	63.6
Qin et al. [3]	78.7	75.1	75.4
Nguyen et al. [17]	83.2	84.0	82.7
Dutta et al. [20]	84.6	85.1	85.3
Xin and Ma [5]	88.5	87.9	88.4
Dubray and Laubrock [21]	90.7	91.6	91.1
Ours	95.3	98.8	98.5

Table 2. Colored comic image performance comparison.

Model	FID	SSIM
Pix2Pix	103.25	0.53
S2PV3	79.01	0.61
DCSGAN	54.36	0.69
Zoom-GAN	34.75	0.87
VAE-GAN	27.09	0.91
Liu et al. [12]	25.16	0.92
Ours	24.58	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Man, Q.; Cho, Y.-I. Efficient Comic Content Extraction and Coloring Composite Networks. Appl. Sci. 2025, 15, 2641. https://doi.org/10.3390/app15052641

AMA Style

Man Q, Cho Y-I. Efficient Comic Content Extraction and Coloring Composite Networks. Applied Sciences. 2025; 15(5):2641. https://doi.org/10.3390/app15052641

Chicago/Turabian Style

Man, Qiaoyue, and Young-Im Cho. 2025. "Efficient Comic Content Extraction and Coloring Composite Networks" Applied Sciences 15, no. 5: 2641. https://doi.org/10.3390/app15052641

APA Style

Man, Q., & Cho, Y.-I. (2025). Efficient Comic Content Extraction and Coloring Composite Networks. Applied Sciences, 15(5), 2641. https://doi.org/10.3390/app15052641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Comic Content Extraction and Coloring Composite Networks

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Comics Intelligent Detection and Recognition Framework

3.2. Comic Coloring Module

4. Results

4.1. Dataset and Implementation Details

4.2. Overall Performance

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI