Submit to Special Issue Submit Abstract to Special Issue Review for Electronics Propose a Special Issue

Journal Menu

Journal Browser

Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: 31 December 2025 | Viewed by 25968

Share This Special Issue

Special Issue Editors

Dr. Abdussalam Elhanashi

E-Mail Website
Guest Editor

Department of Information Engineering, University of Pisa, Via Girolamo Caruso, 16, 56122 Pisa, Italy
Interests: deep learning; machine learning; video processing; image processing; Internet of Things; cybersecurity; embedded systems
Special Issues, Collections and Topics in MDPI journals

Prof. Dr. Sergio Saponara

E-Mail Website
Guest Editor

Department of Information Engineering, University of Pisa, Via Girolamo Caruso, 16, 56122 Pisa, Italy
Interests: automotive electronics; embedded HPC (high-performance computing); enabling technologies IoT (Internet of Things)
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The Special Issue “Challenges and Solutions in Real-Time Deep Learning and Machine Learning for Video and Image Processing on Edge Platforms” focuses on advancing the integration of deep learning and machine learning (ML) techniques with video and image processing directly on edge devices. This collection of papers aims to address the unique challenges of executing computationally intensive ML algorithms in real time on resource-constrained devices, such as those with limited processing power, memory, and energy consumption. The purpose is to explore innovative solutions that enhance the efficiency, accuracy, and reliability of ML applications in real-world scenarios. The scope covers a broad spectrum of topics including, but not limited to, algorithm optimization, hardware–software co-design, energy-efficient ML models, and real-time data processing techniques. This Special Issue will significantly contribute to the existing literature by bridging the gap between theoretical ML advancements and practical edge computing implementations. While current research predominantly focuses on cloud-based solutions or offline processing, this Special Issue emphasizes the need for immediate, localized processing, which is crucial for latency-sensitive applications. Examples of real-world applications include surveillance systems that require instant anomaly detection, medical imaging for real-time diagnosis, autonomous vehicles needing immediate object recognition and decision-making, smart cameras in urban traffic management, augmented reality devices for interactive user experiences, industrial automation for monitoring and control, wildlife monitoring for real-time tracking, disaster response systems for rapid situational analysis, smart home devices for enhanced security and convenience, and wearable technology for health monitoring and personalized feedback. By presenting cutting-edge research and practical case studies, this Special Issue will serve as a valuable resource for researchers, engineers, and practitioners aiming to develop and deploy efficient ML solutions on edge platforms, ultimately advancing the field of real-time video and image processing.

Dr. Abdussalam Elhanashi
Prof. Dr. Sergio Saponara
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

real-time machine learning
edge computing
video processing
image processing
algorithm optimization
hardware–software co-design
energy-efficient ML
latency-sensitive applications
computational efficiency
resource-constrained devices
anomaly detection
real-time diagnostics
autonomous vehicles
object recognition
urban traffic management
augmented reality
industrial automation
wildlife monitoring
disaster response systems
smart home devices
wearable technology
health monitoring
personalized feedback
real-time data processing
ML model deployment
edge AI
smart cameras
low-power computing
data privacy
real-world ML applications

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (13 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

17 pages, 3628 KB

Open AccessArticle

A Unified Self-Supervised Framework for Plant Disease Detection on Laboratory and In-Field Images

by Xiaoli Huan, Bernard Chen and Hong Zhou

Electronics 2025, 14(17), 3410; https://doi.org/10.3390/electronics14173410 - 27 Aug 2025

Viewed by 772

Abstract

Early and accurate detection of plant diseases is essential for ensuring food security and maintaining sustainable agricultural productivity. However, most deep learning models for plant disease classification rely heavily on large-scale annotated datasets, which are expensive, labor-intensive, and often impractical to obtain in real-world farming environments. To address this limitation, we propose a unified self-supervised learning (SSL) framework that leverages unlabeled plant imagery to learn meaningful and transferable visual representations. Our method integrates three complementary objectives—Bootstrap Your Own Latent (BYOL), Masked Image Modeling (MIM), and contrastive learning—within a ResNet101 backbone, optimized through a hybrid loss function that captures global alignment, local structure, and instance-level distinction. GPU-based data augmentations are used to introduce stochasticity and enhance generalization during pretraining. Experimental results on the challenging PlantDoc dataset demonstrate that our model achieves an accuracy of 77.82%, with macro-averaged precision, recall, and F1-score of 80.00%, 78.24%, and 77.48%, respectively—on par with or exceeding most state-of-the-art supervised and self-supervised approaches. Furthermore, when fine-tuned on the PlantVillage dataset, the pretrained model attains 99.85% accuracy, highlighting its strong cross-domain generalization and practical transferability. These findings underscore the potential of self-supervised learning as a scalable, annotation-efficient, and robust solution for plant disease detection in real-world agricultural settings, especially where labeled data is scarce or unavailable. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

14 pages, 24112 KB

Open AccessEditor’s ChoiceArticle

ImpactAlert: Pedestrian-Carried Vehicle Collision Alert System

by Raghav Rawat, Caspar Lant, Haowen Yuan and Dennis Shasha

Electronics 2025, 14(15), 3133; https://doi.org/10.3390/electronics14153133 - 6 Aug 2025

Viewed by 657

Abstract

The ImpactAlert system is a chest-mounted system that detects objects that are likely to hit a pedestrian and alerts that pedestrian. The primary use cases are visually impaired pedestrians or pedestrians who need to be warned about vehicles or other pedestrians coming from unseen directions. This paper argues for the need for such a system, the design and algorithms of ImpactAlert, and experiments carried out in varied urban environments, ranging from densely crowded to semi-urban in the United States, India and China. ImpactAlert makes use of a LiDAR camera found on a commercial wireless phone, processes the data over several frames to evaluate the time to impact and speed of potential threats. When ImpactAlert determines a threat meets the criteria set by the user, it sends warning signals through an output device to warn a pedestrian. The output device can be an audible warning and/or a low-cost smart cane that vibrates when danger approaches. Our experiments in urban and semi-urban environments show that (i) ImpactAlert can avoid nearly all false negatives (when an alarm should be sent and it isn’t) and (ii) enjoys a low false positive rate. The net result is an effective low cost system to alert pedestrians in an urban environment. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

19 pages, 1555 KB

Open AccessArticle

MedLangViT: A Language–Vision Network for Medical Image Segmentation

by Yiyi Wang, Jia Su, Xinxiao Li and Eisei Nakahara

Electronics 2025, 14(15), 3020; https://doi.org/10.3390/electronics14153020 - 29 Jul 2025

Viewed by 546

Abstract

Precise medical image segmentation is crucial for advancing computer-aided diagnosis. Although deep learning-based medical image segmentation is now widely applied in this field, the complexity of human anatomy and the diversity of pathological manifestations often necessitate the use of image annotations to enhance segmentation accuracy. In this process, the scarcity of annotations and the lightweight design requirements of associated text encoders collectively present key challenges for improving segmentation model performance. To address these challenges, we propose MedLangViT, a novel language–vision multimodal model for medical image segmentation that incorporates medical descriptive information through lightweight text embedding rather than text encoders. MedLangViT innovatively leverages medical textual information to assist the segmentation process, thereby reducing reliance on extensive high-precision image annotations. Furthermore, we design an Enhanced Channel-Spatial Attention Module (ECSAM) to effectively fuse textual and visual features, strengthening textual guidance for segmentation decisions. Extensive experiments conducted on two publicly available text–image-paired medical datasets demonstrated that MedLangViT significantly outperforms existing state-of-the-art methods, validating the effectiveness of both the proposed model and the ECSAM. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

26 pages, 3625 KB

Open AccessArticle

Deep-CNN-Based Layout-to-SEM Image Reconstruction with Conformal Uncertainty Calibration for Nanoimprint Lithography in Semiconductor Manufacturing

by Jean Chien and Eric Lee

Electronics 2025, 14(15), 2973; https://doi.org/10.3390/electronics14152973 - 25 Jul 2025

Viewed by 653

Abstract

Nanoimprint lithography (NIL) has emerged as a promising sub-10 nm patterning at low cost; yet, robust process control remains difficult because of time-consuming physics-based simulators and labeled SEM data scarcity. We propose a data-efficient, two-stage deep-learning framework here that directly reconstructs post-imprint SEM images from binary design layouts and delivers calibrated pixel-by-pixel uncertainty simultaneously. First, a shallow U-Net is trained on conformalized quantile regression (CQR) to output 90% prediction intervals with statistically guaranteed coverage. Moreover, per-level errors on a small calibration dataset are designed to drive an outlier-weighted and encoder-frozen transfer fine-tuning phase that refines only the decoder, with its capacity explicitly focused on regions of spatial uncertainty. On independent test layouts, our proposed fine-tuned model significantly reduces the mean absolute error (MAE) from 0.0365 to 0.0255 and raises the coverage from 0.904 to 0.926, while cutting the labeled data and GPU time by 80% and 72%, respectively. The resultant uncertainty maps highlight spatial regions associated with error hotspots and support defect-aware optical proximity correction (OPC) with fewer guard-band iterations. Extending the current perspective beyond OPC, the innovatively model-agnostic and modular design of the pipeline here allows flexible integration into other critical stages of the semiconductor manufacturing workflow, such as imprinting, etching, and inspection. In these stages, such predictions are critical for achieving higher precision, efficiency, and overall process robustness in semiconductor manufacturing, which is the ultimate motivation of this study. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

23 pages, 3645 KB

Open AccessArticle

Color-Guided Mixture-of-Experts Conditional GAN for Realistic Biomedical Image Synthesis in Data-Scarce Diagnostics

by Patrycja Kwiek, Filip Ciepiela and Małgorzata Jakubowska

Electronics 2025, 14(14), 2773; https://doi.org/10.3390/electronics14142773 - 10 Jul 2025

Viewed by 501

Abstract

Background: Limited availability of high-quality labeled biomedical image datasets presents a significant challenge for training deep learning models in medical diagnostics. This study proposes a novel image generation framework combining conditional generative adversarial networks (cGANs) with a Mixture-of-Experts (MoE) architecture and color histogram-aware loss functions to enhance synthetic blood cell image quality. Methods: RGB microscopic images from the BloodMNIST dataset (eight blood cell types, resolution 3 × 128 × 128) underwent preprocessing with k-means clustering to extract the dominant colors and UMAP for visualizing class similarity. Spearman correlation-based distance matrices were used to evaluate the discriminative power of each RGB channel. A MoE–cGAN architecture was developed with residual blocks and LeakyReLU activations. Expert generators were conditioned on cell type, and the generator’s loss was augmented with a Wasserstein distance-based term comparing red and green channel histograms, which were found most relevant for class separation. Results: The red and green channels contributed most to class discrimination; the blue channel had minimal impact. The proposed model achieved 0.97 classification accuracy on generated images (ResNet50), with 0.96 precision, 0.97 recall, and a 0.96 F1-score. The best Fréchet Inception Distance (FID) was 52.1. Misclassifications occurred mainly among visually similar cell types. Conclusions: Integrating histogram alignment into the MoE–cGAN training significantly improves the realism and class-specific variability of synthetic images, supporting robust model development under data scarcity in hematological imaging. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

13 pages, 3074 KB

Open AccessArticle

Wavelet-Based Fusion for Image Steganography Using Deep Convolutional Neural Networks

by Amal Khalifa and Yashi Yadav

Electronics 2025, 14(14), 2758; https://doi.org/10.3390/electronics14142758 - 9 Jul 2025

Viewed by 609

Abstract

Steganography has long served as a powerful tool for covert communication, particularly through image-based techniques that embed secret information within innocuous cover images. With the increasing adoption of deep learning, researchers have sought more secure and efficient methods for image steganography. This study builds upon and extends the DeepWaveletFusion approach by integrating convolutional neural networks (CNNs) with the discrete wavelet transform (DWT) to enhance both embedding and recovery performance. The proposed method, DeepWaveletFusionToo, is a lightweight architecture that employs a custom-built DWT image dataset and leverages the mean squared error (MSE) loss function during training, significantly reducing model complexity and computational cost. Experimental results demonstrate that DeepWaveletFusionToo achieves improved imperceptibility compared to its predecessor and delivers competitive recovery accuracy over existing deep learning-based steganographic techniques, establishing its simplicity and effectiveness. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

19 pages, 1891 KB

Open AccessArticle

Comparative Study on Energy Consumption of Neural Networks by Scaling of Weight-Memory Energy Versus Computing Energy for Implementing Low-Power Edge Intelligence

by Ilpyung Yoon, Jihwan Mun and Kyeong-Sik Min

Electronics 2025, 14(13), 2718; https://doi.org/10.3390/electronics14132718 - 5 Jul 2025

Cited by 1 | Viewed by 1973

Abstract

Energy consumption has emerged as a critical design constraint in deploying high-performance neural networks, especially on edge devices with limited power resources. In this paper, a comparative study is conducted for two prevalent deep learning paradigms—convolutional neural networks (CNNs), exemplified by ResNet18, and transformer-based large language models (LLMs), represented by GPT3-small, Llama-7B, and GPT3-175B. By analyzing how the scaling of memory energy versus computing energy affects the energy consumption of neural networks with different batch sizes (1, 4, 8, 16), it is shown that ResNet18 transitions from a memory energy-limited regime at low batch sizes to a computing energy-limited regime at higher batch sizes due to its extensive convolution operations. On the other hand, GPT-like models remain predominantly memory-bound, with large parameter tensors and frequent key–value (KV) cache lookups accounting for most of the total energy usage. Our results reveal that reducing weight-memory energy is particularly effective in transformer architectures, while improving multiply–accumulate (MAC) efficiency significantly benefits CNNs at higher workloads. We further highlight near-memory and in-memory computing approaches as promising strategies to lower data-transfer costs and enhance power efficiency in large-scale deployments. These findings offer actionable insights for architects and system designers aiming to optimize artificial intelligence (AI) performance under stringent energy budgets on battery-powered edge devices. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

17 pages, 6326 KB

Open AccessArticle

Meta Network for Flow-Based Image Style Transfer

by Yihjia Tsai, Hsiau-Wen Lin, Chii-Jen Chen, Hwei-Jen Lin and Chen-Hsiang Yu

Electronics 2025, 14(10), 2035; https://doi.org/10.3390/electronics14102035 - 16 May 2025

Viewed by 628

Abstract

A style transfer aims to produce synthesized images that retain the content of one image while adopting the artistic style of another. Traditional style transfer methods often require training separate transformation networks for each new style, limiting their adaptability and scalability. To address this challenge, we propose a flow-based image style transfer framework that integrates Randomized Hierarchy Flow (RH Flow) and a meta network for adaptive parameter generation. The meta network dynamically produces the RH Flow parameters conditioned on the style image, enabling efficient and flexible style adaptation without retraining for new styles. RH Flow enhances feature interaction by introducing a random permutation of the feature sub-blocks before hierarchical coupling, promoting diverse and expressive stylization while preserving the content structure. Our experimental results demonstrate that Meta FIST achieves superior content retention, style fidelity, and adaptability compared to existing approaches. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

24 pages, 1713 KB

Open AccessArticle

A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications

by Lucas Rey, Ana M. Bernardos, Andrzej D. Dobrzycki, David Carramiñana, Luca Bergesio, Juan A. Besada and José Ramón Casar

Electronics 2025, 14(3), 638; https://doi.org/10.3390/electronics14030638 - 6 Feb 2025

Cited by 7 | Viewed by 4139

Abstract

Advancements in embedded systems and Artificial Intelligence (AI) have enhanced the capabilities of Unmanned Aircraft Vehicles (UAVs) in computer vision. However, the integration of AI techniques o-nboard drones is constrained by their processing capabilities. In this sense, this study evaluates the deployment of object detection models (YOLOv8n and YOLOv8s) on both resource-constrained edge devices and cloud environments. The objective is to carry out a comparative performance analysis using a representative real-time UAV image processing pipeline. Specifically, the NVIDIA Jetson Orin Nano, Orin NX, and Raspberry Pi 5 (RPI5) devices have been tested to measure their detection accuracy, inference speed, and energy consumption, and the effects of post-training quantization (PTQ). The results show that YOLOv8n surpasses YOLOv8s in its inference speed, achieving 52 FPS on the Jetson Orin NX and 65 fps with INT8 quantization. Conversely, the RPI5 failed to satisfy the real-time processing needs in spite of its suitability for low-energy consumption applications. An analysis of both the cloud-based and edge-based end-to-end processing times showed that increased communication latencies hindered real-time applications, revealing trade-offs between edge (low latency) and cloud processing (quick processing). Overall, these findings contribute to providing recommendations and optimization strategies for the deployment of AI models on UAVs. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

22 pages, 9016 KB

Open AccessArticle

Leveraging Transformer-Based OCR Model with Generative Data Augmentation for Engineering Document Recognition

by Wael Khallouli, Mohammad Shahab Uddin, Andres Sousa-Poza, Jiang Li and Samuel Kovacic

Electronics 2025, 14(1), 5; https://doi.org/10.3390/electronics14010005 - 24 Dec 2024

Viewed by 6443

Abstract

The long-standing practice of document-based engineering has resulted in the accumulation of a large number of engineering documents across various industries. Engineering documents, such as 2D drawings, continue to play a significant role in exchanging information and sharing knowledge across multiple engineering processes. However, these documents are often stored in non-digitized formats, such as paper and portable document format (PDF) files, making automation difficult. As digital engineering transforms processes in many industries, digitizing engineering documents presents a crucial challenge that requires advanced methods. This research addresses the problem of automatically extracting textual content from non-digitized legacy engineering documents. We introduced an optical character recognition (OCR) system for text detection and recognition that leverages transformer-based generative deep learning models and transfer learning approaches to enhance text recognition accuracy in engineering documents. The proposed system was evaluated on a dataset collected from ships’ engineering drawings provided by a U.S. agency. Experimental results demonstrated that the proposed transformer-based OCR model significantly outperformed pretrained off-the-shelf OCR models. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

19 pages, 1024 KB

Open AccessArticle

A Hessian-Based Deep Learning Preprocessing Method for Coronary Angiography Image Analysis

by Yanjun Li, Takaaki Yoshimura, Yuto Horima and Hiroyuki Sugimori

Electronics 2024, 13(18), 3676; https://doi.org/10.3390/electronics13183676 - 16 Sep 2024

Cited by 3 | Viewed by 1758

Abstract

Leveraging its high accuracy and stability, deep-learning-based coronary artery detection technology has been extensively utilized in diagnosing coronary artery diseases. However, traditional algorithms for localizing coronary stenosis often fall short when detecting stenosis in branch vessels, which can pose significant health risks due to factors like imaging angles and uneven contrast agent distribution. To tackle these challenges, we propose a preprocessing method that integrates Hessian-based vascular enhancement and image fusion as prerequisites for deep learning. This approach enhances fuzzy features in coronary angiography images, thereby increasing the neural network’s sensitivity to stenosis characteristics. We assessed the effectiveness of this method using the latest deep learning networks, such as YOLOv10, YOLOv9, and RT-DETR, across various evaluation metrics. Our results show that our method improves

{AP}_{50}

accuracy by 4.84% and 5.07% on RT-DETR R101 and YOLOv10-X, respectively, compared to images without special pre-processing. Furthermore, our analysis of different imaging angles on stenosis localization detection indicates that the left coronary artery zero is the most suitable for detecting stenosis with a

{AP}_{50} (%)

value of 90.5. The experimental results have revealed that the proposed method is effective as a preprocessing technique for deep-learning-based coronary angiography image processing and enhances the model’s ability to identify stenosis in small blood vessels. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

16 pages, 9901 KB

Open AccessArticle

A Generative Approach for Document Enhancement with Small Unpaired Data

by Mohammad Shahab Uddin, Wael Khallouli, Andres Sousa-Poza, Samuel Kovacic and Jiang Li

Electronics 2024, 13(17), 3539; https://doi.org/10.3390/electronics13173539 - 6 Sep 2024

Viewed by 1532

Abstract

Shipbuilding drawings, crafted manually before the digital era, are vital for historical reference and technical insight. However, their digital versions, stored as scanned PDFs, often contain significant noise, making them unsuitable for use in modern CAD software like AutoCAD. Traditional denoising techniques struggle with the diverse and intense noise found in these documents, which also does not adhere to standard noise models. In this paper, we propose an innovative generative approach tailored for document enhancement, particularly focusing on shipbuilding drawings. For a small, unpaired dataset of clean and noisy shipbuilding drawing documents, we first learn to generate the noise in the dataset based on a CycleGAN model. We then generate multiple paired clean–noisy image pairs using the clean images in the dataset. Finally, we train a Pix2Pix GAN model with these generated image pairs to enhance shipbuilding drawings. Through empirical evaluation on a small Military Sealift Command (MSC) dataset, we demonstrated the superiority of our method in mitigating noise and preserving essential details, offering an effective solution for the restoration and utilization of historical shipbuilding drawings in contemporary digital environments. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Figure 1

27 pages, 52132 KB

Open AccessFeature PaperArticle

Temporally Coherent Video Cartoonization for Animation Scenery Generation

by Gustavo Rayo and Ruben Tous

Electronics 2024, 13(17), 3462; https://doi.org/10.3390/electronics13173462 - 31 Aug 2024

Viewed by 3848

Abstract

The automatic transformation of short background videos from real scenarios into other forms with a visually pleasing style, like those used in cartoons, holds application in various domains. These include animated films, video games, advertisements, and many other areas that involve visual content creation. A method or tool that can perform this task would inspire, facilitate, and streamline the work of artists and people who produce this type of content. This work proposes a method that integrates multiple components to translate short background videos into other forms that contain a particular style. We apply a fine-tuned latent diffusion model with an image-to-image setting, conditioned with the image edges (computed with holistically nested edge detection) and CLIP-generated prompts to translate the keyframes from a source video, ensuring content preservation. To maintain temporal coherence, the keyframes are translated into grids and the style is interpolated with an example-based style propagation algorithm. We quantitatively assess the content preservation and temporal coherence using CLIP-based metrics over a new dataset of 20 videos translated into three distinct styles. Full article

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions)

► Show Figures

Journal Menu

Journal Browser

Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (13 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI